[PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces
       [not found] <CGME20200204095215eucas1p1bb0d5a3c183f7531d8b0e5e081f1ae6b@eucas1p1.samsung.com>
@ 2020-02-04  9:51 ` Klaus Jensen
       [not found]   ` <CGME20200204095216eucas1p2cb2b4772c04b92c97b0690c8e565234c@eucas1p2.samsung.com>
                     ` (27 more replies)
  0 siblings, 28 replies; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Hi,


Changes since v4
 - Changed vendor and device id to use a Red Hat allocated one. For
   backwards compatibility add the 'x-use-intel-id' nvme device
   parameter. This is off by default but is added as a machine compat
   property to be true for machine types <= 4.2.

 - SGL mapping code has been refactored.


Comments specific to Beata's review:
 - [PATCH v4 19/24] nvme: handle dma errors
   I ended up not including any specific code for resetting the device
   when dma transfers fail for too long. If running I/O and then
   disabling bus master, the OS will (should) eventually reset the
   device and reenable bus mastering (this is the behavior in Linux at
   least). The device can maybe set CFS ("Controller Fail Status") in
   the BAR, but I have not explored that for now.

 - [PATCH v4 17/24] nvme: allow multiple aios per command
   I forgot to give an answer for your comment on the correctness of:

     if (unlikely((slba + nlb) > nsze)) {

   `slba` *is* the "address" of the first logical block, but it is in
   terms of logical blocks, so the condition should be correct. (and at
   this point `nlb` is no longer a 0's based value)


Klaus Jensen (26):
  nvme: rename trace events to nvme_dev
  nvme: remove superfluous breaks
  nvme: move device parameters to separate struct
  nvme: add missing fields in the identify data structures
  nvme: populate the mandatory subnqn and ver fields
  nvme: refactor nvme_addr_read
  nvme: add support for the abort command
  nvme: refactor device realization
  nvme: add temperature threshold feature
  nvme: add support for the get log page command
  nvme: add support for the asynchronous event request command
  nvme: add missing mandatory features
  nvme: additional tracing
  nvme: make sure ncqr and nsqr is valid
  nvme: bump supported specification to 1.3
  nvme: refactor prp mapping
  nvme: allow multiple aios per command
  nvme: use preallocated qsg/iov in nvme_dma_prp
  pci: pass along the return value of dma_memory_rw
  nvme: handle dma errors
  nvme: add support for scatter gather lists
  nvme: support multiple namespaces
  pci: allocate pci id for nvme
  nvme: change controller pci id
  nvme: remove redundant NvmeCmd pointer parameter
  nvme: make lba data size configurable

 MAINTAINERS            |    1 +
 block/nvme.c           |   18 +-
 docs/specs/nvme.txt    |   10 +
 docs/specs/pci-ids.txt |    1 +
 hw/block/Makefile.objs |    2 +-
 hw/block/nvme-ns.c     |  158 ++++
 hw/block/nvme-ns.h     |   62 ++
 hw/block/nvme.c        | 2012 +++++++++++++++++++++++++++++++---------
 hw/block/nvme.h        |  201 +++-
 hw/block/trace-events  |  204 ++--
 hw/core/machine.c      |    1 +
 include/block/nvme.h   |  143 ++-
 include/hw/pci/pci.h   |    4 +-
 13 files changed, 2266 insertions(+), 551 deletions(-)
 create mode 100644 docs/specs/nvme.txt
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

-- 
2.25.0



^ permalink raw reply	[flat|nested] 86+ messages in thread

* [PATCH v5 01/26] nvme: rename trace events to nvme_dev
       [not found]   ` <CGME20200204095216eucas1p2cb2b4772c04b92c97b0690c8e565234c@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:08       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Change the prefix of all nvme device related trace events to 'nvme_dev'
to not clash with trace events from the nvme block driver.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 185 +++++++++++++++++++++---------------------
 hw/block/trace-events | 172 +++++++++++++++++++--------------------
 2 files changed, 178 insertions(+), 179 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index d28335cbf377..dd548d9b6605 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -112,16 +112,16 @@ static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
 {
     if (cq->irq_enabled) {
         if (msix_enabled(&(n->parent_obj))) {
-            trace_nvme_irq_msix(cq->vector);
+            trace_nvme_dev_irq_msix(cq->vector);
             msix_notify(&(n->parent_obj), cq->vector);
         } else {
-            trace_nvme_irq_pin();
+            trace_nvme_dev_irq_pin();
             assert(cq->cqid < 64);
             n->irq_status |= 1 << cq->cqid;
             nvme_irq_check(n);
         }
     } else {
-        trace_nvme_irq_masked();
+        trace_nvme_dev_irq_masked();
     }
 }
 
@@ -146,7 +146,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
     int num_prps = (len >> n->page_bits) + 1;
 
     if (unlikely(!prp1)) {
-        trace_nvme_err_invalid_prp();
+        trace_nvme_dev_err_invalid_prp();
         return NVME_INVALID_FIELD | NVME_DNR;
     } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
                prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
@@ -160,7 +160,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
     len -= trans_len;
     if (len) {
         if (unlikely(!prp2)) {
-            trace_nvme_err_invalid_prp2_missing();
+            trace_nvme_dev_err_invalid_prp2_missing();
             goto unmap;
         }
         if (len > n->page_size) {
@@ -176,7 +176,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
 
                 if (i == n->max_prp_ents - 1 && len > n->page_size) {
                     if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
-                        trace_nvme_err_invalid_prplist_ent(prp_ent);
+                        trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
                         goto unmap;
                     }
 
@@ -189,7 +189,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
                 }
 
                 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
-                    trace_nvme_err_invalid_prplist_ent(prp_ent);
+                    trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
                     goto unmap;
                 }
 
@@ -204,7 +204,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
             }
         } else {
             if (unlikely(prp2 & (n->page_size - 1))) {
-                trace_nvme_err_invalid_prp2_align(prp2);
+                trace_nvme_dev_err_invalid_prp2_align(prp2);
                 goto unmap;
             }
             if (qsg->nsg) {
@@ -252,20 +252,20 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     QEMUIOVector iov;
     uint16_t status = NVME_SUCCESS;
 
-    trace_nvme_dma_read(prp1, prp2);
+    trace_nvme_dev_dma_read(prp1, prp2);
 
     if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
     if (qsg.nsg > 0) {
         if (unlikely(dma_buf_read(ptr, len, &qsg))) {
-            trace_nvme_err_invalid_dma();
+            trace_nvme_dev_err_invalid_dma();
             status = NVME_INVALID_FIELD | NVME_DNR;
         }
         qemu_sglist_destroy(&qsg);
     } else {
         if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
-            trace_nvme_err_invalid_dma();
+            trace_nvme_dev_err_invalid_dma();
             status = NVME_INVALID_FIELD | NVME_DNR;
         }
         qemu_iovec_destroy(&iov);
@@ -354,7 +354,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     uint32_t count = nlb << data_shift;
 
     if (unlikely(slba + nlb > ns->id_ns.nsze)) {
-        trace_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
@@ -382,11 +382,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
     enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
 
-    trace_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
+    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
 
     if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
-        trace_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
+        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
@@ -421,7 +421,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t nsid = le32_to_cpu(cmd->nsid);
 
     if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
-        trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
@@ -435,7 +435,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     case NVME_CMD_READ:
         return nvme_rw(n, ns, cmd, req);
     default:
-        trace_nvme_err_invalid_opc(cmd->opcode);
+        trace_nvme_dev_err_invalid_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
     }
 }
@@ -460,11 +460,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
     uint16_t qid = le16_to_cpu(c->qid);
 
     if (unlikely(!qid || nvme_check_sqid(n, qid))) {
-        trace_nvme_err_invalid_del_sq(qid);
+        trace_nvme_dev_err_invalid_del_sq(qid);
         return NVME_INVALID_QID | NVME_DNR;
     }
 
-    trace_nvme_del_sq(qid);
+    trace_nvme_dev_del_sq(qid);
 
     sq = n->sq[qid];
     while (!QTAILQ_EMPTY(&sq->out_req_list)) {
@@ -528,26 +528,26 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
     uint16_t qflags = le16_to_cpu(c->sq_flags);
     uint64_t prp1 = le64_to_cpu(c->prp1);
 
-    trace_nvme_create_sq(prp1, sqid, cqid, qsize, qflags);
+    trace_nvme_dev_create_sq(prp1, sqid, cqid, qsize, qflags);
 
     if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
-        trace_nvme_err_invalid_create_sq_cqid(cqid);
+        trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
         return NVME_INVALID_CQID | NVME_DNR;
     }
     if (unlikely(!sqid || !nvme_check_sqid(n, sqid))) {
-        trace_nvme_err_invalid_create_sq_sqid(sqid);
+        trace_nvme_dev_err_invalid_create_sq_sqid(sqid);
         return NVME_INVALID_QID | NVME_DNR;
     }
     if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
-        trace_nvme_err_invalid_create_sq_size(qsize);
+        trace_nvme_dev_err_invalid_create_sq_size(qsize);
         return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
     }
     if (unlikely(!prp1 || prp1 & (n->page_size - 1))) {
-        trace_nvme_err_invalid_create_sq_addr(prp1);
+        trace_nvme_dev_err_invalid_create_sq_addr(prp1);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
     if (unlikely(!(NVME_SQ_FLAGS_PC(qflags)))) {
-        trace_nvme_err_invalid_create_sq_qflags(NVME_SQ_FLAGS_PC(qflags));
+        trace_nvme_dev_err_invalid_create_sq_qflags(NVME_SQ_FLAGS_PC(qflags));
         return NVME_INVALID_FIELD | NVME_DNR;
     }
     sq = g_malloc0(sizeof(*sq));
@@ -573,17 +573,17 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
     uint16_t qid = le16_to_cpu(c->qid);
 
     if (unlikely(!qid || nvme_check_cqid(n, qid))) {
-        trace_nvme_err_invalid_del_cq_cqid(qid);
+        trace_nvme_dev_err_invalid_del_cq_cqid(qid);
         return NVME_INVALID_CQID | NVME_DNR;
     }
 
     cq = n->cq[qid];
     if (unlikely(!QTAILQ_EMPTY(&cq->sq_list))) {
-        trace_nvme_err_invalid_del_cq_notempty(qid);
+        trace_nvme_dev_err_invalid_del_cq_notempty(qid);
         return NVME_INVALID_QUEUE_DEL;
     }
     nvme_irq_deassert(n, cq);
-    trace_nvme_del_cq(qid);
+    trace_nvme_dev_del_cq(qid);
     nvme_free_cq(cq, n);
     return NVME_SUCCESS;
 }
@@ -616,27 +616,27 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     uint16_t qflags = le16_to_cpu(c->cq_flags);
     uint64_t prp1 = le64_to_cpu(c->prp1);
 
-    trace_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
+    trace_nvme_dev_create_cq(prp1, cqid, vector, qsize, qflags,
                          NVME_CQ_FLAGS_IEN(qflags) != 0);
 
     if (unlikely(!cqid || !nvme_check_cqid(n, cqid))) {
-        trace_nvme_err_invalid_create_cq_cqid(cqid);
+        trace_nvme_dev_err_invalid_create_cq_cqid(cqid);
         return NVME_INVALID_CQID | NVME_DNR;
     }
     if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
-        trace_nvme_err_invalid_create_cq_size(qsize);
+        trace_nvme_dev_err_invalid_create_cq_size(qsize);
         return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
     }
     if (unlikely(!prp1)) {
-        trace_nvme_err_invalid_create_cq_addr(prp1);
+        trace_nvme_dev_err_invalid_create_cq_addr(prp1);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
     if (unlikely(vector > n->num_queues)) {
-        trace_nvme_err_invalid_create_cq_vector(vector);
+        trace_nvme_dev_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
     if (unlikely(!(NVME_CQ_FLAGS_PC(qflags)))) {
-        trace_nvme_err_invalid_create_cq_qflags(NVME_CQ_FLAGS_PC(qflags));
+        trace_nvme_dev_err_invalid_create_cq_qflags(NVME_CQ_FLAGS_PC(qflags));
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -651,7 +651,7 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
     uint64_t prp1 = le64_to_cpu(c->prp1);
     uint64_t prp2 = le64_to_cpu(c->prp2);
 
-    trace_nvme_identify_ctrl();
+    trace_nvme_dev_identify_ctrl();
 
     return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
         prp1, prp2);
@@ -664,10 +664,10 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
     uint64_t prp1 = le64_to_cpu(c->prp1);
     uint64_t prp2 = le64_to_cpu(c->prp2);
 
-    trace_nvme_identify_ns(nsid);
+    trace_nvme_dev_identify_ns(nsid);
 
     if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
-        trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
+        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
@@ -687,7 +687,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
     uint16_t ret;
     int i, j = 0;
 
-    trace_nvme_identify_nslist(min_nsid);
+    trace_nvme_dev_identify_nslist(min_nsid);
 
     list = g_malloc0(data_len);
     for (i = 0; i < n->num_namespaces; i++) {
@@ -716,14 +716,14 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
     case 0x02:
         return nvme_identify_nslist(n, c);
     default:
-        trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
+        trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 }
 
 static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
 {
-    trace_nvme_setfeat_timestamp(ts);
+    trace_nvme_dev_setfeat_timestamp(ts);
 
     n->host_timestamp = le64_to_cpu(ts);
     n->timestamp_set_qemu_clock_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
@@ -756,7 +756,7 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
     /* If the host timestamp is non-zero, set the timestamp origin */
     ts.origin = n->host_timestamp ? 0x01 : 0x00;
 
-    trace_nvme_getfeat_timestamp(ts.all);
+    trace_nvme_dev_getfeat_timestamp(ts.all);
 
     return cpu_to_le64(ts.all);
 }
@@ -780,17 +780,17 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     switch (dw10) {
     case NVME_VOLATILE_WRITE_CACHE:
         result = blk_enable_write_cache(n->conf.blk);
-        trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
+        trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
         break;
     case NVME_NUMBER_OF_QUEUES:
         result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
-        trace_nvme_getfeat_numq(result);
+        trace_nvme_dev_getfeat_numq(result);
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
         break;
     default:
-        trace_nvme_err_invalid_getfeat(dw10);
+        trace_nvme_dev_err_invalid_getfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
@@ -826,9 +826,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
         break;
     case NVME_NUMBER_OF_QUEUES:
-        trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
-                                ((dw11 >> 16) & 0xFFFF) + 1,
-                                n->num_queues - 1, n->num_queues - 1);
+        trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
+            ((dw11 >> 16) & 0xFFFF) + 1, n->num_queues - 1, n->num_queues - 1);
         req->cqe.result =
             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
         break;
@@ -838,7 +837,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
 
     default:
-        trace_nvme_err_invalid_setfeat(dw10);
+        trace_nvme_dev_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
     return NVME_SUCCESS;
@@ -862,7 +861,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     case NVME_ADM_CMD_GET_FEATURES:
         return nvme_get_feature(n, cmd, req);
     default:
-        trace_nvme_err_invalid_admin_opc(cmd->opcode);
+        trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
     }
 }
@@ -925,77 +924,77 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint32_t page_size = 1 << page_bits;
 
     if (unlikely(n->cq[0])) {
-        trace_nvme_err_startfail_cq();
+        trace_nvme_dev_err_startfail_cq();
         return -1;
     }
     if (unlikely(n->sq[0])) {
-        trace_nvme_err_startfail_sq();
+        trace_nvme_dev_err_startfail_sq();
         return -1;
     }
     if (unlikely(!n->bar.asq)) {
-        trace_nvme_err_startfail_nbarasq();
+        trace_nvme_dev_err_startfail_nbarasq();
         return -1;
     }
     if (unlikely(!n->bar.acq)) {
-        trace_nvme_err_startfail_nbaracq();
+        trace_nvme_dev_err_startfail_nbaracq();
         return -1;
     }
     if (unlikely(n->bar.asq & (page_size - 1))) {
-        trace_nvme_err_startfail_asq_misaligned(n->bar.asq);
+        trace_nvme_dev_err_startfail_asq_misaligned(n->bar.asq);
         return -1;
     }
     if (unlikely(n->bar.acq & (page_size - 1))) {
-        trace_nvme_err_startfail_acq_misaligned(n->bar.acq);
+        trace_nvme_dev_err_startfail_acq_misaligned(n->bar.acq);
         return -1;
     }
     if (unlikely(NVME_CC_MPS(n->bar.cc) <
                  NVME_CAP_MPSMIN(n->bar.cap))) {
-        trace_nvme_err_startfail_page_too_small(
+        trace_nvme_dev_err_startfail_page_too_small(
                     NVME_CC_MPS(n->bar.cc),
                     NVME_CAP_MPSMIN(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_MPS(n->bar.cc) >
                  NVME_CAP_MPSMAX(n->bar.cap))) {
-        trace_nvme_err_startfail_page_too_large(
+        trace_nvme_dev_err_startfail_page_too_large(
                     NVME_CC_MPS(n->bar.cc),
                     NVME_CAP_MPSMAX(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOCQES(n->bar.cc) <
                  NVME_CTRL_CQES_MIN(n->id_ctrl.cqes))) {
-        trace_nvme_err_startfail_cqent_too_small(
+        trace_nvme_dev_err_startfail_cqent_too_small(
                     NVME_CC_IOCQES(n->bar.cc),
                     NVME_CTRL_CQES_MIN(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOCQES(n->bar.cc) >
                  NVME_CTRL_CQES_MAX(n->id_ctrl.cqes))) {
-        trace_nvme_err_startfail_cqent_too_large(
+        trace_nvme_dev_err_startfail_cqent_too_large(
                     NVME_CC_IOCQES(n->bar.cc),
                     NVME_CTRL_CQES_MAX(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOSQES(n->bar.cc) <
                  NVME_CTRL_SQES_MIN(n->id_ctrl.sqes))) {
-        trace_nvme_err_startfail_sqent_too_small(
+        trace_nvme_dev_err_startfail_sqent_too_small(
                     NVME_CC_IOSQES(n->bar.cc),
                     NVME_CTRL_SQES_MIN(n->bar.cap));
         return -1;
     }
     if (unlikely(NVME_CC_IOSQES(n->bar.cc) >
                  NVME_CTRL_SQES_MAX(n->id_ctrl.sqes))) {
-        trace_nvme_err_startfail_sqent_too_large(
+        trace_nvme_dev_err_startfail_sqent_too_large(
                     NVME_CC_IOSQES(n->bar.cc),
                     NVME_CTRL_SQES_MAX(n->bar.cap));
         return -1;
     }
     if (unlikely(!NVME_AQA_ASQS(n->bar.aqa))) {
-        trace_nvme_err_startfail_asqent_sz_zero();
+        trace_nvme_dev_err_startfail_asqent_sz_zero();
         return -1;
     }
     if (unlikely(!NVME_AQA_ACQS(n->bar.aqa))) {
-        trace_nvme_err_startfail_acqent_sz_zero();
+        trace_nvme_dev_err_startfail_acqent_sz_zero();
         return -1;
     }
 
@@ -1018,14 +1017,14 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
     unsigned size)
 {
     if (unlikely(offset & (sizeof(uint32_t) - 1))) {
-        NVME_GUEST_ERR(nvme_ub_mmiowr_misaligned32,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_misaligned32,
                        "MMIO write not 32-bit aligned,"
                        " offset=0x%"PRIx64"", offset);
         /* should be ignored, fall through for now */
     }
 
     if (unlikely(size < sizeof(uint32_t))) {
-        NVME_GUEST_ERR(nvme_ub_mmiowr_toosmall,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_toosmall,
                        "MMIO write smaller than 32-bits,"
                        " offset=0x%"PRIx64", size=%u",
                        offset, size);
@@ -1035,32 +1034,32 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
     switch (offset) {
     case 0xc:   /* INTMS */
         if (unlikely(msix_enabled(&(n->parent_obj)))) {
-            NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
+            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
                            "undefined access to interrupt mask set"
                            " when MSI-X is enabled");
             /* should be ignored, fall through for now */
         }
         n->bar.intms |= data & 0xffffffff;
         n->bar.intmc = n->bar.intms;
-        trace_nvme_mmio_intm_set(data & 0xffffffff,
+        trace_nvme_dev_mmio_intm_set(data & 0xffffffff,
                                  n->bar.intmc);
         nvme_irq_check(n);
         break;
     case 0x10:  /* INTMC */
         if (unlikely(msix_enabled(&(n->parent_obj)))) {
-            NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
+            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
                            "undefined access to interrupt mask clr"
                            " when MSI-X is enabled");
             /* should be ignored, fall through for now */
         }
         n->bar.intms &= ~(data & 0xffffffff);
         n->bar.intmc = n->bar.intms;
-        trace_nvme_mmio_intm_clr(data & 0xffffffff,
+        trace_nvme_dev_mmio_intm_clr(data & 0xffffffff,
                                  n->bar.intmc);
         nvme_irq_check(n);
         break;
     case 0x14:  /* CC */
-        trace_nvme_mmio_cfg(data & 0xffffffff);
+        trace_nvme_dev_mmio_cfg(data & 0xffffffff);
         /* Windows first sends data, then sends enable bit */
         if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
             !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
@@ -1071,42 +1070,42 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
         if (NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc)) {
             n->bar.cc = data;
             if (unlikely(nvme_start_ctrl(n))) {
-                trace_nvme_err_startfail();
+                trace_nvme_dev_err_startfail();
                 n->bar.csts = NVME_CSTS_FAILED;
             } else {
-                trace_nvme_mmio_start_success();
+                trace_nvme_dev_mmio_start_success();
                 n->bar.csts = NVME_CSTS_READY;
             }
         } else if (!NVME_CC_EN(data) && NVME_CC_EN(n->bar.cc)) {
-            trace_nvme_mmio_stopped();
+            trace_nvme_dev_mmio_stopped();
             nvme_clear_ctrl(n);
             n->bar.csts &= ~NVME_CSTS_READY;
         }
         if (NVME_CC_SHN(data) && !(NVME_CC_SHN(n->bar.cc))) {
-            trace_nvme_mmio_shutdown_set();
+            trace_nvme_dev_mmio_shutdown_set();
             nvme_clear_ctrl(n);
             n->bar.cc = data;
             n->bar.csts |= NVME_CSTS_SHST_COMPLETE;
         } else if (!NVME_CC_SHN(data) && NVME_CC_SHN(n->bar.cc)) {
-            trace_nvme_mmio_shutdown_cleared();
+            trace_nvme_dev_mmio_shutdown_cleared();
             n->bar.csts &= ~NVME_CSTS_SHST_COMPLETE;
             n->bar.cc = data;
         }
         break;
     case 0x1C:  /* CSTS */
         if (data & (1 << 4)) {
-            NVME_GUEST_ERR(nvme_ub_mmiowr_ssreset_w1c_unsupported,
+            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_ssreset_w1c_unsupported,
                            "attempted to W1C CSTS.NSSRO"
                            " but CAP.NSSRS is zero (not supported)");
         } else if (data != 0) {
-            NVME_GUEST_ERR(nvme_ub_mmiowr_ro_csts,
+            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_ro_csts,
                            "attempted to set a read only bit"
                            " of controller status");
         }
         break;
     case 0x20:  /* NSSR */
         if (data == 0x4E564D65) {
-            trace_nvme_ub_mmiowr_ssreset_unsupported();
+            trace_nvme_dev_ub_mmiowr_ssreset_unsupported();
         } else {
             /* The spec says that writes of other values have no effect */
             return;
@@ -1114,35 +1113,35 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
         break;
     case 0x24:  /* AQA */
         n->bar.aqa = data & 0xffffffff;
-        trace_nvme_mmio_aqattr(data & 0xffffffff);
+        trace_nvme_dev_mmio_aqattr(data & 0xffffffff);
         break;
     case 0x28:  /* ASQ */
         n->bar.asq = data;
-        trace_nvme_mmio_asqaddr(data);
+        trace_nvme_dev_mmio_asqaddr(data);
         break;
     case 0x2c:  /* ASQ hi */
         n->bar.asq |= data << 32;
-        trace_nvme_mmio_asqaddr_hi(data, n->bar.asq);
+        trace_nvme_dev_mmio_asqaddr_hi(data, n->bar.asq);
         break;
     case 0x30:  /* ACQ */
-        trace_nvme_mmio_acqaddr(data);
+        trace_nvme_dev_mmio_acqaddr(data);
         n->bar.acq = data;
         break;
     case 0x34:  /* ACQ hi */
         n->bar.acq |= data << 32;
-        trace_nvme_mmio_acqaddr_hi(data, n->bar.acq);
+        trace_nvme_dev_mmio_acqaddr_hi(data, n->bar.acq);
         break;
     case 0x38:  /* CMBLOC */
-        NVME_GUEST_ERR(nvme_ub_mmiowr_cmbloc_reserved,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_cmbloc_reserved,
                        "invalid write to reserved CMBLOC"
                        " when CMBSZ is zero, ignored");
         return;
     case 0x3C:  /* CMBSZ */
-        NVME_GUEST_ERR(nvme_ub_mmiowr_cmbsz_readonly,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_cmbsz_readonly,
                        "invalid write to read only CMBSZ, ignored");
         return;
     default:
-        NVME_GUEST_ERR(nvme_ub_mmiowr_invalid,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_invalid,
                        "invalid MMIO write,"
                        " offset=0x%"PRIx64", data=%"PRIx64"",
                        offset, data);
@@ -1157,12 +1156,12 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
     uint64_t val = 0;
 
     if (unlikely(addr & (sizeof(uint32_t) - 1))) {
-        NVME_GUEST_ERR(nvme_ub_mmiord_misaligned32,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiord_misaligned32,
                        "MMIO read not 32-bit aligned,"
                        " offset=0x%"PRIx64"", addr);
         /* should RAZ, fall through for now */
     } else if (unlikely(size < sizeof(uint32_t))) {
-        NVME_GUEST_ERR(nvme_ub_mmiord_toosmall,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiord_toosmall,
                        "MMIO read smaller than 32-bits,"
                        " offset=0x%"PRIx64"", addr);
         /* should RAZ, fall through for now */
@@ -1171,7 +1170,7 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
     if (addr < sizeof(n->bar)) {
         memcpy(&val, ptr + addr, size);
     } else {
-        NVME_GUEST_ERR(nvme_ub_mmiord_invalid_ofs,
+        NVME_GUEST_ERR(nvme_dev_ub_mmiord_invalid_ofs,
                        "MMIO read beyond last register,"
                        " offset=0x%"PRIx64", returning 0", addr);
     }
@@ -1184,7 +1183,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
     uint32_t qid;
 
     if (unlikely(addr & ((1 << 2) - 1))) {
-        NVME_GUEST_ERR(nvme_ub_db_wr_misaligned,
+        NVME_GUEST_ERR(nvme_dev_ub_db_wr_misaligned,
                        "doorbell write not 32-bit aligned,"
                        " offset=0x%"PRIx64", ignoring", addr);
         return;
@@ -1199,7 +1198,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         qid = (addr - (0x1000 + (1 << 2))) >> 3;
         if (unlikely(nvme_check_cqid(n, qid))) {
-            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_cq,
+            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_cq,
                            "completion queue doorbell write"
                            " for nonexistent queue,"
                            " sqid=%"PRIu32", ignoring", qid);
@@ -1208,7 +1207,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         cq = n->cq[qid];
         if (unlikely(new_head >= cq->size)) {
-            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_cqhead,
+            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_cqhead,
                            "completion queue doorbell write value"
                            " beyond queue size, sqid=%"PRIu32","
                            " new_head=%"PRIu16", ignoring",
@@ -1237,7 +1236,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         qid = (addr - 0x1000) >> 3;
         if (unlikely(nvme_check_sqid(n, qid))) {
-            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_sq,
+            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_sq,
                            "submission queue doorbell write"
                            " for nonexistent queue,"
                            " sqid=%"PRIu32", ignoring", qid);
@@ -1246,7 +1245,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         sq = n->sq[qid];
         if (unlikely(new_tail >= sq->size)) {
-            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_sqtail,
+            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_sqtail,
                            "submission queue doorbell write value"
                            " beyond queue size, sqid=%"PRIu32","
                            " new_tail=%"PRIu16", ignoring",
diff --git a/hw/block/trace-events b/hw/block/trace-events
index c03e80c2c9c9..ade506ea2bb2 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -29,96 +29,96 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
 
 # nvme.c
 # nvme traces for successful events
-nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
-nvme_irq_pin(void) "pulsing IRQ pin"
-nvme_irq_masked(void) "IRQ is masked"
-nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
-nvme_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
-nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
-nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
-nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
-nvme_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
-nvme_identify_ctrl(void) "identify controller"
-nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
-nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
-nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
-nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
-nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
-nvme_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
-nvme_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
-nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
-nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
-nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
-nvme_mmio_aqattr(uint64_t data) "wrote MMIO, admin queue attributes=0x%"PRIx64""
-nvme_mmio_asqaddr(uint64_t data) "wrote MMIO, admin submission queue address=0x%"PRIx64""
-nvme_mmio_acqaddr(uint64_t data) "wrote MMIO, admin completion queue address=0x%"PRIx64""
-nvme_mmio_asqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin submission queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
-nvme_mmio_acqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin completion queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
-nvme_mmio_start_success(void) "setting controller enable bit succeeded"
-nvme_mmio_stopped(void) "cleared controller enable bit"
-nvme_mmio_shutdown_set(void) "shutdown bit set"
-nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
+nvme_dev_irq_pin(void) "pulsing IRQ pin"
+nvme_dev_irq_masked(void) "IRQ is masked"
+nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
+nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
+nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
+nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
+nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
+nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
+nvme_dev_identify_ctrl(void) "identify controller"
+nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
+nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
+nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
+nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
+nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
+nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
+nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
+nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
+nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
+nvme_dev_mmio_aqattr(uint64_t data) "wrote MMIO, admin queue attributes=0x%"PRIx64""
+nvme_dev_mmio_asqaddr(uint64_t data) "wrote MMIO, admin submission queue address=0x%"PRIx64""
+nvme_dev_mmio_acqaddr(uint64_t data) "wrote MMIO, admin completion queue address=0x%"PRIx64""
+nvme_dev_mmio_asqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin submission queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
+nvme_dev_mmio_acqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin completion queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
+nvme_dev_mmio_start_success(void) "setting controller enable bit succeeded"
+nvme_dev_mmio_stopped(void) "cleared controller enable bit"
+nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
+nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
-nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
-nvme_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
-nvme_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
-nvme_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
-nvme_err_invalid_prp(void) "invalid PRP"
-nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
-nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
-nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
-nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
-nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
-nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
-nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
-nvme_err_invalid_create_sq_size(uint16_t qsize) "failed creating submission queue, invalid qsize=%"PRIu16""
-nvme_err_invalid_create_sq_addr(uint64_t addr) "failed creating submission queue, addr=0x%"PRIx64""
-nvme_err_invalid_create_sq_qflags(uint16_t qflags) "failed creating submission queue, qflags=%"PRIu16""
-nvme_err_invalid_del_cq_cqid(uint16_t cqid) "failed deleting completion queue, cqid=%"PRIu16""
-nvme_err_invalid_del_cq_notempty(uint16_t cqid) "failed deleting completion queue, it is not empty, cqid=%"PRIu16""
-nvme_err_invalid_create_cq_cqid(uint16_t cqid) "failed creating completion queue, cqid=%"PRIu16""
-nvme_err_invalid_create_cq_size(uint16_t size) "failed creating completion queue, size=%"PRIu16""
-nvme_err_invalid_create_cq_addr(uint64_t addr) "failed creating completion queue, addr=0x%"PRIx64""
-nvme_err_invalid_create_cq_vector(uint16_t vector) "failed creating completion queue, vector=%"PRIu16""
-nvme_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completion queue, qflags=%"PRIu16""
-nvme_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
-nvme_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
-nvme_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
-nvme_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
-nvme_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
-nvme_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
-nvme_err_startfail_nbaracq(void) "nvme_start_ctrl failed because the admin completion queue address is null"
-nvme_err_startfail_asq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin submission queue address is misaligned: 0x%"PRIx64""
-nvme_err_startfail_acq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin completion queue address is misaligned: 0x%"PRIx64""
-nvme_err_startfail_page_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too small: log2size=%u, min=%u"
-nvme_err_startfail_page_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too large: log2size=%u, max=%u"
-nvme_err_startfail_cqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too small: log2size=%u, min=%u"
-nvme_err_startfail_cqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too large: log2size=%u, max=%u"
-nvme_err_startfail_sqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too small: log2size=%u, min=%u"
-nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too large: log2size=%u, max=%u"
-nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
-nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
-nvme_err_startfail(void) "setting controller enable bit failed"
+nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
+nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
+nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
+nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
+nvme_dev_err_invalid_prp(void) "invalid PRP"
+nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
+nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
+nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
+nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
+nvme_dev_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
+nvme_dev_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
+nvme_dev_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
+nvme_dev_err_invalid_create_sq_size(uint16_t qsize) "failed creating submission queue, invalid qsize=%"PRIu16""
+nvme_dev_err_invalid_create_sq_addr(uint64_t addr) "failed creating submission queue, addr=0x%"PRIx64""
+nvme_dev_err_invalid_create_sq_qflags(uint16_t qflags) "failed creating submission queue, qflags=%"PRIu16""
+nvme_dev_err_invalid_del_cq_cqid(uint16_t cqid) "failed deleting completion queue, cqid=%"PRIu16""
+nvme_dev_err_invalid_del_cq_notempty(uint16_t cqid) "failed deleting completion queue, it is not empty, cqid=%"PRIu16""
+nvme_dev_err_invalid_create_cq_cqid(uint16_t cqid) "failed creating completion queue, cqid=%"PRIu16""
+nvme_dev_err_invalid_create_cq_size(uint16_t size) "failed creating completion queue, size=%"PRIu16""
+nvme_dev_err_invalid_create_cq_addr(uint64_t addr) "failed creating completion queue, addr=0x%"PRIx64""
+nvme_dev_err_invalid_create_cq_vector(uint16_t vector) "failed creating completion queue, vector=%"PRIu16""
+nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completion queue, qflags=%"PRIu16""
+nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
+nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
+nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
+nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
+nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
+nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
+nvme_dev_err_startfail_nbaracq(void) "nvme_start_ctrl failed because the admin completion queue address is null"
+nvme_dev_err_startfail_asq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin submission queue address is misaligned: 0x%"PRIx64""
+nvme_dev_err_startfail_acq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin completion queue address is misaligned: 0x%"PRIx64""
+nvme_dev_err_startfail_page_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too small: log2size=%u, min=%u"
+nvme_dev_err_startfail_page_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too large: log2size=%u, max=%u"
+nvme_dev_err_startfail_cqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too small: log2size=%u, min=%u"
+nvme_dev_err_startfail_cqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too large: log2size=%u, max=%u"
+nvme_dev_err_startfail_sqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too small: log2size=%u, min=%u"
+nvme_dev_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too large: log2size=%u, max=%u"
+nvme_dev_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
+nvme_dev_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
+nvme_dev_err_startfail(void) "setting controller enable bit failed"
 
 # Traces for undefined behavior
-nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
-nvme_ub_mmiowr_toosmall(uint64_t offset, unsigned size) "MMIO write smaller than 32 bits, offset=0x%"PRIx64", size=%u"
-nvme_ub_mmiowr_intmask_with_msix(void) "undefined access to interrupt mask set when MSI-X is enabled"
-nvme_ub_mmiowr_ro_csts(void) "attempted to set a read only bit of controller status"
-nvme_ub_mmiowr_ssreset_w1c_unsupported(void) "attempted to W1C CSTS.NSSRO but CAP.NSSRS is zero (not supported)"
-nvme_ub_mmiowr_ssreset_unsupported(void) "attempted NVM subsystem reset but CAP.NSSRS is zero (not supported)"
-nvme_ub_mmiowr_cmbloc_reserved(void) "invalid write to reserved CMBLOC when CMBSZ is zero, ignored"
-nvme_ub_mmiowr_cmbsz_readonly(void) "invalid write to read only CMBSZ, ignored"
-nvme_ub_mmiowr_invalid(uint64_t offset, uint64_t data) "invalid MMIO write, offset=0x%"PRIx64", data=0x%"PRIx64""
-nvme_ub_mmiord_misaligned32(uint64_t offset) "MMIO read not 32-bit aligned, offset=0x%"PRIx64""
-nvme_ub_mmiord_toosmall(uint64_t offset) "MMIO read smaller than 32-bits, offset=0x%"PRIx64""
-nvme_ub_mmiord_invalid_ofs(uint64_t offset) "MMIO read beyond last register, offset=0x%"PRIx64", returning 0"
-nvme_ub_db_wr_misaligned(uint64_t offset) "doorbell write not 32-bit aligned, offset=0x%"PRIx64", ignoring"
-nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for nonexistent queue, cqid=%"PRIu32", ignoring"
-nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
-nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
-nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+nvme_dev_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
+nvme_dev_ub_mmiowr_toosmall(uint64_t offset, unsigned size) "MMIO write smaller than 32 bits, offset=0x%"PRIx64", size=%u"
+nvme_dev_ub_mmiowr_intmask_with_msix(void) "undefined access to interrupt mask set when MSI-X is enabled"
+nvme_dev_ub_mmiowr_ro_csts(void) "attempted to set a read only bit of controller status"
+nvme_dev_ub_mmiowr_ssreset_w1c_unsupported(void) "attempted to W1C CSTS.NSSRO but CAP.NSSRS is zero (not supported)"
+nvme_dev_ub_mmiowr_ssreset_unsupported(void) "attempted NVM subsystem reset but CAP.NSSRS is zero (not supported)"
+nvme_dev_ub_mmiowr_cmbloc_reserved(void) "invalid write to reserved CMBLOC when CMBSZ is zero, ignored"
+nvme_dev_ub_mmiowr_cmbsz_readonly(void) "invalid write to read only CMBSZ, ignored"
+nvme_dev_ub_mmiowr_invalid(uint64_t offset, uint64_t data) "invalid MMIO write, offset=0x%"PRIx64", data=0x%"PRIx64""
+nvme_dev_ub_mmiord_misaligned32(uint64_t offset) "MMIO read not 32-bit aligned, offset=0x%"PRIx64""
+nvme_dev_ub_mmiord_toosmall(uint64_t offset) "MMIO read smaller than 32-bits, offset=0x%"PRIx64""
+nvme_dev_ub_mmiord_invalid_ofs(uint64_t offset) "MMIO read beyond last register, offset=0x%"PRIx64", returning 0"
+nvme_dev_ub_db_wr_misaligned(uint64_t offset) "doorbell write not 32-bit aligned, offset=0x%"PRIx64", ignoring"
+nvme_dev_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for nonexistent queue, cqid=%"PRIu32", ignoring"
+nvme_dev_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
+nvme_dev_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
+nvme_dev_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
 
 # xen-block.c
 xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 02/26] nvme: remove superfluous breaks
       [not found]   ` <CGME20200204095216eucas1p137a2adf666e82d490aefca96a269acd9@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:09       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

These break statements was left over when commit 3036a626e9ef ("nvme:
add Get/Set Feature Timestamp support") was merged.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index dd548d9b6605..c9ad6aaa5f95 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -788,7 +788,6 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
-        break;
     default:
         trace_nvme_dev_err_invalid_getfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -831,11 +830,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         req->cqe.result =
             cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
         break;
-
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, cmd);
-        break;
-
     default:
         trace_nvme_dev_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 03/26] nvme: move device parameters to separate struct
       [not found]   ` <CGME20200204095217eucas1p1f3e1d113d5eaad4327de0158d1e480cb@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:12       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Move device configuration parameters to separate struct to make it
explicit what is configurable and what is set internally.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c | 44 ++++++++++++++++++++++----------------------
 hw/block/nvme.h | 16 +++++++++++++---
 2 files changed, 35 insertions(+), 25 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index c9ad6aaa5f95..f05ebcce3f53 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -64,12 +64,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
-    return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
+    return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
 }
 
 static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
 {
-    return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
+    return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
 }
 
 static void nvme_inc_cq_tail(NvmeCQueue *cq)
@@ -631,7 +631,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
         trace_nvme_dev_err_invalid_create_cq_addr(prp1);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
-    if (unlikely(vector > n->num_queues)) {
+    if (unlikely(vector > n->params.num_queues)) {
         trace_nvme_dev_err_invalid_create_cq_vector(vector);
         return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
     }
@@ -783,7 +783,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
         break;
     case NVME_NUMBER_OF_QUEUES:
-        result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+        result = cpu_to_le32((n->params.num_queues - 2) |
+            ((n->params.num_queues - 2) << 16));
         trace_nvme_dev_getfeat_numq(result);
         break;
     case NVME_TIMESTAMP:
@@ -826,9 +827,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_NUMBER_OF_QUEUES:
         trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
-            ((dw11 >> 16) & 0xFFFF) + 1, n->num_queues - 1, n->num_queues - 1);
-        req->cqe.result =
-            cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
+            ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
+            n->params.num_queues - 1);
+        req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
+            ((n->params.num_queues - 2) << 16));
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, cmd);
@@ -899,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
     blk_drain(n->conf.blk);
 
-    for (i = 0; i < n->num_queues; i++) {
+    for (i = 0; i < n->params.num_queues; i++) {
         if (n->sq[i] != NULL) {
             nvme_free_sq(n->sq[i], n);
         }
     }
-    for (i = 0; i < n->num_queues; i++) {
+    for (i = 0; i < n->params.num_queues; i++) {
         if (n->cq[i] != NULL) {
             nvme_free_cq(n->cq[i], n);
         }
@@ -1307,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     int64_t bs_size;
     uint8_t *pci_conf;
 
-    if (!n->num_queues) {
+    if (!n->params.num_queues) {
         error_setg(errp, "num_queues can't be zero");
         return;
     }
@@ -1323,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         return;
     }
 
-    if (!n->serial) {
+    if (!n->params.serial) {
         error_setg(errp, "serial property not set");
         return;
     }
@@ -1340,25 +1342,25 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     pcie_endpoint_cap_init(pci_dev, 0x80);
 
     n->num_namespaces = 1;
-    n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
+    n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
     n->ns_size = bs_size / (uint64_t)n->num_namespaces;
 
     n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
-    n->sq = g_new0(NvmeSQueue *, n->num_queues);
-    n->cq = g_new0(NvmeCQueue *, n->num_queues);
+    n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
+    n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
 
     memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
                           "nvme", n->reg_size);
     pci_register_bar(pci_dev, 0,
         PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
         &n->iomem);
-    msix_init_exclusive_bar(pci_dev, n->num_queues, 4, NULL);
+    msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
     strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
     strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
-    strpadcpy((char *)id->sn, sizeof(id->sn), n->serial, ' ');
+    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
     id->rab = 6;
     id->ieee[0] = 0x00;
     id->ieee[1] = 0x02;
@@ -1387,7 +1389,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     n->bar.vs = 0x00010200;
     n->bar.intmc = n->bar.intms = 0;
 
-    if (n->cmb_size_mb) {
+    if (n->params.cmb_size_mb) {
 
         NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
         NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
@@ -1398,7 +1400,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
         NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
+        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
 
         n->cmbloc = n->bar.cmbloc;
         n->cmbsz = n->bar.cmbsz;
@@ -1437,7 +1439,7 @@ static void nvme_exit(PCIDevice *pci_dev)
     g_free(n->cq);
     g_free(n->sq);
 
-    if (n->cmb_size_mb) {
+    if (n->params.cmb_size_mb) {
         g_free(n->cmbuf);
     }
     msix_uninit_exclusive_bar(pci_dev);
@@ -1445,9 +1447,7 @@ static void nvme_exit(PCIDevice *pci_dev)
 
 static Property nvme_props[] = {
     DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
-    DEFINE_PROP_STRING("serial", NvmeCtrl, serial),
-    DEFINE_PROP_UINT32("cmb_size_mb", NvmeCtrl, cmb_size_mb, 0),
-    DEFINE_PROP_UINT32("num_queues", NvmeCtrl, num_queues, 64),
+    DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
     DEFINE_PROP_END_OF_LIST(),
 };
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 557194ee1954..9957c4a200e2 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -1,7 +1,19 @@
 #ifndef HW_NVME_H
 #define HW_NVME_H
+
 #include "block/nvme.h"
 
+#define DEFINE_NVME_PROPERTIES(_state, _props) \
+    DEFINE_PROP_STRING("serial", _state, _props.serial), \
+    DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
+    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)
+
+typedef struct NvmeParams {
+    char     *serial;
+    uint32_t num_queues;
+    uint32_t cmb_size_mb;
+} NvmeParams;
+
 typedef struct NvmeAsyncEvent {
     QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
     NvmeAerResult result;
@@ -63,6 +75,7 @@ typedef struct NvmeCtrl {
     MemoryRegion ctrl_mem;
     NvmeBar      bar;
     BlockConf    conf;
+    NvmeParams   params;
 
     uint32_t    page_size;
     uint16_t    page_bits;
@@ -71,10 +84,8 @@ typedef struct NvmeCtrl {
     uint16_t    sqe_size;
     uint32_t    reg_size;
     uint32_t    num_namespaces;
-    uint32_t    num_queues;
     uint32_t    max_q_ents;
     uint64_t    ns_size;
-    uint32_t    cmb_size_mb;
     uint32_t    cmbsz;
     uint32_t    cmbloc;
     uint8_t     *cmbuf;
@@ -82,7 +93,6 @@ typedef struct NvmeCtrl {
     uint64_t    host_timestamp;                 /* Timestamp sent by the host */
     uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
 
-    char            *serial;
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 04/26] nvme: add missing fields in the identify data structures
       [not found]   ` <CGME20200204095218eucas1p25d4623d82b1b7db3e555f3b27ca19763@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:15       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Not used by the device model but added for completeness. See NVM Express
1.2.1, Section 5.11 ("Identify command"), Figure 90 and Figure 93.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 include/block/nvme.h | 48 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/include/block/nvme.h b/include/block/nvme.h
index 8fb941c6537c..d2f65e8fe496 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
     uint8_t     ieee[3];
     uint8_t     cmic;
     uint8_t     mdts;
-    uint8_t     rsvd255[178];
+    uint16_t    cntlid;
+    uint32_t    ver;
+    uint32_t    rtd3r;
+    uint32_t    rtd3e;
+    uint32_t    oaes;
+    uint32_t    ctratt;
+    uint8_t     rsvd100[156];
     uint16_t    oacs;
     uint8_t     acl;
     uint8_t     aerl;
@@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
     uint8_t     lpa;
     uint8_t     elpe;
     uint8_t     npss;
-    uint8_t     rsvd511[248];
+    uint8_t     avscc;
+    uint8_t     apsta;
+    uint16_t    wctemp;
+    uint16_t    cctemp;
+    uint16_t    mtfa;
+    uint32_t    hmpre;
+    uint32_t    hmmin;
+    uint8_t     tnvmcap[16];
+    uint8_t     unvmcap[16];
+    uint32_t    rpmbs;
+    uint8_t     rsvd316[4];
+    uint16_t    kas;
+    uint8_t     rsvd322[190];
     uint8_t     sqes;
     uint8_t     cqes;
-    uint16_t    rsvd515;
+    uint16_t    maxcmd;
     uint32_t    nn;
     uint16_t    oncs;
     uint16_t    fuses;
@@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
     uint8_t     vwc;
     uint16_t    awun;
     uint16_t    awupf;
-    uint8_t     rsvd703[174];
-    uint8_t     rsvd2047[1344];
+    uint8_t     nvscc;
+    uint8_t     rsvd531;
+    uint16_t    acwu;
+    uint8_t     rsvd534[2];
+    uint32_t    sgls;
+    uint8_t     rsvd540[228];
+    uint8_t     subnqn[256];
+    uint8_t     rsvd1024[1024];
     NvmePSD     psd[32];
     uint8_t     vs[1024];
 } NvmeIdCtrl;
@@ -653,13 +677,21 @@ typedef struct NvmeIdNs {
     uint8_t     mc;
     uint8_t     dpc;
     uint8_t     dps;
-
     uint8_t     nmic;
     uint8_t     rescap;
     uint8_t     fpi;
     uint8_t     dlfeat;
-
-    uint8_t     res34[94];
+    uint8_t     rsvd33;
+    uint16_t    nawun;
+    uint16_t    nawupf;
+    uint16_t    nabsn;
+    uint16_t    nabo;
+    uint16_t    nabspf;
+    uint8_t     rsvd46[2];
+    uint8_t     nvmcap[16];
+    uint8_t     rsvd64[40];
+    uint8_t     nguid[16];
+    uint64_t    eui64;
     NvmeLBAF    lbaf[16];
     uint8_t     res192[192];
     uint8_t     vs[3712];
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 05/26] nvme: populate the mandatory subnqn and ver fields
       [not found]   ` <CGME20200204095218eucas1p2400645e2400b3d4450386a46e71b9e9a@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:18       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Required for compliance with NVMe revision 1.2.1 or later. See NVM
Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
7.9 ("NVMe Qualified Names").

This also bumps the supported version to 1.2.1.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f05ebcce3f53..9abf74da20f2 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,9 +9,9 @@
  */
 
 /**
- * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
+ * Reference Specification: NVM Express 1.2.1
  *
- *  http://www.nvmexpress.org/resources/
+ *   https://nvmexpress.org/resources/specifications/
  */
 
 /**
@@ -43,6 +43,8 @@
 #include "trace.h"
 #include "nvme.h"
 
+#define NVME_SPEC_VER 0x00010201
+
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
         (trace_##trace)(__VA_ARGS__); \
@@ -1365,6 +1367,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     id->ieee[0] = 0x00;
     id->ieee[1] = 0x02;
     id->ieee[2] = 0xb3;
+    id->ver = cpu_to_le32(NVME_SPEC_VER);
     id->oacs = cpu_to_le16(0);
     id->frmw = 7 << 1;
     id->lpa = 1 << 0;
@@ -1372,6 +1375,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
     id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
+
+    strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
+    pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
+
     id->psd[0].mp = cpu_to_le16(0x9c4);
     id->psd[0].enlat = cpu_to_le32(0x10);
     id->psd[0].exlat = cpu_to_le32(0x4);
@@ -1386,7 +1393,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     NVME_CAP_SET_CSS(n->bar.cap, 1);
     NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
 
-    n->bar.vs = 0x00010200;
+    n->bar.vs = NVME_SPEC_VER;
     n->bar.intmc = n->bar.intms = 0;
 
     if (n->params.cmb_size_mb) {
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 06/26] nvme: refactor nvme_addr_read
       [not found]   ` <CGME20200204095219eucas1p1a7d44c741e119939c60ff60b96c7652e@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:23       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Pull the controller memory buffer check to its own function. The check
will be used on its own in later patches.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 9abf74da20f2..ba5089df9ece 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -54,14 +54,22 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
+{
+    hwaddr low = n->ctrl_mem.addr;
+    hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
+
+    return addr >= low && addr < hi;
+}
+
 static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-    if (n->cmbsz && addr >= n->ctrl_mem.addr &&
-                addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
-        memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
-    } else {
-        pci_dma_read(&n->parent_obj, addr, buf, size);
+    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
+        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
+        return;
     }
+
+    pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 07/26] nvme: add support for the abort command
       [not found]   ` <CGME20200204095219eucas1p1a7e88f8f4090988b3dee34d4d4bcc239@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:25       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.1 ("Abort command").

The Abort command is a best effort command; for now, the device always
fails to abort the given command.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index ba5089df9ece..e1810260d40b 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -731,6 +731,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
     }
 }
 
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0xffff;
+
+    req->cqe.result = 1;
+    if (nvme_check_sqid(n, sqid)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
 static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
 {
     trace_nvme_dev_setfeat_timestamp(ts);
@@ -848,6 +860,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         trace_nvme_dev_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
+
     return NVME_SUCCESS;
 }
 
@@ -864,6 +877,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_create_cq(n, cmd);
     case NVME_ADM_CMD_IDENTIFY:
         return nvme_identify(n, cmd);
+    case NVME_ADM_CMD_ABORT:
+        return nvme_abort(n, cmd, req);
     case NVME_ADM_CMD_SET_FEATURES:
         return nvme_set_feature(n, cmd, req);
     case NVME_ADM_CMD_GET_FEATURES:
@@ -1377,6 +1392,19 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     id->ieee[2] = 0xb3;
     id->ver = cpu_to_le32(NVME_SPEC_VER);
     id->oacs = cpu_to_le16(0);
+
+    /*
+     * Because the controller always completes the Abort command immediately,
+     * there can never be more than one concurrently executing Abort command,
+     * so this value is never used for anything. Note that there can easily be
+     * many Abort commands in the queues, but they are not considered
+     * "executing" until processed by nvme_abort.
+     *
+     * The specification recommends a value of 3 for Abort Command Limit (four
+     * concurrently outstanding Abort commands), so lets use that though it is
+     * inconsequential.
+     */
+    id->acl = 3;
     id->frmw = 7 << 1;
     id->lpa = 1 << 0;
     id->sqes = (0x6 << 4) | 0x6;
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 08/26] nvme: refactor device realization
       [not found]   ` <CGME20200204095220eucas1p186b0de598359750d49278e0226ae45fb@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:27       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

This patch splits up nvme_realize into multiple individual functions,
each initializing a different subset of the device.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c | 175 +++++++++++++++++++++++++++++++-----------------
 hw/block/nvme.h |  21 ++++++
 2 files changed, 133 insertions(+), 63 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e1810260d40b..81514eaef63a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -44,6 +44,7 @@
 #include "nvme.h"
 
 #define NVME_SPEC_VER 0x00010201
+#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
     },
 };
 
-static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
-    NvmeCtrl *n = NVME(pci_dev);
-    NvmeIdCtrl *id = &n->id_ctrl;
-
-    int i;
-    int64_t bs_size;
-    uint8_t *pci_conf;
-
-    if (!n->params.num_queues) {
-        error_setg(errp, "num_queues can't be zero");
-        return;
-    }
+    NvmeParams *params = &n->params;
 
     if (!n->conf.blk) {
-        error_setg(errp, "drive property not set");
-        return;
+        error_setg(errp, "nvme: block backend not configured");
+        return 1;
     }
 
-    bs_size = blk_getlength(n->conf.blk);
-    if (bs_size < 0) {
-        error_setg(errp, "could not get backing file size");
-        return;
+    if (!params->serial) {
+        error_setg(errp, "nvme: serial not configured");
+        return 1;
     }
 
-    if (!n->params.serial) {
-        error_setg(errp, "serial property not set");
-        return;
+    if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
+        error_setg(errp, "nvme: invalid queue configuration");
+        return 1;
     }
+
+    return 0;
+}
+
+static int nvme_init_blk(NvmeCtrl *n, Error **errp)
+{
     blkconf_blocksizes(&n->conf);
     if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-                                       false, errp)) {
-        return;
+        false, errp)) {
+        return 1;
     }
 
-    pci_conf = pci_dev->config;
-    pci_conf[PCI_INTERRUPT_PIN] = 1;
-    pci_config_set_prog_interface(pci_dev->config, 0x2);
-    pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
-    pcie_endpoint_cap_init(pci_dev, 0x80);
+    return 0;
+}
 
+static void nvme_init_state(NvmeCtrl *n)
+{
     n->num_namespaces = 1;
     n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-    n->ns_size = bs_size / (uint64_t)n->num_namespaces;
-
     n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
     n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
     n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+}
 
-    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
-                          "nvme", n->reg_size);
+static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
+    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+
+    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
+    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
+    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
+    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
+    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
+    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+
+    n->cmbloc = n->bar.cmbloc;
+    n->cmbsz = n->bar.cmbsz;
+
+    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
+                            "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
+    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
+        PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
+        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+}
+
+static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
+{
+    uint8_t *pci_conf = pci_dev->config;
+
+    pci_conf[PCI_INTERRUPT_PIN] = 1;
+    pci_config_set_prog_interface(pci_conf, 0x2);
+    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+    pci_config_set_device_id(pci_conf, 0x5845);
+    pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
+    pcie_endpoint_cap_init(pci_dev, 0x80);
+
+    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
+        n->reg_size);
     pci_register_bar(pci_dev, 0,
         PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
         &n->iomem);
     msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
 
+    if (n->params.cmb_size_mb) {
+        nvme_init_cmb(n, pci_dev);
+    }
+}
+
+static void nvme_init_ctrl(NvmeCtrl *n)
+{
+    NvmeIdCtrl *id = &n->id_ctrl;
+    NvmeParams *params = &n->params;
+    uint8_t *pci_conf = n->parent_obj.config;
+
     id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
     id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
     strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
     strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
-    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
+    strpadcpy((char *)id->sn, sizeof(id->sn), params->serial, ' ');
     id->rab = 6;
     id->ieee[0] = 0x00;
     id->ieee[1] = 0x02;
@@ -1431,46 +1471,55 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 
     n->bar.vs = NVME_SPEC_VER;
     n->bar.intmc = n->bar.intms = 0;
+}
 
-    if (n->params.cmb_size_mb) {
+static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+    int64_t bs_size;
+    NvmeIdNs *id_ns = &ns->id_ns;
 
-        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
-        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
+    bs_size = blk_getlength(n->conf.blk);
+    if (bs_size < 0) {
+        error_setg_errno(errp, -bs_size, "blk_getlength");
+        return 1;
+    }
 
-        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
-        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
-        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
-        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
-        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
-        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
+    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    n->ns_size = bs_size;
 
-        n->cmbloc = n->bar.cmbloc;
-        n->cmbsz = n->bar.cmbsz;
+    id_ns->ncap = id_ns->nuse = id_ns->nsze =
+        cpu_to_le64(nvme_ns_nlbas(n, ns));
 
-        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
-                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
-        pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
-            PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
-            PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
+    return 0;
+}
 
+static void nvme_realize(PCIDevice *pci_dev, Error **errp)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+    Error *local_err = NULL;
+    int i;
+
+    if (nvme_check_constraints(n, &local_err)) {
+        error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
+        return;
+    }
+
+    nvme_init_state(n);
+
+    if (nvme_init_blk(n, &local_err)) {
+        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
+        return;
     }
 
     for (i = 0; i < n->num_namespaces; i++) {
-        NvmeNamespace *ns = &n->namespaces[i];
-        NvmeIdNs *id_ns = &ns->id_ns;
-        id_ns->nsfeat = 0;
-        id_ns->nlbaf = 0;
-        id_ns->flbas = 0;
-        id_ns->mc = 0;
-        id_ns->dpc = 0;
-        id_ns->dps = 0;
-        id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-        id_ns->ncap  = id_ns->nuse = id_ns->nsze =
-            cpu_to_le64(n->ns_size >>
-                id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
+        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
+            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
+            return;
+        }
     }
+
+    nvme_init_pci(n, pci_dev);
+    nvme_init_ctrl(n);
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 9957c4a200e2..a867bdfabafd 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -65,6 +65,22 @@ typedef struct NvmeNamespace {
     NvmeIdNs        id_ns;
 } NvmeNamespace;
 
+static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
+{
+    NvmeIdNs *id_ns = &ns->id_ns;
+    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
+}
+
+static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
+{
+    return nvme_ns_lbaf(ns).ds;
+}
+
+static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
+{
+    return 1 << nvme_ns_lbads(ns);
+}
+
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
         OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -101,4 +117,9 @@ typedef struct NvmeCtrl {
     NvmeIdCtrl      id_ctrl;
 } NvmeCtrl;
 
+static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
+{
+    return n->ns_size >> nvme_ns_lbads(ns);
+}
+
 #endif /* HW_NVME_H */
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 09/26] nvme: add temperature threshold feature
       [not found]   ` <CGME20200204095221eucas1p1d5b1c9578d79e6bcc5714976bbe7dc11@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:31       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

It might seem wierd to implement this feature for an emulated device,
but it is mandatory to support and the feature is useful for testing
asynchronous event request support, which will be added in a later
patch.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c      | 50 ++++++++++++++++++++++++++++++++++++++++++++
 hw/block/nvme.h      |  2 ++
 include/block/nvme.h |  7 ++++++-
 3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 81514eaef63a..f72348344832 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -45,6 +45,9 @@
 
 #define NVME_SPEC_VER 0x00010201
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
+#define NVME_TEMPERATURE 0x143
+#define NVME_TEMPERATURE_WARNING 0x157
+#define NVME_TEMPERATURE_CRITICAL 0x175
 
 #define NVME_GUEST_ERR(trace, fmt, ...) \
     do { \
@@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
     uint32_t result;
 
     switch (dw10) {
+    case NVME_TEMPERATURE_THRESHOLD:
+        result = 0;
+
+        /*
+         * The controller only implements the Composite Temperature sensor, so
+         * return 0 for all other sensors.
+         */
+        if (NVME_TEMP_TMPSEL(dw11)) {
+            break;
+        }
+
+        switch (NVME_TEMP_THSEL(dw11)) {
+        case 0x0:
+            result = cpu_to_le16(n->features.temp_thresh_hi);
+            break;
+        case 0x1:
+            result = cpu_to_le16(n->features.temp_thresh_low);
+            break;
+        }
+
+        break;
     case NVME_VOLATILE_WRITE_CACHE:
         result = blk_enable_write_cache(n->conf.blk);
         trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
@@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
     switch (dw10) {
+    case NVME_TEMPERATURE_THRESHOLD:
+        if (NVME_TEMP_TMPSEL(dw11)) {
+            break;
+        }
+
+        switch (NVME_TEMP_THSEL(dw11)) {
+        case 0x0:
+            n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
+            break;
+        case 0x1:
+            n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
+            break;
+        default:
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        break;
     case NVME_VOLATILE_WRITE_CACHE:
         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
         break;
@@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
     n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
     n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
     n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
+
+    n->temperature = NVME_TEMPERATURE;
+    n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
     id->acl = 3;
     id->frmw = 7 << 1;
     id->lpa = 1 << 0;
+
+    /* recommended default value (~70 C) */
+    id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
+    id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
+
     id->sqes = (0x6 << 4) | 0x6;
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index a867bdfabafd..1518f32557a3 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
     uint64_t    irq_status;
     uint64_t    host_timestamp;                 /* Timestamp sent by the host */
     uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
+    uint16_t    temperature;
 
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
@@ -115,6 +116,7 @@ typedef struct NvmeCtrl {
     NvmeSQueue      admin_sq;
     NvmeCQueue      admin_cq;
     NvmeIdCtrl      id_ctrl;
+    NvmeFeatureVal  features;
 } NvmeCtrl;
 
 static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
diff --git a/include/block/nvme.h b/include/block/nvme.h
index d2f65e8fe496..ff31cb32117c 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -616,7 +616,8 @@ enum NvmeIdCtrlOncs {
 typedef struct NvmeFeatureVal {
     uint32_t    arbitration;
     uint32_t    power_mgmt;
-    uint32_t    temp_thresh;
+    uint16_t    temp_thresh_hi;
+    uint16_t    temp_thresh_low;
     uint32_t    err_rec;
     uint32_t    volatile_wc;
     uint32_t    num_queues;
@@ -635,6 +636,10 @@ typedef struct NvmeFeatureVal {
 #define NVME_INTC_THR(intc)     (intc & 0xff)
 #define NVME_INTC_TIME(intc)    ((intc >> 8) & 0xff)
 
+#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
+#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
+#define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
+
 enum NvmeFeatureIds {
     NVME_ARBITRATION                = 0x1,
     NVME_POWER_MANAGEMENT           = 0x2,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 10/26] nvme: add support for the get log page command
       [not found]   ` <CGME20200204095221eucas1p216ca2452c4184eb06bff85cff3c6a82b@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12  9:35       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Add support for the Get Log Page command and basic implementations of
the mandatory Error Information, SMART / Health Information and Firmware
Slot Information log pages.

In violation of the specification, the SMART / Health Information log
page does not persist information over the lifetime of the controller
because the device has no place to store such persistent state.

Note that the LPA field in the Identify Controller data structure
intentionally has bit 0 cleared because there is no namespace specific
information in the SMART / Health information log page.

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.10 ("Get Log Page command").

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c       | 122 +++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h       |  10 ++++
 hw/block/trace-events |   2 +
 include/block/nvme.h  |   2 +-
 4 files changed, 134 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f72348344832..468c36918042 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
+    uint64_t off, NvmeRequest *req)
+{
+    uint64_t prp1 = le64_to_cpu(cmd->prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+    uint32_t nsid = le32_to_cpu(cmd->nsid);
+
+    uint32_t trans_len;
+    time_t current_ms;
+    uint64_t units_read = 0, units_written = 0, read_commands = 0,
+        write_commands = 0;
+    NvmeSmartLog smart;
+    BlockAcctStats *s;
+
+    if (nsid && nsid != 0xffffffff) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    s = blk_get_stats(n->conf.blk);
+
+    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
+    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
+    read_commands = s->nr_ops[BLOCK_ACCT_READ];
+    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
+
+    if (off > sizeof(smart)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    trans_len = MIN(sizeof(smart) - off, buf_len);
+
+    memset(&smart, 0x0, sizeof(smart));
+
+    smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
+    smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
+    smart.host_read_commands[0] = cpu_to_le64(read_commands);
+    smart.host_write_commands[0] = cpu_to_le64(write_commands);
+
+    smart.temperature[0] = n->temperature & 0xff;
+    smart.temperature[1] = (n->temperature >> 8) & 0xff;
+
+    if ((n->temperature > n->features.temp_thresh_hi) ||
+        (n->temperature < n->features.temp_thresh_low)) {
+        smart.critical_warning |= NVME_SMART_TEMPERATURE;
+    }
+
+    current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+    smart.power_on_hours[0] = cpu_to_le64(
+        (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
+
+    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
+        prp2);
+}
+
+static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
+    uint64_t off, NvmeRequest *req)
+{
+    uint32_t trans_len;
+    uint64_t prp1 = le64_to_cpu(cmd->prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+    NvmeFwSlotInfoLog fw_log;
+
+    if (off > sizeof(fw_log)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
+
+    trans_len = MIN(sizeof(fw_log) - off, buf_len);
+
+    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
+        prp2);
+}
+
+static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
+    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
+    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint8_t  lid = dw10 & 0xff;
+    uint8_t  rae = (dw10 >> 15) & 0x1;
+    uint32_t numdl, numdu;
+    uint64_t off, lpol, lpou;
+    size_t   len;
+
+    numdl = (dw10 >> 16);
+    numdu = (dw11 & 0xffff);
+    lpol = dw12;
+    lpou = dw13;
+
+    len = (((numdu << 16) | numdl) + 1) << 2;
+    off = (lpou << 32ULL) | lpol;
+
+    if (off & 0x3) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
+
+    switch (lid) {
+    case NVME_LOG_ERROR_INFO:
+        if (off) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        return NVME_SUCCESS;
+    case NVME_LOG_SMART_INFO:
+        return nvme_smart_info(n, cmd, len, off, req);
+    case NVME_LOG_FW_SLOT_INFO:
+        return nvme_fw_log_info(n, cmd, len, off, req);
+    default:
+        trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+}
+
 static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 {
     n->cq[cq->cqid] = NULL;
@@ -914,6 +1031,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_del_sq(n, cmd);
     case NVME_ADM_CMD_CREATE_SQ:
         return nvme_create_sq(n, cmd);
+    case NVME_ADM_CMD_GET_LOG_PAGE:
+        return nvme_get_log(n, cmd, req);
     case NVME_ADM_CMD_DELETE_CQ:
         return nvme_del_cq(n, cmd);
     case NVME_ADM_CMD_CREATE_CQ:
@@ -1411,6 +1530,7 @@ static void nvme_init_state(NvmeCtrl *n)
 
     n->temperature = NVME_TEMPERATURE;
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
+    n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1491,7 +1611,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
      */
     id->acl = 3;
     id->frmw = 7 << 1;
-    id->lpa = 1 << 0;
+    id->lpa = 1 << 2;
 
     /* recommended default value (~70 C) */
     id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1518f32557a3..89b0aafa02a2 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -109,6 +109,7 @@ typedef struct NvmeCtrl {
     uint64_t    host_timestamp;                 /* Timestamp sent by the host */
     uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
     uint16_t    temperature;
+    uint64_t    starttime_ms;
 
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
@@ -124,4 +125,13 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
     return n->ns_size >> nvme_ns_lbads(ns);
 }
 
+static inline uint16_t nvme_cid(NvmeRequest *req)
+{
+    if (req) {
+        return le16_to_cpu(req->cqe.cid);
+    }
+
+    return 0xffff;
+}
+
 #endif /* HW_NVME_H */
diff --git a/hw/block/trace-events b/hw/block/trace-events
index ade506ea2bb2..7da088479f39 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -46,6 +46,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
 nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
 nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
 nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
+nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
@@ -85,6 +86,7 @@ nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completi
 nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
 nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
 nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
+nvme_dev_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
 nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
 nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
 nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index ff31cb32117c..9a6055adeb61 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -515,7 +515,7 @@ enum NvmeSmartWarn {
     NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
 };
 
-enum LogIdentifier {
+enum NvmeLogIdentifier {
     NVME_LOG_ERROR_INFO     = 0x01,
     NVME_LOG_SMART_INFO     = 0x02,
     NVME_LOG_FW_SLOT_INFO   = 0x03,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 11/26] nvme: add support for the asynchronous event request command
       [not found]   ` <CGME20200204095222eucas1p2a2351bfc0930b3939927e485f1417e29@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 10:21       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
Section 5.2 ("Asynchronous Event Request command").

Mostly imported from Keith's qemu-nvme tree. Modified with a max number
of queued events (controllable with the aer_max_queued device
parameter). The spec states that the controller *should* retain
events, so we do best effort here.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c       | 167 +++++++++++++++++++++++++++++++++++++++++-
 hw/block/nvme.h       |  14 +++-
 hw/block/trace-events |   9 +++
 include/block/nvme.h  |   8 +-
 4 files changed, 191 insertions(+), 7 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 468c36918042..a186d95df020 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -325,6 +325,85 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
 }
 
+static void nvme_process_aers(void *opaque)
+{
+    NvmeCtrl *n = opaque;
+    NvmeAsyncEvent *event, *next;
+
+    trace_nvme_dev_process_aers(n->aer_queued);
+
+    QTAILQ_FOREACH_SAFE(event, &n->aer_queue, entry, next) {
+        NvmeRequest *req;
+        NvmeAerResult *result;
+
+        /* can't post cqe if there is nothing to complete */
+        if (!n->outstanding_aers) {
+            trace_nvme_dev_no_outstanding_aers();
+            break;
+        }
+
+        /* ignore if masked (cqe posted, but event not cleared) */
+        if (n->aer_mask & (1 << event->result.event_type)) {
+            trace_nvme_dev_aer_masked(event->result.event_type, n->aer_mask);
+            continue;
+        }
+
+        QTAILQ_REMOVE(&n->aer_queue, event, entry);
+        n->aer_queued--;
+
+        n->aer_mask |= 1 << event->result.event_type;
+        n->outstanding_aers--;
+
+        req = n->aer_reqs[n->outstanding_aers];
+
+        result = (NvmeAerResult *) &req->cqe.result;
+        result->event_type = event->result.event_type;
+        result->event_info = event->result.event_info;
+        result->log_page = event->result.log_page;
+        g_free(event);
+
+        req->status = NVME_SUCCESS;
+
+        trace_nvme_dev_aer_post_cqe(result->event_type, result->event_info,
+            result->log_page);
+
+        nvme_enqueue_req_completion(&n->admin_cq, req);
+    }
+}
+
+static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
+    uint8_t event_info, uint8_t log_page)
+{
+    NvmeAsyncEvent *event;
+
+    trace_nvme_dev_enqueue_event(event_type, event_info, log_page);
+
+    if (n->aer_queued == n->params.aer_max_queued) {
+        trace_nvme_dev_enqueue_event_noqueue(n->aer_queued);
+        return;
+    }
+
+    event = g_new(NvmeAsyncEvent, 1);
+    event->result = (NvmeAerResult) {
+        .event_type = event_type,
+        .event_info = event_info,
+        .log_page   = log_page,
+    };
+
+    QTAILQ_INSERT_TAIL(&n->aer_queue, event, entry);
+    n->aer_queued++;
+
+    nvme_process_aers(n);
+}
+
+static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
+{
+    n->aer_mask &= ~(1 << event_type);
+    if (!QTAILQ_EMPTY(&n->aer_queue)) {
+        nvme_process_aers(n);
+    }
+}
+
 static void nvme_rw_cb(void *opaque, int ret)
 {
     NvmeRequest *req = opaque;
@@ -569,8 +648,8 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
-    uint64_t off, NvmeRequest *req)
+static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
+    uint32_t buf_len, uint64_t off, NvmeRequest *req)
 {
     uint64_t prp1 = le64_to_cpu(cmd->prp1);
     uint64_t prp2 = le64_to_cpu(cmd->prp2);
@@ -619,6 +698,10 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
     smart.power_on_hours[0] = cpu_to_le64(
         (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
 
+    if (!rae) {
+        nvme_clear_events(n, NVME_AER_TYPE_SMART);
+    }
+
     return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
         prp2);
 }
@@ -671,13 +754,17 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
     switch (lid) {
     case NVME_LOG_ERROR_INFO:
+        if (!rae) {
+            nvme_clear_events(n, NVME_AER_TYPE_ERROR);
+        }
+
         if (off) {
             return NVME_INVALID_FIELD | NVME_DNR;
         }
 
         return NVME_SUCCESS;
     case NVME_LOG_SMART_INFO:
-        return nvme_smart_info(n, cmd, len, off, req);
+        return nvme_smart_info(n, cmd, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
         return nvme_fw_log_info(n, cmd, len, off, req);
     default:
@@ -954,6 +1041,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
+    case NVME_ASYNCHRONOUS_EVENT_CONF:
+        result = cpu_to_le32(n->features.async_config);
+        break;
     default:
         trace_nvme_dev_err_invalid_getfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1003,6 +1093,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
             return NVME_INVALID_FIELD | NVME_DNR;
         }
 
+        if (((n->temperature > n->features.temp_thresh_hi) ||
+            (n->temperature < n->features.temp_thresh_low)) &&
+            NVME_AEC_SMART(n->features.async_config) & NVME_SMART_TEMPERATURE) {
+            nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
+                NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
+        }
+
         break;
     case NVME_VOLATILE_WRITE_CACHE:
         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
@@ -1016,6 +1113,9 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_set_feature_timestamp(n, cmd);
+    case NVME_ASYNCHRONOUS_EVENT_CONF:
+        n->features.async_config = dw11;
+        break;
     default:
         trace_nvme_dev_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1024,6 +1124,25 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    trace_nvme_dev_aer(nvme_cid(req));
+
+    if (n->outstanding_aers > n->params.aerl) {
+        trace_nvme_dev_aer_aerl_exceeded();
+        return NVME_AER_LIMIT_EXCEEDED;
+    }
+
+    n->aer_reqs[n->outstanding_aers] = req;
+    n->outstanding_aers++;
+
+    if (!QTAILQ_EMPTY(&n->aer_queue)) {
+        nvme_process_aers(n);
+    }
+
+    return NVME_NO_COMPLETE;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     switch (cmd->opcode) {
@@ -1045,6 +1164,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_set_feature(n, cmd, req);
     case NVME_ADM_CMD_GET_FEATURES:
         return nvme_get_feature(n, cmd, req);
+    case NVME_ADM_CMD_ASYNC_EV_REQ:
+        return nvme_aer(n, cmd, req);
     default:
         trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -1099,6 +1220,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
         }
     }
 
+    while (!QTAILQ_EMPTY(&n->aer_queue)) {
+        NvmeAsyncEvent *event = QTAILQ_FIRST(&n->aer_queue);
+        QTAILQ_REMOVE(&n->aer_queue, event, entry);
+        g_free(event);
+    }
+
+    n->aer_queued = 0;
+    n->outstanding_aers = 0;
+
     blk_flush(n->conf.blk);
     n->bar.cc = 0;
 }
@@ -1195,6 +1325,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     nvme_set_timestamp(n, 0ULL);
 
+    QTAILQ_INIT(&n->aer_queue);
+
     return 0;
 }
 
@@ -1387,6 +1519,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            "completion queue doorbell write"
                            " for nonexistent queue,"
                            " sqid=%"PRIu32", ignoring", qid);
+
+            if (n->outstanding_aers) {
+                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
+                    NVME_LOG_ERROR_INFO);
+            }
+
             return;
         }
 
@@ -1397,6 +1536,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            " beyond queue size, sqid=%"PRIu32","
                            " new_head=%"PRIu16", ignoring",
                            qid, new_head);
+
+            if (n->outstanding_aers) {
+                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                    NVME_AER_INFO_ERR_INVALID_DB_VALUE, NVME_LOG_ERROR_INFO);
+            }
+
             return;
         }
 
@@ -1425,6 +1570,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            "submission queue doorbell write"
                            " for nonexistent queue,"
                            " sqid=%"PRIu32", ignoring", qid);
+
+            if (n->outstanding_aers) {
+                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
+                    NVME_LOG_ERROR_INFO);
+            }
+
             return;
         }
 
@@ -1435,6 +1587,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
                            " beyond queue size, sqid=%"PRIu32","
                            " new_tail=%"PRIu16", ignoring",
                            qid, new_tail);
+
+            if (n->outstanding_aers) {
+                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
+                    NVME_AER_INFO_ERR_INVALID_DB_VALUE, NVME_LOG_ERROR_INFO);
+            }
+
             return;
         }
 
@@ -1531,6 +1689,7 @@ static void nvme_init_state(NvmeCtrl *n)
     n->temperature = NVME_TEMPERATURE;
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+    n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 }
 
 static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
@@ -1610,6 +1769,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
      * inconsequential.
      */
     id->acl = 3;
+    id->aerl = n->params.aerl;
     id->frmw = 7 << 1;
     id->lpa = 1 << 2;
 
@@ -1700,6 +1860,7 @@ static void nvme_exit(PCIDevice *pci_dev)
     g_free(n->namespaces);
     g_free(n->cq);
     g_free(n->sq);
+    g_free(n->aer_reqs);
 
     if (n->params.cmb_size_mb) {
         g_free(n->cmbuf);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 89b0aafa02a2..1e715ab1d75c 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -6,16 +6,20 @@
 #define DEFINE_NVME_PROPERTIES(_state, _props) \
     DEFINE_PROP_STRING("serial", _state, _props.serial), \
     DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
-    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)
+    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
+    DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
+    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
 
 typedef struct NvmeParams {
     char     *serial;
     uint32_t num_queues;
     uint32_t cmb_size_mb;
+    uint8_t  aerl;
+    uint32_t aer_max_queued;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
-    QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
+    QTAILQ_ENTRY(NvmeAsyncEvent) entry;
     NvmeAerResult result;
 } NvmeAsyncEvent;
 
@@ -102,6 +106,7 @@ typedef struct NvmeCtrl {
     uint32_t    num_namespaces;
     uint32_t    max_q_ents;
     uint64_t    ns_size;
+    uint8_t     outstanding_aers;
     uint32_t    cmbsz;
     uint32_t    cmbloc;
     uint8_t     *cmbuf;
@@ -111,6 +116,11 @@ typedef struct NvmeCtrl {
     uint16_t    temperature;
     uint64_t    starttime_ms;
 
+    uint8_t     aer_mask;
+    NvmeRequest **aer_reqs;
+    QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
+    int         aer_queued;
+
     NvmeNamespace   *namespaces;
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 7da088479f39..3952c36774cf 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -47,6 +47,15 @@ nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_
 nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
 nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
 nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
+nvme_dev_process_aers(int queued) "queued %d"
+nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
+nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
+nvme_dev_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 0x%"PRIx8""
+nvme_dev_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+nvme_dev_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+nvme_dev_enqueue_event_noqueue(int queued) "queued %d"
+nvme_dev_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
+nvme_dev_no_outstanding_aers(void) "ignoring event; no outstanding AERs"
 nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
 nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 9a6055adeb61..a24be047a311 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -386,8 +386,8 @@ enum NvmeAsyncEventRequest {
     NVME_AER_TYPE_SMART                     = 1,
     NVME_AER_TYPE_IO_SPECIFIC               = 6,
     NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
-    NVME_AER_INFO_ERR_INVALID_SQ            = 0,
-    NVME_AER_INFO_ERR_INVALID_DB            = 1,
+    NVME_AER_INFO_ERR_INVALID_DB_REGISTER   = 0,
+    NVME_AER_INFO_ERR_INVALID_DB_VALUE      = 1,
     NVME_AER_INFO_ERR_DIAG_FAIL             = 2,
     NVME_AER_INFO_ERR_PERS_INTERNAL_ERR     = 3,
     NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR    = 4,
@@ -640,6 +640,10 @@ typedef struct NvmeFeatureVal {
 #define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
 #define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
 
+#define NVME_AEC_SMART(aec)         (aec & 0xff)
+#define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
+#define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
+
 enum NvmeFeatureIds {
     NVME_ARBITRATION                = 0x1,
     NVME_POWER_MANAGEMENT           = 0x2,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 12/26] nvme: add missing mandatory features
       [not found]   ` <CGME20200204095223eucas1p281b4ef7c8f4170d8a42da3b4aea9e166@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 10:27       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Add support for returning a resonable response to Get/Set Features of
mandatory features.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  3 ++-
 3 files changed, 58 insertions(+), 4 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a186d95df020..3267ee2de47a 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
     uint32_t result;
 
+    trace_nvme_dev_getfeat(nvme_cid(req), dw10);
+
     switch (dw10) {
+    case NVME_ARBITRATION:
+        result = cpu_to_le32(n->features.arbitration);
+        break;
+    case NVME_POWER_MANAGEMENT:
+        result = cpu_to_le32(n->features.power_mgmt);
+        break;
     case NVME_TEMPERATURE_THRESHOLD:
         result = 0;
 
@@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
             break;
         }
 
+        break;
+    case NVME_ERROR_RECOVERY:
+        result = cpu_to_le32(n->features.err_rec);
         break;
     case NVME_VOLATILE_WRITE_CACHE:
         result = blk_enable_write_cache(n->conf.blk);
@@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         break;
     case NVME_TIMESTAMP:
         return nvme_get_feature_timestamp(n, cmd);
+    case NVME_INTERRUPT_COALESCING:
+        result = cpu_to_le32(n->features.int_coalescing);
+        break;
+    case NVME_INTERRUPT_VECTOR_CONF:
+        if ((dw11 & 0xffff) > n->params.num_queues) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
+        result = cpu_to_le32(n->features.int_vector_config[dw11 & 0xffff]);
+        break;
+    case NVME_WRITE_ATOMICITY:
+        result = cpu_to_le32(n->features.write_atomicity);
+        break;
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         result = cpu_to_le32(n->features.async_config);
         break;
@@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
+    trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
+
     switch (dw10) {
     case NVME_TEMPERATURE_THRESHOLD:
         if (NVME_TEMP_TMPSEL(dw11)) {
@@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
         break;
+    case NVME_ARBITRATION:
+    case NVME_POWER_MANAGEMENT:
+    case NVME_ERROR_RECOVERY:
+    case NVME_INTERRUPT_COALESCING:
+    case NVME_INTERRUPT_VECTOR_CONF:
+    case NVME_WRITE_ATOMICITY:
+        return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
     default:
         trace_nvme_dev_err_invalid_setfeat(dw10);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
     n->temperature = NVME_TEMPERATURE;
     n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
     n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+
+    /*
+     * There is no limit on the number of commands that the controller may
+     * launch at one time from a particular Submission Queue.
+     */
+    n->features.arbitration = 0x7;
+
+    n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
+        sizeof(*n->features.int_vector_config));
+
+    /* disable coalescing (not supported) */
+    for (int i = 0; i < n->params.num_queues; i++) {
+        n->features.int_vector_config[i] = i | (1 << 16);
+    }
+
     n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
 }
 
@@ -1782,15 +1830,17 @@ static void nvme_init_ctrl(NvmeCtrl *n)
     id->nn = cpu_to_le32(n->num_namespaces);
     id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
 
+
+    if (blk_enable_write_cache(n->conf.blk)) {
+        id->vwc = 1;
+    }
+
     strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
     pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
 
     id->psd[0].mp = cpu_to_le16(0x9c4);
     id->psd[0].enlat = cpu_to_le32(0x10);
     id->psd[0].exlat = cpu_to_le32(0x4);
-    if (blk_enable_write_cache(n->conf.blk)) {
-        id->vwc = 1;
-    }
 
     n->bar.cap = 0;
     NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
@@ -1861,6 +1911,7 @@ static void nvme_exit(PCIDevice *pci_dev)
     g_free(n->cq);
     g_free(n->sq);
     g_free(n->aer_reqs);
+    g_free(n->features.int_vector_config);
 
     if (n->params.cmb_size_mb) {
         g_free(n->cmbuf);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 3952c36774cf..4cf39961989d 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,8 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
 nvme_dev_identify_ctrl(void) "identify controller"
 nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
 nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
+nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
 nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
 nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
 nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index a24be047a311..09419ed499d0 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -445,7 +445,8 @@ enum NvmeStatusCodes {
     NVME_FW_REQ_RESET           = 0x010b,
     NVME_INVALID_QUEUE_DEL      = 0x010c,
     NVME_FID_NOT_SAVEABLE       = 0x010d,
-    NVME_FID_NOT_NSID_SPEC      = 0x010f,
+    NVME_FEAT_NOT_CHANGABLE     = 0x010e,
+    NVME_FEAT_NOT_NSID_SPEC     = 0x010f,
     NVME_FW_REQ_SUSYSTEM_RESET  = 0x0110,
     NVME_CONFLICTING_ATTRS      = 0x0180,
     NVME_INVALID_PROT_INFO      = 0x0181,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 13/26] nvme: additional tracing
       [not found]   ` <CGME20200204095223eucas1p2b24d674e4b201c13a5fffc6853520d9b@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 10:28       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Add a trace call for nvme_enqueue_req_completion.

Also, streamline nvme_identify_ns and nvme_identify_ns_list. They do not
need to repeat the command, it is already in the trace name.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 8 +++++---
 hw/block/trace-events | 5 +++--
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3267ee2de47a..30c5b3e7a67d 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -320,6 +320,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
     assert(cq->cqid == req->sq->cqid);
+    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
+        req->status);
     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -895,7 +897,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
         prp1, prp2);
 }
 
-static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
 {
     static const int data_len = 4 * KiB;
     uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -905,7 +907,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
     uint16_t ret;
     int i, j = 0;
 
-    trace_nvme_dev_identify_nslist(min_nsid);
+    trace_nvme_dev_identify_ns_list(min_nsid);
 
     list = g_malloc0(data_len);
     for (i = 0; i < n->num_namespaces; i++) {
@@ -932,7 +934,7 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
     case 0x01:
         return nvme_identify_ctrl(n, c);
     case 0x02:
-        return nvme_identify_nslist(n, c);
+        return nvme_identify_ns_list(n, c);
     default:
         trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 4cf39961989d..f982ec1a3221 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -39,8 +39,8 @@ nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
 nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
 nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
 nvme_dev_identify_ctrl(void) "identify controller"
-nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
-nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
+nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
+nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
 nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
 nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
 nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
@@ -54,6 +54,7 @@ nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
 nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
 nvme_dev_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 0x%"PRIx8""
 nvme_dev_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
+nvme_dev_enqueue_req_completion(uint16_t cid, uint16_t cqid, uint16_t status) "cid %"PRIu16" cqid %"PRIu16" status 0x%"PRIx16""
 nvme_dev_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
 nvme_dev_enqueue_event_noqueue(int queued) "queued %d"
 nvme_dev_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid
       [not found]   ` <CGME20200204095224eucas1p10807239f5dc4aa809650c85186c426a8@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 10:30       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

0xffff is not an allowed value for NCQR and NSQR in Set Features on
Number of Queues.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 30c5b3e7a67d..900732bb2f38 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
         break;
     case NVME_NUMBER_OF_QUEUES:
+        if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
+            return NVME_INVALID_FIELD | NVME_DNR;
+        }
+
         trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
             ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
             n->params.num_queues - 1);
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 15/26] nvme: bump supported specification to 1.3
       [not found]   ` <CGME20200204095225eucas1p1e44b4de86afdf936e3c7f61359d529ce@eucas1p1.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 10:35       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Add new fields to the Identify Controller and Identify Namespace data
structures accoding to NVM Express 1.3d.

NVM Express 1.3d requires the following additional features:
  - addition of the Namespace Identification Descriptor List (CNS 03h)
    for the Identify command
  - support for returning Command Sequence Error if a Set Features
    command is submitted for the Number of Queues feature after any I/O
    queues have been created.
  - The addition of the Log Specific Field (LSP) in the Get Log Page
    command.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
---
 hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
 hw/block/nvme.h       |  1 +
 hw/block/trace-events |  3 ++-
 include/block/nvme.h  | 20 ++++++++++-----
 4 files changed, 71 insertions(+), 10 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 900732bb2f38..4acfc85b56a2 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -9,7 +9,7 @@
  */
 
 /**
- * Reference Specification: NVM Express 1.2.1
+ * Reference Specification: NVM Express 1.3d
  *
  *   https://nvmexpress.org/resources/specifications/
  */
@@ -43,7 +43,7 @@
 #include "trace.h"
 #include "nvme.h"
 
-#define NVME_SPEC_VER 0x00010201
+#define NVME_SPEC_VER 0x00010300
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
 #define NVME_TEMPERATURE 0x143
 #define NVME_TEMPERATURE_WARNING 0x157
@@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t dw12 = le32_to_cpu(cmd->cdw12);
     uint32_t dw13 = le32_to_cpu(cmd->cdw13);
     uint8_t  lid = dw10 & 0xff;
+    uint8_t  lsp = (dw10 >> 8) & 0xf;
     uint8_t  rae = (dw10 >> 15) & 0x1;
     uint32_t numdl, numdu;
     uint64_t off, lpol, lpou;
@@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
+    trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
 
     switch (lid) {
     case NVME_LOG_ERROR_INFO:
@@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     cq = g_malloc0(sizeof(*cq));
     nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
         NVME_CQ_FLAGS_IEN(qflags));
+
+    n->qs_created = true;
     return NVME_SUCCESS;
 }
 
@@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
     return ret;
 }
 
+static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
+{
+    static const int len = 4096;
+
+    struct ns_descr {
+        uint8_t nidt;
+        uint8_t nidl;
+        uint8_t rsvd2[2];
+        uint8_t nid[16];
+    };
+
+    uint32_t nsid = le32_to_cpu(c->nsid);
+    uint64_t prp1 = le64_to_cpu(c->prp1);
+    uint64_t prp2 = le64_to_cpu(c->prp2);
+
+    struct ns_descr *list;
+    uint16_t ret;
+
+    trace_nvme_dev_identify_ns_descr_list(nsid);
+
+    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
+        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_NSID | NVME_DNR;
+    }
+
+    /*
+     * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
+     * structure, a Namespace UUID (nidt = 0x3) must be reported in the
+     * Namespace Identification Descriptor. Add a very basic Namespace UUID
+     * here.
+     */
+    list = g_malloc0(len);
+    list->nidt = 0x3;
+    list->nidl = 0x10;
+    *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
+
+    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
+    g_free(list);
+    return ret;
+}
+
 static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
 {
     NvmeIdentify *c = (NvmeIdentify *)cmd;
@@ -935,6 +979,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
         return nvme_identify_ctrl(n, c);
     case 0x02:
         return nvme_identify_ns_list(n, c);
+    case 0x03:
+        return nvme_identify_ns_descr_list(n, cmd);
     default:
         trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1133,6 +1179,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
         break;
     case NVME_NUMBER_OF_QUEUES:
+        if (n->qs_created) {
+            return NVME_CMD_SEQ_ERROR | NVME_DNR;
+        }
+
         if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
             return NVME_INVALID_FIELD | NVME_DNR;
         }
@@ -1267,6 +1317,7 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
 
     n->aer_queued = 0;
     n->outstanding_aers = 0;
+    n->qs_created = false;
 
     blk_flush(n->conf.blk);
     n->bar.cc = 0;
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 1e715ab1d75c..7ced5fd485a9 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -97,6 +97,7 @@ typedef struct NvmeCtrl {
     BlockConf    conf;
     NvmeParams   params;
 
+    bool        qs_created;
     uint32_t    page_size;
     uint16_t    page_bits;
     uint16_t    max_prp_ents;
diff --git a/hw/block/trace-events b/hw/block/trace-events
index f982ec1a3221..9e5a4548bde0 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -41,6 +41,7 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
 nvme_dev_identify_ctrl(void) "identify controller"
 nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
 nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
+nvme_dev_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
 nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
 nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
 nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
@@ -48,7 +49,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
 nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
 nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
 nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
-nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
+nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
 nvme_dev_process_aers(int queued) "queued %d"
 nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
 nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 09419ed499d0..31eb9397d8c6 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -550,7 +550,9 @@ typedef struct NvmeIdCtrl {
     uint32_t    rtd3e;
     uint32_t    oaes;
     uint32_t    ctratt;
-    uint8_t     rsvd100[156];
+    uint8_t     rsvd100[12];
+    uint8_t     fguid[16];
+    uint8_t     rsvd128[128];
     uint16_t    oacs;
     uint8_t     acl;
     uint8_t     aerl;
@@ -568,9 +570,15 @@ typedef struct NvmeIdCtrl {
     uint8_t     tnvmcap[16];
     uint8_t     unvmcap[16];
     uint32_t    rpmbs;
-    uint8_t     rsvd316[4];
+    uint16_t    edstt;
+    uint8_t     dsto;
+    uint8_t     fwug;
     uint16_t    kas;
-    uint8_t     rsvd322[190];
+    uint16_t    hctma;
+    uint16_t    mntmt;
+    uint16_t    mxtmt;
+    uint32_t    sanicap;
+    uint8_t     rsvd332[180];
     uint8_t     sqes;
     uint8_t     cqes;
     uint16_t    maxcmd;
@@ -691,19 +699,19 @@ typedef struct NvmeIdNs {
     uint8_t     rescap;
     uint8_t     fpi;
     uint8_t     dlfeat;
-    uint8_t     rsvd33;
     uint16_t    nawun;
     uint16_t    nawupf;
+    uint16_t    nacwu;
     uint16_t    nabsn;
     uint16_t    nabo;
     uint16_t    nabspf;
-    uint8_t     rsvd46[2];
+    uint16_t    noiob;
     uint8_t     nvmcap[16];
     uint8_t     rsvd64[40];
     uint8_t     nguid[16];
     uint64_t    eui64;
     NvmeLBAF    lbaf[16];
-    uint8_t     res192[192];
+    uint8_t     rsvd192[192];
     uint8_t     vs[3712];
 } NvmeIdNs;
 
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 16/26] nvme: refactor prp mapping
       [not found]   ` <CGME20200204095225eucas1p226336a91fb5460dddae5caa85964279f@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 11:44       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
ensures that if some of the PRP is in the CMB, all of it must be located
there, as per the specification.

Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
takes an additional DMADirection parameter.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 245 +++++++++++++++++++++++++++---------------
 hw/block/nvme.h       |   2 +-
 hw/block/trace-events |   1 +
 include/block/nvme.h  |   1 +
 4 files changed, 160 insertions(+), 89 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 4acfc85b56a2..334265efb21e 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -58,6 +58,11 @@
 
 static void nvme_process_sq(void *opaque);
 
+static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
+{
+    return &n->cmbuf[addr - n->ctrl_mem.addr];
+}
+
 static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 {
     hwaddr low = n->ctrl_mem.addr;
@@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
-static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
-                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
+static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+    uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
     hwaddr trans_len = n->page_size - (prp1 % n->page_size);
     trans_len = MIN(len, trans_len);
     int num_prps = (len >> n->page_bits) + 1;
+    uint16_t status = NVME_SUCCESS;
+    bool is_cmb = false;
+    bool prp_list_in_cmb = false;
+
+    trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
+        prp1, prp2, num_prps);
 
     if (unlikely(!prp1)) {
         trace_nvme_dev_err_invalid_prp();
         return NVME_INVALID_FIELD | NVME_DNR;
-    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
-               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
-        qsg->nsg = 0;
+    }
+
+    if (nvme_addr_is_cmb(n, prp1)) {
+        is_cmb = true;
+
         qemu_iovec_init(iov, num_prps);
-        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
+
+        /*
+         * PRPs do not cross page boundaries, so if the start address (here,
+         * prp1) is within the CMB, it cannot cross outside the controller
+         * memory buffer range. This is ensured by
+         *
+         *   len = n->page_size - (addr % n->page_size)
+         *
+         * Thus, we can directly add to the iovec without risking an out of
+         * bounds access. This also holds for the remaining qemu_iovec_add
+         * calls.
+         */
+        qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
     } else {
         pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
         qemu_sglist_add(qsg, prp1, trans_len);
     }
+
     len -= trans_len;
     if (len) {
         if (unlikely(!prp2)) {
             trace_nvme_dev_err_invalid_prp2_missing();
+            status = NVME_INVALID_FIELD | NVME_DNR;
             goto unmap;
         }
+
         if (len > n->page_size) {
             uint64_t prp_list[n->max_prp_ents];
             uint32_t nents, prp_trans;
             int i = 0;
 
+            if (nvme_addr_is_cmb(n, prp2)) {
+                prp_list_in_cmb = true;
+            }
+
             nents = (len + n->page_size - 1) >> n->page_bits;
             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
+            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
             while (len != 0) {
                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
                 if (i == n->max_prp_ents - 1 && len > n->page_size) {
                     if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
                         trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
+                        status = NVME_INVALID_FIELD | NVME_DNR;
+                        goto unmap;
+                    }
+
+                    if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
+                        status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
                         goto unmap;
                     }
 
                     i = 0;
                     nents = (len + n->page_size - 1) >> n->page_bits;
                     prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-                    nvme_addr_read(n, prp_ent, (void *)prp_list,
-                        prp_trans);
+                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
                     prp_ent = le64_to_cpu(prp_list[i]);
                 }
 
                 if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
                     trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
+                    status = NVME_INVALID_FIELD | NVME_DNR;
+                    goto unmap;
+                }
+
+                if (is_cmb != nvme_addr_is_cmb(n, prp_ent)) {
+                    status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
                     goto unmap;
                 }
 
                 trans_len = MIN(len, n->page_size);
-                if (qsg->nsg){
-                    qemu_sglist_add(qsg, prp_ent, trans_len);
+                if (is_cmb) {
+                    qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp_ent),
+                        trans_len);
                 } else {
-                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
+                    qemu_sglist_add(qsg, prp_ent, trans_len);
                 }
+
                 len -= trans_len;
                 i++;
             }
         } else {
+            if (is_cmb != nvme_addr_is_cmb(n, prp2)) {
+                status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+                goto unmap;
+            }
+
             if (unlikely(prp2 & (n->page_size - 1))) {
                 trace_nvme_dev_err_invalid_prp2_align(prp2);
+                status = NVME_INVALID_FIELD | NVME_DNR;
                 goto unmap;
             }
-            if (qsg->nsg) {
+
+            if (is_cmb) {
+                qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp2), len);
+            } else {
                 qemu_sglist_add(qsg, prp2, len);
-            } else {
-                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
             }
         }
     }
+
     return NVME_SUCCESS;
 
- unmap:
-    qemu_sglist_destroy(qsg);
-    return NVME_INVALID_FIELD | NVME_DNR;
-}
-
-static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-                                   uint64_t prp1, uint64_t prp2)
-{
-    QEMUSGList qsg;
-    QEMUIOVector iov;
-    uint16_t status = NVME_SUCCESS;
-
-    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
-        return NVME_INVALID_FIELD | NVME_DNR;
-    }
-    if (qsg.nsg > 0) {
-        if (dma_buf_write(ptr, len, &qsg)) {
-            status = NVME_INVALID_FIELD | NVME_DNR;
-        }
-        qemu_sglist_destroy(&qsg);
+unmap:
+    if (is_cmb) {
+        qemu_iovec_destroy(iov);
     } else {
-        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
-            status = NVME_INVALID_FIELD | NVME_DNR;
-        }
-        qemu_iovec_destroy(&iov);
+        qemu_sglist_destroy(qsg);
     }
+
     return status;
 }
 
-static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-    uint64_t prp1, uint64_t prp2)
+static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
+    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
 {
     QEMUSGList qsg;
     QEMUIOVector iov;
     uint16_t status = NVME_SUCCESS;
+    size_t bytes;
 
-    trace_nvme_dev_dma_read(prp1, prp2);
-
-    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
-        return NVME_INVALID_FIELD | NVME_DNR;
+    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
+    if (status) {
+        return status;
     }
+
     if (qsg.nsg > 0) {
-        if (unlikely(dma_buf_read(ptr, len, &qsg))) {
+        uint64_t residual;
+
+        if (dir == DMA_DIRECTION_TO_DEVICE) {
+            residual = dma_buf_write(ptr, len, &qsg);
+        } else {
+            residual = dma_buf_read(ptr, len, &qsg);
+        }
+
+        if (unlikely(residual)) {
             trace_nvme_dev_err_invalid_dma();
             status = NVME_INVALID_FIELD | NVME_DNR;
         }
+
         qemu_sglist_destroy(&qsg);
+
+        return status;
+    }
+
+    if (dir == DMA_DIRECTION_TO_DEVICE) {
+        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
     } else {
-        if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
-            trace_nvme_dev_err_invalid_dma();
-            status = NVME_INVALID_FIELD | NVME_DNR;
-        }
-        qemu_iovec_destroy(&iov);
+        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
     }
+
+    if (unlikely(bytes != len)) {
+        trace_nvme_dev_err_invalid_dma();
+        status = NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    qemu_iovec_destroy(&iov);
+
     return status;
 }
 
@@ -420,16 +474,20 @@ static void nvme_rw_cb(void *opaque, int ret)
         block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
         req->status = NVME_INTERNAL_DEV_ERROR;
     }
-    if (req->has_sg) {
+
+    if (req->qsg.nalloc) {
         qemu_sglist_destroy(&req->qsg);
     }
+    if (req->iov.nalloc) {
+        qemu_iovec_destroy(&req->iov);
+    }
+
     nvme_enqueue_req_completion(cq, req);
 }
 
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
     NvmeRequest *req)
 {
-    req->has_sg = false;
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
          BLOCK_ACCT_FLUSH);
     req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
@@ -453,7 +511,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    req->has_sg = false;
     block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
                      BLOCK_ACCT_WRITE);
     req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
@@ -485,21 +542,24 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
         return NVME_LBA_RANGE | NVME_DNR;
     }
 
-    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
+    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
         block_acct_invalid(blk_get_stats(n->conf.blk), acct);
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
     if (req->qsg.nsg > 0) {
-        req->has_sg = true;
+        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
+            acct);
+
         req->aiocb = is_write ?
             dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
                           nvme_rw_cb, req) :
             dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
                          nvme_rw_cb, req);
     } else {
-        req->has_sg = false;
+        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
+            acct);
+
         req->aiocb = is_write ?
             blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
                             req) :
@@ -596,7 +656,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     sq->size = size;
     sq->cqid = cqid;
     sq->head = sq->tail = 0;
-    sq->io_req = g_new(NvmeRequest, sq->size);
+    sq->io_req = g_new0(NvmeRequest, sq->size);
 
     QTAILQ_INIT(&sq->req_list);
     QTAILQ_INIT(&sq->out_req_list);
@@ -704,8 +764,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
         nvme_clear_events(n, NVME_AER_TYPE_SMART);
     }
 
-    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
-        prp2);
+    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
+        prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
@@ -724,8 +784,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
 
     trans_len = MIN(sizeof(fw_log) - off, buf_len);
 
-    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
-        prp2);
+    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
+        prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
@@ -869,18 +929,20 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
+    NvmeRequest *req)
 {
     uint64_t prp1 = le64_to_cpu(c->prp1);
     uint64_t prp2 = le64_to_cpu(c->prp2);
 
     trace_nvme_dev_identify_ctrl();
 
-    return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
-        prp1, prp2);
+    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
+        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
+    NvmeRequest *req)
 {
     NvmeNamespace *ns;
     uint32_t nsid = le32_to_cpu(c->nsid);
@@ -896,11 +958,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
 
     ns = &n->namespaces[nsid - 1];
 
-    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
-        prp1, prp2);
+    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
+        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
+    NvmeRequest *req)
 {
     static const int data_len = 4 * KiB;
     uint32_t min_nsid = le32_to_cpu(c->nsid);
@@ -922,12 +985,14 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
             break;
         }
     }
-    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
+    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
+        DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
     return ret;
 }
 
-static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
+static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
+    NvmeRequest *req)
 {
     static const int len = 4096;
 
@@ -963,24 +1028,25 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
     list->nidl = 0x10;
     *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
 
-    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
+    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
+        DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
     return ret;
 }
 
-static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     NvmeIdentify *c = (NvmeIdentify *)cmd;
 
     switch (le32_to_cpu(c->cns)) {
     case 0x00:
-        return nvme_identify_ns(n, c);
+        return nvme_identify_ns(n, c, req);
     case 0x01:
-        return nvme_identify_ctrl(n, c);
+        return nvme_identify_ctrl(n, c, req);
     case 0x02:
-        return nvme_identify_ns_list(n, c);
+        return nvme_identify_ns_list(n, c, req);
     case 0x03:
-        return nvme_identify_ns_descr_list(n, cmd);
+        return nvme_identify_ns_descr_list(n, c, req);
     default:
         trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1039,15 +1105,16 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
     return cpu_to_le64(ts.all);
 }
 
-static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
+    NvmeRequest *req)
 {
     uint64_t prp1 = le64_to_cpu(cmd->prp1);
     uint64_t prp2 = le64_to_cpu(cmd->prp2);
 
     uint64_t timestamp = nvme_get_timestamp(n);
 
-    return nvme_dma_read_prp(n, (uint8_t *)&timestamp,
-                                 sizeof(timestamp), prp1, prp2);
+    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
+        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
@@ -1099,7 +1166,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         trace_nvme_dev_getfeat_numq(result);
         break;
     case NVME_TIMESTAMP:
-        return nvme_get_feature_timestamp(n, cmd);
+        return nvme_get_feature_timestamp(n, cmd, req);
     case NVME_INTERRUPT_COALESCING:
         result = cpu_to_le32(n->features.int_coalescing);
         break;
@@ -1125,15 +1192,16 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
+    NvmeRequest *req)
 {
     uint16_t ret;
     uint64_t timestamp;
     uint64_t prp1 = le64_to_cpu(cmd->prp1);
     uint64_t prp2 = le64_to_cpu(cmd->prp2);
 
-    ret = nvme_dma_write_prp(n, (uint8_t *)&timestamp,
-                                sizeof(timestamp), prp1, prp2);
+    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
+        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
     if (ret != NVME_SUCCESS) {
         return ret;
     }
@@ -1194,7 +1262,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
             ((n->params.num_queues - 2) << 16));
         break;
     case NVME_TIMESTAMP:
-        return nvme_set_feature_timestamp(n, cmd);
+        return nvme_set_feature_timestamp(n, cmd, req);
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
         break;
@@ -1246,7 +1314,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     case NVME_ADM_CMD_CREATE_CQ:
         return nvme_create_cq(n, cmd);
     case NVME_ADM_CMD_IDENTIFY:
-        return nvme_identify(n, cmd);
+        return nvme_identify(n, cmd, req);
     case NVME_ADM_CMD_ABORT:
         return nvme_abort(n, cmd, req);
     case NVME_ADM_CMD_SET_FEATURES:
@@ -1282,6 +1350,7 @@ static void nvme_process_sq(void *opaque)
         QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
         memset(&req->cqe, 0, sizeof(req->cqe));
         req->cqe.cid = cmd.cid;
+        memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
 
         status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
             nvme_admin_cmd(n, &cmd, req);
@@ -1804,7 +1873,7 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
 
     NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
     NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
-    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
+    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 1);
     NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
     NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
     NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 7ced5fd485a9..d27baa9d5391 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -27,11 +27,11 @@ typedef struct NvmeRequest {
     struct NvmeSQueue       *sq;
     BlockAIOCB              *aiocb;
     uint16_t                status;
-    bool                    has_sg;
     NvmeCqe                 cqe;
     BlockAcctCookie         acct;
     QEMUSGList              qsg;
     QEMUIOVector            iov;
+    NvmeCmd                 cmd;
     QTAILQ_ENTRY(NvmeRequest)entry;
 } NvmeRequest;
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 9e5a4548bde0..77aa0da99ee0 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -33,6 +33,7 @@ nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 nvme_dev_irq_pin(void) "pulsing IRQ pin"
 nvme_dev_irq_masked(void) "IRQ is masked"
 nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
+nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
 nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
 nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 31eb9397d8c6..c1de92179596 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -427,6 +427,7 @@ enum NvmeStatusCodes {
     NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
     NVME_INVALID_NSID           = 0x000b,
     NVME_CMD_SEQ_ERROR          = 0x000c,
+    NVME_INVALID_USE_OF_CMB     = 0x0012,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
     NVME_NS_NOT_READY           = 0x0082,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 17/26] nvme: allow multiple aios per command
       [not found]   ` <CGME20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44@eucas1p2.samsung.com>
@ 2020-02-04  9:51     ` Klaus Jensen
  2020-02-12 11:48       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:51 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

This refactors how the device issues asynchronous block backend
requests. The NvmeRequest now holds a queue of NvmeAIOs that are
associated with the command. This allows multiple aios to be issued for
a command. Only when all requests have been completed will the device
post a completion queue entry.

Because the device is currently guaranteed to only issue a single aio
request per command, the benefit is not immediately obvious. But this
functionality is required to support metadata, the dataset management
command and other features.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 449 +++++++++++++++++++++++++++++++++---------
 hw/block/nvme.h       | 134 +++++++++++--
 hw/block/trace-events |   8 +
 3 files changed, 480 insertions(+), 111 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 334265efb21e..e97da35c4ca1 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -19,7 +19,8 @@
  *      -drive file=<file>,if=none,id=<drive_id>
  *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
  *              cmb_size_mb=<cmb_size_mb[optional]>, \
- *              num_queues=<N[optional]>
+ *              num_queues=<N[optional]>, \
+ *              mdts=<mdts[optional]>
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -57,6 +58,7 @@
     } while (0)
 
 static void nvme_process_sq(void *opaque);
+static void nvme_aio_cb(void *opaque, int ret);
 
 static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
 {
@@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     return status;
 }
 
+static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    NvmeNamespace *ns = req->ns;
+
+    uint32_t len = req->nlb << nvme_ns_lbads(ns);
+    uint64_t prp1 = le64_to_cpu(cmd->prp1);
+    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+
+    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
+}
+
+static void nvme_aio_destroy(NvmeAIO *aio)
+{
+    g_free(aio);
+}
+
+static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
+    NvmeAIOOp opc)
+{
+    aio->opc = opc;
+
+    trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
+        aio->offset, aio->len, nvme_aio_opc_str(aio), req);
+
+    if (req) {
+        QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
+    }
+}
+
+static void nvme_aio(NvmeAIO *aio)
+{
+    BlockBackend *blk = aio->blk;
+    BlockAcctCookie *acct = &aio->acct;
+    BlockAcctStats *stats = blk_get_stats(blk);
+
+    bool is_write, dma;
+
+    switch (aio->opc) {
+    case NVME_AIO_OPC_NONE:
+        break;
+
+    case NVME_AIO_OPC_FLUSH:
+        block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
+        aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
+        break;
+
+    case NVME_AIO_OPC_WRITE_ZEROES:
+        block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
+        aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
+            BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
+        break;
+
+    case NVME_AIO_OPC_READ:
+    case NVME_AIO_OPC_WRITE:
+        dma = aio->qsg != NULL;
+        is_write = (aio->opc == NVME_AIO_OPC_WRITE);
+
+        block_acct_start(stats, acct, aio->len,
+            is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
+
+        if (dma) {
+            aio->aiocb = is_write ?
+                dma_blk_write(blk, aio->qsg, aio->offset,
+                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio) :
+                dma_blk_read(blk, aio->qsg, aio->offset,
+                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio);
+
+            return;
+        }
+
+        aio->aiocb = is_write ?
+            blk_aio_pwritev(blk, aio->offset, aio->iov, 0,
+                nvme_aio_cb, aio) :
+            blk_aio_preadv(blk, aio->offset, aio->iov, 0,
+                nvme_aio_cb, aio);
+
+        break;
+    }
+}
+
+static void nvme_rw_aio(BlockBackend *blk, uint64_t offset, NvmeRequest *req)
+{
+    NvmeAIO *aio;
+    size_t len = req->qsg.nsg > 0 ? req->qsg.size : req->iov.size;
+
+    aio = g_new0(NvmeAIO, 1);
+
+    *aio = (NvmeAIO) {
+        .blk = blk,
+        .offset = offset,
+        .len = len,
+        .req = req,
+        .qsg = &req->qsg,
+        .iov = &req->iov,
+    };
+
+    nvme_req_register_aio(req, aio, nvme_req_is_write(req) ?
+        NVME_AIO_OPC_WRITE : NVME_AIO_OPC_READ);
+    nvme_aio(aio);
+}
+
 static void nvme_post_cqes(void *opaque)
 {
     NvmeCQueue *cq = opaque;
@@ -364,6 +467,7 @@ static void nvme_post_cqes(void *opaque)
         nvme_inc_cq_tail(cq);
         pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
             sizeof(req->cqe));
+        nvme_req_clear(req);
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
@@ -374,8 +478,8 @@ static void nvme_post_cqes(void *opaque)
 static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
 {
     assert(cq->cqid == req->sq->cqid);
-    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
-        req->status);
+    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, req->status);
+
     QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
     QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
     timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
@@ -460,135 +564,272 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
     }
 }
 
-static void nvme_rw_cb(void *opaque, int ret)
+static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len,
+    NvmeRequest *req)
+{
+    uint8_t mdts = n->params.mdts;
+
+    if (mdts && len > n->page_size << mdts) {
+        trace_nvme_dev_err_mdts(nvme_cid(req), n->page_size << mdts, len);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
+    NvmeNamespace *ns = req->ns;
+
+    uint16_t ctrl = le16_to_cpu(rw->control);
+
+    if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) {
+        trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
+    uint32_t nlb, NvmeRequest *req)
+{
+    NvmeNamespace *ns = req->ns;
+    uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
+
+    if (unlikely((slba + nlb) > nsze)) {
+        block_acct_invalid(blk_get_stats(n->conf.blk),
+            nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
+        trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
+        return NVME_LBA_RANGE | NVME_DNR;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
+{
+    NvmeNamespace *ns = req->ns;
+    size_t len = req->nlb << nvme_ns_lbads(ns);
+    uint16_t status;
+
+    status = nvme_check_mdts(n, len, req);
+    if (status) {
+        return status;
+    }
+
+    status = nvme_check_prinfo(n, req);
+    if (status) {
+        return status;
+    }
+
+    status = nvme_check_bounds(n, req->slba, req->nlb, req);
+    if (status) {
+        return status;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static void nvme_rw_cb(NvmeRequest *req, void *opaque)
 {
-    NvmeRequest *req = opaque;
     NvmeSQueue *sq = req->sq;
     NvmeCtrl *n = sq->ctrl;
     NvmeCQueue *cq = n->cq[sq->cqid];
 
-    if (!ret) {
-        block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
-        req->status = NVME_SUCCESS;
-    } else {
-        block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
-        req->status = NVME_INTERNAL_DEV_ERROR;
-    }
-
-    if (req->qsg.nalloc) {
-        qemu_sglist_destroy(&req->qsg);
-    }
-    if (req->iov.nalloc) {
-        qemu_iovec_destroy(&req->iov);
-    }
+    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
 
     nvme_enqueue_req_completion(cq, req);
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-    NvmeRequest *req)
+static void nvme_aio_cb(void *opaque, int ret)
 {
-    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
-         BLOCK_ACCT_FLUSH);
-    req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
+    NvmeAIO *aio = opaque;
+    NvmeRequest *req = aio->req;
 
-    return NVME_NO_COMPLETE;
-}
+    BlockBackend *blk = aio->blk;
+    BlockAcctCookie *acct = &aio->acct;
+    BlockAcctStats *stats = blk_get_stats(blk);
 
-static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-    NvmeRequest *req)
-{
-    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
-    const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
-    const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
-    uint64_t slba = le64_to_cpu(rw->slba);
-    uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
-    uint64_t offset = slba << data_shift;
-    uint32_t count = nlb << data_shift;
-
-    if (unlikely(slba + nlb > ns->id_ns.nsze)) {
-        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-        return NVME_LBA_RANGE | NVME_DNR;
-    }
-
-    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
-                     BLOCK_ACCT_WRITE);
-    req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
-                                        BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
-    return NVME_NO_COMPLETE;
-}
-
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
-    NvmeRequest *req)
-{
-    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
-    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
-    uint64_t slba = le64_to_cpu(rw->slba);
-    uint64_t prp1 = le64_to_cpu(rw->prp1);
-    uint64_t prp2 = le64_to_cpu(rw->prp2);
-
-    uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
-    uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
-    uint64_t data_size = (uint64_t)nlb << data_shift;
-    uint64_t data_offset = slba << data_shift;
-    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
-    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+    Error *local_err = NULL;
 
-    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
+    trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
+        nvme_aio_opc_str(aio), req);
 
-    if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
-        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
-        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
-        return NVME_LBA_RANGE | NVME_DNR;
+    if (req) {
+        QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
     }
 
-    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
-        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
-        return NVME_INVALID_FIELD | NVME_DNR;
-    }
-
-    if (req->qsg.nsg > 0) {
-        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
-            acct);
-
-        req->aiocb = is_write ?
-            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                          nvme_rw_cb, req) :
-            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
-                         nvme_rw_cb, req);
+    if (!ret) {
+        block_acct_done(stats, acct);
     } else {
-        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
-            acct);
+        block_acct_failed(stats, acct);
 
-        req->aiocb = is_write ?
-            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
-                            req) :
-            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
-                           req);
+        if (req) {
+            uint16_t status;
+
+            switch (aio->opc) {
+            case NVME_AIO_OPC_READ:
+                status = NVME_UNRECOVERED_READ;
+                break;
+            case NVME_AIO_OPC_WRITE:
+            case NVME_AIO_OPC_WRITE_ZEROES:
+                status = NVME_WRITE_FAULT;
+                break;
+            default:
+                status = NVME_INTERNAL_DEV_ERROR;
+                break;
+            }
+
+            trace_nvme_dev_err_aio(nvme_cid(req), aio, blk_name(blk),
+                aio->offset, nvme_aio_opc_str(aio), req, status);
+
+            error_setg_errno(&local_err, -ret, "aio failed");
+            error_report_err(local_err);
+
+            /*
+             * An Internal Error trumps all other errors. For other errors,
+             * only set the first error encountered. Any additional errors will
+             * be recorded in the error information log page.
+             */
+            if (!req->status ||
+                nvme_status_is_error(status, NVME_INTERNAL_DEV_ERROR)) {
+                req->status = status;
+            }
+        }
+    }
+
+    if (aio->cb) {
+        aio->cb(aio, aio->cb_arg, ret);
+    }
+
+    if (req && QTAILQ_EMPTY(&req->aio_tailq)) {
+        if (req->cb) {
+            req->cb(req, req->cb_arg);
+        } else {
+            NvmeSQueue *sq = req->sq;
+            NvmeCtrl *n = sq->ctrl;
+            NvmeCQueue *cq = n->cq[sq->cqid];
+
+            nvme_enqueue_req_completion(cq, req);
+        }
     }
 
+    nvme_aio_destroy(aio);
+}
+
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    NvmeAIO *aio = g_new0(NvmeAIO, 1);
+
+    *aio = (NvmeAIO) {
+        .blk = n->conf.blk,
+        .req = req,
+    };
+
+    nvme_req_register_aio(req, aio, NVME_AIO_OPC_FLUSH);
+    nvme_aio(aio);
+
+    return NVME_NO_COMPLETE;
+}
+
+static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    NvmeAIO *aio;
+
+    NvmeNamespace *ns = req->ns;
+    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+
+    int64_t offset;
+    size_t count;
+    uint16_t status;
+
+    req->slba = le64_to_cpu(rw->slba);
+    req->nlb  = le16_to_cpu(rw->nlb) + 1;
+
+    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
+        req->slba, req->nlb);
+
+    status = nvme_check_bounds(n, req->slba, req->nlb, req);
+    if (unlikely(status)) {
+        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
+        return status;
+    }
+
+    offset = req->slba << nvme_ns_lbads(ns);
+    count = req->nlb << nvme_ns_lbads(ns);
+
+    aio = g_new0(NvmeAIO, 1);
+
+    *aio = (NvmeAIO) {
+        .blk = n->conf.blk,
+        .offset = offset,
+        .len = count,
+        .req = req,
+    };
+
+    nvme_req_register_aio(req, aio, NVME_AIO_OPC_WRITE_ZEROES);
+    nvme_aio(aio);
+
+    return NVME_NO_COMPLETE;
+}
+
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+{
+    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+    NvmeNamespace *ns = req->ns;
+    int status;
+
+    enum BlockAcctType acct =
+        nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
+
+    req->nlb  = le16_to_cpu(rw->nlb) + 1;
+    req->slba = le64_to_cpu(rw->slba);
+
+    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
+        req->nlb << nvme_ns_lbads(req->ns), req->slba);
+
+    status = nvme_check_rw(n, req);
+    if (status) {
+        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+        return status;
+    }
+
+    status = nvme_map(n, cmd, req);
+    if (status) {
+        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+        return status;
+    }
+
+    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
+    nvme_req_set_cb(req, nvme_rw_cb, NULL);
+
     return NVME_NO_COMPLETE;
 }
 
 static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-    NvmeNamespace *ns;
     uint32_t nsid = le32_to_cpu(cmd->nsid);
 
+    trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
+        cmd->opcode);
+
     if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
         trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
         return NVME_INVALID_NSID | NVME_DNR;
     }
 
-    ns = &n->namespaces[nsid - 1];
+    req->ns = &n->namespaces[nsid - 1];
+
     switch (cmd->opcode) {
     case NVME_CMD_FLUSH:
-        return nvme_flush(n, ns, cmd, req);
+        return nvme_flush(n, cmd, req);
     case NVME_CMD_WRITE_ZEROS:
-        return nvme_write_zeros(n, ns, cmd, req);
+        return nvme_write_zeros(n, cmd, req);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, ns, cmd, req);
+        return nvme_rw(n, cmd, req);
     default:
         trace_nvme_dev_err_invalid_opc(cmd->opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
@@ -612,6 +853,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
     NvmeRequest *req, *next;
     NvmeSQueue *sq;
     NvmeCQueue *cq;
+    NvmeAIO *aio;
     uint16_t qid = le16_to_cpu(c->qid);
 
     if (unlikely(!qid || nvme_check_sqid(n, qid))) {
@@ -624,8 +866,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
     sq = n->sq[qid];
     while (!QTAILQ_EMPTY(&sq->out_req_list)) {
         req = QTAILQ_FIRST(&sq->out_req_list);
-        assert(req->aiocb);
-        blk_aio_cancel(req->aiocb);
+        while (!QTAILQ_EMPTY(&req->aio_tailq)) {
+            aio = QTAILQ_FIRST(&req->aio_tailq);
+            assert(aio->aiocb);
+            blk_aio_cancel(aio->aiocb);
+        }
     }
     if (!nvme_check_cqid(n, sq->cqid)) {
         cq = n->cq[sq->cqid];
@@ -662,6 +907,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     QTAILQ_INIT(&sq->out_req_list);
     for (i = 0; i < sq->size; i++) {
         sq->io_req[i].sq = sq;
+        QTAILQ_INIT(&(sq->io_req[i].aio_tailq));
         QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
     }
     sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
@@ -800,6 +1046,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     uint32_t numdl, numdu;
     uint64_t off, lpol, lpou;
     size_t   len;
+    uint16_t status;
 
     numdl = (dw10 >> 16);
     numdu = (dw11 & 0xffff);
@@ -815,6 +1062,11 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
     trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
 
+    status = nvme_check_mdts(n, len, req);
+    if (status) {
+        return status;
+    }
+
     switch (lid) {
     case NVME_LOG_ERROR_INFO:
         if (!rae) {
@@ -1348,7 +1600,7 @@ static void nvme_process_sq(void *opaque)
         req = QTAILQ_FIRST(&sq->req_list);
         QTAILQ_REMOVE(&sq->req_list, req, entry);
         QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
-        memset(&req->cqe, 0, sizeof(req->cqe));
+
         req->cqe.cid = cmd.cid;
         memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
 
@@ -1928,6 +2180,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
     id->ieee[0] = 0x00;
     id->ieee[1] = 0x02;
     id->ieee[2] = 0xb3;
+    id->mdts = params->mdts;
     id->ver = cpu_to_le32(NVME_SPEC_VER);
     id->oacs = cpu_to_le16(0);
 
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index d27baa9d5391..3319f8edd7e1 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -8,7 +8,8 @@
     DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
     DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
     DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
-    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
+    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
+    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
 
 typedef struct NvmeParams {
     char     *serial;
@@ -16,6 +17,7 @@ typedef struct NvmeParams {
     uint32_t cmb_size_mb;
     uint8_t  aerl;
     uint32_t aer_max_queued;
+    uint8_t  mdts;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
@@ -23,17 +25,58 @@ typedef struct NvmeAsyncEvent {
     NvmeAerResult result;
 } NvmeAsyncEvent;
 
-typedef struct NvmeRequest {
-    struct NvmeSQueue       *sq;
-    BlockAIOCB              *aiocb;
-    uint16_t                status;
-    NvmeCqe                 cqe;
-    BlockAcctCookie         acct;
-    QEMUSGList              qsg;
-    QEMUIOVector            iov;
-    NvmeCmd                 cmd;
-    QTAILQ_ENTRY(NvmeRequest)entry;
-} NvmeRequest;
+typedef struct NvmeRequest NvmeRequest;
+typedef void NvmeRequestCompletionFunc(NvmeRequest *req, void *opaque);
+
+struct NvmeRequest {
+    struct NvmeSQueue    *sq;
+    struct NvmeNamespace *ns;
+
+    NvmeCqe  cqe;
+    NvmeCmd  cmd;
+    uint16_t status;
+
+    uint64_t slba;
+    uint32_t nlb;
+
+    QEMUSGList   qsg;
+    QEMUIOVector iov;
+
+    NvmeRequestCompletionFunc *cb;
+    void                      *cb_arg;
+
+    QTAILQ_HEAD(, NvmeAIO)    aio_tailq;
+    QTAILQ_ENTRY(NvmeRequest) entry;
+};
+
+static inline void nvme_req_clear(NvmeRequest *req)
+{
+    req->ns = NULL;
+    memset(&req->cqe, 0, sizeof(req->cqe));
+    req->status = NVME_SUCCESS;
+    req->slba = req->nlb = 0x0;
+    req->cb = req->cb_arg = NULL;
+
+    if (req->qsg.sg) {
+        qemu_sglist_destroy(&req->qsg);
+    }
+
+    if (req->iov.iov) {
+        qemu_iovec_destroy(&req->iov);
+    }
+}
+
+static inline void nvme_req_set_cb(NvmeRequest *req,
+    NvmeRequestCompletionFunc *cb, void *cb_arg)
+{
+    req->cb = cb;
+    req->cb_arg = cb_arg;
+}
+
+static inline void nvme_req_clear_cb(NvmeRequest *req)
+{
+    req->cb = req->cb_arg = NULL;
+}
 
 typedef struct NvmeSQueue {
     struct NvmeCtrl *ctrl;
@@ -85,6 +128,60 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
     return 1 << nvme_ns_lbads(ns);
 }
 
+typedef enum NvmeAIOOp {
+    NVME_AIO_OPC_NONE         = 0x0,
+    NVME_AIO_OPC_FLUSH        = 0x1,
+    NVME_AIO_OPC_READ         = 0x2,
+    NVME_AIO_OPC_WRITE        = 0x3,
+    NVME_AIO_OPC_WRITE_ZEROES = 0x4,
+} NvmeAIOOp;
+
+typedef struct NvmeAIO NvmeAIO;
+typedef void NvmeAIOCompletionFunc(NvmeAIO *aio, void *opaque, int ret);
+
+struct NvmeAIO {
+    NvmeRequest *req;
+
+    NvmeAIOOp       opc;
+    int64_t         offset;
+    size_t          len;
+    BlockBackend    *blk;
+    BlockAIOCB      *aiocb;
+    BlockAcctCookie acct;
+
+    NvmeAIOCompletionFunc *cb;
+    void                  *cb_arg;
+
+    QEMUSGList   *qsg;
+    QEMUIOVector *iov;
+
+    QTAILQ_ENTRY(NvmeAIO) tailq_entry;
+};
+
+static inline const char *nvme_aio_opc_str(NvmeAIO *aio)
+{
+    switch (aio->opc) {
+    case NVME_AIO_OPC_NONE:         return "NVME_AIO_OP_NONE";
+    case NVME_AIO_OPC_FLUSH:        return "NVME_AIO_OP_FLUSH";
+    case NVME_AIO_OPC_READ:         return "NVME_AIO_OP_READ";
+    case NVME_AIO_OPC_WRITE:        return "NVME_AIO_OP_WRITE";
+    case NVME_AIO_OPC_WRITE_ZEROES: return "NVME_AIO_OP_WRITE_ZEROES";
+    default:                        return "NVME_AIO_OP_UNKNOWN";
+    }
+}
+
+static inline bool nvme_req_is_write(NvmeRequest *req)
+{
+    switch (req->cmd.opcode) {
+    case NVME_CMD_WRITE:
+    case NVME_CMD_WRITE_UNCOR:
+    case NVME_CMD_WRITE_ZEROS:
+        return true;
+    default:
+        return false;
+    }
+}
+
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
         OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -139,10 +236,21 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
 static inline uint16_t nvme_cid(NvmeRequest *req)
 {
     if (req) {
-        return le16_to_cpu(req->cqe.cid);
+        return le16_to_cpu(req->cmd.cid);
     }
 
     return 0xffff;
 }
 
+static inline bool nvme_status_is_error(uint16_t status, uint16_t err)
+{
+    /* strip DNR and MORE */
+    return (status & 0xfff) == err;
+}
+
+static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
+{
+    return req->sq->ctrl;
+}
+
 #endif /* HW_NVME_H */
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 77aa0da99ee0..90a57fb6099a 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -34,7 +34,12 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
 nvme_dev_irq_masked(void) "IRQ is masked"
 nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
 nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
+nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count %"PRIu64" opc \"%s\" req %p"
+nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
+nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
 nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
+nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
+nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
 nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
 nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
 nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
@@ -75,6 +80,9 @@ nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
 nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
 
 # nvme traces for error conditions
+nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
+nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
+nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p status 0x%"PRIx16""
 nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
 nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
 nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 18/26] nvme: use preallocated qsg/iov in nvme_dma_prp
       [not found]   ` <CGME20200204095227eucas1p2f23061d391e67f4d3bde8bab74d1e44b@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-12 11:49       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Since clean up of the request qsg/iov has been moved to the common
nvme_enqueue_req_completion function, there is no need to use a
stack allocated qsg/iov in nvme_dma_prp.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index e97da35c4ca1..f8c81b9e2202 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -298,23 +298,21 @@ unmap:
 static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
 {
-    QEMUSGList qsg;
-    QEMUIOVector iov;
     uint16_t status = NVME_SUCCESS;
     size_t bytes;
 
-    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
+    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
     if (status) {
         return status;
     }
 
-    if (qsg.nsg > 0) {
+    if (req->qsg.nsg > 0) {
         uint64_t residual;
 
         if (dir == DMA_DIRECTION_TO_DEVICE) {
-            residual = dma_buf_write(ptr, len, &qsg);
+            residual = dma_buf_write(ptr, len, &req->qsg);
         } else {
-            residual = dma_buf_read(ptr, len, &qsg);
+            residual = dma_buf_read(ptr, len, &req->qsg);
         }
 
         if (unlikely(residual)) {
@@ -322,15 +320,13 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
             status = NVME_INVALID_FIELD | NVME_DNR;
         }
 
-        qemu_sglist_destroy(&qsg);
-
         return status;
     }
 
     if (dir == DMA_DIRECTION_TO_DEVICE) {
-        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
+        bytes = qemu_iovec_to_buf(&req->iov, 0, ptr, len);
     } else {
-        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
+        bytes = qemu_iovec_from_buf(&req->iov, 0, ptr, len);
     }
 
     if (unlikely(bytes != len)) {
@@ -338,8 +334,6 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
         status = NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    qemu_iovec_destroy(&iov);
-
     return status;
 }
 
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 19/26] pci: pass along the return value of dma_memory_rw
       [not found]   ` <CGME20200204095227eucas1p2d86cd6abcb66327dc112d58c83664139@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  0 siblings, 0 replies; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

The nvme device needs to know the return value of dma_memory_rw to pass
block/011 from blktests. So pass it along instead of ignoring it.

There are no existing users of the return value, so this patch should be
safe.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@redhat.com>
Reviewed-by: Michael S. Tsirkin <mst@redhat.com>
---
 include/hw/pci/pci.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index 2acd8321af2f..b5013b834b20 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -783,8 +783,7 @@ static inline AddressSpace *pci_get_address_space(PCIDevice *dev)
 static inline int pci_dma_rw(PCIDevice *dev, dma_addr_t addr,
                              void *buf, dma_addr_t len, DMADirection dir)
 {
-    dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
-    return 0;
+    return dma_memory_rw(pci_get_address_space(dev), addr, buf, len, dir);
 }
 
 static inline int pci_dma_read(PCIDevice *dev, dma_addr_t addr,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 20/26] nvme: handle dma errors
       [not found]   ` <CGME20200204095228eucas1p2878eb150a933bb196fe5ca10a0b76eaf@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-12 11:52       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Handling DMA errors gracefully is required for the device to pass the
block/011 test ("disable PCI device while doing I/O") in the blktests
suite.

With this patch the device passes the test by retrying "critical"
transfers (posting of completion entries and processing of submission
queue entries).

If DMA errors occur at any other point in the execution of the command
(say, while mapping the PRPs), the command is aborted with a Data
Transfer Error status code.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c       | 42 +++++++++++++++++++++++++++++++++---------
 hw/block/trace-events |  2 ++
 include/block/nvme.h  |  2 +-
 3 files changed, 36 insertions(+), 10 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index f8c81b9e2202..204ae1d33234 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
     return addr >= low && addr < hi;
 }
 
-static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
+static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
     if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
         memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
-        return;
+        return 0;
     }
 
-    pci_dma_read(&n->parent_obj, addr, buf, size);
+    return pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
@@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
     uint16_t status = NVME_SUCCESS;
     bool is_cmb = false;
     bool prp_list_in_cmb = false;
+    int ret;
 
     trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
         prp1, prp2, num_prps);
@@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
 
             nents = (len + n->page_size - 1) >> n->page_bits;
             prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
+            ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
+            if (ret) {
+                trace_nvme_dev_err_addr_read(prp2);
+                status = NVME_DATA_TRANSFER_ERROR;
+                goto unmap;
+            }
             while (len != 0) {
                 uint64_t prp_ent = le64_to_cpu(prp_list[i]);
 
@@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
                     i = 0;
                     nents = (len + n->page_size - 1) >> n->page_bits;
                     prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
-                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
+                    ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
+                        prp_trans);
+                    if (ret) {
+                        trace_nvme_dev_err_addr_read(prp_ent);
+                        status = NVME_DATA_TRANSFER_ERROR;
+                        goto unmap;
+                    }
                     prp_ent = le64_to_cpu(prp_list[i]);
                 }
 
@@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
     NvmeCQueue *cq = opaque;
     NvmeCtrl *n = cq->ctrl;
     NvmeRequest *req, *next;
+    int ret;
 
     QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
         NvmeSQueue *sq;
@@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
             break;
         }
 
-        QTAILQ_REMOVE(&cq->req_list, req, entry);
         sq = req->sq;
         req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
         req->cqe.sq_id = cpu_to_le16(sq->sqid);
         req->cqe.sq_head = cpu_to_le16(sq->head);
         addr = cq->dma_addr + cq->tail * n->cqe_size;
-        nvme_inc_cq_tail(cq);
-        pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
+        ret = pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
             sizeof(req->cqe));
+        if (ret) {
+            trace_nvme_dev_err_addr_write(addr);
+            timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
+                100 * SCALE_MS);
+            break;
+        }
+        QTAILQ_REMOVE(&cq->req_list, req, entry);
+        nvme_inc_cq_tail(cq);
         nvme_req_clear(req);
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
@@ -1588,7 +1607,12 @@ static void nvme_process_sq(void *opaque)
 
     while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
         addr = sq->dma_addr + sq->head * n->sqe_size;
-        nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
+        if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
+            trace_nvme_dev_err_addr_read(addr);
+            timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
+                100 * SCALE_MS);
+            break;
+        }
         nvme_inc_sq_head(sq);
 
         req = QTAILQ_FIRST(&sq->req_list);
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 90a57fb6099a..09bfb3782dd0 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -83,6 +83,8 @@ nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
 nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
 nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
 nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p status 0x%"PRIx16""
+nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
+nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
 nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
 nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
 nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index c1de92179596..a873776d98b8 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -418,7 +418,7 @@ enum NvmeStatusCodes {
     NVME_INVALID_OPCODE         = 0x0001,
     NVME_INVALID_FIELD          = 0x0002,
     NVME_CID_CONFLICT           = 0x0003,
-    NVME_DATA_TRAS_ERROR        = 0x0004,
+    NVME_DATA_TRANSFER_ERROR    = 0x0004,
     NVME_POWER_LOSS_ABORT       = 0x0005,
     NVME_INTERNAL_DEV_ERROR     = 0x0006,
     NVME_CMD_ABORT_REQ          = 0x0007,
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 21/26] nvme: add support for scatter gather lists
       [not found]   ` <CGME20200204095229eucas1p2b290e3603d73c129a4f6149805273705@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-12 12:07       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

For now, support the Data Block, Segment and Last Segment descriptor
types.

See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Acked-by: Fam Zheng <fam@euphon.net>
---
 block/nvme.c          |  18 +-
 hw/block/nvme.c       | 375 +++++++++++++++++++++++++++++++++++-------
 hw/block/trace-events |   4 +
 include/block/nvme.h  |  62 ++++++-
 4 files changed, 389 insertions(+), 70 deletions(-)

diff --git a/block/nvme.c b/block/nvme.c
index d41c4bda6e39..521f521054d5 100644
--- a/block/nvme.c
+++ b/block/nvme.c
@@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
         error_setg(errp, "Cannot map buffer for DMA");
         goto out;
     }
-    cmd.prp1 = cpu_to_le64(iova);
+    cmd.dptr.prp.prp1 = cpu_to_le64(iova);
 
     if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
         error_setg(errp, "Failed to identify controller");
@@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
     }
     cmd = (NvmeCmd) {
         .opcode = NVME_ADM_CMD_CREATE_CQ,
-        .prp1 = cpu_to_le64(q->cq.iova),
+        .dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
         .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
         .cdw11 = cpu_to_le32(0x3),
     };
@@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
     }
     cmd = (NvmeCmd) {
         .opcode = NVME_ADM_CMD_CREATE_SQ,
-        .prp1 = cpu_to_le64(q->sq.iova),
+        .dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
         .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
         .cdw11 = cpu_to_le32(0x1 | (n << 16)),
     };
@@ -906,16 +906,16 @@ try_map:
     case 0:
         abort();
     case 1:
-        cmd->prp1 = pagelist[0];
-        cmd->prp2 = 0;
+        cmd->dptr.prp.prp1 = pagelist[0];
+        cmd->dptr.prp.prp2 = 0;
         break;
     case 2:
-        cmd->prp1 = pagelist[0];
-        cmd->prp2 = pagelist[1];
+        cmd->dptr.prp.prp1 = pagelist[0];
+        cmd->dptr.prp.prp2 = pagelist[1];
         break;
     default:
-        cmd->prp1 = pagelist[0];
-        cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
+        cmd->dptr.prp.prp1 = pagelist[0];
+        cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
         break;
     }
     trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 204ae1d33234..a91c60fdc111 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
 
 static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
 {
-    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
-        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
+    hwaddr hi = addr + size;
+
+    if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
+        memcpy(buf, nvme_addr_to_cmb(n, addr), size);
         return 0;
     }
 
@@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
     }
 }
 
+static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
+    size_t len)
+{
+    if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
+        return NVME_DATA_TRANSFER_ERROR;
+    }
+
+    qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+    hwaddr addr, size_t len)
+{
+    bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
+
+    if (addr_is_cmb) {
+        if (qsg->sg) {
+            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+        }
+
+        if (!iov->iov) {
+            qemu_iovec_init(iov, 1);
+        }
+
+        return nvme_map_addr_cmb(n, iov, addr, len);
+    }
+
+    if (iov->iov) {
+        return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+    }
+
+    if (!qsg->sg) {
+        pci_dma_sglist_init(qsg, &n->parent_obj, 1);
+    }
+
+    qemu_sglist_add(qsg, addr, len);
+
+    return NVME_SUCCESS;
+}
+
 static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
     uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
 {
@@ -307,15 +351,240 @@ unmap:
     return status;
 }
 
-static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
+static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
+    QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
+    uint32_t *len, NvmeRequest *req)
+{
+    dma_addr_t addr, trans_len;
+    uint32_t length;
+    uint16_t status;
+
+    for (int i = 0; i < nsgld; i++) {
+        uint8_t type = NVME_SGL_TYPE(segment[i].type);
+
+        if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
+            switch (type) {
+            case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
+            case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
+                return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
+            default:
+                break;
+            }
+
+            return NVME_INVALID_NUM_SGL_DESCRIPTORS | NVME_DNR;
+        }
+
+        if (*len == 0) {
+            if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
+                trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
+                return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+            }
+
+            break;
+        }
+
+        addr = le64_to_cpu(segment[i].addr);
+        length = le32_to_cpu(segment[i].len);
+
+        if (!length) {
+            continue;
+        }
+
+        if (UINT64_MAX - addr < length) {
+            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+        }
+
+        trans_len = MIN(*len, length);
+
+        status = nvme_map_addr(n, qsg, iov, addr, trans_len);
+        if (status) {
+            return status;
+        }
+
+        *len -= trans_len;
+    }
+
+    return NVME_SUCCESS;
+}
+
+static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
+    NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
+{
+    const int MAX_NSGLD = 256;
+
+    NvmeSglDescriptor segment[MAX_NSGLD], *sgld, *last_sgld;
+    uint64_t nsgld;
+    uint32_t length;
+    uint16_t status;
+    bool sgl_in_cmb = false;
+    hwaddr addr;
+    int ret;
+
+    sgld = &sgl;
+    addr = le64_to_cpu(sgl.addr);
+
+    trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), req->nlb,
+        len);
+
+    /*
+     * If the entire transfer can be described with a single data block it can
+     * be mapped directly.
+     */
+    if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
+        status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, &len, req);
+        if (status) {
+            goto unmap;
+        }
+
+        goto out;
+    }
+
+    /*
+     * If the segment is located in the CMB, the submission queue of the
+     * request must also reside there.
+     */
+    if (nvme_addr_is_cmb(n, addr)) {
+        if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
+            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
+        }
+
+        sgl_in_cmb = true;
+    }
+
+    for (;;) {
+        length = le32_to_cpu(sgld->len);
+
+        if (!length || length & 0xf) {
+            return NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
+        }
+
+        if (UINT64_MAX - addr < length) {
+            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+        }
+
+        nsgld = length / sizeof(NvmeSglDescriptor);
+
+        /* read the segment in chunks of 256 descriptors (4k) */
+        while (nsgld > MAX_NSGLD) {
+            if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
+                trace_nvme_dev_err_addr_read(addr);
+                status = NVME_DATA_TRANSFER_ERROR;
+                goto unmap;
+            }
+
+            status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, &len,
+                req);
+            if (status) {
+                goto unmap;
+            }
+
+            nsgld -= MAX_NSGLD;
+            addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
+        }
+
+        ret = nvme_addr_read(n, addr, segment, nsgld *
+            sizeof(NvmeSglDescriptor));
+        if (ret) {
+            trace_nvme_dev_err_addr_read(addr);
+            status = NVME_DATA_TRANSFER_ERROR;
+            goto unmap;
+        }
+
+        last_sgld = &segment[nsgld - 1];
+
+        /* if the segment ends with a Data Block, then we are done */
+        if (NVME_SGL_TYPE(last_sgld->type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
+            status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld, &len, req);
+            if (status) {
+                goto unmap;
+            }
+
+            break;
+        }
+
+        /* a Last Segment must end with a Data Block descriptor */
+        if (NVME_SGL_TYPE(sgld->type) == NVME_SGL_DESCR_TYPE_LAST_SEGMENT) {
+            status = NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
+            goto unmap;
+        }
+
+        sgld = last_sgld;
+        addr = le64_to_cpu(sgld->addr);
+
+        /*
+         * Do not map the last descriptor; it will be a Segment or Last Segment
+         * descriptor instead and handled by the next iteration.
+         */
+        status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld - 1, &len, req);
+        if (status) {
+            goto unmap;
+        }
+
+        /*
+         * If the next segment is in the CMB, make sure that the sgl was
+         * already located there.
+         */
+        if (sgl_in_cmb != nvme_addr_is_cmb(n, addr)) {
+            status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
+            goto unmap;
+        }
+    }
+
+out:
+    /* if there is any residual left in len, the SGL was too short */
+    if (len) {
+        status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
+        goto unmap;
+    }
+
+    return NVME_SUCCESS;
+
+unmap:
+    if (iov->iov) {
+        qemu_iovec_destroy(iov);
+    }
+
+    if (qsg->sg) {
+        qemu_sglist_destroy(qsg);
+    }
+
+    return status;
+}
+
+static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
+    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
 {
     uint16_t status = NVME_SUCCESS;
     size_t bytes;
 
-    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
-    if (status) {
-        return status;
+    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
+    case PSDT_PRP:
+        status = nvme_map_prp(n, &req->qsg, &req->iov,
+            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
+            len, req);
+        if (status) {
+            return status;
+        }
+
+        break;
+
+    case PSDT_SGL_MPTR_CONTIGUOUS:
+    case PSDT_SGL_MPTR_SGL:
+        if (!req->sq->sqid) {
+            /* SGLs shall not be used for Admin commands in NVMe over PCIe */
+            return NVME_INVALID_FIELD;
+        }
+
+        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
+            req);
+        if (status) {
+            return status;
+        }
+
+        break;
+
+    default:
+        return NVME_INVALID_FIELD;
     }
 
     if (req->qsg.nsg > 0) {
@@ -351,13 +620,21 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
 
 static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-    NvmeNamespace *ns = req->ns;
+    uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
+    uint64_t prp1, prp2;
 
-    uint32_t len = req->nlb << nvme_ns_lbads(ns);
-    uint64_t prp1 = le64_to_cpu(cmd->prp1);
-    uint64_t prp2 = le64_to_cpu(cmd->prp2);
+    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
+    case PSDT_PRP:
+        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
+        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
 
-    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
+        return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
+    case PSDT_SGL_MPTR_CONTIGUOUS:
+    case PSDT_SGL_MPTR_SGL:
+        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
+    default:
+        return NVME_INVALID_FIELD;
+    }
 }
 
 static void nvme_aio_destroy(NvmeAIO *aio)
@@ -972,8 +1249,6 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
 static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
     uint32_t buf_len, uint64_t off, NvmeRequest *req)
 {
-    uint64_t prp1 = le64_to_cpu(cmd->prp1);
-    uint64_t prp2 = le64_to_cpu(cmd->prp2);
     uint32_t nsid = le32_to_cpu(cmd->nsid);
 
     uint32_t trans_len;
@@ -1023,16 +1298,14 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
         nvme_clear_events(n, NVME_AER_TYPE_SMART);
     }
 
-    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
-        prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
+        DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
     uint64_t off, NvmeRequest *req)
 {
     uint32_t trans_len;
-    uint64_t prp1 = le64_to_cpu(cmd->prp1);
-    uint64_t prp2 = le64_to_cpu(cmd->prp2);
     NvmeFwSlotInfoLog fw_log;
 
     if (off > sizeof(fw_log)) {
@@ -1043,8 +1316,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
 
     trans_len = MIN(sizeof(fw_log) - off, buf_len);
 
-    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
-        prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
+        DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
@@ -1194,25 +1467,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
-    NvmeRequest *req)
+static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-    uint64_t prp1 = le64_to_cpu(c->prp1);
-    uint64_t prp2 = le64_to_cpu(c->prp2);
-
     trace_nvme_dev_identify_ctrl();
 
-    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
-        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
+        DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
-    NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
     NvmeNamespace *ns;
-    uint32_t nsid = le32_to_cpu(c->nsid);
-    uint64_t prp1 = le64_to_cpu(c->prp1);
-    uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint32_t nsid = le32_to_cpu(cmd->nsid);
 
     trace_nvme_dev_identify_ns(nsid);
 
@@ -1223,17 +1489,15 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
 
     ns = &n->namespaces[nsid - 1];
 
-    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
-        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
+        DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
     NvmeRequest *req)
 {
     static const int data_len = 4 * KiB;
-    uint32_t min_nsid = le32_to_cpu(c->nsid);
-    uint64_t prp1 = le64_to_cpu(c->prp1);
-    uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
     uint32_t *list;
     uint16_t ret;
     int i, j = 0;
@@ -1250,13 +1514,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
             break;
         }
     }
-    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
+    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
         DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
     return ret;
 }
 
-static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
+static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
     NvmeRequest *req)
 {
     static const int len = 4096;
@@ -1268,9 +1532,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
         uint8_t nid[16];
     };
 
-    uint32_t nsid = le32_to_cpu(c->nsid);
-    uint64_t prp1 = le64_to_cpu(c->prp1);
-    uint64_t prp2 = le64_to_cpu(c->prp2);
+    uint32_t nsid = le32_to_cpu(cmd->nsid);
 
     struct ns_descr *list;
     uint16_t ret;
@@ -1293,8 +1555,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
     list->nidl = 0x10;
     *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
 
-    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
-        DMA_DIRECTION_FROM_DEVICE, req);
+    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
+        req);
     g_free(list);
     return ret;
 }
@@ -1305,13 +1567,13 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
     switch (le32_to_cpu(c->cns)) {
     case 0x00:
-        return nvme_identify_ns(n, c, req);
+        return nvme_identify_ns(n, cmd, req);
     case 0x01:
-        return nvme_identify_ctrl(n, c, req);
+        return nvme_identify_ctrl(n, cmd, req);
     case 0x02:
-        return nvme_identify_ns_list(n, c, req);
+        return nvme_identify_ns_list(n, cmd, req);
     case 0x03:
-        return nvme_identify_ns_descr_list(n, c, req);
+        return nvme_identify_ns_descr_list(n, cmd, req);
     default:
         trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1373,13 +1635,10 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
 static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
     NvmeRequest *req)
 {
-    uint64_t prp1 = le64_to_cpu(cmd->prp1);
-    uint64_t prp2 = le64_to_cpu(cmd->prp2);
-
     uint64_t timestamp = nvme_get_timestamp(n);
 
-    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
-        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
+    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
+        DMA_DIRECTION_FROM_DEVICE, req);
 }
 
 static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
@@ -1462,11 +1721,9 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
 {
     uint16_t ret;
     uint64_t timestamp;
-    uint64_t prp1 = le64_to_cpu(cmd->prp1);
-    uint64_t prp2 = le64_to_cpu(cmd->prp2);
 
-    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
-        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
+    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
+        DMA_DIRECTION_TO_DEVICE, req);
     if (ret != NVME_SUCCESS) {
         return ret;
     }
@@ -2232,6 +2489,8 @@ static void nvme_init_ctrl(NvmeCtrl *n)
         id->vwc = 1;
     }
 
+    id->sgls = cpu_to_le32(0x1);
+
     strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
     pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
 
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 09bfb3782dd0..81d69e15fc32 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -34,6 +34,7 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
 nvme_dev_irq_masked(void) "IRQ is masked"
 nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
 nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
+nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" nlb %"PRIu32" len %"PRIu64""
 nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count %"PRIu64" opc \"%s\" req %p"
 nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
 nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
@@ -85,6 +86,9 @@ nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
 nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p status 0x%"PRIx16""
 nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
 nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
+nvme_dev_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
+nvme_dev_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
+nvme_dev_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
 nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
 nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
 nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
diff --git a/include/block/nvme.h b/include/block/nvme.h
index a873776d98b8..dbdeecf82358 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -205,15 +205,53 @@ enum NvmeCmbszMask {
 #define NVME_CMBSZ_GETSIZE(cmbsz) \
     (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
 
+enum NvmeSglDescriptorType {
+    NVME_SGL_DESCR_TYPE_DATA_BLOCK           = 0x0,
+    NVME_SGL_DESCR_TYPE_BIT_BUCKET           = 0x1,
+    NVME_SGL_DESCR_TYPE_SEGMENT              = 0x2,
+    NVME_SGL_DESCR_TYPE_LAST_SEGMENT         = 0x3,
+    NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK     = 0x4,
+
+    NVME_SGL_DESCR_TYPE_VENDOR_SPECIFIC      = 0xf,
+};
+
+enum NvmeSglDescriptorSubtype {
+    NVME_SGL_DESCR_SUBTYPE_ADDRESS = 0x0,
+};
+
+typedef struct NvmeSglDescriptor {
+    uint64_t addr;
+    uint32_t len;
+    uint8_t  rsvd[3];
+    uint8_t  type;
+} NvmeSglDescriptor;
+
+#define NVME_SGL_TYPE(type)     ((type >> 4) & 0xf)
+#define NVME_SGL_SUBTYPE(type)  (type & 0xf)
+
+typedef union NvmeCmdDptr {
+    struct {
+        uint64_t    prp1;
+        uint64_t    prp2;
+    } prp;
+
+    NvmeSglDescriptor sgl;
+} NvmeCmdDptr;
+
+enum NvmePsdt {
+    PSDT_PRP                 = 0x0,
+    PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
+    PSDT_SGL_MPTR_SGL        = 0x2,
+};
+
 typedef struct NvmeCmd {
     uint8_t     opcode;
-    uint8_t     fuse;
+    uint8_t     flags;
     uint16_t    cid;
     uint32_t    nsid;
     uint64_t    res1;
     uint64_t    mptr;
-    uint64_t    prp1;
-    uint64_t    prp2;
+    NvmeCmdDptr dptr;
     uint32_t    cdw10;
     uint32_t    cdw11;
     uint32_t    cdw12;
@@ -222,6 +260,9 @@ typedef struct NvmeCmd {
     uint32_t    cdw15;
 } NvmeCmd;
 
+#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
+#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
+
 enum NvmeAdminCommands {
     NVME_ADM_CMD_DELETE_SQ      = 0x00,
     NVME_ADM_CMD_CREATE_SQ      = 0x01,
@@ -427,6 +468,11 @@ enum NvmeStatusCodes {
     NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
     NVME_INVALID_NSID           = 0x000b,
     NVME_CMD_SEQ_ERROR          = 0x000c,
+    NVME_INVALID_SGL_SEG_DESCRIPTOR  = 0x000d,
+    NVME_INVALID_NUM_SGL_DESCRIPTORS = 0x000e,
+    NVME_DATA_SGL_LENGTH_INVALID     = 0x000f,
+    NVME_METADATA_SGL_LENGTH_INVALID = 0x0010,
+    NVME_SGL_DESCRIPTOR_TYPE_INVALID = 0x0011,
     NVME_INVALID_USE_OF_CMB     = 0x0012,
     NVME_LBA_RANGE              = 0x0080,
     NVME_CAP_EXCEEDED           = 0x0081,
@@ -623,6 +669,16 @@ enum NvmeIdCtrlOncs {
 #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
 #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
 
+#define NVME_CTRL_SGLS_SUPPORTED(sgls)                 ((sgls) & 0x3)
+#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(sgls)    ((sgls) & (0x1 <<  0))
+#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT(sgls) ((sgls) & (0x1 <<  1))
+#define NVME_CTRL_SGLS_KEYED(sgls)                     ((sgls) & (0x1 <<  2))
+#define NVME_CTRL_SGLS_BITBUCKET(sgls)                 ((sgls) & (0x1 << 16))
+#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS(sgls)           ((sgls) & (0x1 << 17))
+#define NVME_CTRL_SGLS_EXCESS_LENGTH(sgls)             ((sgls) & (0x1 << 18))
+#define NVME_CTRL_SGLS_MPTR_SGL(sgls)                  ((sgls) & (0x1 << 19))
+#define NVME_CTRL_SGLS_ADDR_OFFSET(sgls)               ((sgls) & (0x1 << 20))
+
 typedef struct NvmeFeatureVal {
     uint32_t    arbitration;
     uint32_t    power_mgmt;
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 22/26] nvme: support multiple namespaces
       [not found]   ` <CGME20200204095230eucas1p27456c6c0ab3b688d2f891d0dff098821@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-04 16:31       ` Keith Busch
  2020-02-12 12:34       ` Maxim Levitsky
  0 siblings, 2 replies; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

This adds support for multiple namespaces by introducing a new 'nvme-ns'
device model. The nvme device creates a bus named from the device name
('id'). The nvme-ns devices then connect to this and registers
themselves with the nvme device.

This changes how an nvme device is created. Example with two namespaces:

  -drive file=nvme0n1.img,if=none,id=disk1
  -drive file=nvme0n2.img,if=none,id=disk2
  -device nvme,serial=deadbeef,id=nvme0
  -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
  -device nvme-ns,drive=disk2,bus=nvme0,nsid=2

The drive property is kept on the nvme device to keep the change
backward compatible, but the property is now optional. Specifying a
drive for the nvme device will always create the namespace with nsid 1.

Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/Makefile.objs |   2 +-
 hw/block/nvme-ns.c     | 158 +++++++++++++++++++++++++++
 hw/block/nvme-ns.h     |  60 +++++++++++
 hw/block/nvme.c        | 235 +++++++++++++++++++++++++----------------
 hw/block/nvme.h        |  47 ++++-----
 hw/block/trace-events  |   6 +-
 6 files changed, 389 insertions(+), 119 deletions(-)
 create mode 100644 hw/block/nvme-ns.c
 create mode 100644 hw/block/nvme-ns.h

diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
index 28c2495a00dc..45f463462f1e 100644
--- a/hw/block/Makefile.objs
+++ b/hw/block/Makefile.objs
@@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
 common-obj-$(CONFIG_XEN) += xen-block.o
 common-obj-$(CONFIG_ECC) += ecc.o
 common-obj-$(CONFIG_ONENAND) += onenand.o
-common-obj-$(CONFIG_NVME_PCI) += nvme.o
+common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
 common-obj-$(CONFIG_SWIM) += swim.o
 
 obj-$(CONFIG_SH4) += tc58128.o
diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
new file mode 100644
index 000000000000..0e5be44486f4
--- /dev/null
+++ b/hw/block/nvme-ns.c
@@ -0,0 +1,158 @@
+#include "qemu/osdep.h"
+#include "qemu/units.h"
+#include "qemu/cutils.h"
+#include "qemu/log.h"
+#include "hw/block/block.h"
+#include "hw/pci/msix.h"
+#include "sysemu/sysemu.h"
+#include "sysemu/block-backend.h"
+#include "qapi/error.h"
+
+#include "hw/qdev-properties.h"
+#include "hw/qdev-core.h"
+
+#include "nvme.h"
+#include "nvme-ns.h"
+
+static int nvme_ns_init(NvmeNamespace *ns)
+{
+    NvmeIdNs *id_ns = &ns->id_ns;
+
+    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    id_ns->nuse = id_ns->ncap = id_ns->nsze =
+        cpu_to_le64(nvme_ns_nlbas(ns));
+
+    return 0;
+}
+
+static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
+    Error **errp)
+{
+    uint64_t perm, shared_perm;
+
+    Error *local_err = NULL;
+    int ret;
+
+    perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
+    shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
+        BLK_PERM_GRAPH_MOD;
+
+    ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
+    if (ret) {
+        error_propagate_prepend(errp, local_err, "blk_set_perm: ");
+        return ret;
+    }
+
+    ns->size = blk_getlength(ns->blk);
+    if (ns->size < 0) {
+        error_setg_errno(errp, -ns->size, "blk_getlength");
+        return 1;
+    }
+
+    switch (n->conf.wce) {
+    case ON_OFF_AUTO_ON:
+        n->features.volatile_wc = 1;
+        break;
+    case ON_OFF_AUTO_OFF:
+        n->features.volatile_wc = 0;
+    case ON_OFF_AUTO_AUTO:
+        n->features.volatile_wc = blk_enable_write_cache(ns->blk);
+        break;
+    default:
+        abort();
+    }
+
+    blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
+
+    return 0;
+}
+
+static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
+{
+    if (!ns->blk) {
+        error_setg(errp, "block backend not configured");
+        return 1;
+    }
+
+    return 0;
+}
+
+int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+{
+    Error *local_err = NULL;
+
+    if (nvme_ns_check_constraints(ns, &local_err)) {
+        error_propagate_prepend(errp, local_err,
+            "nvme_ns_check_constraints: ");
+        return 1;
+    }
+
+    if (nvme_ns_init_blk(n, ns, &n->id_ctrl, &local_err)) {
+        error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
+        return 1;
+    }
+
+    nvme_ns_init(ns);
+    if (nvme_register_namespace(n, ns, &local_err)) {
+        error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
+        return 1;
+    }
+
+    return 0;
+}
+
+static void nvme_ns_realize(DeviceState *dev, Error **errp)
+{
+    NvmeNamespace *ns = NVME_NS(dev);
+    BusState *s = qdev_get_parent_bus(dev);
+    NvmeCtrl *n = NVME(s->parent);
+    Error *local_err = NULL;
+
+    if (nvme_ns_setup(n, ns, &local_err)) {
+        error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
+        return;
+    }
+}
+
+static Property nvme_ns_props[] = {
+    DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void nvme_ns_class_init(ObjectClass *oc, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(oc);
+
+    set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
+
+    dc->bus_type = TYPE_NVME_BUS;
+    dc->realize = nvme_ns_realize;
+    device_class_set_props(dc, nvme_ns_props);
+    dc->desc = "virtual nvme namespace";
+}
+
+static void nvme_ns_instance_init(Object *obj)
+{
+    NvmeNamespace *ns = NVME_NS(obj);
+    char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
+
+    device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
+        bootindex, DEVICE(obj), &error_abort);
+
+    g_free(bootindex);
+}
+
+static const TypeInfo nvme_ns_info = {
+    .name = TYPE_NVME_NS,
+    .parent = TYPE_DEVICE,
+    .class_init = nvme_ns_class_init,
+    .instance_size = sizeof(NvmeNamespace),
+    .instance_init = nvme_ns_instance_init,
+};
+
+static void nvme_ns_register_types(void)
+{
+    type_register_static(&nvme_ns_info);
+}
+
+type_init(nvme_ns_register_types)
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
new file mode 100644
index 000000000000..b564bac25f6d
--- /dev/null
+++ b/hw/block/nvme-ns.h
@@ -0,0 +1,60 @@
+#ifndef NVME_NS_H
+#define NVME_NS_H
+
+#define TYPE_NVME_NS "nvme-ns"
+#define NVME_NS(obj) \
+    OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
+
+#define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
+    DEFINE_PROP_DRIVE("drive", _state, blk), \
+    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
+
+typedef struct NvmeNamespaceParams {
+    uint32_t nsid;
+} NvmeNamespaceParams;
+
+typedef struct NvmeNamespace {
+    DeviceState  parent_obj;
+    BlockBackend *blk;
+    int32_t      bootindex;
+    int64_t      size;
+
+    NvmeIdNs            id_ns;
+    NvmeNamespaceParams params;
+} NvmeNamespace;
+
+static inline uint32_t nvme_nsid(NvmeNamespace *ns)
+{
+    if (ns) {
+        return ns->params.nsid;
+    }
+
+    return -1;
+}
+
+static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
+{
+    NvmeIdNs *id_ns = &ns->id_ns;
+    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
+}
+
+static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
+{
+    return nvme_ns_lbaf(ns).ds;
+}
+
+static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
+{
+    return 1 << nvme_ns_lbads(ns);
+}
+
+static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
+{
+    return ns->size >> nvme_ns_lbads(ns);
+}
+
+typedef struct NvmeCtrl NvmeCtrl;
+
+int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+
+#endif /* NVME_NS_H */
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index a91c60fdc111..3a377bc56734 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -17,10 +17,11 @@
 /**
  * Usage: add options:
  *      -drive file=<file>,if=none,id=<drive_id>
- *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
+ *      -device nvme,serial=<serial>,id=<bus_name>, \
  *              cmb_size_mb=<cmb_size_mb[optional]>, \
  *              num_queues=<N[optional]>, \
  *              mdts=<mdts[optional]>
+ *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=1
  *
  * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
  * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
@@ -28,6 +29,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/units.h"
+#include "qemu/error-report.h"
 #include "hw/block/block.h"
 #include "hw/pci/msix.h"
 #include "hw/pci/pci.h"
@@ -43,6 +45,7 @@
 #include "qemu/cutils.h"
 #include "trace.h"
 #include "nvme.h"
+#include "nvme-ns.h"
 
 #define NVME_SPEC_VER 0x00010300
 #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
@@ -85,6 +88,17 @@ static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
     return pci_dma_read(&n->parent_obj, addr, buf, size);
 }
 
+static uint16_t nvme_nsid_err(NvmeCtrl *n, uint32_t nsid)
+{
+    if (nsid && nsid < n->num_namespaces) {
+        trace_nvme_dev_err_inactive_ns(nsid, n->num_namespaces);
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
+    return NVME_INVALID_NSID | NVME_DNR;
+}
+
 static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
 {
     return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
@@ -889,7 +903,7 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
     uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
 
     if (unlikely((slba + nlb) > nsze)) {
-        block_acct_invalid(blk_get_stats(n->conf.blk),
+        block_acct_invalid(blk_get_stats(ns->blk),
             nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
         trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
         return NVME_LBA_RANGE | NVME_DNR;
@@ -924,11 +938,12 @@ static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
 
 static void nvme_rw_cb(NvmeRequest *req, void *opaque)
 {
+    NvmeNamespace *ns = req->ns;
     NvmeSQueue *sq = req->sq;
     NvmeCtrl *n = sq->ctrl;
     NvmeCQueue *cq = n->cq[sq->cqid];
 
-    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
+    trace_nvme_dev_rw_cb(nvme_cid(req), nvme_nsid(ns));
 
     nvme_enqueue_req_completion(cq, req);
 }
@@ -1011,10 +1026,11 @@ static void nvme_aio_cb(void *opaque, int ret)
 
 static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
+    NvmeNamespace *ns = req->ns;
     NvmeAIO *aio = g_new0(NvmeAIO, 1);
 
     *aio = (NvmeAIO) {
-        .blk = n->conf.blk,
+        .blk = ns->blk,
         .req = req,
     };
 
@@ -1038,12 +1054,12 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     req->slba = le64_to_cpu(rw->slba);
     req->nlb  = le16_to_cpu(rw->nlb) + 1;
 
-    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
-        req->slba, req->nlb);
+    trace_nvme_dev_write_zeros(nvme_cid(req), nvme_nsid(ns), req->slba,
+        req->nlb);
 
     status = nvme_check_bounds(n, req->slba, req->nlb, req);
     if (unlikely(status)) {
-        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
+        block_acct_invalid(blk_get_stats(ns->blk), BLOCK_ACCT_WRITE);
         return status;
     }
 
@@ -1053,7 +1069,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     aio = g_new0(NvmeAIO, 1);
 
     *aio = (NvmeAIO) {
-        .blk = n->conf.blk,
+        .blk = ns->blk,
         .offset = offset,
         .len = count,
         .req = req,
@@ -1077,22 +1093,23 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     req->nlb  = le16_to_cpu(rw->nlb) + 1;
     req->slba = le64_to_cpu(rw->slba);
 
-    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
-        req->nlb << nvme_ns_lbads(req->ns), req->slba);
+    trace_nvme_dev_rw(nvme_cid(req), nvme_req_is_write(req) ? "write" : "read",
+        nvme_nsid(ns), req->nlb, req->nlb << nvme_ns_lbads(ns),
+        req->slba);
 
     status = nvme_check_rw(n, req);
     if (status) {
-        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+        block_acct_invalid(blk_get_stats(ns->blk), acct);
         return status;
     }
 
     status = nvme_map(n, cmd, req);
     if (status) {
-        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
+        block_acct_invalid(blk_get_stats(ns->blk), acct);
         return status;
     }
 
-    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
+    nvme_rw_aio(ns->blk, req->slba << nvme_ns_lbads(ns), req);
     nvme_req_set_cb(req, nvme_rw_cb, NULL);
 
     return NVME_NO_COMPLETE;
@@ -1105,12 +1122,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
         cmd->opcode);
 
-    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
-        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
-        return NVME_INVALID_NSID | NVME_DNR;
-    }
+    req->ns = nvme_ns(n, nsid);
 
-    req->ns = &n->namespaces[nsid - 1];
+    if (unlikely(!req->ns)) {
+        return nvme_nsid_err(n, nsid);
+    }
 
     switch (cmd->opcode) {
     case NVME_CMD_FLUSH:
@@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
     uint64_t units_read = 0, units_written = 0, read_commands = 0,
         write_commands = 0;
     NvmeSmartLog smart;
-    BlockAcctStats *s;
 
     if (nsid && nsid != 0xffffffff) {
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 
-    s = blk_get_stats(n->conf.blk);
+    for (int i = 1; i <= n->num_namespaces; i++) {
+        NvmeNamespace *ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
 
-    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
-    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
-    read_commands = s->nr_ops[BLOCK_ACCT_READ];
-    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
+        BlockAcctStats *s = blk_get_stats(ns->blk);
+
+        units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
+        units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
+        read_commands += s->nr_ops[BLOCK_ACCT_READ];
+        write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
+    }
 
     if (off > sizeof(smart)) {
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1477,19 +1499,25 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
 static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
-    NvmeNamespace *ns;
+    NvmeIdNs *id_ns, inactive = { 0 };
     uint32_t nsid = le32_to_cpu(cmd->nsid);
+    NvmeNamespace *ns = nvme_ns(n, nsid);
 
     trace_nvme_dev_identify_ns(nsid);
 
-    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
-        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
-        return NVME_INVALID_NSID | NVME_DNR;
+    if (unlikely(!ns)) {
+        uint16_t status = nvme_nsid_err(n, nsid);
+
+        if (!nvme_status_is_error(status, NVME_INVALID_FIELD)) {
+            return status;
+        }
+
+        id_ns = &inactive;
+    } else {
+        id_ns = &ns->id_ns;
     }
 
-    ns = &n->namespaces[nsid - 1];
-
-    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
+    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
@@ -1505,11 +1533,11 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
     trace_nvme_dev_identify_ns_list(min_nsid);
 
     list = g_malloc0(data_len);
-    for (i = 0; i < n->num_namespaces; i++) {
-        if (i < min_nsid) {
+    for (i = 1; i <= n->num_namespaces; i++) {
+        if (i <= min_nsid || !nvme_ns(n, i)) {
             continue;
         }
-        list[j++] = cpu_to_le32(i + 1);
+        list[j++] = cpu_to_le32(i);
         if (j == data_len / sizeof(uint32_t)) {
             break;
         }
@@ -1539,9 +1567,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
 
     trace_nvme_dev_identify_ns_descr_list(nsid);
 
-    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
-        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
-        return NVME_INVALID_NSID | NVME_DNR;
+    if (unlikely(!nvme_ns(n, nsid))) {
+        return nvme_nsid_err(n, nsid);
     }
 
     /*
@@ -1681,7 +1708,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         result = cpu_to_le32(n->features.err_rec);
         break;
     case NVME_VOLATILE_WRITE_CACHE:
-        result = blk_enable_write_cache(n->conf.blk);
+        result = cpu_to_le32(n->features.volatile_wc);
         trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
         break;
     case NVME_NUMBER_OF_QUEUES:
@@ -1735,6 +1762,8 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
 
 static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 {
+    NvmeNamespace *ns;
+
     uint32_t dw10 = le32_to_cpu(cmd->cdw10);
     uint32_t dw11 = le32_to_cpu(cmd->cdw11);
 
@@ -1766,8 +1795,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
         break;
     case NVME_VOLATILE_WRITE_CACHE:
-        blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
+        n->features.volatile_wc = dw11;
+
+        for (int i = 1; i <= n->num_namespaces; i++) {
+            ns = nvme_ns(n, i);
+            if (!ns) {
+                continue;
+            }
+
+            blk_set_enable_write_cache(ns->blk, dw11 & 1);
+        }
+
         break;
+
     case NVME_NUMBER_OF_QUEUES:
         if (n->qs_created) {
             return NVME_CMD_SEQ_ERROR | NVME_DNR;
@@ -1890,9 +1930,17 @@ static void nvme_process_sq(void *opaque)
 
 static void nvme_clear_ctrl(NvmeCtrl *n)
 {
+    NvmeNamespace *ns;
     int i;
 
-    blk_drain(n->conf.blk);
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        blk_drain(ns->blk);
+    }
 
     for (i = 0; i < n->params.num_queues; i++) {
         if (n->sq[i] != NULL) {
@@ -1915,7 +1963,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
     n->outstanding_aers = 0;
     n->qs_created = false;
 
-    blk_flush(n->conf.blk);
+    for (i = 1; i <= n->num_namespaces; i++) {
+        ns = nvme_ns(n, i);
+        if (!ns) {
+            continue;
+        }
+
+        blk_flush(ns->blk);
+    }
+
     n->bar.cc = 0;
 }
 
@@ -2335,8 +2391,8 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
 {
     NvmeParams *params = &n->params;
 
-    if (!n->conf.blk) {
-        error_setg(errp, "nvme: block backend not configured");
+    if (!n->namespace.blk && !n->parent_obj.qdev.id) {
+        error_setg(errp, "nvme: invalid 'id' parameter");
         return 1;
     }
 
@@ -2353,22 +2409,10 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
     return 0;
 }
 
-static int nvme_init_blk(NvmeCtrl *n, Error **errp)
-{
-    blkconf_blocksizes(&n->conf);
-    if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
-        false, errp)) {
-        return 1;
-    }
-
-    return 0;
-}
-
 static void nvme_init_state(NvmeCtrl *n)
 {
-    n->num_namespaces = 1;
+    n->num_namespaces = 0;
     n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
-    n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
     n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
     n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
 
@@ -2483,12 +2527,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
     id->cqes = (0x4 << 4) | 0x4;
     id->nn = cpu_to_le32(n->num_namespaces);
     id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
-
-
-    if (blk_enable_write_cache(n->conf.blk)) {
-        id->vwc = 1;
-    }
-
+    id->vwc = 1;
     id->sgls = cpu_to_le32(0x1);
 
     strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
@@ -2509,22 +2548,25 @@ static void nvme_init_ctrl(NvmeCtrl *n)
     n->bar.intmc = n->bar.intms = 0;
 }
 
-static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
+int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 {
-    int64_t bs_size;
-    NvmeIdNs *id_ns = &ns->id_ns;
+    uint32_t nsid = nvme_nsid(ns);
 
-    bs_size = blk_getlength(n->conf.blk);
-    if (bs_size < 0) {
-        error_setg_errno(errp, -bs_size, "blk_getlength");
+    if (nsid == 0 || nsid > NVME_MAX_NAMESPACES) {
+        error_setg(errp, "invalid nsid");
         return 1;
     }
 
-    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
-    n->ns_size = bs_size;
+    if (n->namespaces[nsid - 1]) {
+        error_setg(errp, "nsid must be unique");
+        return 1;
+    }
+
+    trace_nvme_dev_register_namespace(nsid);
 
-    id_ns->ncap = id_ns->nuse = id_ns->nsze =
-        cpu_to_le64(nvme_ns_nlbas(n, ns));
+    n->namespaces[nsid - 1] = ns;
+    n->num_namespaces = MAX(n->num_namespaces, nsid);
+    n->id_ctrl.nn = cpu_to_le32(n->num_namespaces);
 
     return 0;
 }
@@ -2532,30 +2574,31 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
 static void nvme_realize(PCIDevice *pci_dev, Error **errp)
 {
     NvmeCtrl *n = NVME(pci_dev);
+    NvmeNamespace *ns;
     Error *local_err = NULL;
-    int i;
 
     if (nvme_check_constraints(n, &local_err)) {
         error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
         return;
     }
 
+    qbus_create_inplace(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
+        &pci_dev->qdev, n->parent_obj.qdev.id);
+
     nvme_init_state(n);
-
-    if (nvme_init_blk(n, &local_err)) {
-        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
-        return;
-    }
-
-    for (i = 0; i < n->num_namespaces; i++) {
-        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
-            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
-            return;
-        }
-    }
-
     nvme_init_pci(n, pci_dev);
     nvme_init_ctrl(n);
+
+    /* setup a namespace if the controller drive property was given */
+    if (n->namespace.blk) {
+        ns = &n->namespace;
+        ns->params.nsid = 1;
+
+        if (nvme_ns_setup(n, ns, &local_err)) {
+            error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
+            return;
+        }
+    }
 }
 
 static void nvme_exit(PCIDevice *pci_dev)
@@ -2576,7 +2619,8 @@ static void nvme_exit(PCIDevice *pci_dev)
 }
 
 static Property nvme_props[] = {
-    DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
+    DEFINE_BLOCK_PROPERTIES_BASE(NvmeCtrl, conf), \
+    DEFINE_PROP_DRIVE("drive", NvmeCtrl, namespace.blk), \
     DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -2608,26 +2652,35 @@ static void nvme_instance_init(Object *obj)
 {
     NvmeCtrl *s = NVME(obj);
 
-    device_add_bootindex_property(obj, &s->conf.bootindex,
-                                  "bootindex", "/namespace@1,0",
-                                  DEVICE(obj), &error_abort);
+    if (s->namespace.blk) {
+        device_add_bootindex_property(obj, &s->conf.bootindex,
+                                      "bootindex", "/namespace@1,0",
+                                      DEVICE(obj), &error_abort);
+    }
 }
 
 static const TypeInfo nvme_info = {
     .name          = TYPE_NVME,
     .parent        = TYPE_PCI_DEVICE,
     .instance_size = sizeof(NvmeCtrl),
-    .class_init    = nvme_class_init,
     .instance_init = nvme_instance_init,
+    .class_init    = nvme_class_init,
     .interfaces = (InterfaceInfo[]) {
         { INTERFACE_PCIE_DEVICE },
         { }
     },
 };
 
+static const TypeInfo nvme_bus_info = {
+    .name = TYPE_NVME_BUS,
+    .parent = TYPE_BUS,
+    .instance_size = sizeof(NvmeBus),
+};
+
 static void nvme_register_types(void)
 {
     type_register_static(&nvme_info);
+    type_register_static(&nvme_bus_info);
 }
 
 type_init(nvme_register_types)
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index 3319f8edd7e1..c3cef0f024da 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -2,6 +2,9 @@
 #define HW_NVME_H
 
 #include "block/nvme.h"
+#include "nvme-ns.h"
+
+#define NVME_MAX_NAMESPACES 256
 
 #define DEFINE_NVME_PROPERTIES(_state, _props) \
     DEFINE_PROP_STRING("serial", _state, _props.serial), \
@@ -108,26 +111,6 @@ typedef struct NvmeCQueue {
     QTAILQ_HEAD(, NvmeRequest) req_list;
 } NvmeCQueue;
 
-typedef struct NvmeNamespace {
-    NvmeIdNs        id_ns;
-} NvmeNamespace;
-
-static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
-{
-    NvmeIdNs *id_ns = &ns->id_ns;
-    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
-}
-
-static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
-{
-    return nvme_ns_lbaf(ns).ds;
-}
-
-static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
-{
-    return 1 << nvme_ns_lbads(ns);
-}
-
 typedef enum NvmeAIOOp {
     NVME_AIO_OPC_NONE         = 0x0,
     NVME_AIO_OPC_FLUSH        = 0x1,
@@ -182,6 +165,13 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
     }
 }
 
+#define TYPE_NVME_BUS "nvme-bus"
+#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
+
+typedef struct NvmeBus {
+    BusState parent_bus;
+} NvmeBus;
+
 #define TYPE_NVME "nvme"
 #define NVME(obj) \
         OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
@@ -191,8 +181,9 @@ typedef struct NvmeCtrl {
     MemoryRegion iomem;
     MemoryRegion ctrl_mem;
     NvmeBar      bar;
-    BlockConf    conf;
     NvmeParams   params;
+    NvmeBus      bus;
+    BlockConf    conf;
 
     bool        qs_created;
     uint32_t    page_size;
@@ -203,7 +194,6 @@ typedef struct NvmeCtrl {
     uint32_t    reg_size;
     uint32_t    num_namespaces;
     uint32_t    max_q_ents;
-    uint64_t    ns_size;
     uint8_t     outstanding_aers;
     uint32_t    cmbsz;
     uint32_t    cmbloc;
@@ -219,7 +209,8 @@ typedef struct NvmeCtrl {
     QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
     int         aer_queued;
 
-    NvmeNamespace   *namespaces;
+    NvmeNamespace   namespace;
+    NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
     NvmeSQueue      **sq;
     NvmeCQueue      **cq;
     NvmeSQueue      admin_sq;
@@ -228,9 +219,13 @@ typedef struct NvmeCtrl {
     NvmeFeatureVal  features;
 } NvmeCtrl;
 
-static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
+static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
 {
-    return n->ns_size >> nvme_ns_lbads(ns);
+    if (!nsid || nsid > n->num_namespaces) {
+        return NULL;
+    }
+
+    return n->namespaces[nsid - 1];
 }
 
 static inline uint16_t nvme_cid(NvmeRequest *req)
@@ -253,4 +248,6 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
     return req->sq->ctrl;
 }
 
+int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
+
 #endif /* HW_NVME_H */
diff --git a/hw/block/trace-events b/hw/block/trace-events
index 81d69e15fc32..aaf1fcda7923 100644
--- a/hw/block/trace-events
+++ b/hw/block/trace-events
@@ -29,6 +29,7 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
 
 # nvme.c
 # nvme traces for successful events
+nvme_dev_register_namespace(uint32_t nsid) "nsid %"PRIu32""
 nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 nvme_dev_irq_pin(void) "pulsing IRQ pin"
 nvme_dev_irq_masked(void) "IRQ is masked"
@@ -38,7 +39,7 @@ nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"P
 nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count %"PRIu64" opc \"%s\" req %p"
 nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
 nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
-nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
+nvme_dev_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" %s nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
 nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
 nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
 nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16", qflags=%"PRIu16""
@@ -94,7 +95,8 @@ nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or no
 nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
 nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
 nvme_dev_err_invalid_prp(void) "invalid PRP"
-nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
+nvme_dev_err_invalid_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
+nvme_dev_err_inactive_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
 nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
 nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
 nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 23/26] pci: allocate pci id for nvme
       [not found]   ` <CGME20200204095230eucas1p23f3105c4cab4aaec77a3dd42b8158c10@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-12 12:36       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

The emulated nvme device (hw/block/nvme.c) is currently using an
internal Intel device id.

Prepare to change that by allocating a device id under the 1b36 (Red
Hat, Inc.) vendor id.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 MAINTAINERS            |  1 +
 docs/specs/nvme.txt    | 10 ++++++++++
 docs/specs/pci-ids.txt |  1 +
 include/hw/pci/pci.h   |  1 +
 4 files changed, 13 insertions(+)
 create mode 100644 docs/specs/nvme.txt

diff --git a/MAINTAINERS b/MAINTAINERS
index 1f0bc72f2189..14a018e9c0ae 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1645,6 +1645,7 @@ L: qemu-block@nongnu.org
 S: Supported
 F: hw/block/nvme*
 F: tests/qtest/nvme-test.c
+F: docs/specs/nvme.txt
 
 megasas
 M: Hannes Reinecke <hare@suse.com>
diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
new file mode 100644
index 000000000000..6ec7ddbc7ee0
--- /dev/null
+++ b/docs/specs/nvme.txt
@@ -0,0 +1,10 @@
+NVM Express Controller
+======================
+
+The nvme device (-device nvme) emulates an NVM Express Controller.
+
+
+Reference Specifications
+------------------------
+
+  https://nvmexpress.org/resources/specifications/
diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
index 4d53e5c7d9d5..abbdbca6be38 100644
--- a/docs/specs/pci-ids.txt
+++ b/docs/specs/pci-ids.txt
@@ -63,6 +63,7 @@ PCI devices (other than virtio):
 1b36:000b  PCIe Expander Bridge (-device pxb-pcie)
 1b36:000d  PCI xhci usb host adapter
 1b36:000f  mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
+1b36:0010  PCIe NVMe device (-device nvme)
 
 All these devices are documented in docs/specs.
 
diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
index b5013b834b20..9a20c309d0f2 100644
--- a/include/hw/pci/pci.h
+++ b/include/hw/pci/pci.h
@@ -103,6 +103,7 @@ extern bool pci_available;
 #define PCI_DEVICE_ID_REDHAT_XHCI        0x000d
 #define PCI_DEVICE_ID_REDHAT_PCIE_BRIDGE 0x000e
 #define PCI_DEVICE_ID_REDHAT_MDPY        0x000f
+#define PCI_DEVICE_ID_REDHAT_NVME        0x0010
 #define PCI_DEVICE_ID_REDHAT_QXL         0x0100
 
 #define FMT_PCIBUS                      PRIx64
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 24/26] nvme: change controller pci id
       [not found]   ` <CGME20200204095231eucas1p21019b1d857fcda9d67950e7d01de6b6a@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-04 16:35       ` Keith Busch
  2020-02-12 12:37       ` Maxim Levitsky
  0 siblings, 2 replies; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

There are two reasons for changing this:

  1. The nvme device currently uses an internal Intel device id.

  2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
     support multiple namespaces" the controller device no longer has
     the quirks that the Linux kernel think it has.

     As the quirks are applied based on pci vendor and device id, change
     them to get rid of the quirks.

To keep backward compatibility, add a new 'x-use-intel-id' parameter to
the nvme device to force use of the Intel vendor and device id. This is
off by default but add a compat property to set this for machines 4.2
and older.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c   | 13 +++++++++----
 hw/block/nvme.h   |  4 +++-
 hw/core/machine.c |  1 +
 3 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 3a377bc56734..bdef53a590b0 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2467,8 +2467,15 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
 
     pci_conf[PCI_INTERRUPT_PIN] = 1;
     pci_config_set_prog_interface(pci_conf, 0x2);
-    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
-    pci_config_set_device_id(pci_conf, 0x5845);
+
+    if (n->params.use_intel_id) {
+        pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
+        pci_config_set_device_id(pci_conf, 0x5846);
+    } else {
+        pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
+        pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
+    }
+
     pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
     pcie_endpoint_cap_init(pci_dev, 0x80);
 
@@ -2638,8 +2645,6 @@ static void nvme_class_init(ObjectClass *oc, void *data)
     pc->realize = nvme_realize;
     pc->exit = nvme_exit;
     pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
-    pc->vendor_id = PCI_VENDOR_ID_INTEL;
-    pc->device_id = 0x5845;
     pc->revision = 2;
 
     set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
diff --git a/hw/block/nvme.h b/hw/block/nvme.h
index c3cef0f024da..6b584f53ed64 100644
--- a/hw/block/nvme.h
+++ b/hw/block/nvme.h
@@ -12,7 +12,8 @@
     DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
     DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
     DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
-    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
+    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7), \
+    DEFINE_PROP_BOOL("x-use-intel-id", _state, _props.use_intel_id, false)
 
 typedef struct NvmeParams {
     char     *serial;
@@ -21,6 +22,7 @@ typedef struct NvmeParams {
     uint8_t  aerl;
     uint32_t aer_max_queued;
     uint8_t  mdts;
+    bool     use_intel_id;
 } NvmeParams;
 
 typedef struct NvmeAsyncEvent {
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 3e288bfceb7f..984412d98c9d 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -34,6 +34,7 @@ GlobalProperty hw_compat_4_2[] = {
     { "vhost-blk-device", "seg_max_adjust", "off"},
     { "usb-host", "suppress-remote-wake", "off" },
     { "usb-redir", "suppress-remote-wake", "off" },
+    { "nvme", "x-use-intel-id", "on"},
 };
 const size_t hw_compat_4_2_len = G_N_ELEMENTS(hw_compat_4_2);
 
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 25/26] nvme: remove redundant NvmeCmd pointer parameter
       [not found]   ` <CGME20200204095231eucas1p1f2b78a655b1a217fe4f7006f79e37f86@eucas1p1.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-12 12:37       ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

The command struct is available in the NvmeRequest that we generally
pass around anyway.

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme.c | 198 ++++++++++++++++++++++++------------------------
 1 file changed, 98 insertions(+), 100 deletions(-)

diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index bdef53a590b0..5fe2e2fe1fa9 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -566,16 +566,18 @@ unmap:
 }
 
 static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
-    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
+    DMADirection dir, NvmeRequest *req)
 {
     uint16_t status = NVME_SUCCESS;
     size_t bytes;
+    uint64_t prp1, prp2;
 
-    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
+    switch (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
     case PSDT_PRP:
-        status = nvme_map_prp(n, &req->qsg, &req->iov,
-            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
-            len, req);
+        prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
+        prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
+
+        status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
         if (status) {
             return status;
         }
@@ -589,7 +591,7 @@ static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
             return NVME_INVALID_FIELD;
         }
 
-        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
+        status = nvme_map_sgl(n, &req->qsg, &req->iov, req->cmd.dptr.sgl, len,
             req);
         if (status) {
             return status;
@@ -632,20 +634,21 @@ static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
     return status;
 }
 
-static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_map(NvmeCtrl *n, NvmeRequest *req)
 {
     uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
     uint64_t prp1, prp2;
 
-    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
+    switch (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
     case PSDT_PRP:
-        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
-        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
+        prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
+        prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
 
         return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
     case PSDT_SGL_MPTR_CONTIGUOUS:
     case PSDT_SGL_MPTR_SGL:
-        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
+        return nvme_map_sgl(n, &req->qsg, &req->iov, req->cmd.dptr.sgl, len,
+            req);
     default:
         return NVME_INVALID_FIELD;
     }
@@ -1024,7 +1027,7 @@ static void nvme_aio_cb(void *opaque, int ret)
     nvme_aio_destroy(aio);
 }
 
-static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns = req->ns;
     NvmeAIO *aio = g_new0(NvmeAIO, 1);
@@ -1040,12 +1043,12 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeAIO *aio;
 
     NvmeNamespace *ns = req->ns;
-    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
 
     int64_t offset;
     size_t count;
@@ -1081,9 +1084,9 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
+    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
     NvmeNamespace *ns = req->ns;
     int status;
 
@@ -1103,7 +1106,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return status;
     }
 
-    status = nvme_map(n, cmd, req);
+    status = nvme_map(n, req);
     if (status) {
         block_acct_invalid(blk_get_stats(ns->blk), acct);
         return status;
@@ -1115,12 +1118,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
-    uint32_t nsid = le32_to_cpu(cmd->nsid);
+    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
 
     trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
-        cmd->opcode);
+        req->cmd.opcode);
 
     req->ns = nvme_ns(n, nsid);
 
@@ -1128,16 +1131,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         return nvme_nsid_err(n, nsid);
     }
 
-    switch (cmd->opcode) {
+    switch (req->cmd.opcode) {
     case NVME_CMD_FLUSH:
-        return nvme_flush(n, cmd, req);
+        return nvme_flush(n, req);
     case NVME_CMD_WRITE_ZEROS:
-        return nvme_write_zeros(n, cmd, req);
+        return nvme_write_zeros(n, req);
     case NVME_CMD_WRITE:
     case NVME_CMD_READ:
-        return nvme_rw(n, cmd, req);
+        return nvme_rw(n, req);
     default:
-        trace_nvme_dev_err_invalid_opc(cmd->opcode);
+        trace_nvme_dev_err_invalid_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
     }
 }
@@ -1153,10 +1156,10 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
     }
 }
 
-static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
-    NvmeRequest *req, *next;
+    NvmeDeleteQ *c = (NvmeDeleteQ *) &req->cmd;
+    NvmeRequest *next;
     NvmeSQueue *sq;
     NvmeCQueue *cq;
     NvmeAIO *aio;
@@ -1224,10 +1227,10 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     n->sq[sqid] = sq;
 }
 
-static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeSQueue *sq;
-    NvmeCreateSq *c = (NvmeCreateSq *)cmd;
+    NvmeCreateSq *c = (NvmeCreateSq *) &req->cmd;
 
     uint16_t cqid = le16_to_cpu(c->cqid);
     uint16_t sqid = le16_to_cpu(c->sqid);
@@ -1262,10 +1265,10 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
-    uint32_t buf_len, uint64_t off, NvmeRequest *req)
+static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
+    uint64_t off, NvmeRequest *req)
 {
-    uint32_t nsid = le32_to_cpu(cmd->nsid);
+    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
 
     uint32_t trans_len;
     time_t current_ms;
@@ -1320,12 +1323,12 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
         nvme_clear_events(n, NVME_AER_TYPE_SMART);
     }
 
-    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
+    return nvme_dma(n, (uint8_t *) &smart + off, trans_len,
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
-    uint64_t off, NvmeRequest *req)
+static uint16_t nvme_fw_log_info(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
+    NvmeRequest *req)
 {
     uint32_t trans_len;
     NvmeFwSlotInfoLog fw_log;
@@ -1338,16 +1341,16 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
 
     trans_len = MIN(sizeof(fw_log) - off, buf_len);
 
-    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
+    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len,
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 {
-    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
-    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
-    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
-    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
+    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
+    uint32_t dw12 = le32_to_cpu(req->cmd.cdw12);
+    uint32_t dw13 = le32_to_cpu(req->cmd.cdw13);
     uint8_t  lid = dw10 & 0xff;
     uint8_t  lsp = (dw10 >> 8) & 0xf;
     uint8_t  rae = (dw10 >> 15) & 0x1;
@@ -1387,9 +1390,9 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
 
         return NVME_SUCCESS;
     case NVME_LOG_SMART_INFO:
-        return nvme_smart_info(n, cmd, rae, len, off, req);
+        return nvme_smart_info(n, rae, len, off, req);
     case NVME_LOG_FW_SLOT_INFO:
-        return nvme_fw_log_info(n, cmd, len, off, req);
+        return nvme_fw_log_info(n, len, off, req);
     default:
         trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
         return NVME_INVALID_FIELD | NVME_DNR;
@@ -1407,9 +1410,9 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
     }
 }
 
-static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
+    NvmeDeleteQ *c = (NvmeDeleteQ *) &req->cmd;
     NvmeCQueue *cq;
     uint16_t qid = le16_to_cpu(c->qid);
 
@@ -1447,10 +1450,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
 }
 
-static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
+static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeCQueue *cq;
-    NvmeCreateCq *c = (NvmeCreateCq *)cmd;
+    NvmeCreateCq *c = (NvmeCreateCq *) &req->cmd;
     uint16_t cqid = le16_to_cpu(c->cqid);
     uint16_t vector = le16_to_cpu(c->irq_vector);
     uint16_t qsize = le16_to_cpu(c->qsize);
@@ -1489,18 +1492,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_nvme_dev_identify_ctrl();
 
-    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
+    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl),
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeIdNs *id_ns, inactive = { 0 };
-    uint32_t nsid = le32_to_cpu(cmd->nsid);
+    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
     NvmeNamespace *ns = nvme_ns(n, nsid);
 
     trace_nvme_dev_identify_ns(nsid);
@@ -1517,15 +1520,14 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         id_ns = &ns->id_ns;
     }
 
-    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
+    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs),
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
-    NvmeRequest *req)
+static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeRequest *req)
 {
     static const int data_len = 4 * KiB;
-    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
+    uint32_t min_nsid = le32_to_cpu(req->cmd.nsid);
     uint32_t *list;
     uint16_t ret;
     int i, j = 0;
@@ -1542,14 +1544,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
             break;
         }
     }
-    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
+    ret = nvme_dma(n, (uint8_t *) list, data_len,
         DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
     return ret;
 }
 
-static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
-    NvmeRequest *req)
+static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
 {
     static const int len = 4096;
 
@@ -1560,7 +1561,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
         uint8_t nid[16];
     };
 
-    uint32_t nsid = le32_to_cpu(cmd->nsid);
+    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
 
     struct ns_descr *list;
     uint16_t ret;
@@ -1582,34 +1583,33 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
     list->nidl = 0x10;
     *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
 
-    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
-        req);
+    ret = nvme_dma(n, (uint8_t *) list, len, DMA_DIRECTION_FROM_DEVICE, req);
     g_free(list);
     return ret;
 }
 
-static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
 {
-    NvmeIdentify *c = (NvmeIdentify *)cmd;
+    NvmeIdentify *c = (NvmeIdentify *) &req->cmd;
 
     switch (le32_to_cpu(c->cns)) {
     case 0x00:
-        return nvme_identify_ns(n, cmd, req);
+        return nvme_identify_ns(n, req);
     case 0x01:
-        return nvme_identify_ctrl(n, cmd, req);
+        return nvme_identify_ctrl(n, req);
     case 0x02:
-        return nvme_identify_ns_list(n, cmd, req);
+        return nvme_identify_ns_list(n, req);
     case 0x03:
-        return nvme_identify_ns_descr_list(n, cmd, req);
+        return nvme_identify_ns_descr_list(n, req);
     default:
         trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
         return NVME_INVALID_FIELD | NVME_DNR;
     }
 }
 
-static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
 {
-    uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0xffff;
+    uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
 
     req->cqe.result = 1;
     if (nvme_check_sqid(n, sqid)) {
@@ -1659,19 +1659,18 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
     return cpu_to_le64(ts.all);
 }
 
-static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
-    NvmeRequest *req)
+static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 {
     uint64_t timestamp = nvme_get_timestamp(n);
 
-    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
+    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp),
         DMA_DIRECTION_FROM_DEVICE, req);
 }
 
-static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
 {
-    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
-    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
     uint32_t result;
 
     trace_nvme_dev_getfeat(nvme_cid(req), dw10);
@@ -1717,7 +1716,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
         trace_nvme_dev_getfeat_numq(result);
         break;
     case NVME_TIMESTAMP:
-        return nvme_get_feature_timestamp(n, cmd, req);
+        return nvme_get_feature_timestamp(n, req);
     case NVME_INTERRUPT_COALESCING:
         result = cpu_to_le32(n->features.int_coalescing);
         break;
@@ -1743,13 +1742,12 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
-    NvmeRequest *req)
+static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
 {
     uint16_t ret;
     uint64_t timestamp;
 
-    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
+    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp),
         DMA_DIRECTION_TO_DEVICE, req);
     if (ret != NVME_SUCCESS) {
         return ret;
@@ -1760,12 +1758,12 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
 {
     NvmeNamespace *ns;
 
-    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
-    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
+    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
+    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
 
     trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
 
@@ -1824,7 +1822,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
             ((n->params.num_queues - 2) << 16));
         break;
     case NVME_TIMESTAMP:
-        return nvme_set_feature_timestamp(n, cmd, req);
+        return nvme_set_feature_timestamp(n, req);
     case NVME_ASYNCHRONOUS_EVENT_CONF:
         n->features.async_config = dw11;
         break;
@@ -1843,7 +1841,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_SUCCESS;
 }
 
-static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_nvme_dev_aer(nvme_cid(req));
 
@@ -1862,31 +1860,31 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
     return NVME_NO_COMPLETE;
 }
 
-static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
+static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
-    switch (cmd->opcode) {
+    switch (req->cmd.opcode) {
     case NVME_ADM_CMD_DELETE_SQ:
-        return nvme_del_sq(n, cmd);
+        return nvme_del_sq(n, req);
     case NVME_ADM_CMD_CREATE_SQ:
-        return nvme_create_sq(n, cmd);
+        return nvme_create_sq(n, req);
     case NVME_ADM_CMD_GET_LOG_PAGE:
-        return nvme_get_log(n, cmd, req);
+        return nvme_get_log(n, req);
     case NVME_ADM_CMD_DELETE_CQ:
-        return nvme_del_cq(n, cmd);
+        return nvme_del_cq(n, req);
     case NVME_ADM_CMD_CREATE_CQ:
-        return nvme_create_cq(n, cmd);
+        return nvme_create_cq(n, req);
     case NVME_ADM_CMD_IDENTIFY:
-        return nvme_identify(n, cmd, req);
+        return nvme_identify(n, req);
     case NVME_ADM_CMD_ABORT:
-        return nvme_abort(n, cmd, req);
+        return nvme_abort(n, req);
     case NVME_ADM_CMD_SET_FEATURES:
-        return nvme_set_feature(n, cmd, req);
+        return nvme_set_feature(n, req);
     case NVME_ADM_CMD_GET_FEATURES:
-        return nvme_get_feature(n, cmd, req);
+        return nvme_get_feature(n, req);
     case NVME_ADM_CMD_ASYNC_EV_REQ:
-        return nvme_aer(n, cmd, req);
+        return nvme_aer(n, req);
     default:
-        trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
+        trace_nvme_dev_err_invalid_admin_opc(req->cmd.opcode);
         return NVME_INVALID_OPCODE | NVME_DNR;
     }
 }
@@ -1919,8 +1917,8 @@ static void nvme_process_sq(void *opaque)
         req->cqe.cid = cmd.cid;
         memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
 
-        status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
-            nvme_admin_cmd(n, &cmd, req);
+        status = sq->sqid ? nvme_io_cmd(n, req) :
+            nvme_admin_cmd(n, req);
         if (status != NVME_NO_COMPLETE) {
             req->status = status;
             nvme_enqueue_req_completion(cq, req);
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* [PATCH v5 26/26] nvme: make lba data size configurable
       [not found]   ` <CGME20200204095232eucas1p2b3264104447a42882f10edb06608ece5@eucas1p2.samsung.com>
@ 2020-02-04  9:52     ` Klaus Jensen
  2020-02-04 16:43       ` Keith Busch
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Jensen @ 2020-02-04  9:52 UTC (permalink / raw)
  To: qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Klaus Jensen,
	Keith Busch, Javier Gonzalez

Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/block/nvme-ns.c | 2 +-
 hw/block/nvme-ns.h | 4 +++-
 hw/block/nvme.c    | 1 +
 3 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
index 0e5be44486f4..981d7101b8f2 100644
--- a/hw/block/nvme-ns.c
+++ b/hw/block/nvme-ns.c
@@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
 {
     NvmeIdNs *id_ns = &ns->id_ns;
 
-    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
+    id_ns->lbaf[0].ds = ns->params.lbads;
     id_ns->nuse = id_ns->ncap = id_ns->nsze =
         cpu_to_le64(nvme_ns_nlbas(ns));
 
diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
index b564bac25f6d..f1fe4db78b41 100644
--- a/hw/block/nvme-ns.h
+++ b/hw/block/nvme-ns.h
@@ -7,10 +7,12 @@
 
 #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
     DEFINE_PROP_DRIVE("drive", _state, blk), \
-    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
+    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
+    DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
 
 typedef struct NvmeNamespaceParams {
     uint32_t nsid;
+    uint8_t  lbads;
 } NvmeNamespaceParams;
 
 typedef struct NvmeNamespace {
diff --git a/hw/block/nvme.c b/hw/block/nvme.c
index 5fe2e2fe1fa9..67cd8d9d65fe 100644
--- a/hw/block/nvme.c
+++ b/hw/block/nvme.c
@@ -2598,6 +2598,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
     if (n->namespace.blk) {
         ns = &n->namespace;
         ns->params.nsid = 1;
+        ns->params.lbads = BDRV_SECTOR_BITS;
 
         if (nvme_ns_setup(n, ns, &local_err)) {
             error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
-- 
2.25.0



^ permalink raw reply related	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces
  2020-02-04  9:51 ` [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces Klaus Jensen
                     ` (25 preceding siblings ...)
       [not found]   ` <CGME20200204095232eucas1p2b3264104447a42882f10edb06608ece5@eucas1p2.samsung.com>
@ 2020-02-04 10:34   ` no-reply
  2020-02-04 16:47   ` Keith Busch
  27 siblings, 0 replies; 86+ messages in thread
From: no-reply @ 2020-02-04 10:34 UTC (permalink / raw)
  To: k.jensen
  Cc: kwolf, beata.michalska, qemu-block, qemu-devel, mreitz, kbusch,
	its, javier.gonz

Patchew URL: https://patchew.org/QEMU/20200204095208.269131-1-k.jensen@samsung.com/

Hi,

This series seems to have some coding style problems. See output below for
more information:

Subject: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces
Message-id: 20200204095208.269131-1-k.jensen@samsung.com
Type: series

=== TEST SCRIPT BEGIN ===
#!/bin/bash
git rev-parse base > /dev/null || exit 0
git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram
./scripts/checkpatch.pl --mailback base..
=== TEST SCRIPT END ===

From https://github.com/patchew-project/qemu
 * [new tag]         patchew/20200204095208.269131-1-k.jensen@samsung.com -> patchew/20200204095208.269131-1-k.jensen@samsung.com
Switched to a new branch 'test'
a7128db nvme: make lba data size configurable
a0eb0b5 nvme: remove redundant NvmeCmd pointer parameter
0e1bd8c nvme: change controller pci id
f046db4 pci: allocate pci id for nvme
a59d563 nvme: support multiple namespaces
f6b57ba nvme: add support for scatter gather lists
4f78005 nvme: handle dma errors
4e75b11 pci: pass along the return value of dma_memory_rw
c08065d nvme: use preallocated qsg/iov in nvme_dma_prp
3ec0d9f nvme: allow multiple aios per command
540b98f nvme: refactor prp mapping
8fd4e4c nvme: bump supported specification to 1.3
13fceab nvme: make sure ncqr and nsqr is valid
66bf321 nvme: additional tracing
bbb3c58 nvme: add missing mandatory features
77b9455 nvme: add support for the asynchronous event request command
8cdc15c nvme: add support for the get log page command
e612e83 nvme: add temperature threshold feature
ffc039c nvme: refactor device realization
0623024 nvme: add support for the abort command
11b89df nvme: refactor nvme_addr_read
d9f7bf0 nvme: populate the mandatory subnqn and ver fields
f8716d6 nvme: add missing fields in the identify data structures
67d91b0 nvme: move device parameters to separate struct
5f71397 nvme: remove superfluous breaks
f83d65c nvme: rename trace events to nvme_dev

=== OUTPUT BEGIN ===
1/26 Checking commit f83d65c36a14 (nvme: rename trace events to nvme_dev)
2/26 Checking commit 5f71397a1057 (nvme: remove superfluous breaks)
3/26 Checking commit 67d91b03edce (nvme: move device parameters to separate struct)
ERROR: Macros with complex values should be enclosed in parenthesis
#177: FILE: hw/block/nvme.h:6:
+#define DEFINE_NVME_PROPERTIES(_state, _props) \
+    DEFINE_PROP_STRING("serial", _state, _props.serial), \
+    DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
+    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)

total: 1 errors, 0 warnings, 181 lines checked

Patch 3/26 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

4/26 Checking commit f8716d6d577c (nvme: add missing fields in the identify data structures)
5/26 Checking commit d9f7bf0bea10 (nvme: populate the mandatory subnqn and ver fields)
6/26 Checking commit 11b89df75991 (nvme: refactor nvme_addr_read)
7/26 Checking commit 06230241911a (nvme: add support for the abort command)
8/26 Checking commit ffc039c6a990 (nvme: refactor device realization)
9/26 Checking commit e612e83d5189 (nvme: add temperature threshold feature)
10/26 Checking commit 8cdc15c88d53 (nvme: add support for the get log page command)
11/26 Checking commit 77b945573421 (nvme: add support for the asynchronous event request command)
12/26 Checking commit bbb3c586241a (nvme: add missing mandatory features)
13/26 Checking commit 66bf3218e7f3 (nvme: additional tracing)
14/26 Checking commit 13fceab275cc (nvme: make sure ncqr and nsqr is valid)
15/26 Checking commit 8fd4e4c6fb73 (nvme: bump supported specification to 1.3)
16/26 Checking commit 540b98f3c98d (nvme: refactor prp mapping)
17/26 Checking commit 3ec0d9f718ea (nvme: allow multiple aios per command)
18/26 Checking commit c08065deefa3 (nvme: use preallocated qsg/iov in nvme_dma_prp)
19/26 Checking commit 4e75b1170a2f (pci: pass along the return value of dma_memory_rw)
20/26 Checking commit 4f78005aa73f (nvme: handle dma errors)
21/26 Checking commit f6b57ba3f3f8 (nvme: add support for scatter gather lists)
22/26 Checking commit a59d5630a44f (nvme: support multiple namespaces)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#42: 
new file mode 100644

ERROR: Macros with complex values should be enclosed in parenthesis
#218: FILE: hw/block/nvme-ns.h:8:
+#define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
+    DEFINE_PROP_DRIVE("drive", _state, blk), \
+    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)

total: 1 errors, 1 warnings, 816 lines checked

Patch 22/26 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

23/26 Checking commit f046db41f34b (pci: allocate pci id for nvme)
WARNING: added, moved or deleted file(s), does MAINTAINERS need updating?
#29: 
new file mode 100644

total: 0 errors, 1 warnings, 31 lines checked

Patch 23/26 has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.
24/26 Checking commit 0e1bd8c8281e (nvme: change controller pci id)
25/26 Checking commit a0eb0b55ad5d (nvme: remove redundant NvmeCmd pointer parameter)
26/26 Checking commit a7128db3f7bf (nvme: make lba data size configurable)
=== OUTPUT END ===

Test command exited with code: 1

The full log is available at
http://patchew.org/logs/20200204095208.269131-1-k.jensen@samsung.com/testing.checkpatch/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 22/26] nvme: support multiple namespaces
  2020-02-04  9:52     ` [PATCH v5 22/26] nvme: support multiple namespaces Klaus Jensen
@ 2020-02-04 16:31       ` Keith Busch
  2020-02-06  7:27         ` Klaus Birkelund Jensen
  2020-02-12 12:34       ` Maxim Levitsky
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2020-02-04 16:31 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> This adds support for multiple namespaces by introducing a new 'nvme-ns'
> device model. The nvme device creates a bus named from the device name
> ('id'). The nvme-ns devices then connect to this and registers
> themselves with the nvme device.
> 
> This changes how an nvme device is created. Example with two namespaces:
> 
>   -drive file=nvme0n1.img,if=none,id=disk1
>   -drive file=nvme0n2.img,if=none,id=disk2
>   -device nvme,serial=deadbeef,id=nvme0
>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> 
> The drive property is kept on the nvme device to keep the change
> backward compatible, but the property is now optional. Specifying a
> drive for the nvme device will always create the namespace with nsid 1.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

I like this feature a lot, thanks for doing it.

Reviewed-by: Keith Busch <kbusch@kernel.org>

> @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>      uint64_t units_read = 0, units_written = 0, read_commands = 0,
>          write_commands = 0;
>      NvmeSmartLog smart;
> -    BlockAcctStats *s;
>  
>      if (nsid && nsid != 0xffffffff) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }

This is totally optional, but worth mentioning: this patch makes it
possible to remove this check and allow per-namespace smart logs. The
ID_CTRL.LPA would need to updated to reflect that if you wanted to
go that route.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 24/26] nvme: change controller pci id
  2020-02-04  9:52     ` [PATCH v5 24/26] nvme: change controller pci id Klaus Jensen
@ 2020-02-04 16:35       ` Keith Busch
  2020-02-06  7:28         ` Klaus Birkelund Jensen
  2020-02-12 12:37       ` Maxim Levitsky
  1 sibling, 1 reply; 86+ messages in thread
From: Keith Busch @ 2020-02-04 16:35 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> There are two reasons for changing this:
> 
>   1. The nvme device currently uses an internal Intel device id.
> 
>   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
>      support multiple namespaces" the controller device no longer has
>      the quirks that the Linux kernel think it has.
> 
>      As the quirks are applied based on pci vendor and device id, change
>      them to get rid of the quirks.
> 
> To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> the nvme device to force use of the Intel vendor and device id. This is
> off by default but add a compat property to set this for machines 4.2
> and older.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

Yay, thank you for following through on getting this identifier assigned.

Reviewed-by: Keith Busch <kbusch@kernel.org>


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 26/26] nvme: make lba data size configurable
  2020-02-04  9:52     ` [PATCH v5 26/26] nvme: make lba data size configurable Klaus Jensen
@ 2020-02-04 16:43       ` Keith Busch
  2020-02-06  7:24         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Keith Busch @ 2020-02-04 16:43 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme-ns.c | 2 +-
>  hw/block/nvme-ns.h | 4 +++-
>  hw/block/nvme.c    | 1 +
>  3 files changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> index 0e5be44486f4..981d7101b8f2 100644
> --- a/hw/block/nvme-ns.c
> +++ b/hw/block/nvme-ns.c
> @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
>  {
>      NvmeIdNs *id_ns = &ns->id_ns;
>  
> -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> +    id_ns->lbaf[0].ds = ns->params.lbads;
>      id_ns->nuse = id_ns->ncap = id_ns->nsze =
>          cpu_to_le64(nvme_ns_nlbas(ns));
>  
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> index b564bac25f6d..f1fe4db78b41 100644
> --- a/hw/block/nvme-ns.h
> +++ b/hw/block/nvme-ns.h
> @@ -7,10 +7,12 @@
>  
>  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
>      DEFINE_PROP_DRIVE("drive", _state, blk), \
> -    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> +    DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)

I think we need to validate the parameter is between 9 and 12 before
trusting it can be used safely.

Alternatively, add supported formats to the lbaf array and let the host
decide on a live system with the 'format' command.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces
  2020-02-04  9:51 ` [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces Klaus Jensen
                     ` (26 preceding siblings ...)
  2020-02-04 10:34   ` [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces no-reply
@ 2020-02-04 16:47   ` Keith Busch
  2020-02-06  7:29     ` Klaus Birkelund Jensen
  27 siblings, 1 reply; 86+ messages in thread
From: Keith Busch @ 2020-02-04 16:47 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> Hi,
> 
> 
> Changes since v4
>  - Changed vendor and device id to use a Red Hat allocated one. For
>    backwards compatibility add the 'x-use-intel-id' nvme device
>    parameter. This is off by default but is added as a machine compat
>    property to be true for machine types <= 4.2.
> 
>  - SGL mapping code has been refactored.

Looking pretty good to me. For the series beyond the individually
reviewed patches:

Acked-by: Keith Busch <kbusch@kernel.org>

If you need to send a v5, you may add my tag to the patches that are not
substaintially modified if you like.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 26/26] nvme: make lba data size configurable
  2020-02-04 16:43       ` Keith Busch
@ 2020-02-06  7:24         ` Klaus Birkelund Jensen
  2020-02-12 12:39           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-02-06  7:24 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

[-- Attachment #1: Type: text/plain, Size: 1869 bytes --]

On Feb  5 01:43, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme-ns.c | 2 +-
> >  hw/block/nvme-ns.h | 4 +++-
> >  hw/block/nvme.c    | 1 +
> >  3 files changed, 5 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > index 0e5be44486f4..981d7101b8f2 100644
> > --- a/hw/block/nvme-ns.c
> > +++ b/hw/block/nvme-ns.c
> > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> >  {
> >      NvmeIdNs *id_ns = &ns->id_ns;
> >  
> > -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +    id_ns->lbaf[0].ds = ns->params.lbads;
> >      id_ns->nuse = id_ns->ncap = id_ns->nsze =
> >          cpu_to_le64(nvme_ns_nlbas(ns));
> >  
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > index b564bac25f6d..f1fe4db78b41 100644
> > --- a/hw/block/nvme-ns.h
> > +++ b/hw/block/nvme-ns.h
> > @@ -7,10 +7,12 @@
> >  
> >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> >      DEFINE_PROP_DRIVE("drive", _state, blk), \
> > -    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > +    DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> 
> I think we need to validate the parameter is between 9 and 12 before
> trusting it can be used safely.
> 
> Alternatively, add supported formats to the lbaf array and let the host
> decide on a live system with the 'format' command.

The device does not yet support Format NVM, but we have a patch ready
for that to be submitted with a new series when this is merged.

For now, while it does not support Format, I will change this patch such
that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
alternative (while always keeping the number of formats available to 1).

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 22/26] nvme: support multiple namespaces
  2020-02-04 16:31       ` Keith Busch
@ 2020-02-06  7:27         ` Klaus Birkelund Jensen
  0 siblings, 0 replies; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-02-06  7:27 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

[-- Attachment #1: Type: text/plain, Size: 2111 bytes --]

On Feb  5 01:31, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:04AM +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> 
> I like this feature a lot, thanks for doing it.
> 
> Reviewed-by: Keith Busch <kbusch@kernel.org>
> 
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> >      uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >          write_commands = 0;
> >      NvmeSmartLog smart;
> > -    BlockAcctStats *s;
> >  
> >      if (nsid && nsid != 0xffffffff) {
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> 
> This is totally optional, but worth mentioning: this patch makes it
> possible to remove this check and allow per-namespace smart logs. The
> ID_CTRL.LPA would need to updated to reflect that if you wanted to
> go that route.

Yeah, I thought about that, but with NVMe v1.4 support arriving in a
later series, there are no longer any namespace specific stuff in the
log page anyway.

The spec isn't really clear on what the preferred behavior for a 1.4
compliant device is. Either

  1. LBA bit 0 set and just return the same page for each namespace or,
  2. LBA bit 0 unset and fail when NSID is set


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 24/26] nvme: change controller pci id
  2020-02-04 16:35       ` Keith Busch
@ 2020-02-06  7:28         ` Klaus Birkelund Jensen
  0 siblings, 0 replies; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-02-06  7:28 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

[-- Attachment #1: Type: text/plain, Size: 1115 bytes --]

On Feb  5 01:35, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:52:06AM +0100, Klaus Jensen wrote:
> > There are two reasons for changing this:
> > 
> >   1. The nvme device currently uses an internal Intel device id.
> > 
> >   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
> >      support multiple namespaces" the controller device no longer has
> >      the quirks that the Linux kernel think it has.
> > 
> >      As the quirks are applied based on pci vendor and device id, change
> >      them to get rid of the quirks.
> > 
> > To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> > the nvme device to force use of the Intel vendor and device id. This is
> > off by default but add a compat property to set this for machines 4.2
> > and older.
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> 
> Yay, thank you for following through on getting this identifier assigned.
> 
> Reviewed-by: Keith Busch <kbusch@kernel.org>

This is technically not "officially" sanctioned yet, but I got an
indication from Gerd that we are good to proceed with this.


[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces
  2020-02-04 16:47   ` Keith Busch
@ 2020-02-06  7:29     ` Klaus Birkelund Jensen
  0 siblings, 0 replies; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-02-06  7:29 UTC (permalink / raw)
  To: Keith Busch
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

[-- Attachment #1: Type: text/plain, Size: 932 bytes --]

On Feb  5 01:47, Keith Busch wrote:
> On Tue, Feb 04, 2020 at 10:51:42AM +0100, Klaus Jensen wrote:
> > Hi,
> > 
> > 
> > Changes since v4
> >  - Changed vendor and device id to use a Red Hat allocated one. For
> >    backwards compatibility add the 'x-use-intel-id' nvme device
> >    parameter. This is off by default but is added as a machine compat
> >    property to be true for machine types <= 4.2.
> > 
> >  - SGL mapping code has been refactored.
> 
> Looking pretty good to me. For the series beyond the individually
> reviewed patches:
> 
> Acked-by: Keith Busch <kbusch@kernel.org>
> 
> If you need to send a v5, you may add my tag to the patches that are not
> substaintially modified if you like.

I'll send a v6 with the changes to "nvme: make lba data size
configurable". It won't be substantially changed, I will just only
accept 9 and 12 as valid values for lbads.

Thanks for the Ack's and Reviews Keith!


Klaus

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev
  2020-02-04  9:51     ` [PATCH v5 01/26] nvme: rename trace events to nvme_dev Klaus Jensen
@ 2020-02-12  9:08       ` Maxim Levitsky
  2020-02-12 13:08         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:08 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Change the prefix of all nvme device related trace events to 'nvme_dev'
> to not clash with trace events from the nvme block driver.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c       | 185 +++++++++++++++++++++---------------------
>  hw/block/trace-events | 172 +++++++++++++++++++--------------------
>  2 files changed, 178 insertions(+), 179 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index d28335cbf377..dd548d9b6605 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -112,16 +112,16 @@ static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
>  {
>      if (cq->irq_enabled) {
>          if (msix_enabled(&(n->parent_obj))) {
> -            trace_nvme_irq_msix(cq->vector);
> +            trace_nvme_dev_irq_msix(cq->vector);
>              msix_notify(&(n->parent_obj), cq->vector);
>          } else {
> -            trace_nvme_irq_pin();
> +            trace_nvme_dev_irq_pin();
>              assert(cq->cqid < 64);
>              n->irq_status |= 1 << cq->cqid;
>              nvme_irq_check(n);
>          }
>      } else {
> -        trace_nvme_irq_masked();
> +        trace_nvme_dev_irq_masked();
>      }
>  }
>  
> @@ -146,7 +146,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
>      int num_prps = (len >> n->page_bits) + 1;
>  
>      if (unlikely(!prp1)) {
> -        trace_nvme_err_invalid_prp();
> +        trace_nvme_dev_err_invalid_prp();
>          return NVME_INVALID_FIELD | NVME_DNR;
>      } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
>                 prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> @@ -160,7 +160,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
>      len -= trans_len;
>      if (len) {
>          if (unlikely(!prp2)) {
> -            trace_nvme_err_invalid_prp2_missing();
> +            trace_nvme_dev_err_invalid_prp2_missing();
>              goto unmap;
>          }
>          if (len > n->page_size) {
> @@ -176,7 +176,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
>  
>                  if (i == n->max_prp_ents - 1 && len > n->page_size) {
>                      if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> -                        trace_nvme_err_invalid_prplist_ent(prp_ent);
> +                        trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
>                          goto unmap;
>                      }
>  
> @@ -189,7 +189,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
>                  }
>  
>                  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> -                    trace_nvme_err_invalid_prplist_ent(prp_ent);
> +                    trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
>                      goto unmap;
>                  }
>  
> @@ -204,7 +204,7 @@ static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
>              }
>          } else {
>              if (unlikely(prp2 & (n->page_size - 1))) {
> -                trace_nvme_err_invalid_prp2_align(prp2);
> +                trace_nvme_dev_err_invalid_prp2_align(prp2);
>                  goto unmap;
>              }
>              if (qsg->nsg) {
> @@ -252,20 +252,20 @@ static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>      QEMUIOVector iov;
>      uint16_t status = NVME_SUCCESS;
>  
> -    trace_nvme_dma_read(prp1, prp2);
> +    trace_nvme_dev_dma_read(prp1, prp2);
>  
>      if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>      if (qsg.nsg > 0) {
>          if (unlikely(dma_buf_read(ptr, len, &qsg))) {
> -            trace_nvme_err_invalid_dma();
> +            trace_nvme_dev_err_invalid_dma();
>              status = NVME_INVALID_FIELD | NVME_DNR;
>          }
>          qemu_sglist_destroy(&qsg);
>      } else {
>          if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
> -            trace_nvme_err_invalid_dma();
> +            trace_nvme_dev_err_invalid_dma();
>              status = NVME_INVALID_FIELD | NVME_DNR;
>          }
>          qemu_iovec_destroy(&iov);
> @@ -354,7 +354,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      uint32_t count = nlb << data_shift;
>  
>      if (unlikely(slba + nlb > ns->id_ns.nsze)) {
> -        trace_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> +        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> @@ -382,11 +382,11 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
>      enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
>  
> -    trace_nvme_rw(is_write ? "write" : "read", nlb, data_size, slba);
> +    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
>  
>      if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
>          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> -        trace_nvme_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> +        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> @@ -421,7 +421,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
>      if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> -        trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
>  
> @@ -435,7 +435,7 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      case NVME_CMD_READ:
>          return nvme_rw(n, ns, cmd, req);
>      default:
> -        trace_nvme_err_invalid_opc(cmd->opcode);
> +        trace_nvme_dev_err_invalid_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
>      }
>  }
> @@ -460,11 +460,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      uint16_t qid = le16_to_cpu(c->qid);
>  
>      if (unlikely(!qid || nvme_check_sqid(n, qid))) {
> -        trace_nvme_err_invalid_del_sq(qid);
> +        trace_nvme_dev_err_invalid_del_sq(qid);
>          return NVME_INVALID_QID | NVME_DNR;
>      }
>  
> -    trace_nvme_del_sq(qid);
> +    trace_nvme_dev_del_sq(qid);
>  
>      sq = n->sq[qid];
>      while (!QTAILQ_EMPTY(&sq->out_req_list)) {
> @@ -528,26 +528,26 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      uint16_t qflags = le16_to_cpu(c->sq_flags);
>      uint64_t prp1 = le64_to_cpu(c->prp1);
>  
> -    trace_nvme_create_sq(prp1, sqid, cqid, qsize, qflags);
> +    trace_nvme_dev_create_sq(prp1, sqid, cqid, qsize, qflags);
>  
>      if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
> -        trace_nvme_err_invalid_create_sq_cqid(cqid);
> +        trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
>          return NVME_INVALID_CQID | NVME_DNR;
>      }
>      if (unlikely(!sqid || !nvme_check_sqid(n, sqid))) {
> -        trace_nvme_err_invalid_create_sq_sqid(sqid);
> +        trace_nvme_dev_err_invalid_create_sq_sqid(sqid);
>          return NVME_INVALID_QID | NVME_DNR;
>      }
>      if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
> -        trace_nvme_err_invalid_create_sq_size(qsize);
> +        trace_nvme_dev_err_invalid_create_sq_size(qsize);
>          return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
>      }
>      if (unlikely(!prp1 || prp1 & (n->page_size - 1))) {
> -        trace_nvme_err_invalid_create_sq_addr(prp1);
> +        trace_nvme_dev_err_invalid_create_sq_addr(prp1);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>      if (unlikely(!(NVME_SQ_FLAGS_PC(qflags)))) {
> -        trace_nvme_err_invalid_create_sq_qflags(NVME_SQ_FLAGS_PC(qflags));
> +        trace_nvme_dev_err_invalid_create_sq_qflags(NVME_SQ_FLAGS_PC(qflags));
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>      sq = g_malloc0(sizeof(*sq));
> @@ -573,17 +573,17 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      uint16_t qid = le16_to_cpu(c->qid);
>  
>      if (unlikely(!qid || nvme_check_cqid(n, qid))) {
> -        trace_nvme_err_invalid_del_cq_cqid(qid);
> +        trace_nvme_dev_err_invalid_del_cq_cqid(qid);
>          return NVME_INVALID_CQID | NVME_DNR;
>      }
>  
>      cq = n->cq[qid];
>      if (unlikely(!QTAILQ_EMPTY(&cq->sq_list))) {
> -        trace_nvme_err_invalid_del_cq_notempty(qid);
> +        trace_nvme_dev_err_invalid_del_cq_notempty(qid);
>          return NVME_INVALID_QUEUE_DEL;
>      }
>      nvme_irq_deassert(n, cq);
> -    trace_nvme_del_cq(qid);
> +    trace_nvme_dev_del_cq(qid);
>      nvme_free_cq(cq, n);
>      return NVME_SUCCESS;
>  }
> @@ -616,27 +616,27 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      uint16_t qflags = le16_to_cpu(c->cq_flags);
>      uint64_t prp1 = le64_to_cpu(c->prp1);
>  
> -    trace_nvme_create_cq(prp1, cqid, vector, qsize, qflags,
> +    trace_nvme_dev_create_cq(prp1, cqid, vector, qsize, qflags,
>                           NVME_CQ_FLAGS_IEN(qflags) != 0);
>  
>      if (unlikely(!cqid || !nvme_check_cqid(n, cqid))) {
> -        trace_nvme_err_invalid_create_cq_cqid(cqid);
> +        trace_nvme_dev_err_invalid_create_cq_cqid(cqid);
>          return NVME_INVALID_CQID | NVME_DNR;
>      }
>      if (unlikely(!qsize || qsize > NVME_CAP_MQES(n->bar.cap))) {
> -        trace_nvme_err_invalid_create_cq_size(qsize);
> +        trace_nvme_dev_err_invalid_create_cq_size(qsize);
>          return NVME_MAX_QSIZE_EXCEEDED | NVME_DNR;
>      }
>      if (unlikely(!prp1)) {
> -        trace_nvme_err_invalid_create_cq_addr(prp1);
> +        trace_nvme_dev_err_invalid_create_cq_addr(prp1);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>      if (unlikely(vector > n->num_queues)) {
> -        trace_nvme_err_invalid_create_cq_vector(vector);
> +        trace_nvme_dev_err_invalid_create_cq_vector(vector);
>          return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
>      }
>      if (unlikely(!(NVME_CQ_FLAGS_PC(qflags)))) {
> -        trace_nvme_err_invalid_create_cq_qflags(NVME_CQ_FLAGS_PC(qflags));
> +        trace_nvme_dev_err_invalid_create_cq_qflags(NVME_CQ_FLAGS_PC(qflags));
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> @@ -651,7 +651,7 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
>      uint64_t prp1 = le64_to_cpu(c->prp1);
>      uint64_t prp2 = le64_to_cpu(c->prp2);
>  
> -    trace_nvme_identify_ctrl();
> +    trace_nvme_dev_identify_ctrl();
>  
>      return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
>          prp1, prp2);
> @@ -664,10 +664,10 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>      uint64_t prp1 = le64_to_cpu(c->prp1);
>      uint64_t prp2 = le64_to_cpu(c->prp2);
>  
> -    trace_nvme_identify_ns(nsid);
> +    trace_nvme_dev_identify_ns(nsid);
>  
>      if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> -        trace_nvme_err_invalid_ns(nsid, n->num_namespaces);
> +        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
>  
> @@ -687,7 +687,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
>      uint16_t ret;
>      int i, j = 0;
>  
> -    trace_nvme_identify_nslist(min_nsid);
> +    trace_nvme_dev_identify_nslist(min_nsid);
>  
>      list = g_malloc0(data_len);
>      for (i = 0; i < n->num_namespaces; i++) {
> @@ -716,14 +716,14 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>      case 0x02:
>          return nvme_identify_nslist(n, c);
>      default:
> -        trace_nvme_err_invalid_identify_cns(le32_to_cpu(c->cns));
> +        trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  }
>  
>  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
>  {
> -    trace_nvme_setfeat_timestamp(ts);
> +    trace_nvme_dev_setfeat_timestamp(ts);
>  
>      n->host_timestamp = le64_to_cpu(ts);
>      n->timestamp_set_qemu_clock_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> @@ -756,7 +756,7 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
>      /* If the host timestamp is non-zero, set the timestamp origin */
>      ts.origin = n->host_timestamp ? 0x01 : 0x00;
>  
> -    trace_nvme_getfeat_timestamp(ts.all);
> +    trace_nvme_dev_getfeat_timestamp(ts.all);
>  
>      return cpu_to_le64(ts.all);
>  }
> @@ -780,17 +780,17 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      switch (dw10) {
>      case NVME_VOLATILE_WRITE_CACHE:
>          result = blk_enable_write_cache(n->conf.blk);
> -        trace_nvme_getfeat_vwcache(result ? "enabled" : "disabled");
> +        trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
>          break;
>      case NVME_NUMBER_OF_QUEUES:
>          result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
> -        trace_nvme_getfeat_numq(result);
> +        trace_nvme_dev_getfeat_numq(result);
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
>          break;
>      default:
> -        trace_nvme_err_invalid_getfeat(dw10);
> +        trace_nvme_dev_err_invalid_getfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> @@ -826,9 +826,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> -        trace_nvme_setfeat_numq((dw11 & 0xFFFF) + 1,
> -                                ((dw11 >> 16) & 0xFFFF) + 1,
> -                                n->num_queues - 1, n->num_queues - 1);
> +        trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
> +            ((dw11 >> 16) & 0xFFFF) + 1, n->num_queues - 1, n->num_queues - 1);
>          req->cqe.result =
>              cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
>          break;
> @@ -838,7 +837,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>  
>      default:
> -        trace_nvme_err_invalid_setfeat(dw10);
> +        trace_nvme_dev_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>      return NVME_SUCCESS;
> @@ -862,7 +861,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      case NVME_ADM_CMD_GET_FEATURES:
>          return nvme_get_feature(n, cmd, req);
>      default:
> -        trace_nvme_err_invalid_admin_opc(cmd->opcode);
> +        trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
>      }
>  }
> @@ -925,77 +924,77 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>      uint32_t page_size = 1 << page_bits;
>  
>      if (unlikely(n->cq[0])) {
> -        trace_nvme_err_startfail_cq();
> +        trace_nvme_dev_err_startfail_cq();
>          return -1;
>      }
>      if (unlikely(n->sq[0])) {
> -        trace_nvme_err_startfail_sq();
> +        trace_nvme_dev_err_startfail_sq();
>          return -1;
>      }
>      if (unlikely(!n->bar.asq)) {
> -        trace_nvme_err_startfail_nbarasq();
> +        trace_nvme_dev_err_startfail_nbarasq();
>          return -1;
>      }
>      if (unlikely(!n->bar.acq)) {
> -        trace_nvme_err_startfail_nbaracq();
> +        trace_nvme_dev_err_startfail_nbaracq();
>          return -1;
>      }
>      if (unlikely(n->bar.asq & (page_size - 1))) {
> -        trace_nvme_err_startfail_asq_misaligned(n->bar.asq);
> +        trace_nvme_dev_err_startfail_asq_misaligned(n->bar.asq);
>          return -1;
>      }
>      if (unlikely(n->bar.acq & (page_size - 1))) {
> -        trace_nvme_err_startfail_acq_misaligned(n->bar.acq);
> +        trace_nvme_dev_err_startfail_acq_misaligned(n->bar.acq);
>          return -1;
>      }
>      if (unlikely(NVME_CC_MPS(n->bar.cc) <
>                   NVME_CAP_MPSMIN(n->bar.cap))) {
> -        trace_nvme_err_startfail_page_too_small(
> +        trace_nvme_dev_err_startfail_page_too_small(
>                      NVME_CC_MPS(n->bar.cc),
>                      NVME_CAP_MPSMIN(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(NVME_CC_MPS(n->bar.cc) >
>                   NVME_CAP_MPSMAX(n->bar.cap))) {
> -        trace_nvme_err_startfail_page_too_large(
> +        trace_nvme_dev_err_startfail_page_too_large(
>                      NVME_CC_MPS(n->bar.cc),
>                      NVME_CAP_MPSMAX(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(NVME_CC_IOCQES(n->bar.cc) <
>                   NVME_CTRL_CQES_MIN(n->id_ctrl.cqes))) {
> -        trace_nvme_err_startfail_cqent_too_small(
> +        trace_nvme_dev_err_startfail_cqent_too_small(
>                      NVME_CC_IOCQES(n->bar.cc),
>                      NVME_CTRL_CQES_MIN(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(NVME_CC_IOCQES(n->bar.cc) >
>                   NVME_CTRL_CQES_MAX(n->id_ctrl.cqes))) {
> -        trace_nvme_err_startfail_cqent_too_large(
> +        trace_nvme_dev_err_startfail_cqent_too_large(
>                      NVME_CC_IOCQES(n->bar.cc),
>                      NVME_CTRL_CQES_MAX(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(NVME_CC_IOSQES(n->bar.cc) <
>                   NVME_CTRL_SQES_MIN(n->id_ctrl.sqes))) {
> -        trace_nvme_err_startfail_sqent_too_small(
> +        trace_nvme_dev_err_startfail_sqent_too_small(
>                      NVME_CC_IOSQES(n->bar.cc),
>                      NVME_CTRL_SQES_MIN(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(NVME_CC_IOSQES(n->bar.cc) >
>                   NVME_CTRL_SQES_MAX(n->id_ctrl.sqes))) {
> -        trace_nvme_err_startfail_sqent_too_large(
> +        trace_nvme_dev_err_startfail_sqent_too_large(
>                      NVME_CC_IOSQES(n->bar.cc),
>                      NVME_CTRL_SQES_MAX(n->bar.cap));
>          return -1;
>      }
>      if (unlikely(!NVME_AQA_ASQS(n->bar.aqa))) {
> -        trace_nvme_err_startfail_asqent_sz_zero();
> +        trace_nvme_dev_err_startfail_asqent_sz_zero();
>          return -1;
>      }
>      if (unlikely(!NVME_AQA_ACQS(n->bar.aqa))) {
> -        trace_nvme_err_startfail_acqent_sz_zero();
> +        trace_nvme_dev_err_startfail_acqent_sz_zero();
>          return -1;
>      }
>  
> @@ -1018,14 +1017,14 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>      unsigned size)
>  {
>      if (unlikely(offset & (sizeof(uint32_t) - 1))) {
> -        NVME_GUEST_ERR(nvme_ub_mmiowr_misaligned32,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_misaligned32,
>                         "MMIO write not 32-bit aligned,"
>                         " offset=0x%"PRIx64"", offset);
>          /* should be ignored, fall through for now */
>      }
>  
>      if (unlikely(size < sizeof(uint32_t))) {
> -        NVME_GUEST_ERR(nvme_ub_mmiowr_toosmall,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_toosmall,
>                         "MMIO write smaller than 32-bits,"
>                         " offset=0x%"PRIx64", size=%u",
>                         offset, size);
> @@ -1035,32 +1034,32 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>      switch (offset) {
>      case 0xc:   /* INTMS */
>          if (unlikely(msix_enabled(&(n->parent_obj)))) {
> -            NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> +            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
>                             "undefined access to interrupt mask set"
>                             " when MSI-X is enabled");
>              /* should be ignored, fall through for now */
>          }
>          n->bar.intms |= data & 0xffffffff;
>          n->bar.intmc = n->bar.intms;
> -        trace_nvme_mmio_intm_set(data & 0xffffffff,
> +        trace_nvme_dev_mmio_intm_set(data & 0xffffffff,
>                                   n->bar.intmc);
>          nvme_irq_check(n);
>          break;
>      case 0x10:  /* INTMC */
>          if (unlikely(msix_enabled(&(n->parent_obj)))) {
> -            NVME_GUEST_ERR(nvme_ub_mmiowr_intmask_with_msix,
> +            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_intmask_with_msix,
>                             "undefined access to interrupt mask clr"
>                             " when MSI-X is enabled");
>              /* should be ignored, fall through for now */
>          }
>          n->bar.intms &= ~(data & 0xffffffff);
>          n->bar.intmc = n->bar.intms;
> -        trace_nvme_mmio_intm_clr(data & 0xffffffff,
> +        trace_nvme_dev_mmio_intm_clr(data & 0xffffffff,
>                                   n->bar.intmc);
>          nvme_irq_check(n);
>          break;
>      case 0x14:  /* CC */
> -        trace_nvme_mmio_cfg(data & 0xffffffff);
> +        trace_nvme_dev_mmio_cfg(data & 0xffffffff);
>          /* Windows first sends data, then sends enable bit */
>          if (!NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc) &&
>              !NVME_CC_SHN(data) && !NVME_CC_SHN(n->bar.cc))
> @@ -1071,42 +1070,42 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>          if (NVME_CC_EN(data) && !NVME_CC_EN(n->bar.cc)) {
>              n->bar.cc = data;
>              if (unlikely(nvme_start_ctrl(n))) {
> -                trace_nvme_err_startfail();
> +                trace_nvme_dev_err_startfail();
>                  n->bar.csts = NVME_CSTS_FAILED;
>              } else {
> -                trace_nvme_mmio_start_success();
> +                trace_nvme_dev_mmio_start_success();
>                  n->bar.csts = NVME_CSTS_READY;
>              }
>          } else if (!NVME_CC_EN(data) && NVME_CC_EN(n->bar.cc)) {
> -            trace_nvme_mmio_stopped();
> +            trace_nvme_dev_mmio_stopped();
>              nvme_clear_ctrl(n);
>              n->bar.csts &= ~NVME_CSTS_READY;
>          }
>          if (NVME_CC_SHN(data) && !(NVME_CC_SHN(n->bar.cc))) {
> -            trace_nvme_mmio_shutdown_set();
> +            trace_nvme_dev_mmio_shutdown_set();
>              nvme_clear_ctrl(n);
>              n->bar.cc = data;
>              n->bar.csts |= NVME_CSTS_SHST_COMPLETE;
>          } else if (!NVME_CC_SHN(data) && NVME_CC_SHN(n->bar.cc)) {
> -            trace_nvme_mmio_shutdown_cleared();
> +            trace_nvme_dev_mmio_shutdown_cleared();
>              n->bar.csts &= ~NVME_CSTS_SHST_COMPLETE;
>              n->bar.cc = data;
>          }
>          break;
>      case 0x1C:  /* CSTS */
>          if (data & (1 << 4)) {
> -            NVME_GUEST_ERR(nvme_ub_mmiowr_ssreset_w1c_unsupported,
> +            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_ssreset_w1c_unsupported,
>                             "attempted to W1C CSTS.NSSRO"
>                             " but CAP.NSSRS is zero (not supported)");
>          } else if (data != 0) {
> -            NVME_GUEST_ERR(nvme_ub_mmiowr_ro_csts,
> +            NVME_GUEST_ERR(nvme_dev_ub_mmiowr_ro_csts,
>                             "attempted to set a read only bit"
>                             " of controller status");
>          }
>          break;
>      case 0x20:  /* NSSR */
>          if (data == 0x4E564D65) {
> -            trace_nvme_ub_mmiowr_ssreset_unsupported();
> +            trace_nvme_dev_ub_mmiowr_ssreset_unsupported();
>          } else {
>              /* The spec says that writes of other values have no effect */
>              return;
> @@ -1114,35 +1113,35 @@ static void nvme_write_bar(NvmeCtrl *n, hwaddr offset, uint64_t data,
>          break;
>      case 0x24:  /* AQA */
>          n->bar.aqa = data & 0xffffffff;
> -        trace_nvme_mmio_aqattr(data & 0xffffffff);
> +        trace_nvme_dev_mmio_aqattr(data & 0xffffffff);
>          break;
>      case 0x28:  /* ASQ */
>          n->bar.asq = data;
> -        trace_nvme_mmio_asqaddr(data);
> +        trace_nvme_dev_mmio_asqaddr(data);
>          break;
>      case 0x2c:  /* ASQ hi */
>          n->bar.asq |= data << 32;
> -        trace_nvme_mmio_asqaddr_hi(data, n->bar.asq);
> +        trace_nvme_dev_mmio_asqaddr_hi(data, n->bar.asq);
>          break;
>      case 0x30:  /* ACQ */
> -        trace_nvme_mmio_acqaddr(data);
> +        trace_nvme_dev_mmio_acqaddr(data);
>          n->bar.acq = data;
>          break;
>      case 0x34:  /* ACQ hi */
>          n->bar.acq |= data << 32;
> -        trace_nvme_mmio_acqaddr_hi(data, n->bar.acq);
> +        trace_nvme_dev_mmio_acqaddr_hi(data, n->bar.acq);
>          break;
>      case 0x38:  /* CMBLOC */
> -        NVME_GUEST_ERR(nvme_ub_mmiowr_cmbloc_reserved,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_cmbloc_reserved,
>                         "invalid write to reserved CMBLOC"
>                         " when CMBSZ is zero, ignored");
>          return;
>      case 0x3C:  /* CMBSZ */
> -        NVME_GUEST_ERR(nvme_ub_mmiowr_cmbsz_readonly,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_cmbsz_readonly,
>                         "invalid write to read only CMBSZ, ignored");
>          return;
>      default:
> -        NVME_GUEST_ERR(nvme_ub_mmiowr_invalid,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiowr_invalid,
>                         "invalid MMIO write,"
>                         " offset=0x%"PRIx64", data=%"PRIx64"",
>                         offset, data);
> @@ -1157,12 +1156,12 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
>      uint64_t val = 0;
>  
>      if (unlikely(addr & (sizeof(uint32_t) - 1))) {
> -        NVME_GUEST_ERR(nvme_ub_mmiord_misaligned32,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiord_misaligned32,
>                         "MMIO read not 32-bit aligned,"
>                         " offset=0x%"PRIx64"", addr);
>          /* should RAZ, fall through for now */
>      } else if (unlikely(size < sizeof(uint32_t))) {
> -        NVME_GUEST_ERR(nvme_ub_mmiord_toosmall,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiord_toosmall,
>                         "MMIO read smaller than 32-bits,"
>                         " offset=0x%"PRIx64"", addr);
>          /* should RAZ, fall through for now */
> @@ -1171,7 +1170,7 @@ static uint64_t nvme_mmio_read(void *opaque, hwaddr addr, unsigned size)
>      if (addr < sizeof(n->bar)) {
>          memcpy(&val, ptr + addr, size);
>      } else {
> -        NVME_GUEST_ERR(nvme_ub_mmiord_invalid_ofs,
> +        NVME_GUEST_ERR(nvme_dev_ub_mmiord_invalid_ofs,
>                         "MMIO read beyond last register,"
>                         " offset=0x%"PRIx64", returning 0", addr);
>      }
> @@ -1184,7 +1183,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>      uint32_t qid;
>  
>      if (unlikely(addr & ((1 << 2) - 1))) {
> -        NVME_GUEST_ERR(nvme_ub_db_wr_misaligned,
> +        NVME_GUEST_ERR(nvme_dev_ub_db_wr_misaligned,
>                         "doorbell write not 32-bit aligned,"
>                         " offset=0x%"PRIx64", ignoring", addr);
>          return;
> @@ -1199,7 +1198,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          qid = (addr - (0x1000 + (1 << 2))) >> 3;
>          if (unlikely(nvme_check_cqid(n, qid))) {
> -            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_cq,
> +            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_cq,
>                             "completion queue doorbell write"
>                             " for nonexistent queue,"
>                             " sqid=%"PRIu32", ignoring", qid);
> @@ -1208,7 +1207,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          cq = n->cq[qid];
>          if (unlikely(new_head >= cq->size)) {
> -            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_cqhead,
> +            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_cqhead,
>                             "completion queue doorbell write value"
>                             " beyond queue size, sqid=%"PRIu32","
>                             " new_head=%"PRIu16", ignoring",
> @@ -1237,7 +1236,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          qid = (addr - 0x1000) >> 3;
>          if (unlikely(nvme_check_sqid(n, qid))) {
> -            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_sq,
> +            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_sq,
>                             "submission queue doorbell write"
>                             " for nonexistent queue,"
>                             " sqid=%"PRIu32", ignoring", qid);
> @@ -1246,7 +1245,7 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          sq = n->sq[qid];
>          if (unlikely(new_tail >= sq->size)) {
> -            NVME_GUEST_ERR(nvme_ub_db_wr_invalid_sqtail,
> +            NVME_GUEST_ERR(nvme_dev_ub_db_wr_invalid_sqtail,
>                             "submission queue doorbell write value"
>                             " beyond queue size, sqid=%"PRIu32","
>                             " new_tail=%"PRIu16", ignoring",
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index c03e80c2c9c9..ade506ea2bb2 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -29,96 +29,96 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
>  
>  # nvme.c
>  # nvme traces for successful events
> -nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> -nvme_irq_pin(void) "pulsing IRQ pin"
> -nvme_irq_masked(void) "IRQ is masked"
> -nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> -nvme_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> -nvme_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16""
> -nvme_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16", ien=%d"
> -nvme_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
> -nvme_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> -nvme_identify_ctrl(void) "identify controller"
> -nvme_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> -nvme_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> -nvme_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> -nvme_getfeat_numq(int result) "get feature number of queues, result=%d"
> -nvme_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> -nvme_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> -nvme_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> -nvme_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> -nvme_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> -nvme_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> -nvme_mmio_aqattr(uint64_t data) "wrote MMIO, admin queue attributes=0x%"PRIx64""
> -nvme_mmio_asqaddr(uint64_t data) "wrote MMIO, admin submission queue address=0x%"PRIx64""
> -nvme_mmio_acqaddr(uint64_t data) "wrote MMIO, admin completion queue address=0x%"PRIx64""
> -nvme_mmio_asqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin submission queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
> -nvme_mmio_acqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin completion queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
> -nvme_mmio_start_success(void) "setting controller enable bit succeeded"
> -nvme_mmio_stopped(void) "cleared controller enable bit"
> -nvme_mmio_shutdown_set(void) "shutdown bit set"
> -nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
> +nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> +nvme_dev_irq_pin(void) "pulsing IRQ pin"
> +nvme_dev_irq_masked(void) "IRQ is masked"
> +nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> +nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> +nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16""
> +nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> +nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
> +nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> +nvme_dev_identify_ctrl(void) "identify controller"
> +nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> +nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> +nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> +nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> +nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> +nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> +nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> +nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> +nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> +nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> +nvme_dev_mmio_aqattr(uint64_t data) "wrote MMIO, admin queue attributes=0x%"PRIx64""
> +nvme_dev_mmio_asqaddr(uint64_t data) "wrote MMIO, admin submission queue address=0x%"PRIx64""
> +nvme_dev_mmio_acqaddr(uint64_t data) "wrote MMIO, admin completion queue address=0x%"PRIx64""
> +nvme_dev_mmio_asqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin submission queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
> +nvme_dev_mmio_acqaddr_hi(uint64_t data, uint64_t new_addr) "wrote MMIO, admin completion queue high half=0x%"PRIx64", new_address=0x%"PRIx64""
> +nvme_dev_mmio_start_success(void) "setting controller enable bit succeeded"
> +nvme_dev_mmio_stopped(void) "cleared controller enable bit"
> +nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
> +nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  
>  # nvme traces for error conditions
> -nvme_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> -nvme_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> -nvme_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> -nvme_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
> -nvme_err_invalid_prp(void) "invalid PRP"
> -nvme_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
> -nvme_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
> -nvme_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
> -nvme_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> -nvme_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
> -nvme_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
> -nvme_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> -nvme_err_invalid_create_sq_size(uint16_t qsize) "failed creating submission queue, invalid qsize=%"PRIu16""
> -nvme_err_invalid_create_sq_addr(uint64_t addr) "failed creating submission queue, addr=0x%"PRIx64""
> -nvme_err_invalid_create_sq_qflags(uint16_t qflags) "failed creating submission queue, qflags=%"PRIu16""
> -nvme_err_invalid_del_cq_cqid(uint16_t cqid) "failed deleting completion queue, cqid=%"PRIu16""
> -nvme_err_invalid_del_cq_notempty(uint16_t cqid) "failed deleting completion queue, it is not empty, cqid=%"PRIu16""
> -nvme_err_invalid_create_cq_cqid(uint16_t cqid) "failed creating completion queue, cqid=%"PRIu16""
> -nvme_err_invalid_create_cq_size(uint16_t size) "failed creating completion queue, size=%"PRIu16""
> -nvme_err_invalid_create_cq_addr(uint64_t addr) "failed creating completion queue, addr=0x%"PRIx64""
> -nvme_err_invalid_create_cq_vector(uint16_t vector) "failed creating completion queue, vector=%"PRIu16""
> -nvme_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completion queue, qflags=%"PRIu16""
> -nvme_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
> -nvme_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
> -nvme_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> -nvme_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
> -nvme_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
> -nvme_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> -nvme_err_startfail_nbaracq(void) "nvme_start_ctrl failed because the admin completion queue address is null"
> -nvme_err_startfail_asq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin submission queue address is misaligned: 0x%"PRIx64""
> -nvme_err_startfail_acq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin completion queue address is misaligned: 0x%"PRIx64""
> -nvme_err_startfail_page_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too small: log2size=%u, min=%u"
> -nvme_err_startfail_page_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too large: log2size=%u, max=%u"
> -nvme_err_startfail_cqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too small: log2size=%u, min=%u"
> -nvme_err_startfail_cqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too large: log2size=%u, max=%u"
> -nvme_err_startfail_sqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too small: log2size=%u, min=%u"
> -nvme_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too large: log2size=%u, max=%u"
> -nvme_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
> -nvme_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
> -nvme_err_startfail(void) "setting controller enable bit failed"
> +nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> +nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> +nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> +nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
> +nvme_dev_err_invalid_prp(void) "invalid PRP"
> +nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
> +nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
> +nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
> +nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> +nvme_dev_err_invalid_del_sq(uint16_t qid) "invalid submission queue deletion, sid=%"PRIu16""
> +nvme_dev_err_invalid_create_sq_cqid(uint16_t cqid) "failed creating submission queue, invalid cqid=%"PRIu16""
> +nvme_dev_err_invalid_create_sq_sqid(uint16_t sqid) "failed creating submission queue, invalid sqid=%"PRIu16""
> +nvme_dev_err_invalid_create_sq_size(uint16_t qsize) "failed creating submission queue, invalid qsize=%"PRIu16""
> +nvme_dev_err_invalid_create_sq_addr(uint64_t addr) "failed creating submission queue, addr=0x%"PRIx64""
> +nvme_dev_err_invalid_create_sq_qflags(uint16_t qflags) "failed creating submission queue, qflags=%"PRIu16""
> +nvme_dev_err_invalid_del_cq_cqid(uint16_t cqid) "failed deleting completion queue, cqid=%"PRIu16""
> +nvme_dev_err_invalid_del_cq_notempty(uint16_t cqid) "failed deleting completion queue, it is not empty, cqid=%"PRIu16""
> +nvme_dev_err_invalid_create_cq_cqid(uint16_t cqid) "failed creating completion queue, cqid=%"PRIu16""
> +nvme_dev_err_invalid_create_cq_size(uint16_t size) "failed creating completion queue, size=%"PRIu16""
> +nvme_dev_err_invalid_create_cq_addr(uint64_t addr) "failed creating completion queue, addr=0x%"PRIx64""
> +nvme_dev_err_invalid_create_cq_vector(uint16_t vector) "failed creating completion queue, vector=%"PRIu16""
> +nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completion queue, qflags=%"PRIu16""
> +nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
> +nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
> +nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> +nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
> +nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
> +nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> +nvme_dev_err_startfail_nbaracq(void) "nvme_start_ctrl failed because the admin completion queue address is null"
> +nvme_dev_err_startfail_asq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin submission queue address is misaligned: 0x%"PRIx64""
> +nvme_dev_err_startfail_acq_misaligned(uint64_t addr) "nvme_start_ctrl failed because the admin completion queue address is misaligned: 0x%"PRIx64""
> +nvme_dev_err_startfail_page_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too small: log2size=%u, min=%u"
> +nvme_dev_err_startfail_page_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the page size is too large: log2size=%u, max=%u"
> +nvme_dev_err_startfail_cqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too small: log2size=%u, min=%u"
> +nvme_dev_err_startfail_cqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the completion queue entry size is too large: log2size=%u, max=%u"
> +nvme_dev_err_startfail_sqent_too_small(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too small: log2size=%u, min=%u"
> +nvme_dev_err_startfail_sqent_too_large(uint8_t log2ps, uint8_t maxlog2ps) "nvme_start_ctrl failed because the submission queue entry size is too large: log2size=%u, max=%u"
> +nvme_dev_err_startfail_asqent_sz_zero(void) "nvme_start_ctrl failed because the admin submission queue size is zero"
> +nvme_dev_err_startfail_acqent_sz_zero(void) "nvme_start_ctrl failed because the admin completion queue size is zero"
> +nvme_dev_err_startfail(void) "setting controller enable bit failed"
>  
>  # Traces for undefined behavior
> -nvme_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
> -nvme_ub_mmiowr_toosmall(uint64_t offset, unsigned size) "MMIO write smaller than 32 bits, offset=0x%"PRIx64", size=%u"
> -nvme_ub_mmiowr_intmask_with_msix(void) "undefined access to interrupt mask set when MSI-X is enabled"
> -nvme_ub_mmiowr_ro_csts(void) "attempted to set a read only bit of controller status"
> -nvme_ub_mmiowr_ssreset_w1c_unsupported(void) "attempted to W1C CSTS.NSSRO but CAP.NSSRS is zero (not supported)"
> -nvme_ub_mmiowr_ssreset_unsupported(void) "attempted NVM subsystem reset but CAP.NSSRS is zero (not supported)"
> -nvme_ub_mmiowr_cmbloc_reserved(void) "invalid write to reserved CMBLOC when CMBSZ is zero, ignored"
> -nvme_ub_mmiowr_cmbsz_readonly(void) "invalid write to read only CMBSZ, ignored"
> -nvme_ub_mmiowr_invalid(uint64_t offset, uint64_t data) "invalid MMIO write, offset=0x%"PRIx64", data=0x%"PRIx64""
> -nvme_ub_mmiord_misaligned32(uint64_t offset) "MMIO read not 32-bit aligned, offset=0x%"PRIx64""
> -nvme_ub_mmiord_toosmall(uint64_t offset) "MMIO read smaller than 32-bits, offset=0x%"PRIx64""
> -nvme_ub_mmiord_invalid_ofs(uint64_t offset) "MMIO read beyond last register, offset=0x%"PRIx64", returning 0"
> -nvme_ub_db_wr_misaligned(uint64_t offset) "doorbell write not 32-bit aligned, offset=0x%"PRIx64", ignoring"
> -nvme_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for nonexistent queue, cqid=%"PRIu32", ignoring"
> -nvme_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
> -nvme_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
> -nvme_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
> +nvme_dev_ub_mmiowr_misaligned32(uint64_t offset) "MMIO write not 32-bit aligned, offset=0x%"PRIx64""
> +nvme_dev_ub_mmiowr_toosmall(uint64_t offset, unsigned size) "MMIO write smaller than 32 bits, offset=0x%"PRIx64", size=%u"
> +nvme_dev_ub_mmiowr_intmask_with_msix(void) "undefined access to interrupt mask set when MSI-X is enabled"
> +nvme_dev_ub_mmiowr_ro_csts(void) "attempted to set a read only bit of controller status"
> +nvme_dev_ub_mmiowr_ssreset_w1c_unsupported(void) "attempted to W1C CSTS.NSSRO but CAP.NSSRS is zero (not supported)"
> +nvme_dev_ub_mmiowr_ssreset_unsupported(void) "attempted NVM subsystem reset but CAP.NSSRS is zero (not supported)"
> +nvme_dev_ub_mmiowr_cmbloc_reserved(void) "invalid write to reserved CMBLOC when CMBSZ is zero, ignored"
> +nvme_dev_ub_mmiowr_cmbsz_readonly(void) "invalid write to read only CMBSZ, ignored"
> +nvme_dev_ub_mmiowr_invalid(uint64_t offset, uint64_t data) "invalid MMIO write, offset=0x%"PRIx64", data=0x%"PRIx64""
> +nvme_dev_ub_mmiord_misaligned32(uint64_t offset) "MMIO read not 32-bit aligned, offset=0x%"PRIx64""
> +nvme_dev_ub_mmiord_toosmall(uint64_t offset) "MMIO read smaller than 32-bits, offset=0x%"PRIx64""
> +nvme_dev_ub_mmiord_invalid_ofs(uint64_t offset) "MMIO read beyond last register, offset=0x%"PRIx64", returning 0"
> +nvme_dev_ub_db_wr_misaligned(uint64_t offset) "doorbell write not 32-bit aligned, offset=0x%"PRIx64", ignoring"
> +nvme_dev_ub_db_wr_invalid_cq(uint32_t qid) "completion queue doorbell write for nonexistent queue, cqid=%"PRIu32", ignoring"
> +nvme_dev_ub_db_wr_invalid_cqhead(uint32_t qid, uint16_t new_head) "completion queue doorbell write value beyond queue size, cqid=%"PRIu32", new_head=%"PRIu16", ignoring"
> +nvme_dev_ub_db_wr_invalid_sq(uint32_t qid) "submission queue doorbell write for nonexistent queue, sqid=%"PRIu32", ignoring"
> +nvme_dev_ub_db_wr_invalid_sqtail(uint32_t qid, uint16_t new_tail) "submission queue doorbell write value beyond queue size, sqid=%"PRIu32", new_head=%"PRIu16", ignoring"
>  
>  # xen-block.c
>  xen_block_realize(const char *type, uint32_t disk, uint32_t partition) "%s d%up%u"

Reviewed this by making a script which roughly did the same rename and checking for differences.
Looks OK to me.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 02/26] nvme: remove superfluous breaks
  2020-02-04  9:51     ` [PATCH v5 02/26] nvme: remove superfluous breaks Klaus Jensen
@ 2020-02-12  9:09       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:09 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> These break statements was left over when commit 3036a626e9ef ("nvme:
> add Get/Set Feature Timestamp support") was merged.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 4 ----
>  1 file changed, 4 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index dd548d9b6605..c9ad6aaa5f95 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -788,7 +788,6 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
> -        break;
>      default:
>          trace_nvme_dev_err_invalid_getfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -831,11 +830,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          req->cqe.result =
>              cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
>          break;
> -
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> -        break;
> -
>      default:
>          trace_nvme_dev_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 03/26] nvme: move device parameters to separate struct
  2020-02-04  9:51     ` [PATCH v5 03/26] nvme: move device parameters to separate struct Klaus Jensen
@ 2020-02-12  9:12       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:12 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Move device configuration parameters to separate struct to make it
> explicit what is configurable and what is set internally.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c | 44 ++++++++++++++++++++++----------------------
>  hw/block/nvme.h | 16 +++++++++++++---
>  2 files changed, 35 insertions(+), 25 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index c9ad6aaa5f95..f05ebcce3f53 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -64,12 +64,12 @@ static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>  
>  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
>  {
> -    return sqid < n->num_queues && n->sq[sqid] != NULL ? 0 : -1;
> +    return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
>  }
>  
>  static int nvme_check_cqid(NvmeCtrl *n, uint16_t cqid)
>  {
> -    return cqid < n->num_queues && n->cq[cqid] != NULL ? 0 : -1;
> +    return cqid < n->params.num_queues && n->cq[cqid] != NULL ? 0 : -1;
>  }
>  
>  static void nvme_inc_cq_tail(NvmeCQueue *cq)
> @@ -631,7 +631,7 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>          trace_nvme_dev_err_invalid_create_cq_addr(prp1);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
> -    if (unlikely(vector > n->num_queues)) {
> +    if (unlikely(vector > n->params.num_queues)) {
>          trace_nvme_dev_err_invalid_create_cq_vector(vector);
>          return NVME_INVALID_IRQ_VECTOR | NVME_DNR;
>      }
> @@ -783,7 +783,8 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> -        result = cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
> +        result = cpu_to_le32((n->params.num_queues - 2) |
> +            ((n->params.num_queues - 2) << 16));
Line wrapping issue.

>          trace_nvme_dev_getfeat_numq(result);
>          break;
>      case NVME_TIMESTAMP:
> @@ -826,9 +827,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_NUMBER_OF_QUEUES:
>          trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
> -            ((dw11 >> 16) & 0xFFFF) + 1, n->num_queues - 1, n->num_queues - 1);
> -        req->cqe.result =
> -            cpu_to_le32((n->num_queues - 2) | ((n->num_queues - 2) << 16));
> +            ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
> +            n->params.num_queues - 1);
> +        req->cqe.result = cpu_to_le32((n->params.num_queues - 2) |
> +            ((n->params.num_queues - 2) << 16));
Here as well, and there are more probably.
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> @@ -899,12 +901,12 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>  
>      blk_drain(n->conf.blk);
>  
> -    for (i = 0; i < n->num_queues; i++) {
> +    for (i = 0; i < n->params.num_queues; i++) {
>          if (n->sq[i] != NULL) {
>              nvme_free_sq(n->sq[i], n);
>          }
>      }
> -    for (i = 0; i < n->num_queues; i++) {
> +    for (i = 0; i < n->params.num_queues; i++) {
>          if (n->cq[i] != NULL) {
>              nvme_free_cq(n->cq[i], n);
>          }
> @@ -1307,7 +1309,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      int64_t bs_size;
>      uint8_t *pci_conf;
>  
> -    if (!n->num_queues) {
> +    if (!n->params.num_queues) {
>          error_setg(errp, "num_queues can't be zero");
>          return;
>      }
> @@ -1323,7 +1325,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>          return;
>      }
>  
> -    if (!n->serial) {
> +    if (!n->params.serial) {
>          error_setg(errp, "serial property not set");
>          return;
>      }
> @@ -1340,25 +1342,25 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      pcie_endpoint_cap_init(pci_dev, 0x80);
>  
>      n->num_namespaces = 1;
> -    n->reg_size = pow2ceil(0x1004 + 2 * (n->num_queues + 1) * 4);
> +    n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
>      n->ns_size = bs_size / (uint64_t)n->num_namespaces;
>  
>      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> -    n->sq = g_new0(NvmeSQueue *, n->num_queues);
> -    n->cq = g_new0(NvmeCQueue *, n->num_queues);
> +    n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> +    n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
>  
>      memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
>                            "nvme", n->reg_size);
>      pci_register_bar(pci_dev, 0,
>          PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
>          &n->iomem);
> -    msix_init_exclusive_bar(pci_dev, n->num_queues, 4, NULL);
> +    msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
>  
>      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
>      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
>      strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
>      strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
> -    strpadcpy((char *)id->sn, sizeof(id->sn), n->serial, ' ');
> +    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
>      id->rab = 6;
>      id->ieee[0] = 0x00;
>      id->ieee[1] = 0x02;
> @@ -1387,7 +1389,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      n->bar.vs = 0x00010200;
>      n->bar.intmc = n->bar.intms = 0;
>  
> -    if (n->cmb_size_mb) {
> +    if (n->params.cmb_size_mb) {
>  
>          NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
>          NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> @@ -1398,7 +1400,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>          NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
>          NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
>          NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> -        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->cmb_size_mb);
> +        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
>  
>          n->cmbloc = n->bar.cmbloc;
>          n->cmbsz = n->bar.cmbsz;
> @@ -1437,7 +1439,7 @@ static void nvme_exit(PCIDevice *pci_dev)
>      g_free(n->cq);
>      g_free(n->sq);
>  
> -    if (n->cmb_size_mb) {
> +    if (n->params.cmb_size_mb) {
>          g_free(n->cmbuf);
>      }
>      msix_uninit_exclusive_bar(pci_dev);
> @@ -1445,9 +1447,7 @@ static void nvme_exit(PCIDevice *pci_dev)
>  
>  static Property nvme_props[] = {
>      DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
> -    DEFINE_PROP_STRING("serial", NvmeCtrl, serial),
> -    DEFINE_PROP_UINT32("cmb_size_mb", NvmeCtrl, cmb_size_mb, 0),
> -    DEFINE_PROP_UINT32("num_queues", NvmeCtrl, num_queues, 64),
> +    DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 557194ee1954..9957c4a200e2 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -1,7 +1,19 @@
>  #ifndef HW_NVME_H
>  #define HW_NVME_H
> +
>  #include "block/nvme.h"
>  
> +#define DEFINE_NVME_PROPERTIES(_state, _props) \
> +    DEFINE_PROP_STRING("serial", _state, _props.serial), \
> +    DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
> +    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)
> +
> +typedef struct NvmeParams {
> +    char     *serial;
> +    uint32_t num_queues;
> +    uint32_t cmb_size_mb;
> +} NvmeParams;
> +
>  typedef struct NvmeAsyncEvent {
>      QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
>      NvmeAerResult result;
> @@ -63,6 +75,7 @@ typedef struct NvmeCtrl {
>      MemoryRegion ctrl_mem;
>      NvmeBar      bar;
>      BlockConf    conf;
> +    NvmeParams   params;
>  
>      uint32_t    page_size;
>      uint16_t    page_bits;
> @@ -71,10 +84,8 @@ typedef struct NvmeCtrl {
>      uint16_t    sqe_size;
>      uint32_t    reg_size;
>      uint32_t    num_namespaces;
> -    uint32_t    num_queues;
>      uint32_t    max_q_ents;
>      uint64_t    ns_size;
> -    uint32_t    cmb_size_mb;
>      uint32_t    cmbsz;
>      uint32_t    cmbloc;
>      uint8_t     *cmbuf;
> @@ -82,7 +93,6 @@ typedef struct NvmeCtrl {
>      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
>      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
>  
> -    char            *serial;
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
>      NvmeCQueue      **cq;

With line wrapping issues fixed (this is an issue in all the patches),

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 04/26] nvme: add missing fields in the identify data structures
  2020-02-04  9:51     ` [PATCH v5 04/26] nvme: add missing fields in the identify data structures Klaus Jensen
@ 2020-02-12  9:15       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:15 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Not used by the device model but added for completeness. See NVM Express
> 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Figure 93.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  include/block/nvme.h | 48 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)
> 
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 8fb941c6537c..d2f65e8fe496 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -543,7 +543,13 @@ typedef struct NvmeIdCtrl {
>      uint8_t     ieee[3];
>      uint8_t     cmic;
>      uint8_t     mdts;
> -    uint8_t     rsvd255[178];
> +    uint16_t    cntlid;
> +    uint32_t    ver;
> +    uint32_t    rtd3r;
> +    uint32_t    rtd3e;
> +    uint32_t    oaes;
> +    uint32_t    ctratt;
> +    uint8_t     rsvd100[156];
>      uint16_t    oacs;
>      uint8_t     acl;
>      uint8_t     aerl;
> @@ -551,10 +557,22 @@ typedef struct NvmeIdCtrl {
>      uint8_t     lpa;
>      uint8_t     elpe;
>      uint8_t     npss;
> -    uint8_t     rsvd511[248];
> +    uint8_t     avscc;
> +    uint8_t     apsta;
> +    uint16_t    wctemp;
> +    uint16_t    cctemp;
> +    uint16_t    mtfa;
> +    uint32_t    hmpre;
> +    uint32_t    hmmin;
> +    uint8_t     tnvmcap[16];
> +    uint8_t     unvmcap[16];
> +    uint32_t    rpmbs;
> +    uint8_t     rsvd316[4];
> +    uint16_t    kas;
> +    uint8_t     rsvd322[190];
>      uint8_t     sqes;
>      uint8_t     cqes;
> -    uint16_t    rsvd515;
> +    uint16_t    maxcmd;
>      uint32_t    nn;
>      uint16_t    oncs;
>      uint16_t    fuses;
> @@ -562,8 +580,14 @@ typedef struct NvmeIdCtrl {
>      uint8_t     vwc;
>      uint16_t    awun;
>      uint16_t    awupf;
> -    uint8_t     rsvd703[174];
> -    uint8_t     rsvd2047[1344];
> +    uint8_t     nvscc;
> +    uint8_t     rsvd531;
> +    uint16_t    acwu;
> +    uint8_t     rsvd534[2];
> +    uint32_t    sgls;
> +    uint8_t     rsvd540[228];
> +    uint8_t     subnqn[256];
> +    uint8_t     rsvd1024[1024];
>      NvmePSD     psd[32];
>      uint8_t     vs[1024];
>  } NvmeIdCtrl;
> @@ -653,13 +677,21 @@ typedef struct NvmeIdNs {
>      uint8_t     mc;
>      uint8_t     dpc;
>      uint8_t     dps;
> -
>      uint8_t     nmic;
>      uint8_t     rescap;
>      uint8_t     fpi;
>      uint8_t     dlfeat;
> -
> -    uint8_t     res34[94];
> +    uint8_t     rsvd33;
This is wrong. nawun comes right after dlfeat
> +    uint16_t    nawun;
> +    uint16_t    nawupf;
And here the error cancels out since here there should be 'nacwu' field.
> +    uint16_t    nabsn;
> +    uint16_t    nabo;
> +    uint16_t    nabspf;
> +    uint8_t     rsvd46[2];
> +    uint8_t     nvmcap[16];
> +    uint8_t     rsvd64[40];
> +    uint8_t     nguid[16];
> +    uint64_t    eui64;
>      NvmeLBAF    lbaf[16];
>      uint8_t     res192[192];
Not related to the patch, but maybe rename this to rsvd192 for the sake of consistency?
>      uint8_t     vs[3712];


I reviewed this patch by cross referencing the nvme structures as defined in the kernel,
and the spec.

I prefer to merge this patch with all other spec updates you do in following patches,
to bring nvme.h up to date to 1.3d,
so that it will be easier to review this and remove some noise from other patches.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 05/26] nvme: populate the mandatory subnqn and ver fields
  2020-02-04  9:51     ` [PATCH v5 05/26] nvme: populate the mandatory subnqn and ver fields Klaus Jensen
@ 2020-02-12  9:18       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:18 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Required for compliance with NVMe revision 1.2.1 or later. See NVM
> Express 1.2.1, Section 5.11 ("Identify command"), Figure 90 and Section
> 7.9 ("NVMe Qualified Names").
> 
> This also bumps the supported version to 1.2.1.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c | 13 ++++++++++---
>  1 file changed, 10 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index f05ebcce3f53..9abf74da20f2 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -9,9 +9,9 @@
>   */
>  
>  /**
> - * Reference Specs: http://www.nvmexpress.org, 1.2, 1.1, 1.0e
> + * Reference Specification: NVM Express 1.2.1
>   *
> - *  http://www.nvmexpress.org/resources/
> + *   https://nvmexpress.org/resources/specifications/
To be honest that redirects to https://nvmexpress.org/specifications/
Not a problem though.
>   */
>  
>  /**
> @@ -43,6 +43,8 @@
>  #include "trace.h"
>  #include "nvme.h"
>  
> +#define NVME_SPEC_VER 0x00010201
> +
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>      do { \
>          (trace_##trace)(__VA_ARGS__); \
> @@ -1365,6 +1367,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      id->ieee[0] = 0x00;
>      id->ieee[1] = 0x02;
>      id->ieee[2] = 0xb3;
> +    id->ver = cpu_to_le32(NVME_SPEC_VER);
This is indeed 1.2 addition
>      id->oacs = cpu_to_le16(0);
>      id->frmw = 7 << 1;
>      id->lpa = 1 << 0;
> @@ -1372,6 +1375,10 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      id->cqes = (0x4 << 4) | 0x4;
>      id->nn = cpu_to_le32(n->num_namespaces);
>      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> +
> +    strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> +    pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
Looks OK, this is first format according to the spec.
> +
>      id->psd[0].mp = cpu_to_le16(0x9c4);
>      id->psd[0].enlat = cpu_to_le32(0x10);
>      id->psd[0].exlat = cpu_to_le32(0x4);
> @@ -1386,7 +1393,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      NVME_CAP_SET_CSS(n->bar.cap, 1);
>      NVME_CAP_SET_MPSMAX(n->bar.cap, 4);
>  
> -    n->bar.vs = 0x00010200;
> +    n->bar.vs = NVME_SPEC_VER;
>      n->bar.intmc = n->bar.intms = 0;
>  
>      if (n->params.cmb_size_mb) {

To be really pedantic, the 1.2 spec also requires at least:
  * wctemp and cctemp to be nonzero in Identify Controller (yea, this is stupid to report temperature for virtual controller)
  * NVME_ADM_CMD_GET_LOG_PAGE, with some mandatory log pages
  * NVME_ADM_CMD_SET_FEATURES/NVME_ADM_CMD_GET_FEATURES - The device currently doesn't implement some mandatory features.

And there are probably more. This is what I can recall from my nvme-mdev.

However I see that you implmented these in following patches, so I suggest you first put patches that implement all that features,
and then bump the NVME version.
Most of these features I mentioned were mandatory even in version 1.0 of the spec, so current version is not even
compliant with 1.0 IMHO.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 06/26] nvme: refactor nvme_addr_read
  2020-02-04  9:51     ` [PATCH v5 06/26] nvme: refactor nvme_addr_read Klaus Jensen
@ 2020-02-12  9:23       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:23 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Pull the controller memory buffer check to its own function. The check
> will be used on its own in later patches.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 18 +++++++++++++-----
>  1 file changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 9abf74da20f2..ba5089df9ece 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -54,14 +54,22 @@
>  
>  static void nvme_process_sq(void *opaque);
>  
> +static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> +{
> +    hwaddr low = n->ctrl_mem.addr;
> +    hwaddr hi  = n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size);
> +
> +    return addr >= low && addr < hi;
> +}
> +
>  static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>  {
> -    if (n->cmbsz && addr >= n->ctrl_mem.addr &&
> -                addr < (n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size))) {
> -        memcpy(buf, (void *)&n->cmbuf[addr - n->ctrl_mem.addr], size);
> -    } else {
> -        pci_dma_read(&n->parent_obj, addr, buf, size);
> +    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> +        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
Nitpick:
I am not an expert on qemu coding style but I suspect that there is extra space after that (void *).

Also note that in following patches you fix a serious bug in this function that it doesn't
check that the whole range is in CMB but only that the start of the area is.
I would move it here, or even to a separate patch.

> +        return;
>      }
> +
> +    pci_dma_read(&n->parent_obj, addr, buf, size);
>  }
>  
>  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 07/26] nvme: add support for the abort command
  2020-02-04  9:51     ` [PATCH v5 07/26] nvme: add support for the abort command Klaus Jensen
@ 2020-02-12  9:25       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:25 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.1 ("Abort command").
> 
> The Abort command is a best effort command; for now, the device always
> fails to abort the given command.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c | 28 ++++++++++++++++++++++++++++
>  1 file changed, 28 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index ba5089df9ece..e1810260d40b 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -731,6 +731,18 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>      }
>  }
>  
> +static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0xffff;
> +
> +    req->cqe.result = 1;
> +    if (nvme_check_sqid(n, sqid)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}

Looks 100% up to spec.

In my nvme-mdev it looks like I implemented this wrongly by failing this with
NVME_SC_ABORT_MISSING (which is defined in the kernel sources, but looks like a reserved
error code in the spec. Not that it matters that much.

Also unrelated to this but something I would like to point out 
(this applies not only to this command but to all admin and IO commands) the device
should check for various reserved fields in the command descriptor, which it doesn't currently.

This is what I do:
https://gitlab.com/maximlevitsky/linux/blob/mdev-work-5.2/drivers/nvme/mdev/adm.c#L783

> +
>  static inline void nvme_set_timestamp(NvmeCtrl *n, uint64_t ts)
>  {
>      trace_nvme_dev_setfeat_timestamp(ts);
> @@ -848,6 +860,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          trace_nvme_dev_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
> +
Nitpick: Unrelated whitespace change.
>      return NVME_SUCCESS;
>  }
>  
> @@ -864,6 +877,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_create_cq(n, cmd);
>      case NVME_ADM_CMD_IDENTIFY:
>          return nvme_identify(n, cmd);
> +    case NVME_ADM_CMD_ABORT:
> +        return nvme_abort(n, cmd, req);
>      case NVME_ADM_CMD_SET_FEATURES:
>          return nvme_set_feature(n, cmd, req);
>      case NVME_ADM_CMD_GET_FEATURES:
> @@ -1377,6 +1392,19 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>      id->ieee[2] = 0xb3;
>      id->ver = cpu_to_le32(NVME_SPEC_VER);
>      id->oacs = cpu_to_le16(0);
> +
> +    /*
> +     * Because the controller always completes the Abort command immediately,
> +     * there can never be more than one concurrently executing Abort command,
> +     * so this value is never used for anything. Note that there can easily be
> +     * many Abort commands in the queues, but they are not considered
> +     * "executing" until processed by nvme_abort.
> +     *
> +     * The specification recommends a value of 3 for Abort Command Limit (four
> +     * concurrently outstanding Abort commands), so lets use that though it is
> +     * inconsequential.
> +     */
> +    id->acl = 3;
Yep.
>      id->frmw = 7 << 1;
>      id->lpa = 1 << 0;
>      id->sqes = (0x6 << 4) | 0x6;


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 08/26] nvme: refactor device realization
  2020-02-04  9:51     ` [PATCH v5 08/26] nvme: refactor device realization Klaus Jensen
@ 2020-02-12  9:27       ` Maxim Levitsky
  2020-03-16  7:43         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:27 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> This patch splits up nvme_realize into multiple individual functions,
> each initializing a different subset of the device.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c | 175 +++++++++++++++++++++++++++++++-----------------
>  hw/block/nvme.h |  21 ++++++
>  2 files changed, 133 insertions(+), 63 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index e1810260d40b..81514eaef63a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -44,6 +44,7 @@
>  #include "nvme.h"
>  
>  #define NVME_SPEC_VER 0x00010201
> +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
>  
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>      do { \
> @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
>      },
>  };
>  
> -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
>  {
> -    NvmeCtrl *n = NVME(pci_dev);
> -    NvmeIdCtrl *id = &n->id_ctrl;
> -
> -    int i;
> -    int64_t bs_size;
> -    uint8_t *pci_conf;
> -
> -    if (!n->params.num_queues) {
> -        error_setg(errp, "num_queues can't be zero");
> -        return;
> -    }
> +    NvmeParams *params = &n->params;
>  
>      if (!n->conf.blk) {
> -        error_setg(errp, "drive property not set");
> -        return;
> +        error_setg(errp, "nvme: block backend not configured");
> +        return 1;
As a matter of taste, negative values indicate error, and 0 is the success value.
In Linux kernel this is even an official rule.
>      }
>  
> -    bs_size = blk_getlength(n->conf.blk);
> -    if (bs_size < 0) {
> -        error_setg(errp, "could not get backing file size");
> -        return;
> +    if (!params->serial) {
> +        error_setg(errp, "nvme: serial not configured");
> +        return 1;
>      }
>  
> -    if (!n->params.serial) {
> -        error_setg(errp, "serial property not set");
> -        return;
> +    if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> +        error_setg(errp, "nvme: invalid queue configuration");
Maybe something like "nvme: invalid queue count specified, should be between 1 and ..."?
> +        return 1;
>      }
> +
> +    return 0;
> +}
> +
> +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> +{
>      blkconf_blocksizes(&n->conf);
>      if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> -                                       false, errp)) {
> -        return;
> +        false, errp)) {
> +        return 1;
>      }
>  
> -    pci_conf = pci_dev->config;
> -    pci_conf[PCI_INTERRUPT_PIN] = 1;
> -    pci_config_set_prog_interface(pci_dev->config, 0x2);
> -    pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> -    pcie_endpoint_cap_init(pci_dev, 0x80);
> +    return 0;
> +}
>  
> +static void nvme_init_state(NvmeCtrl *n)
> +{
>      n->num_namespaces = 1;
>      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);

Isn't that wrong?
First 4K of mmio (0x1000) is the registers, and that is followed by the doorbells,
and each doorbell takes 8 bytes (assuming regular doorbell stride).
so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 should be 0x1000 IMHO.
I might miss some rounding magic here though.

> -    n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> -
>      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
>      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
>      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> +}
>  
> -    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
> -                          "nvme", n->reg_size);
> +static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> +{
> +    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
It would be nice to have #define for CMB bar number
> +    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> +
> +    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> +    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> +    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> +    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> +    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> +    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> +
> +    n->cmbloc = n->bar.cmbloc;
> +    n->cmbsz = n->bar.cmbsz;
> +
> +    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> +    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> +                            "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> +    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
Same here although since you read it here from the controller register,
then maybe leave it as is. I prefer though for this kind of thing
to have a #define and use it everywhere. 

> +        PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> +        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> +}
> +
> +static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
> +{
> +    uint8_t *pci_conf = pci_dev->config;
> +
> +    pci_conf[PCI_INTERRUPT_PIN] = 1;
> +    pci_config_set_prog_interface(pci_conf, 0x2);
Nitpick: How about adding some #define for that as well?
(I know that this code is copied as is but still)
> +    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> +    pci_config_set_device_id(pci_conf, 0x5845);
> +    pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> +    pcie_endpoint_cap_init(pci_dev, 0x80);
> +
> +    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
> +        n->reg_size);

Code on split lines should start at column right after the '('
Now its my turn to notice this - our checkpatch.pl doesn't check this,
and I can't explain how often I am getting burnt on this myself.

There are *lot* of these issues, I pointed out some of them but you should
check all the patches for this.


>      pci_register_bar(pci_dev, 0,
>          PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
>          &n->iomem);
Split line alignment issue here as well.
>      msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
>  
> +    if (n->params.cmb_size_mb) {
> +        nvme_init_cmb(n, pci_dev);
> +    }
> +}
> +
> +static void nvme_init_ctrl(NvmeCtrl *n)
> +{
> +    NvmeIdCtrl *id = &n->id_ctrl;
> +    NvmeParams *params = &n->params;
> +    uint8_t *pci_conf = n->parent_obj.config;
> +
>      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
>      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
>      strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
>      strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
> -    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
> +    strpadcpy((char *)id->sn, sizeof(id->sn), params->serial, ' ');
>      id->rab = 6;
>      id->ieee[0] = 0x00;
>      id->ieee[1] = 0x02;
> @@ -1431,46 +1471,55 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  
>      n->bar.vs = NVME_SPEC_VER;
>      n->bar.intmc = n->bar.intms = 0;
> +}
>  
> -    if (n->params.cmb_size_mb) {
> +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> +{
> +    int64_t bs_size;
> +    NvmeIdNs *id_ns = &ns->id_ns;
>  
> -        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
> -        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> +    bs_size = blk_getlength(n->conf.blk);
> +    if (bs_size < 0) {
> +        error_setg_errno(errp, -bs_size, "blk_getlength");
> +        return 1;
> +    }
>  
> -        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> -        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> -        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> -        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> -        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> +    n->ns_size = bs_size;
>  
> -        n->cmbloc = n->bar.cmbloc;
> -        n->cmbsz = n->bar.cmbsz;
> +    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> +        cpu_to_le64(nvme_ns_nlbas(n, ns));
I myself don't know how to align these splits to be honest.
I would just split this into multiple statements.
>  
> -        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> -        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> -                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> -        pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> -            PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> -            PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> +    return 0;
> +}
>  
> +static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> +{
> +    NvmeCtrl *n = NVME(pci_dev);
> +    Error *local_err = NULL;
> +    int i;
> +
> +    if (nvme_check_constraints(n, &local_err)) {
> +        error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
Do we need that hint for the end user?
> +        return;
> +    }
> +
> +    nvme_init_state(n);
> +
> +    if (nvme_init_blk(n, &local_err)) {
> +        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
Same here
> +        return;
>      }
>  
>      for (i = 0; i < n->num_namespaces; i++) {
> -        NvmeNamespace *ns = &n->namespaces[i];
> -        NvmeIdNs *id_ns = &ns->id_ns;
> -        id_ns->nsfeat = 0;
> -        id_ns->nlbaf = 0;
> -        id_ns->flbas = 0;
> -        id_ns->mc = 0;
> -        id_ns->dpc = 0;
> -        id_ns->dps = 0;
> -        id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> -        id_ns->ncap  = id_ns->nuse = id_ns->nsze =
> -            cpu_to_le64(n->ns_size >>
> -                id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
> +        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> +            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
And here
> +            return;
> +        }
>      }
> +
> +    nvme_init_pci(n, pci_dev);
> +    nvme_init_ctrl(n);
>  }
>  
>  static void nvme_exit(PCIDevice *pci_dev)
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 9957c4a200e2..a867bdfabafd 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -65,6 +65,22 @@ typedef struct NvmeNamespace {
>      NvmeIdNs        id_ns;
>  } NvmeNamespace;
>  
> +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> +{
Its not common to return a structure in C, usually pointer is returned to
avoid copying. In this case this doesn't matter that much though.
> +    NvmeIdNs *id_ns = &ns->id_ns;
> +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> +}
> +
> +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> +{
> +    return nvme_ns_lbaf(ns).ds;
> +}
> +
> +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> +{
> +    return 1 << nvme_ns_lbads(ns);
> +}
> +
>  #define TYPE_NVME "nvme"
>  #define NVME(obj) \
>          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> @@ -101,4 +117,9 @@ typedef struct NvmeCtrl {
>      NvmeIdCtrl      id_ctrl;
>  } NvmeCtrl;
>  
> +static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> +{
> +    return n->ns_size >> nvme_ns_lbads(ns);
> +}
Unless you need all these functions in the future, this feels like
it is a bit verbose.

> +
>  #endif /* HW_NVME_H */


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 09/26] nvme: add temperature threshold feature
  2020-02-04  9:51     ` [PATCH v5 09/26] nvme: add temperature threshold feature Klaus Jensen
@ 2020-02-12  9:31       ` Maxim Levitsky
  2020-03-16  7:44         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:31 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> It might seem wierd to implement this feature for an emulated device,
> but it is mandatory to support and the feature is useful for testing
> asynchronous event request support, which will be added in a later
> patch.

Absolutely but as the old saying is, rules are rules.
At least, to the defense of the spec, making this mandatory
forced the vendors to actually report some statistics about
the device in neutral format as opposed to yet another
vendor proprietary thing (I am talking about SMART log page).

> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>

I noticed that you sign off some patches with your @samsung.com email,
and some with @cnexlabs.com
Is there a reason for that?


> ---
>  hw/block/nvme.c      | 50 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/block/nvme.h      |  2 ++
>  include/block/nvme.h |  7 ++++++-
>  3 files changed, 58 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 81514eaef63a..f72348344832 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -45,6 +45,9 @@
>  
>  #define NVME_SPEC_VER 0x00010201
>  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> +#define NVME_TEMPERATURE 0x143
> +#define NVME_TEMPERATURE_WARNING 0x157
> +#define NVME_TEMPERATURE_CRITICAL 0x175
>  
>  #define NVME_GUEST_ERR(trace, fmt, ...) \
>      do { \
> @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
>  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>      uint32_t result;
>  
>      switch (dw10) {
> +    case NVME_TEMPERATURE_THRESHOLD:
> +        result = 0;
> +
> +        /*
> +         * The controller only implements the Composite Temperature sensor, so
> +         * return 0 for all other sensors.
> +         */
> +        if (NVME_TEMP_TMPSEL(dw11)) {
> +            break;
> +        }
> +
> +        switch (NVME_TEMP_THSEL(dw11)) {
> +        case 0x0:
> +            result = cpu_to_le16(n->features.temp_thresh_hi);
> +            break;
> +        case 0x1:
> +            result = cpu_to_le16(n->features.temp_thresh_low);
> +            break;
> +        }
> +
> +        break;
>      case NVME_VOLATILE_WRITE_CACHE:
>          result = blk_enable_write_cache(n->conf.blk);
>          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  
>      switch (dw10) {
> +    case NVME_TEMPERATURE_THRESHOLD:
> +        if (NVME_TEMP_TMPSEL(dw11)) {
> +            break;
> +        }
> +
> +        switch (NVME_TEMP_THSEL(dw11)) {
> +        case 0x0:
> +            n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> +            break;
> +        case 0x1:
> +            n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> +            break;
> +        default:
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }
> +
> +        break;
>      case NVME_VOLATILE_WRITE_CACHE:
>          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>          break;
> @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
>      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
>      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
>      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> +
> +    n->temperature = NVME_TEMPERATURE;

This appears not to be used in the patch.
I think you should move that to the next patch that
adds the get log page support.

> +    n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
>  }
>  
>  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>      id->acl = 3;
>      id->frmw = 7 << 1;
>      id->lpa = 1 << 0;
> +
> +    /* recommended default value (~70 C) */
> +    id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> +    id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> +
>      id->sqes = (0x6 << 4) | 0x6;
>      id->cqes = (0x4 << 4) | 0x4;
>      id->nn = cpu_to_le32(n->num_namespaces);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index a867bdfabafd..1518f32557a3 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
>      uint64_t    irq_status;
>      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
>      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> +    uint16_t    temperature;
>  
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
> @@ -115,6 +116,7 @@ typedef struct NvmeCtrl {
>      NvmeSQueue      admin_sq;
>      NvmeCQueue      admin_cq;
>      NvmeIdCtrl      id_ctrl;
> +    NvmeFeatureVal  features;
>  } NvmeCtrl;
>  
>  static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index d2f65e8fe496..ff31cb32117c 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -616,7 +616,8 @@ enum NvmeIdCtrlOncs {
>  typedef struct NvmeFeatureVal {
>      uint32_t    arbitration;
>      uint32_t    power_mgmt;
> -    uint32_t    temp_thresh;
> +    uint16_t    temp_thresh_hi;
> +    uint16_t    temp_thresh_low;
>      uint32_t    err_rec;
>      uint32_t    volatile_wc;
>      uint32_t    num_queues;
> @@ -635,6 +636,10 @@ typedef struct NvmeFeatureVal {
>  #define NVME_INTC_THR(intc)     (intc & 0xff)
>  #define NVME_INTC_TIME(intc)    ((intc >> 8) & 0xff)
>  
> +#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
> +#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
> +#define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
> +
>  enum NvmeFeatureIds {
>      NVME_ARBITRATION                = 0x1,
>      NVME_POWER_MANAGEMENT           = 0x2,


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 10/26] nvme: add support for the get log page command
  2020-02-04  9:51     ` [PATCH v5 10/26] nvme: add support for the get log page command Klaus Jensen
@ 2020-02-12  9:35       ` Maxim Levitsky
  2020-03-16  7:45         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12  9:35 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Add support for the Get Log Page command and basic implementations of
> the mandatory Error Information, SMART / Health Information and Firmware
> Slot Information log pages.
> 
> In violation of the specification, the SMART / Health Information log
> page does not persist information over the lifetime of the controller
> because the device has no place to store such persistent state.
Yea, not the end of the world.
> 
> Note that the LPA field in the Identify Controller data structure
> intentionally has bit 0 cleared because there is no namespace specific
> information in the SMART / Health information log page.
Makes sense.
> 
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.10 ("Get Log Page command").
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c       | 122 +++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h       |  10 ++++
>  hw/block/trace-events |   2 +
>  include/block/nvme.h  |   2 +-
>  4 files changed, 134 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index f72348344832..468c36918042 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> +    uint64_t off, NvmeRequest *req)
> +{
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> +
> +    uint32_t trans_len;
> +    time_t current_ms;
> +    uint64_t units_read = 0, units_written = 0, read_commands = 0,
> +        write_commands = 0;
> +    NvmeSmartLog smart;
> +    BlockAcctStats *s;
> +
> +    if (nsid && nsid != 0xffffffff) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    s = blk_get_stats(n->conf.blk);
> +
> +    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> +    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> +    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> +    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> +
> +    if (off > sizeof(smart)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    trans_len = MIN(sizeof(smart) - off, buf_len);
> +
> +    memset(&smart, 0x0, sizeof(smart));
> +
> +    smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> +    smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> +    smart.host_read_commands[0] = cpu_to_le64(read_commands);
> +    smart.host_write_commands[0] = cpu_to_le64(write_commands);
> +
> +    smart.temperature[0] = n->temperature & 0xff;
> +    smart.temperature[1] = (n->temperature >> 8) & 0xff;
> +
> +    if ((n->temperature > n->features.temp_thresh_hi) ||
> +        (n->temperature < n->features.temp_thresh_low)) {
> +        smart.critical_warning |= NVME_SMART_TEMPERATURE;
> +    }
> +
> +    current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> +    smart.power_on_hours[0] = cpu_to_le64(
> +        (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> +
> +    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> +        prp2);
> +}
Looks OK.
> +
> +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> +    uint64_t off, NvmeRequest *req)
> +{
> +    uint32_t trans_len;
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    NvmeFwSlotInfoLog fw_log;
> +
> +    if (off > sizeof(fw_log)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> +
> +    trans_len = MIN(sizeof(fw_log) - off, buf_len);
> +
> +    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> +        prp2);
> +}
Looks OK
> +
> +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint8_t  lid = dw10 & 0xff;
> +    uint8_t  rae = (dw10 >> 15) & 0x1;
> +    uint32_t numdl, numdu;
> +    uint64_t off, lpol, lpou;
> +    size_t   len;
> +
> +    numdl = (dw10 >> 16);
> +    numdu = (dw11 & 0xffff);
> +    lpol = dw12;
> +    lpou = dw13;
> +
> +    len = (((numdu << 16) | numdl) + 1) << 2;
> +    off = (lpou << 32ULL) | lpol;
> +
> +    if (off & 0x3) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }

Good. 
Note that there are plenty of other places in the driver that don't honor
such tiny formal bits of the spec, like for instance checking for the reserved
bits in commands.
> +
> +    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> +
> +    switch (lid) {
> +    case NVME_LOG_ERROR_INFO:
> +        if (off) {
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }

I think you might want to memset the user given buffer to zero:

"This is a 64-bit incrementing error count, indicating a unique identifier for this error.
The error count starts at 1h, is incremented for each unique error log entry, and is retained across
power off conditions. A value of 0h indicates an invalid entry; this value is used when there are
lost entries or when there are fewer errors than the maximum number of entries the controller
supports."
> +
> +        return NVME_SUCCESS;
> +    case NVME_LOG_SMART_INFO:
> +        return nvme_smart_info(n, cmd, len, off, req);
> +    case NVME_LOG_FW_SLOT_INFO:
> +        return nvme_fw_log_info(n, cmd, len, off, req);
> +    default:
> +        trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +}


> +
>  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>  {
>      n->cq[cq->cqid] = NULL;
> @@ -914,6 +1031,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_del_sq(n, cmd);
>      case NVME_ADM_CMD_CREATE_SQ:
>          return nvme_create_sq(n, cmd);
> +    case NVME_ADM_CMD_GET_LOG_PAGE:
> +        return nvme_get_log(n, cmd, req);
>      case NVME_ADM_CMD_DELETE_CQ:
>          return nvme_del_cq(n, cmd);
>      case NVME_ADM_CMD_CREATE_CQ:
> @@ -1411,6 +1530,7 @@ static void nvme_init_state(NvmeCtrl *n)
>  
>      n->temperature = NVME_TEMPERATURE;
>      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> +    n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
>  }
>  
>  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> @@ -1491,7 +1611,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>       */
>      id->acl = 3;
>      id->frmw = 7 << 1;
> -    id->lpa = 1 << 0;
> +    id->lpa = 1 << 2;
>  
>      /* recommended default value (~70 C) */
>      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 1518f32557a3..89b0aafa02a2 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -109,6 +109,7 @@ typedef struct NvmeCtrl {
>      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
>      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
>      uint16_t    temperature;
> +    uint64_t    starttime_ms;
>  
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
> @@ -124,4 +125,13 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
>      return n->ns_size >> nvme_ns_lbads(ns);
>  }
>  
> +static inline uint16_t nvme_cid(NvmeRequest *req)
> +{
> +    if (req) {
> +        return le16_to_cpu(req->cqe.cid);
> +    }
> +
> +    return 0xffff;
> +}

I see that you added command ID reporting to trace events you added,
which makes sense.
I think it would be nice later to add it to existing trace events where it makes sense.


> +
>  #endif /* HW_NVME_H */
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index ade506ea2bb2..7da088479f39 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -46,6 +46,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
>  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
>  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
>  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
>  nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> @@ -85,6 +86,7 @@ nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completi
>  nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
>  nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
>  nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> +nvme_dev_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
>  nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
>  nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
>  nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index ff31cb32117c..9a6055adeb61 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -515,7 +515,7 @@ enum NvmeSmartWarn {
>      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
>  };
>  
> -enum LogIdentifier {
> +enum NvmeLogIdentifier {
>      NVME_LOG_ERROR_INFO     = 0x01,
>      NVME_LOG_SMART_INFO     = 0x02,
>      NVME_LOG_FW_SLOT_INFO   = 0x03,

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 11/26] nvme: add support for the asynchronous event request command
  2020-02-04  9:51     ` [PATCH v5 11/26] nvme: add support for the asynchronous event request command Klaus Jensen
@ 2020-02-12 10:21       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 10:21 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> Section 5.2 ("Asynchronous Event Request command").
> 
> Mostly imported from Keith's qemu-nvme tree. Modified with a max number
> of queued events (controllable with the aer_max_queued device
> parameter). The spec states that the controller *should* retain
> events, so we do best effort here.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c       | 167 +++++++++++++++++++++++++++++++++++++++++-
>  hw/block/nvme.h       |  14 +++-
>  hw/block/trace-events |   9 +++
>  include/block/nvme.h  |   8 +-
>  4 files changed, 191 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 468c36918042..a186d95df020 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -325,6 +325,85 @@ static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>  }
>  
> +static void nvme_process_aers(void *opaque)
> +{
> +    NvmeCtrl *n = opaque;
> +    NvmeAsyncEvent *event, *next;
> +
> +    trace_nvme_dev_process_aers(n->aer_queued);
> +
> +    QTAILQ_FOREACH_SAFE(event, &n->aer_queue, entry, next) {
> +        NvmeRequest *req;
> +        NvmeAerResult *result;
> +
> +        /* can't post cqe if there is nothing to complete */
> +        if (!n->outstanding_aers) {
> +            trace_nvme_dev_no_outstanding_aers();
> +            break;
> +        }
> +
> +        /* ignore if masked (cqe posted, but event not cleared) */
> +        if (n->aer_mask & (1 << event->result.event_type)) {
> +            trace_nvme_dev_aer_masked(event->result.event_type, n->aer_mask);
> +            continue;
> +        }
> +
> +        QTAILQ_REMOVE(&n->aer_queue, event, entry);
> +        n->aer_queued--;
> +
> +        n->aer_mask |= 1 << event->result.event_type;
> +        n->outstanding_aers--;
> +
> +        req = n->aer_reqs[n->outstanding_aers];
> +
> +        result = (NvmeAerResult *) &req->cqe.result;
> +        result->event_type = event->result.event_type;
> +        result->event_info = event->result.event_info;
> +        result->log_page = event->result.log_page;
> +        g_free(event);
> +
> +        req->status = NVME_SUCCESS;
> +
> +        trace_nvme_dev_aer_post_cqe(result->event_type, result->event_info,
> +            result->log_page);
> +
> +        nvme_enqueue_req_completion(&n->admin_cq, req);
> +    }
> +}
> +
> +static void nvme_enqueue_event(NvmeCtrl *n, uint8_t event_type,
> +    uint8_t event_info, uint8_t log_page)
> +{
> +    NvmeAsyncEvent *event;
> +
> +    trace_nvme_dev_enqueue_event(event_type, event_info, log_page);
> +
> +    if (n->aer_queued == n->params.aer_max_queued) {
> +        trace_nvme_dev_enqueue_event_noqueue(n->aer_queued);
> +        return;
> +    }
> +
> +    event = g_new(NvmeAsyncEvent, 1);
> +    event->result = (NvmeAerResult) {
> +        .event_type = event_type,
> +        .event_info = event_info,
> +        .log_page   = log_page,
> +    };
> +
> +    QTAILQ_INSERT_TAIL(&n->aer_queue, event, entry);
> +    n->aer_queued++;
> +
> +    nvme_process_aers(n);
> +}
> +
> +static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
> +{
> +    n->aer_mask &= ~(1 << event_type);
> +    if (!QTAILQ_EMPTY(&n->aer_queue)) {
> +        nvme_process_aers(n);
> +    }
> +}
> +
>  static void nvme_rw_cb(void *opaque, int ret)
>  {
>      NvmeRequest *req = opaque;
> @@ -569,8 +648,8 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> -    uint64_t off, NvmeRequest *req)
> +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> +    uint32_t buf_len, uint64_t off, NvmeRequest *req)
>  {
>      uint64_t prp1 = le64_to_cpu(cmd->prp1);
>      uint64_t prp2 = le64_to_cpu(cmd->prp2);
> @@ -619,6 +698,10 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
>      smart.power_on_hours[0] = cpu_to_le64(
>          (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
>  
> +    if (!rae) {
> +        nvme_clear_events(n, NVME_AER_TYPE_SMART);
> +    }
> +
>      return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
>          prp2);
>  }
> @@ -671,13 +754,17 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>      switch (lid) {
>      case NVME_LOG_ERROR_INFO:
> +        if (!rae) {
> +            nvme_clear_events(n, NVME_AER_TYPE_ERROR);
> +        }
> +
>          if (off) {
>              return NVME_INVALID_FIELD | NVME_DNR;
>          }
>  
>          return NVME_SUCCESS;
>      case NVME_LOG_SMART_INFO:
> -        return nvme_smart_info(n, cmd, len, off, req);
> +        return nvme_smart_info(n, cmd, rae, len, off, req);
>      case NVME_LOG_FW_SLOT_INFO:
>          return nvme_fw_log_info(n, cmd, len, off, req);
>      default:
> @@ -954,6 +1041,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
> +    case NVME_ASYNCHRONOUS_EVENT_CONF:
> +        result = cpu_to_le32(n->features.async_config);
> +        break;
>      default:
>          trace_nvme_dev_err_invalid_getfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1003,6 +1093,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>              return NVME_INVALID_FIELD | NVME_DNR;
>          }
>  
> +        if (((n->temperature > n->features.temp_thresh_hi) ||
> +            (n->temperature < n->features.temp_thresh_low)) &&
> +            NVME_AEC_SMART(n->features.async_config) & NVME_SMART_TEMPERATURE) {
> +            nvme_enqueue_event(n, NVME_AER_TYPE_SMART,
> +                NVME_AER_INFO_SMART_TEMP_THRESH, NVME_LOG_SMART_INFO);
> +        }
> +
>          break;
>      case NVME_VOLATILE_WRITE_CACHE:
>          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> @@ -1016,6 +1113,9 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_set_feature_timestamp(n, cmd);
> +    case NVME_ASYNCHRONOUS_EVENT_CONF:
> +        n->features.async_config = dw11;
> +        break;
>      default:
>          trace_nvme_dev_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1024,6 +1124,25 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> +static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    trace_nvme_dev_aer(nvme_cid(req));
> +
> +    if (n->outstanding_aers > n->params.aerl) {
> +        trace_nvme_dev_aer_aerl_exceeded();
> +        return NVME_AER_LIMIT_EXCEEDED;
> +    }
> +
> +    n->aer_reqs[n->outstanding_aers] = req;
> +    n->outstanding_aers++;
> +
> +    if (!QTAILQ_EMPTY(&n->aer_queue)) {
> +        nvme_process_aers(n);
> +    }
> +
> +    return NVME_NO_COMPLETE;
> +}
> +
>  static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      switch (cmd->opcode) {
> @@ -1045,6 +1164,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_set_feature(n, cmd, req);
>      case NVME_ADM_CMD_GET_FEATURES:
>          return nvme_get_feature(n, cmd, req);
> +    case NVME_ADM_CMD_ASYNC_EV_REQ:
> +        return nvme_aer(n, cmd, req);
>      default:
>          trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> @@ -1099,6 +1220,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>          }
>      }
>  
> +    while (!QTAILQ_EMPTY(&n->aer_queue)) {
> +        NvmeAsyncEvent *event = QTAILQ_FIRST(&n->aer_queue);
> +        QTAILQ_REMOVE(&n->aer_queue, event, entry);
> +        g_free(event);
> +    }
> +
> +    n->aer_queued = 0;
> +    n->outstanding_aers = 0;
> +
>      blk_flush(n->conf.blk);
>      n->bar.cc = 0;
>  }
> @@ -1195,6 +1325,8 @@ static int nvme_start_ctrl(NvmeCtrl *n)
>  
>      nvme_set_timestamp(n, 0ULL);
>  
> +    QTAILQ_INIT(&n->aer_queue);
> +
>      return 0;
>  }
>  
> @@ -1387,6 +1519,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>                             "completion queue doorbell write"
>                             " for nonexistent queue,"
>                             " sqid=%"PRIu32", ignoring", qid);
> +
> +            if (n->outstanding_aers) {
> +                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> +                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
> +                    NVME_LOG_ERROR_INFO);
> +            }
> +
>              return;
>          }
>  
> @@ -1397,6 +1536,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>                             " beyond queue size, sqid=%"PRIu32","
>                             " new_head=%"PRIu16", ignoring",
>                             qid, new_head);
> +
> +            if (n->outstanding_aers) {
> +                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> +                    NVME_AER_INFO_ERR_INVALID_DB_VALUE, NVME_LOG_ERROR_INFO);
> +            }
> +
>              return;
>          }
>  
> @@ -1425,6 +1570,13 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>                             "submission queue doorbell write"
>                             " for nonexistent queue,"
>                             " sqid=%"PRIu32", ignoring", qid);
> +
> +            if (n->outstanding_aers) {
> +                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> +                    NVME_AER_INFO_ERR_INVALID_DB_REGISTER,
> +                    NVME_LOG_ERROR_INFO);
> +            }
> +
>              return;
>          }
>  
> @@ -1435,6 +1587,12 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>                             " beyond queue size, sqid=%"PRIu32","
>                             " new_tail=%"PRIu16", ignoring",
>                             qid, new_tail);
> +
> +            if (n->outstanding_aers) {
> +                nvme_enqueue_event(n, NVME_AER_TYPE_ERROR,
> +                    NVME_AER_INFO_ERR_INVALID_DB_VALUE, NVME_LOG_ERROR_INFO);
> +            }
> +
>              return;
>          }
>  
> @@ -1531,6 +1689,7 @@ static void nvme_init_state(NvmeCtrl *n)
>      n->temperature = NVME_TEMPERATURE;
>      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
>      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> +    n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
>  }
>  
>  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> @@ -1610,6 +1769,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>       * inconsequential.
>       */
>      id->acl = 3;
> +    id->aerl = n->params.aerl;
>      id->frmw = 7 << 1;
>      id->lpa = 1 << 2;
>  
> @@ -1700,6 +1860,7 @@ static void nvme_exit(PCIDevice *pci_dev)
>      g_free(n->namespaces);
>      g_free(n->cq);
>      g_free(n->sq);
> +    g_free(n->aer_reqs);
>  
>      if (n->params.cmb_size_mb) {
>          g_free(n->cmbuf);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 89b0aafa02a2..1e715ab1d75c 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -6,16 +6,20 @@
>  #define DEFINE_NVME_PROPERTIES(_state, _props) \
>      DEFINE_PROP_STRING("serial", _state, _props.serial), \
>      DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
> -    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64)
> +    DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
> +    DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
> +    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
>  
>  typedef struct NvmeParams {
>      char     *serial;
>      uint32_t num_queues;
>      uint32_t cmb_size_mb;
> +    uint8_t  aerl;
> +    uint32_t aer_max_queued;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> -    QSIMPLEQ_ENTRY(NvmeAsyncEvent) entry;
> +    QTAILQ_ENTRY(NvmeAsyncEvent) entry;
>      NvmeAerResult result;
>  } NvmeAsyncEvent;
>  
> @@ -102,6 +106,7 @@ typedef struct NvmeCtrl {
>      uint32_t    num_namespaces;
>      uint32_t    max_q_ents;
>      uint64_t    ns_size;
> +    uint8_t     outstanding_aers;
>      uint32_t    cmbsz;
>      uint32_t    cmbloc;
>      uint8_t     *cmbuf;
> @@ -111,6 +116,11 @@ typedef struct NvmeCtrl {
>      uint16_t    temperature;
>      uint64_t    starttime_ms;
>  
> +    uint8_t     aer_mask;
> +    NvmeRequest **aer_reqs;
> +    QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
> +    int         aer_queued;
> +
>      NvmeNamespace   *namespaces;
>      NvmeSQueue      **sq;
>      NvmeCQueue      **cq;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 7da088479f39..3952c36774cf 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -47,6 +47,15 @@ nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_
>  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
>  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
>  nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> +nvme_dev_process_aers(int queued) "queued %d"
> +nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
> +nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
> +nvme_dev_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 0x%"PRIx8""
> +nvme_dev_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
> +nvme_dev_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
> +nvme_dev_enqueue_event_noqueue(int queued) "queued %d"
> +nvme_dev_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
> +nvme_dev_no_outstanding_aers(void) "ignoring event; no outstanding AERs"
>  nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
>  nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 9a6055adeb61..a24be047a311 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -386,8 +386,8 @@ enum NvmeAsyncEventRequest {
>      NVME_AER_TYPE_SMART                     = 1,
>      NVME_AER_TYPE_IO_SPECIFIC               = 6,
>      NVME_AER_TYPE_VENDOR_SPECIFIC           = 7,
> -    NVME_AER_INFO_ERR_INVALID_SQ            = 0,
> -    NVME_AER_INFO_ERR_INVALID_DB            = 1,
> +    NVME_AER_INFO_ERR_INVALID_DB_REGISTER   = 0,
> +    NVME_AER_INFO_ERR_INVALID_DB_VALUE      = 1,
>      NVME_AER_INFO_ERR_DIAG_FAIL             = 2,
>      NVME_AER_INFO_ERR_PERS_INTERNAL_ERR     = 3,
>      NVME_AER_INFO_ERR_TRANS_INTERNAL_ERR    = 4,
> @@ -640,6 +640,10 @@ typedef struct NvmeFeatureVal {
>  #define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
>  #define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
>  
> +#define NVME_AEC_SMART(aec)         (aec & 0xff)
> +#define NVME_AEC_NS_ATTR(aec)       ((aec >> 8) & 0x1)
> +#define NVME_AEC_FW_ACTIVATION(aec) ((aec >> 9) & 0x1)
> +
>  enum NvmeFeatureIds {
>      NVME_ARBITRATION                = 0x1,
>      NVME_POWER_MANAGEMENT           = 0x2,


Overall looks very good. This feature is very tricky to
get right due to somewhat unclear spec but after reading the
spec again, it looks OK.

I might have missed something though. I cross checked against my
implementation of this and it looks like I misunderstood the spec
in few places back then in my nvme-mdev implementation.

Reminding to fix all the split code line alignment issues (when C statement is split over to next line it should be aligned on first '('). 
There are plenty of these here.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 12/26] nvme: add missing mandatory features
  2020-02-04  9:51     ` [PATCH v5 12/26] nvme: add missing mandatory features Klaus Jensen
@ 2020-02-12 10:27       ` Maxim Levitsky
  2020-03-16  7:47         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 10:27 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Add support for returning a resonable response to Get/Set Features of
> mandatory features.
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
>  hw/block/trace-events |  2 ++
>  include/block/nvme.h  |  3 ++-
>  3 files changed, 58 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a186d95df020..3267ee2de47a 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>      uint32_t result;
>  
> +    trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> +
>      switch (dw10) {
> +    case NVME_ARBITRATION:
> +        result = cpu_to_le32(n->features.arbitration);
> +        break;
> +    case NVME_POWER_MANAGEMENT:
> +        result = cpu_to_le32(n->features.power_mgmt);
> +        break;
>      case NVME_TEMPERATURE_THRESHOLD:
>          result = 0;
>  
> @@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>              break;
>          }
>  
> +        break;
> +    case NVME_ERROR_RECOVERY:
> +        result = cpu_to_le32(n->features.err_rec);
>          break;
>      case NVME_VOLATILE_WRITE_CACHE:
>          result = blk_enable_write_cache(n->conf.blk);

This is existing code but still like to point out that endianess conversion is missing.
Also we need to think if we need to do some flush if the write cache is disabled.
I don't know yet that area well enough.

> @@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          break;
>      case NVME_TIMESTAMP:
>          return nvme_get_feature_timestamp(n, cmd);
> +    case NVME_INTERRUPT_COALESCING:
> +        result = cpu_to_le32(n->features.int_coalescing);
> +        break;
> +    case NVME_INTERRUPT_VECTOR_CONF:
> +        if ((dw11 & 0xffff) > n->params.num_queues) {
Looks like it should be >= since interrupt vector is not zero based.
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }
> +
> +        result = cpu_to_le32(n->features.int_vector_config[dw11 & 0xffff]);
> +        break;
> +    case NVME_WRITE_ATOMICITY:
> +        result = cpu_to_le32(n->features.write_atomicity);
> +        break;
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          result = cpu_to_le32(n->features.async_config);
>          break;
> @@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
>      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  
> +    trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
> +
>      switch (dw10) {
>      case NVME_TEMPERATURE_THRESHOLD:
>          if (NVME_TEMP_TMPSEL(dw11)) {
> @@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          n->features.async_config = dw11;
>          break;
> +    case NVME_ARBITRATION:
> +    case NVME_POWER_MANAGEMENT:
> +    case NVME_ERROR_RECOVERY:
> +    case NVME_INTERRUPT_COALESCING:
> +    case NVME_INTERRUPT_VECTOR_CONF:
> +    case NVME_WRITE_ATOMICITY:
> +        return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
>      default:
>          trace_nvme_dev_err_invalid_setfeat(dw10);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
>      n->temperature = NVME_TEMPERATURE;
>      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
>      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> +
> +    /*
> +     * There is no limit on the number of commands that the controller may
> +     * launch at one time from a particular Submission Queue.
> +     */
> +    n->features.arbitration = 0x7;
A nice #define in nvme.h stating that 0x7 means no burst limit would be nice.

> +
> +    n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
> +        sizeof(*n->features.int_vector_config));
> +
> +    /* disable coalescing (not supported) */
> +    for (int i = 0; i < n->params.num_queues; i++) {
> +        n->features.int_vector_config[i] = i | (1 << 16);
Same here
> +    }
> +
>      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
>  }
>  
> @@ -1782,15 +1830,17 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>      id->nn = cpu_to_le32(n->num_namespaces);
>      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
>  
> +
> +    if (blk_enable_write_cache(n->conf.blk)) {
> +        id->vwc = 1;
> +    }
> +
>      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
>      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
>  
>      id->psd[0].mp = cpu_to_le16(0x9c4);
>      id->psd[0].enlat = cpu_to_le32(0x10);
>      id->psd[0].exlat = cpu_to_le32(0x4);
> -    if (blk_enable_write_cache(n->conf.blk)) {
> -        id->vwc = 1;
> -    }
>  
>      n->bar.cap = 0;
>      NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> @@ -1861,6 +1911,7 @@ static void nvme_exit(PCIDevice *pci_dev)
>      g_free(n->cq);
>      g_free(n->sq);
>      g_free(n->aer_reqs);
> +    g_free(n->features.int_vector_config);
>  
>      if (n->params.cmb_size_mb) {
>          g_free(n->cmbuf);
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 3952c36774cf..4cf39961989d 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -41,6 +41,8 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
>  nvme_dev_identify_ctrl(void) "identify controller"
>  nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
>  nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> +nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
> +nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
>  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
>  nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
>  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index a24be047a311..09419ed499d0 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -445,7 +445,8 @@ enum NvmeStatusCodes {
>      NVME_FW_REQ_RESET           = 0x010b,
>      NVME_INVALID_QUEUE_DEL      = 0x010c,
>      NVME_FID_NOT_SAVEABLE       = 0x010d,
> -    NVME_FID_NOT_NSID_SPEC      = 0x010f,
> +    NVME_FEAT_NOT_CHANGABLE     = 0x010e,
> +    NVME_FEAT_NOT_NSID_SPEC     = 0x010f,
>      NVME_FW_REQ_SUSYSTEM_RESET  = 0x0110,
>      NVME_CONFLICTING_ATTRS      = 0x0180,
>      NVME_INVALID_PROT_INFO      = 0x0181,

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 13/26] nvme: additional tracing
  2020-02-04  9:51     ` [PATCH v5 13/26] nvme: additional tracing Klaus Jensen
@ 2020-02-12 10:28       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 10:28 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Add a trace call for nvme_enqueue_req_completion.
> 
> Also, streamline nvme_identify_ns and nvme_identify_ns_list. They do not
> need to repeat the command, it is already in the trace name.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c       | 8 +++++---
>  hw/block/trace-events | 5 +++--
>  2 files changed, 8 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 3267ee2de47a..30c5b3e7a67d 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -320,6 +320,8 @@ static void nvme_post_cqes(void *opaque)
>  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  {
>      assert(cq->cqid == req->sq->cqid);
> +    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
> +        req->status);
Split line alignment on that '('

>      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> @@ -895,7 +897,7 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>          prp1, prp2);
>  }
>  
> -static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
> +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
>  {
>      static const int data_len = 4 * KiB;
>      uint32_t min_nsid = le32_to_cpu(c->nsid);
> @@ -905,7 +907,7 @@ static uint16_t nvme_identify_nslist(NvmeCtrl *n, NvmeIdentify *c)
>      uint16_t ret;
>      int i, j = 0;
>  
> -    trace_nvme_dev_identify_nslist(min_nsid);
> +    trace_nvme_dev_identify_ns_list(min_nsid);
>  
>      list = g_malloc0(data_len);
>      for (i = 0; i < n->num_namespaces; i++) {
> @@ -932,7 +934,7 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>      case 0x01:
>          return nvme_identify_ctrl(n, c);
>      case 0x02:
> -        return nvme_identify_nslist(n, c);
> +        return nvme_identify_ns_list(n, c);
>      default:
>          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 4cf39961989d..f982ec1a3221 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -39,8 +39,8 @@ nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size,
>  nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
>  nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
>  nvme_dev_identify_ctrl(void) "identify controller"
> -nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> -nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> +nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
> +nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
>  nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
>  nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
>  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> @@ -54,6 +54,7 @@ nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
>  nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
>  nvme_dev_aer_masked(uint8_t type, uint8_t mask) "type 0x%"PRIx8" mask 0x%"PRIx8""
>  nvme_dev_aer_post_cqe(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
> +nvme_dev_enqueue_req_completion(uint16_t cid, uint16_t cqid, uint16_t status) "cid %"PRIu16" cqid %"PRIu16" status 0x%"PRIx16""
>  nvme_dev_enqueue_event(uint8_t typ, uint8_t info, uint8_t log_page) "type 0x%"PRIx8" info 0x%"PRIx8" lid 0x%"PRIx8""
>  nvme_dev_enqueue_event_noqueue(int queued) "queued %d"
>  nvme_dev_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""

With alignment fixed:

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid
  2020-02-04  9:51     ` [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid Klaus Jensen
@ 2020-02-12 10:30       ` Maxim Levitsky
  2020-03-16  7:48         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 10:30 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> 0xffff is not an allowed value for NCQR and NSQR in Set Features on
> Number of Queues.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 30c5b3e7a67d..900732bb2f38 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> +        if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
> +            return NVME_INVALID_FIELD | NVME_DNR;
> +        }
Very minor nitpick: since this spec requirement is not obvious, a quote/reference to the spec
would be nice to have here. 

> +
>          trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
>              ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
>              n->params.num_queues - 1);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 15/26] nvme: bump supported specification to 1.3
  2020-02-04  9:51     ` [PATCH v5 15/26] nvme: bump supported specification to 1.3 Klaus Jensen
@ 2020-02-12 10:35       ` Maxim Levitsky
  2020-03-16  7:50         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 10:35 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Add new fields to the Identify Controller and Identify Namespace data
> structures accoding to NVM Express 1.3d.
> 
> NVM Express 1.3d requires the following additional features:
>   - addition of the Namespace Identification Descriptor List (CNS 03h)
>     for the Identify command
>   - support for returning Command Sequence Error if a Set Features
>     command is submitted for the Number of Queues feature after any I/O
>     queues have been created.
>   - The addition of the Log Specific Field (LSP) in the Get Log Page
>     command.

> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> ---
>  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
>  hw/block/nvme.h       |  1 +
>  hw/block/trace-events |  3 ++-
>  include/block/nvme.h  | 20 ++++++++++-----
>  4 files changed, 71 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 900732bb2f38..4acfc85b56a2 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -9,7 +9,7 @@
>   */
>  
>  /**
> - * Reference Specification: NVM Express 1.2.1
> + * Reference Specification: NVM Express 1.3d
>   *
>   *   https://nvmexpress.org/resources/specifications/
>   */
> @@ -43,7 +43,7 @@
>  #include "trace.h"
>  #include "nvme.h"
>  
> -#define NVME_SPEC_VER 0x00010201
> +#define NVME_SPEC_VER 0x00010300
>  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
>  #define NVME_TEMPERATURE 0x143
>  #define NVME_TEMPERATURE_WARNING 0x157
> @@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t dw12 = le32_to_cpu(cmd->cdw12);
>      uint32_t dw13 = le32_to_cpu(cmd->cdw13);
>      uint8_t  lid = dw10 & 0xff;
> +    uint8_t  lsp = (dw10 >> 8) & 0xf;
>      uint8_t  rae = (dw10 >> 15) & 0x1;
>      uint32_t numdl, numdu;
>      uint64_t off, lpol, lpou;
> @@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> +    trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
>  
>      switch (lid) {
>      case NVME_LOG_ERROR_INFO:
> @@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      cq = g_malloc0(sizeof(*cq));
>      nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
>          NVME_CQ_FLAGS_IEN(qflags));
Code alignment on that '('
> +
> +    n->qs_created = true;
Should be done also at nvme_create_sq
>      return NVME_SUCCESS;
>  }
>  
> @@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
>      return ret;
>  }
>  
> +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> +{
> +    static const int len = 4096;
The spec caps the Identify payload size to 4K,
thus this should go to nvme.h
> +
> +    struct ns_descr {
> +        uint8_t nidt;
> +        uint8_t nidl;
> +        uint8_t rsvd2[2];
> +        uint8_t nid[16];
> +    };
This is also part of the spec, thus should
move to nvme.h

> +
> +    uint32_t nsid = le32_to_cpu(c->nsid);
> +    uint64_t prp1 = le64_to_cpu(c->prp1);
> +    uint64_t prp2 = le64_to_cpu(c->prp2);
> +
> +    struct ns_descr *list;
> +    uint16_t ret;
> +
> +    trace_nvme_dev_identify_ns_descr_list(nsid);
> +
> +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> +        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> +        return NVME_INVALID_NSID | NVME_DNR;
> +    }
> +
> +    /*
> +     * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
> +     * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> +     * Namespace Identification Descriptor. Add a very basic Namespace UUID
> +     * here.
Some per namespace uuid qemu property will be very nice to have to have a uuid that
is at least somewhat unique.
Linux kernel I think might complain if it detects namespaces with duplicate uuids.

> +     */
> +    list = g_malloc0(len);
> +    list->nidt = 0x3;
> +    list->nidl = 0x10;
Those should also be #defined in nvme.h
> +    *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> +
> +    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> +    g_free(list);
> +    return ret;
> +}
> +
>  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)cmd;
> @@ -935,6 +979,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
>          return nvme_identify_ctrl(n, c);
>      case 0x02:
>          return nvme_identify_ns_list(n, c);
> +    case 0x03:
The CNS values should be defined in nvme.h.
> +        return nvme_identify_ns_descr_list(n, cmd);
>      default:
>          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1133,6 +1179,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> +        if (n->qs_created) {
> +            return NVME_CMD_SEQ_ERROR | NVME_DNR;
> +        }
> +
>          if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
>              return NVME_INVALID_FIELD | NVME_DNR;
>          }
> @@ -1267,6 +1317,7 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>  
>      n->aer_queued = 0;
>      n->outstanding_aers = 0;
> +    n->qs_created = false;
>  
>      blk_flush(n->conf.blk);
>      n->bar.cc = 0;
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 1e715ab1d75c..7ced5fd485a9 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -97,6 +97,7 @@ typedef struct NvmeCtrl {
>      BlockConf    conf;
>      NvmeParams   params;
>  
> +    bool        qs_created;
>      uint32_t    page_size;
>      uint16_t    page_bits;
>      uint16_t    max_prp_ents;
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index f982ec1a3221..9e5a4548bde0 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -41,6 +41,7 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
>  nvme_dev_identify_ctrl(void) "identify controller"
>  nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
>  nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
> +nvme_dev_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
>  nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
>  nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
>  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> @@ -48,7 +49,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
>  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
>  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
>  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> -nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
>  nvme_dev_process_aers(int queued) "queued %d"
>  nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
>  nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 09419ed499d0..31eb9397d8c6 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -550,7 +550,9 @@ typedef struct NvmeIdCtrl {
>      uint32_t    rtd3e;
>      uint32_t    oaes;
>      uint32_t    ctratt;
> -    uint8_t     rsvd100[156];
> +    uint8_t     rsvd100[12];
> +    uint8_t     fguid[16];
> +    uint8_t     rsvd128[128];
looks OK
>      uint16_t    oacs;
>      uint8_t     acl;
>      uint8_t     aerl;
> @@ -568,9 +570,15 @@ typedef struct NvmeIdCtrl {
>      uint8_t     tnvmcap[16];
>      uint8_t     unvmcap[16];
>      uint32_t    rpmbs;
> -    uint8_t     rsvd316[4];
> +    uint16_t    edstt;
> +    uint8_t     dsto;
> +    uint8_t     fwug;
looks OK
>      uint16_t    kas;
> -    uint8_t     rsvd322[190];
> +    uint16_t    hctma;
> +    uint16_t    mntmt;
> +    uint16_t    mxtmt;
> +    uint32_t    sanicap;
> +    uint8_t     rsvd332[180];
looks OK
>      uint8_t     sqes;
>      uint8_t     cqes;
>      uint16_t    maxcmd;
> @@ -691,19 +699,19 @@ typedef struct NvmeIdNs {
>      uint8_t     rescap;
>      uint8_t     fpi;
>      uint8_t     dlfeat;
> -    uint8_t     rsvd33;
>      uint16_t    nawun;
>      uint16_t    nawupf;
> +    uint16_t    nacwu;
Aha! Here you 'fix' the bug you had in patch 4.
>      uint16_t    nabsn;
>      uint16_t    nabo;
>      uint16_t    nabspf;
> -    uint8_t     rsvd46[2];
> +    uint16_t    noiob;
>      uint8_t     nvmcap[16];
>      uint8_t     rsvd64[40];
>      uint8_t     nguid[16];
>      uint64_t    eui64;
>      NvmeLBAF    lbaf[16];
> -    uint8_t     res192[192];
> +    uint8_t     rsvd192[192];
And even do what I suggested with that field :-)
Please squash the changes.
>      uint8_t     vs[3712];
>  } NvmeIdNs;
>  

So I suggest you squash this set of changes with patch 4.
I also suggest you to split the other changes in this patch, 1 per feature added.
The tracing change can also be squashed with the other tracing patch you submitted.

In summary I would suggest you to have:

1. patch that only adds all the fields from the 1.3d spec, and overall updates nvme.h
to be up to 1.3d spec

2. patches that do refactoring, add more tracing (also form of refactoring, since tracing
isn't a functional thing)

3. set of patches that implement all the 1.3d features.

4. patch that only bumps the supported version right to 1.3d

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 16/26] nvme: refactor prp mapping
  2020-02-04  9:51     ` [PATCH v5 16/26] nvme: refactor prp mapping Klaus Jensen
@ 2020-02-12 11:44       ` Maxim Levitsky
  2020-03-16  7:51         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 11:44 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
> ensures that if some of the PRP is in the CMB, all of it must be located
> there, as per the specification.

To be honest this looks like not refactoring but a bugfix
(old code was just assuming that if first prp entry is in cmb, the rest also is)
> 
> Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
> takes an additional DMADirection parameter.

To be honest 'nvme_dma_prp' was not a clear function name to me at first glance.
Could you rename this to nvme_dma_prp_rw or so? (Although even that is somewhat unclear
to convey the meaning of read/write the data to/from the guest memory areas defined by the prp list.
Also could you split this change into a new patch?

> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
Now you even use your both addresses :-)

> ---
>  hw/block/nvme.c       | 245 +++++++++++++++++++++++++++---------------
>  hw/block/nvme.h       |   2 +-
>  hw/block/trace-events |   1 +
>  include/block/nvme.h  |   1 +
>  4 files changed, 160 insertions(+), 89 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 4acfc85b56a2..334265efb21e 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -58,6 +58,11 @@
>  
>  static void nvme_process_sq(void *opaque);
>  
> +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> +{
> +    return &n->cmbuf[addr - n->ctrl_mem.addr];
> +}

To my taste I would put this together with the patch that
added nvme_addr_is_cmb. I know that some people are against
this citing the fact that you should use the code you add
in the same patch. Your call.

Regardless of this I also prefer to put refactoring patches first in the series.

> +
>  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  {
>      hwaddr low = n->ctrl_mem.addr;
> @@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
>      }
>  }
>  
> -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
> -                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
> +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> +    uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)

Split line alignment (it was correct before).
Also while at the refactoring, it would be great to add some documentation
to this and few more functions, since its not clear immediately what this does.


>  {
>      hwaddr trans_len = n->page_size - (prp1 % n->page_size);
>      trans_len = MIN(len, trans_len);
>      int num_prps = (len >> n->page_bits) + 1;
> +    uint16_t status = NVME_SUCCESS;
> +    bool is_cmb = false;
> +    bool prp_list_in_cmb = false;
> +
> +    trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> +        prp1, prp2, num_prps);
>  
>      if (unlikely(!prp1)) {
>          trace_nvme_dev_err_invalid_prp();
>          return NVME_INVALID_FIELD | NVME_DNR;
> -    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> -               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> -        qsg->nsg = 0;
> +    }
> +
> +    if (nvme_addr_is_cmb(n, prp1)) {
> +        is_cmb = true;
> +
>          qemu_iovec_init(iov, num_prps);
> -        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
> +
> +        /*
> +         * PRPs do not cross page boundaries, so if the start address (here,
> +         * prp1) is within the CMB, it cannot cross outside the controller
> +         * memory buffer range. This is ensured by
> +         *
> +         *   len = n->page_size - (addr % n->page_size)
> +         *
> +         * Thus, we can directly add to the iovec without risking an out of
> +         * bounds access. This also holds for the remaining qemu_iovec_add
> +         * calls.
> +         */
> +        qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
>      } else {
>          pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
>          qemu_sglist_add(qsg, prp1, trans_len);
>      }
> +
>      len -= trans_len;
>      if (len) {
>          if (unlikely(!prp2)) {
>              trace_nvme_dev_err_invalid_prp2_missing();
> +            status = NVME_INVALID_FIELD | NVME_DNR;
>              goto unmap;
>          }
> +
>          if (len > n->page_size) {
>              uint64_t prp_list[n->max_prp_ents];
>              uint32_t nents, prp_trans;
>              int i = 0;
>  
> +            if (nvme_addr_is_cmb(n, prp2)) {
> +                prp_list_in_cmb = true;
> +            }
> +
>              nents = (len + n->page_size - 1) >> n->page_bits;
>              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> -            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> +            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
>              while (len != 0) {
>                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
>  
>                  if (i == n->max_prp_ents - 1 && len > n->page_size) {
>                      if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>                          trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> +                        status = NVME_INVALID_FIELD | NVME_DNR;
> +                        goto unmap;
> +                    }
> +
> +                    if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> +                        status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
>                          goto unmap;
>                      }
>  
>                      i = 0;
>                      nents = (len + n->page_size - 1) >> n->page_bits;
>                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> -                    nvme_addr_read(n, prp_ent, (void *)prp_list,
> -                        prp_trans);
> +                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
>                      prp_ent = le64_to_cpu(prp_list[i]);
>                  }
>  
>                  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
>                      trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> +                    status = NVME_INVALID_FIELD | NVME_DNR;
> +                    goto unmap;
> +                }
> +
> +                if (is_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> +                    status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
>                      goto unmap;
>                  }
>  
>                  trans_len = MIN(len, n->page_size);
> -                if (qsg->nsg){
> -                    qemu_sglist_add(qsg, prp_ent, trans_len);
> +                if (is_cmb) {
> +                    qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp_ent),
> +                        trans_len);
>                  } else {
> -                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
> +                    qemu_sglist_add(qsg, prp_ent, trans_len);
>                  }
> +
>                  len -= trans_len;
>                  i++;
>              }
>          } else {
> +            if (is_cmb != nvme_addr_is_cmb(n, prp2)) {
> +                status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +                goto unmap;
> +            }
> +
>              if (unlikely(prp2 & (n->page_size - 1))) {
>                  trace_nvme_dev_err_invalid_prp2_align(prp2);
> +                status = NVME_INVALID_FIELD | NVME_DNR;
>                  goto unmap;
>              }
> -            if (qsg->nsg) {
> +
> +            if (is_cmb) {
> +                qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp2), len);
> +            } else {
>                  qemu_sglist_add(qsg, prp2, len);
> -            } else {
> -                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
>              }
>          }
>      }
> +
>      return NVME_SUCCESS;
>  
> - unmap:
> -    qemu_sglist_destroy(qsg);
> -    return NVME_INVALID_FIELD | NVME_DNR;
> -}

I haven't checked the new nvme_map_prp to the extent that I am sure that
it is correct, but it looks reasonable.

> -
> -static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -                                   uint64_t prp1, uint64_t prp2)
> -{
> -    QEMUSGList qsg;
> -    QEMUIOVector iov;
> -    uint16_t status = NVME_SUCCESS;
> -
> -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> -        return NVME_INVALID_FIELD | NVME_DNR;
> -    }
> -    if (qsg.nsg > 0) {
> -        if (dma_buf_write(ptr, len, &qsg)) {
> -            status = NVME_INVALID_FIELD | NVME_DNR;
> -        }
> -        qemu_sglist_destroy(&qsg);
> +unmap:
> +    if (is_cmb) {
> +        qemu_iovec_destroy(iov);
>      } else {
> -        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
> -            status = NVME_INVALID_FIELD | NVME_DNR;
> -        }
> -        qemu_iovec_destroy(&iov);
> +        qemu_sglist_destroy(qsg);
>      }
> +
>      return status;
>  }
>  
> -static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -    uint64_t prp1, uint64_t prp2)
> +static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> +    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
>  {
>      QEMUSGList qsg;
>      QEMUIOVector iov;
>      uint16_t status = NVME_SUCCESS;
> +    size_t bytes;
>  
> -    trace_nvme_dev_dma_read(prp1, prp2);
> -
> -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> -        return NVME_INVALID_FIELD | NVME_DNR;
> +    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
> +    if (status) {
> +        return status;
>      }
> +
>      if (qsg.nsg > 0) {
> -        if (unlikely(dma_buf_read(ptr, len, &qsg))) {
> +        uint64_t residual;
> +
> +        if (dir == DMA_DIRECTION_TO_DEVICE) {
> +            residual = dma_buf_write(ptr, len, &qsg);
> +        } else {
> +            residual = dma_buf_read(ptr, len, &qsg);
> +        }
> +
> +        if (unlikely(residual)) {
>              trace_nvme_dev_err_invalid_dma();
>              status = NVME_INVALID_FIELD | NVME_DNR;
>          }
> +
>          qemu_sglist_destroy(&qsg);
> +
> +        return status;

I would prefer if/else here rather than that early return here.
It would make code more symmetric.

> +    }
> +
> +    if (dir == DMA_DIRECTION_TO_DEVICE) {
> +        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
>      } else {
> -        if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
> -            trace_nvme_dev_err_invalid_dma();
> -            status = NVME_INVALID_FIELD | NVME_DNR;
> -        }
> -        qemu_iovec_destroy(&iov);
> +        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
>      }
> +
> +    if (unlikely(bytes != len)) {
> +        trace_nvme_dev_err_invalid_dma();
> +        status = NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    qemu_iovec_destroy(&iov);
> +
>      return status;
>  }
>  
> @@ -420,16 +474,20 @@ static void nvme_rw_cb(void *opaque, int ret)
>          block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
>          req->status = NVME_INTERNAL_DEV_ERROR;
>      }
> -    if (req->has_sg) {
> +
> +    if (req->qsg.nalloc) {
>          qemu_sglist_destroy(&req->qsg);
>      }
> +    if (req->iov.nalloc) {
> +        qemu_iovec_destroy(&req->iov);
> +    }
> +
>      nvme_enqueue_req_completion(cq, req);
>  }
>  
>  static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>           BLOCK_ACCT_FLUSH);
>      req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> @@ -453,7 +511,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> -    req->has_sg = false;
>      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
>                       BLOCK_ACCT_WRITE);
>      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> @@ -485,21 +542,24 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
>          return NVME_LBA_RANGE | NVME_DNR;
>      }
>  
> -    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
> +    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
>          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
>      if (req->qsg.nsg > 0) {
> -        req->has_sg = true;
> +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> +            acct);
> +
>          req->aiocb = is_write ?
>              dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                            nvme_rw_cb, req) :
>              dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
>                           nvme_rw_cb, req);
>      } else {
> -        req->has_sg = false;
> +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> +            acct);
> +
>          req->aiocb = is_write ?
>              blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
>                              req) :
> @@ -596,7 +656,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>      sq->size = size;
>      sq->cqid = cqid;
>      sq->head = sq->tail = 0;
> -    sq->io_req = g_new(NvmeRequest, sq->size);
> +    sq->io_req = g_new0(NvmeRequest, sq->size);
>  
>      QTAILQ_INIT(&sq->req_list);
>      QTAILQ_INIT(&sq->out_req_list);
> @@ -704,8 +764,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>          nvme_clear_events(n, NVME_AER_TYPE_SMART);
>      }
>  
> -    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> -        prp2);
> +    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> @@ -724,8 +784,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
>  
>      trans_len = MIN(sizeof(fw_log) - off, buf_len);
>  
> -    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> -        prp2);
> +    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> @@ -869,18 +929,20 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
> +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> +    NvmeRequest *req)
>  {
>      uint64_t prp1 = le64_to_cpu(c->prp1);
>      uint64_t prp2 = le64_to_cpu(c->prp2);
>  
>      trace_nvme_dev_identify_ctrl();
>  
> -    return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> -        prp1, prp2);
> +    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
> +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> +    NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
>      uint32_t nsid = le32_to_cpu(c->nsid);
> @@ -896,11 +958,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
>  
>      ns = &n->namespaces[nsid - 1];
>  
> -    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> -        prp1, prp2);
> +    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> +    NvmeRequest *req)
>  {
>      static const int data_len = 4 * KiB;
>      uint32_t min_nsid = le32_to_cpu(c->nsid);
> @@ -922,12 +985,14 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
>              break;
>          }
>      }
> -    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> +    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>      g_free(list);
>      return ret;
>  }
>  
> -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> +    NvmeRequest *req)
>  {
>      static const int len = 4096;
>  
> @@ -963,24 +1028,25 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
>      list->nidl = 0x10;
>      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
>  
> -    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> +    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>      g_free(list);
>      return ret;
>  }
>  
> -static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      NvmeIdentify *c = (NvmeIdentify *)cmd;
>  
>      switch (le32_to_cpu(c->cns)) {
>      case 0x00:
> -        return nvme_identify_ns(n, c);
> +        return nvme_identify_ns(n, c, req);
>      case 0x01:
> -        return nvme_identify_ctrl(n, c);
> +        return nvme_identify_ctrl(n, c, req);
>      case 0x02:
> -        return nvme_identify_ns_list(n, c);
> +        return nvme_identify_ns_list(n, c, req);
>      case 0x03:
> -        return nvme_identify_ns_descr_list(n, cmd);
> +        return nvme_identify_ns_descr_list(n, c, req);
>      default:
>          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1039,15 +1105,16 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
>      return cpu_to_le64(ts.all);
>  }
>  
> -static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> +    NvmeRequest *req)
>  {
>      uint64_t prp1 = le64_to_cpu(cmd->prp1);
>      uint64_t prp2 = le64_to_cpu(cmd->prp2);
>  
>      uint64_t timestamp = nvme_get_timestamp(n);
>  
> -    return nvme_dma_read_prp(n, (uint8_t *)&timestamp,
> -                                 sizeof(timestamp), prp1, prp2);
> +    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> @@ -1099,7 +1166,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          trace_nvme_dev_getfeat_numq(result);
>          break;
>      case NVME_TIMESTAMP:
> -        return nvme_get_feature_timestamp(n, cmd);
> +        return nvme_get_feature_timestamp(n, cmd, req);
>      case NVME_INTERRUPT_COALESCING:
>          result = cpu_to_le32(n->features.int_coalescing);
>          break;
> @@ -1125,15 +1192,16 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> +    NvmeRequest *req)
>  {
>      uint16_t ret;
>      uint64_t timestamp;
>      uint64_t prp1 = le64_to_cpu(cmd->prp1);
>      uint64_t prp2 = le64_to_cpu(cmd->prp2);
>  
> -    ret = nvme_dma_write_prp(n, (uint8_t *)&timestamp,
> -                                sizeof(timestamp), prp1, prp2);
> +    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> +        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
>      if (ret != NVME_SUCCESS) {
>          return ret;
>      }
> @@ -1194,7 +1262,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>              ((n->params.num_queues - 2) << 16));
>          break;
>      case NVME_TIMESTAMP:
> -        return nvme_set_feature_timestamp(n, cmd);
> +        return nvme_set_feature_timestamp(n, cmd, req);
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          n->features.async_config = dw11;
>          break;
> @@ -1246,7 +1314,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      case NVME_ADM_CMD_CREATE_CQ:
>          return nvme_create_cq(n, cmd);
>      case NVME_ADM_CMD_IDENTIFY:
> -        return nvme_identify(n, cmd);
> +        return nvme_identify(n, cmd, req);
>      case NVME_ADM_CMD_ABORT:
>          return nvme_abort(n, cmd, req);
>      case NVME_ADM_CMD_SET_FEATURES:
> @@ -1282,6 +1350,7 @@ static void nvme_process_sq(void *opaque)
>          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
>          memset(&req->cqe, 0, sizeof(req->cqe));
>          req->cqe.cid = cmd.cid;
> +        memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
>  
>          status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
>              nvme_admin_cmd(n, &cmd, req);
> @@ -1804,7 +1873,7 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
>  
>      NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
>      NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> -    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 1);
>      NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
>      NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
>      NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 7ced5fd485a9..d27baa9d5391 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -27,11 +27,11 @@ typedef struct NvmeRequest {
>      struct NvmeSQueue       *sq;
>      BlockAIOCB              *aiocb;
>      uint16_t                status;
> -    bool                    has_sg;
>      NvmeCqe                 cqe;
>      BlockAcctCookie         acct;
>      QEMUSGList              qsg;
>      QEMUIOVector            iov;
> +    NvmeCmd                 cmd;
>      QTAILQ_ENTRY(NvmeRequest)entry;
>  } NvmeRequest;
>  
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 9e5a4548bde0..77aa0da99ee0 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -33,6 +33,7 @@ nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
>  nvme_dev_irq_pin(void) "pulsing IRQ pin"
>  nvme_dev_irq_masked(void) "IRQ is masked"
>  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> +nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
>  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
>  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16""
>  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 31eb9397d8c6..c1de92179596 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -427,6 +427,7 @@ enum NvmeStatusCodes {
>      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
> +    NVME_INVALID_USE_OF_CMB     = 0x0012,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
>      NVME_NS_NOT_READY           = 0x0082,


Overall I would split this commit into real refactoring and bugfixes.
Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 17/26] nvme: allow multiple aios per command
  2020-02-04  9:51     ` [PATCH v5 17/26] nvme: allow multiple aios per command Klaus Jensen
@ 2020-02-12 11:48       ` Maxim Levitsky
  2020-03-16  7:53         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 11:48 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> This refactors how the device issues asynchronous block backend
> requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> associated with the command. This allows multiple aios to be issued for
> a command. Only when all requests have been completed will the device
> post a completion queue entry.
> 
> Because the device is currently guaranteed to only issue a single aio
> request per command, the benefit is not immediately obvious. But this
> functionality is required to support metadata, the dataset management
> command and other features.

I don't know what the strategy will be chosen for supporting metadata
(qemu doesn't have any notion of metadata in the block layer), but for dataset management
you are right. Dataset management command can contain a table of areas to discard
(although in reality I have seen no driver putting there more that one entry).


> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c       | 449 +++++++++++++++++++++++++++++++++---------
>  hw/block/nvme.h       | 134 +++++++++++--
>  hw/block/trace-events |   8 +
>  3 files changed, 480 insertions(+), 111 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 334265efb21e..e97da35c4ca1 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -19,7 +19,8 @@
>   *      -drive file=<file>,if=none,id=<drive_id>
>   *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
>   *              cmb_size_mb=<cmb_size_mb[optional]>, \
> - *              num_queues=<N[optional]>
> + *              num_queues=<N[optional]>, \
> + *              mdts=<mdts[optional]>

Could you split mdts checks into a separate patch? This is not related to the series.

>   *
>   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
>   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> @@ -57,6 +58,7 @@
>      } while (0)
>  
>  static void nvme_process_sq(void *opaque);
> +static void nvme_aio_cb(void *opaque, int ret);
>  
>  static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
>  {
> @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>      return status;
>  }
>  
> +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns = req->ns;
> +
> +    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +
> +    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> +}

Same here, this is another nice refactoring and it should be in separate patch.

> +
> +static void nvme_aio_destroy(NvmeAIO *aio)
> +{
> +    g_free(aio);
> +}
> +
> +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> +    NvmeAIOOp opc)
> +{
> +    aio->opc = opc;
> +
> +    trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> +        aio->offset, aio->len, nvme_aio_opc_str(aio), req);
> +
> +    if (req) {
> +        QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
> +    }
> +}
> +
> +static void nvme_aio(NvmeAIO *aio)
Function name not clear to me. Maybe change this to something like nvme_submit_aio.
> +{
> +    BlockBackend *blk = aio->blk;
> +    BlockAcctCookie *acct = &aio->acct;
> +    BlockAcctStats *stats = blk_get_stats(blk);
> +
> +    bool is_write, dma;
> +
> +    switch (aio->opc) {
> +    case NVME_AIO_OPC_NONE:
> +        break;
> +
> +    case NVME_AIO_OPC_FLUSH:
> +        block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
> +        aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
> +        break;
> +
> +    case NVME_AIO_OPC_WRITE_ZEROES:
> +        block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
> +        aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
> +            BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> +        break;
> +
> +    case NVME_AIO_OPC_READ:
> +    case NVME_AIO_OPC_WRITE:

> +        dma = aio->qsg != NULL;

This doesn't work.
aio->qsg is always not null since nvme_rw_aio sets this to &req->qsg
which is then written to aio->qsg by nvme_aio_new.

That is yet another reason I really don't like these parallel QEMUSGList
and QEMUIOVector. However I see that few other qemu drivers do this,
thus this is probably a necessary evil.

What we can do maybe is to do dma_memory_map on the SG list,
and then deal with QEMUIOVector only. Virtio does this
(virtqueue_pop/virtqueue_push)


> +        is_write = (aio->opc == NVME_AIO_OPC_WRITE);
> +
> +        block_acct_start(stats, acct, aio->len,
> +            is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> +
> +        if (dma) {
> +            aio->aiocb = is_write ?
> +                dma_blk_write(blk, aio->qsg, aio->offset,
> +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio) :
> +                dma_blk_read(blk, aio->qsg, aio->offset,
> +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio);
> +
Extra space
> +            return;
> +        }
> +
> +        aio->aiocb = is_write ?
> +            blk_aio_pwritev(blk, aio->offset, aio->iov, 0,
> +                nvme_aio_cb, aio) :
> +            blk_aio_preadv(blk, aio->offset, aio->iov, 0,
> +                nvme_aio_cb, aio);
> +
> +        break;
> +    }
> +}
> +
> +static void nvme_rw_aio(BlockBackend *blk, uint64_t offset, NvmeRequest *req)
> +{
> +    NvmeAIO *aio;
> +    size_t len = req->qsg.nsg > 0 ? req->qsg.size : req->iov.size;
> +
> +    aio = g_new0(NvmeAIO, 1);
> +
> +    *aio = (NvmeAIO) {
> +        .blk = blk,
> +        .offset = offset,
> +        .len = len,
> +        .req = req,
> +        .qsg = &req->qsg,
> +        .iov = &req->iov,
> +    };
> +
> +    nvme_req_register_aio(req, aio, nvme_req_is_write(req) ?
> +        NVME_AIO_OPC_WRITE : NVME_AIO_OPC_READ);
nitpick: I think I don't like the nvme_req_register_aio name either, but I don't think I have
a better name for it yet. 
> +    nvme_aio(aio);
> +}
> +
>  static void nvme_post_cqes(void *opaque)
>  {
>      NvmeCQueue *cq = opaque;
> @@ -364,6 +467,7 @@ static void nvme_post_cqes(void *opaque)
>          nvme_inc_cq_tail(cq);
>          pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
>              sizeof(req->cqe));
> +        nvme_req_clear(req);
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }
>      if (cq->tail != cq->head) {
> @@ -374,8 +478,8 @@ static void nvme_post_cqes(void *opaque)
>  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
>  {
>      assert(cq->cqid == req->sq->cqid);
> -    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
> -        req->status);
> +    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, req->status);
> +
>      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
>      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
>      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> @@ -460,135 +564,272 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
>      }
>  }
>  
> -static void nvme_rw_cb(void *opaque, int ret)
> +static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len,
> +    NvmeRequest *req)
> +{
> +    uint8_t mdts = n->params.mdts;
> +
> +    if (mdts && len > n->page_size << mdts) {
> +        trace_nvme_dev_err_mdts(nvme_cid(req), n->page_size << mdts, len);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
> +    NvmeNamespace *ns = req->ns;
> +
> +    uint16_t ctrl = le16_to_cpu(rw->control);
> +
> +    if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) {
> +        trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> +    uint32_t nlb, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns = req->ns;
> +    uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> +
> +    if (unlikely((slba + nlb) > nsze)) {
> +        block_acct_invalid(blk_get_stats(n->conf.blk),
> +            nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> +        trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
> +        return NVME_LBA_RANGE | NVME_DNR;
> +    }

Double check this in regard to integer overflows, e.g if slba + nlb overflows.

That is what I did in my nvme-mdev:

static inline bool check_range(u64 start, u64 size, u64 end)
{
	u64 test = start + size;

	/* check for overflow */
	if (test < start || test < size)
		return false;
	return test <= end;
}

> +
> +    return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> +{
> +    NvmeNamespace *ns = req->ns;
> +    size_t len = req->nlb << nvme_ns_lbads(ns);
> +    uint16_t status;
> +
> +    status = nvme_check_mdts(n, len, req);
> +    if (status) {
> +        return status;
> +    }
> +
> +    status = nvme_check_prinfo(n, req);
> +    if (status) {
> +        return status;
> +    }
> +
> +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> +    if (status) {
> +        return status;
> +    }
> +
> +    return NVME_SUCCESS;
> +}

Note that there are more things to check if we don't support metadata,
like for instance the metadata pointer in the submission entry is NULL.

All these check_ functions are very good but they should move to
a separate patch since they just implement parts of the spec
and have nothing to do with the patch subject.

> +
> +static void nvme_rw_cb(NvmeRequest *req, void *opaque)
>  {
> -    NvmeRequest *req = opaque;
>      NvmeSQueue *sq = req->sq;
>      NvmeCtrl *n = sq->ctrl;
>      NvmeCQueue *cq = n->cq[sq->cqid];
>  
> -    if (!ret) {
> -        block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
> -        req->status = NVME_SUCCESS;
> -    } else {
> -        block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
> -        req->status = NVME_INTERNAL_DEV_ERROR;
> -    }
> -
> -    if (req->qsg.nalloc) {
> -        qemu_sglist_destroy(&req->qsg);
> -    }
> -    if (req->iov.nalloc) {
> -        qemu_iovec_destroy(&req->iov);
> -    }
> +    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
>  
>      nvme_enqueue_req_completion(cq, req);
>  }
>  
> -static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> -    NvmeRequest *req)
> +static void nvme_aio_cb(void *opaque, int ret)
>  {
> -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> -         BLOCK_ACCT_FLUSH);
> -    req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> +    NvmeAIO *aio = opaque;
> +    NvmeRequest *req = aio->req;
>  
> -    return NVME_NO_COMPLETE;
> -}
> +    BlockBackend *blk = aio->blk;
> +    BlockAcctCookie *acct = &aio->acct;
> +    BlockAcctStats *stats = blk_get_stats(blk);
>  
> -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> -    NvmeRequest *req)
> -{
> -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> -    const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> -    const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> -    uint64_t slba = le64_to_cpu(rw->slba);
> -    uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
> -    uint64_t offset = slba << data_shift;
> -    uint32_t count = nlb << data_shift;
> -
> -    if (unlikely(slba + nlb > ns->id_ns.nsze)) {
> -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> -        return NVME_LBA_RANGE | NVME_DNR;
> -    }
> -
> -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> -                     BLOCK_ACCT_WRITE);
> -    req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> -                                        BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
> -    return NVME_NO_COMPLETE;
> -}
> -
> -static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> -    NvmeRequest *req)
> -{
> -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> -    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
> -    uint64_t slba = le64_to_cpu(rw->slba);
> -    uint64_t prp1 = le64_to_cpu(rw->prp1);
> -    uint64_t prp2 = le64_to_cpu(rw->prp2);
> -
> -    uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> -    uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> -    uint64_t data_size = (uint64_t)nlb << data_shift;
> -    uint64_t data_offset = slba << data_shift;
> -    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> -    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> +    Error *local_err = NULL;
>  
> -    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
> +    trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> +        nvme_aio_opc_str(aio), req);
>  
> -    if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> -        return NVME_LBA_RANGE | NVME_DNR;
> +    if (req) {

I wonder in which case the aio callback will be called without req.
Looking at the code it looks like that can't happen.
(NvmeAIO is created by nvme_aio_new and all its callers pass not null req)

> +        QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
>      }
>  
> -    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> -        return NVME_INVALID_FIELD | NVME_DNR;
> -    }
> -
> -    if (req->qsg.nsg > 0) {
> -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> -            acct);
> -
> -        req->aiocb = is_write ?
> -            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> -                          nvme_rw_cb, req) :
> -            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> -                         nvme_rw_cb, req);
> +    if (!ret) {
> +        block_acct_done(stats, acct);
>      } else {
> -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> -            acct);
> +        block_acct_failed(stats, acct);
>  
> -        req->aiocb = is_write ?
> -            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> -                            req) :
> -            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> -                           req);
> +        if (req) {
> +            uint16_t status;
> +
> +            switch (aio->opc) {
> +            case NVME_AIO_OPC_READ:
> +                status = NVME_UNRECOVERED_READ;
> +                break;
> +            case NVME_AIO_OPC_WRITE:
> +            case NVME_AIO_OPC_WRITE_ZEROES:
> +                status = NVME_WRITE_FAULT;
> +                break;
> +            default:
> +                status = NVME_INTERNAL_DEV_ERROR;
> +                break;
> +            }
> +
> +            trace_nvme_dev_err_aio(nvme_cid(req), aio, blk_name(blk),
> +                aio->offset, nvme_aio_opc_str(aio), req, status);
> +
> +            error_setg_errno(&local_err, -ret, "aio failed");
> +            error_report_err(local_err);
> +
> +            /*
> +             * An Internal Error trumps all other errors. For other errors,
> +             * only set the first error encountered. Any additional errors will
> +             * be recorded in the error information log page.
> +             */
> +            if (!req->status ||
> +                nvme_status_is_error(status, NVME_INTERNAL_DEV_ERROR)) {
> +                req->status = status;
> +            }
> +        }
> +    }
> +
> +    if (aio->cb) {
> +        aio->cb(aio, aio->cb_arg, ret);
> +    }
> +
> +    if (req && QTAILQ_EMPTY(&req->aio_tailq)) {
> +        if (req->cb) {
> +            req->cb(req, req->cb_arg);
> +        } else {
> +            NvmeSQueue *sq = req->sq;
> +            NvmeCtrl *n = sq->ctrl;
> +            NvmeCQueue *cq = n->cq[sq->cqid];
> +
> +            nvme_enqueue_req_completion(cq, req);
> +        }
>      }
>  
> +    nvme_aio_destroy(aio);
> +}
> +
> +static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    NvmeAIO *aio = g_new0(NvmeAIO, 1);
> +
> +    *aio = (NvmeAIO) {
> +        .blk = n->conf.blk,
> +        .req = req,
> +    };
> +
> +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_FLUSH);
> +    nvme_aio(aio);
> +
> +    return NVME_NO_COMPLETE;
> +}
> +
> +static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    NvmeAIO *aio;
> +
> +    NvmeNamespace *ns = req->ns;
> +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> +
> +    int64_t offset;
> +    size_t count;
> +    uint16_t status;
> +
> +    req->slba = le64_to_cpu(rw->slba);
> +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> +
> +    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> +        req->slba, req->nlb);
> +
> +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> +    if (unlikely(status)) {
> +        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> +        return status;
> +    }
This refactoring also should be in a separate patch.

> +
> +    offset = req->slba << nvme_ns_lbads(ns);
> +    count = req->nlb << nvme_ns_lbads(ns);
> +
> +    aio = g_new0(NvmeAIO, 1);
> +
> +    *aio = (NvmeAIO) {
> +        .blk = n->conf.blk,
> +        .offset = offset,
> +        .len = count,
> +        .req = req,
> +    };
> +
> +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_WRITE_ZEROES);
> +    nvme_aio(aio);
> +
> +    return NVME_NO_COMPLETE;
> +}
> +
> +static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +{
> +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> +    NvmeNamespace *ns = req->ns;
> +    int status;
> +
> +    enum BlockAcctType acct =
> +        nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> +
> +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> +    req->slba = le64_to_cpu(rw->slba);
> +
> +    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> +        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> +
> +    status = nvme_check_rw(n, req);
> +    if (status) {
> +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> +        return status;
> +    }
> +
> +    status = nvme_map(n, cmd, req);
> +    if (status) {
> +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> +        return status;
> +    }
> +
> +    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> +    nvme_req_set_cb(req, nvme_rw_cb, NULL);
> +
>      return NVME_NO_COMPLETE;
>  }
>  
>  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> -    NvmeNamespace *ns;
>      uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
> +    trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> +        cmd->opcode);
> +
>      if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
>          trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
>          return NVME_INVALID_NSID | NVME_DNR;
>      }
>  
> -    ns = &n->namespaces[nsid - 1];
> +    req->ns = &n->namespaces[nsid - 1];
> +
>      switch (cmd->opcode) {
>      case NVME_CMD_FLUSH:
> -        return nvme_flush(n, ns, cmd, req);
> +        return nvme_flush(n, cmd, req);
>      case NVME_CMD_WRITE_ZEROS:
> -        return nvme_write_zeros(n, ns, cmd, req);
> +        return nvme_write_zeros(n, cmd, req);
>      case NVME_CMD_WRITE:
>      case NVME_CMD_READ:
> -        return nvme_rw(n, ns, cmd, req);
> +        return nvme_rw(n, cmd, req);
>      default:
>          trace_nvme_dev_err_invalid_opc(cmd->opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
> @@ -612,6 +853,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      NvmeRequest *req, *next;
>      NvmeSQueue *sq;
>      NvmeCQueue *cq;
> +    NvmeAIO *aio;
>      uint16_t qid = le16_to_cpu(c->qid);
>  
>      if (unlikely(!qid || nvme_check_sqid(n, qid))) {
> @@ -624,8 +866,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      sq = n->sq[qid];
>      while (!QTAILQ_EMPTY(&sq->out_req_list)) {
>          req = QTAILQ_FIRST(&sq->out_req_list);
> -        assert(req->aiocb);
> -        blk_aio_cancel(req->aiocb);
> +        while (!QTAILQ_EMPTY(&req->aio_tailq)) {
> +            aio = QTAILQ_FIRST(&req->aio_tailq);
> +            assert(aio->aiocb);
> +            blk_aio_cancel(aio->aiocb);
> +        }
>      }
>      if (!nvme_check_cqid(n, sq->cqid)) {
>          cq = n->cq[sq->cqid];
> @@ -662,6 +907,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>      QTAILQ_INIT(&sq->out_req_list);
>      for (i = 0; i < sq->size; i++) {
>          sq->io_req[i].sq = sq;
> +        QTAILQ_INIT(&(sq->io_req[i].aio_tailq));
>          QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
>      }
>      sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
> @@ -800,6 +1046,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      uint32_t numdl, numdu;
>      uint64_t off, lpol, lpou;
>      size_t   len;
> +    uint16_t status;
>  
>      numdl = (dw10 >> 16);
>      numdu = (dw11 & 0xffff);
> @@ -815,6 +1062,11 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>      trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
>  
> +    status = nvme_check_mdts(n, len, req);
> +    if (status) {
> +        return status;
> +    }
> +
>      switch (lid) {
>      case NVME_LOG_ERROR_INFO:
>          if (!rae) {
> @@ -1348,7 +1600,7 @@ static void nvme_process_sq(void *opaque)
>          req = QTAILQ_FIRST(&sq->req_list);
>          QTAILQ_REMOVE(&sq->req_list, req, entry);
>          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> -        memset(&req->cqe, 0, sizeof(req->cqe));
> +
>          req->cqe.cid = cmd.cid;
>          memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
>  
> @@ -1928,6 +2180,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>      id->ieee[0] = 0x00;
>      id->ieee[1] = 0x02;
>      id->ieee[2] = 0xb3;
> +    id->mdts = params->mdts;
>      id->ver = cpu_to_le32(NVME_SPEC_VER);
>      id->oacs = cpu_to_le16(0);
>  
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index d27baa9d5391..3319f8edd7e1 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -8,7 +8,8 @@
>      DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
>      DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
>      DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
> -    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
> +    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
> +    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
>  
>  typedef struct NvmeParams {
>      char     *serial;
> @@ -16,6 +17,7 @@ typedef struct NvmeParams {
>      uint32_t cmb_size_mb;
>      uint8_t  aerl;
>      uint32_t aer_max_queued;
> +    uint8_t  mdts;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> @@ -23,17 +25,58 @@ typedef struct NvmeAsyncEvent {
>      NvmeAerResult result;
>  } NvmeAsyncEvent;
>  
> -typedef struct NvmeRequest {
> -    struct NvmeSQueue       *sq;
> -    BlockAIOCB              *aiocb;
> -    uint16_t                status;
> -    NvmeCqe                 cqe;
> -    BlockAcctCookie         acct;
> -    QEMUSGList              qsg;
> -    QEMUIOVector            iov;
> -    NvmeCmd                 cmd;
> -    QTAILQ_ENTRY(NvmeRequest)entry;
> -} NvmeRequest;
> +typedef struct NvmeRequest NvmeRequest;
> +typedef void NvmeRequestCompletionFunc(NvmeRequest *req, void *opaque);
> +
> +struct NvmeRequest {
> +    struct NvmeSQueue    *sq;
> +    struct NvmeNamespace *ns;
> +
> +    NvmeCqe  cqe;
> +    NvmeCmd  cmd;
> +    uint16_t status;
> +
> +    uint64_t slba;
> +    uint32_t nlb;
> +
> +    QEMUSGList   qsg;
> +    QEMUIOVector iov;
> +
> +    NvmeRequestCompletionFunc *cb;
> +    void                      *cb_arg;
> +
> +    QTAILQ_HEAD(, NvmeAIO)    aio_tailq;
> +    QTAILQ_ENTRY(NvmeRequest) entry;
> +};
> +
> +static inline void nvme_req_clear(NvmeRequest *req)
> +{
> +    req->ns = NULL;
> +    memset(&req->cqe, 0, sizeof(req->cqe));
> +    req->status = NVME_SUCCESS;
> +    req->slba = req->nlb = 0x0;
> +    req->cb = req->cb_arg = NULL;
> +
> +    if (req->qsg.sg) {
> +        qemu_sglist_destroy(&req->qsg);
> +    }
> +
> +    if (req->iov.iov) {
> +        qemu_iovec_destroy(&req->iov);
> +    }
> +}
> +
> +static inline void nvme_req_set_cb(NvmeRequest *req,
> +    NvmeRequestCompletionFunc *cb, void *cb_arg)
> +{
> +    req->cb = cb;
> +    req->cb_arg = cb_arg;
> +}
> +
> +static inline void nvme_req_clear_cb(NvmeRequest *req)
> +{
> +    req->cb = req->cb_arg = NULL;
> +}
>  
>  typedef struct NvmeSQueue {
>      struct NvmeCtrl *ctrl;
> @@ -85,6 +128,60 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
>      return 1 << nvme_ns_lbads(ns);
>  }
>  
> +typedef enum NvmeAIOOp {
> +    NVME_AIO_OPC_NONE         = 0x0,
> +    NVME_AIO_OPC_FLUSH        = 0x1,
> +    NVME_AIO_OPC_READ         = 0x2,
> +    NVME_AIO_OPC_WRITE        = 0x3,
> +    NVME_AIO_OPC_WRITE_ZEROES = 0x4,
> +} NvmeAIOOp;
> +
> +typedef struct NvmeAIO NvmeAIO;
> +typedef void NvmeAIOCompletionFunc(NvmeAIO *aio, void *opaque, int ret);
> +
> +struct NvmeAIO {
> +    NvmeRequest *req;
> +
> +    NvmeAIOOp       opc;
> +    int64_t         offset;
> +    size_t          len;
> +    BlockBackend    *blk;
> +    BlockAIOCB      *aiocb;
> +    BlockAcctCookie acct;
> +
> +    NvmeAIOCompletionFunc *cb;
> +    void                  *cb_arg;
> +
> +    QEMUSGList   *qsg;
> +    QEMUIOVector *iov;
> +
> +    QTAILQ_ENTRY(NvmeAIO) tailq_entry;
> +};
> +
> +static inline const char *nvme_aio_opc_str(NvmeAIO *aio)
> +{
> +    switch (aio->opc) {
> +    case NVME_AIO_OPC_NONE:         return "NVME_AIO_OP_NONE";
> +    case NVME_AIO_OPC_FLUSH:        return "NVME_AIO_OP_FLUSH";
> +    case NVME_AIO_OPC_READ:         return "NVME_AIO_OP_READ";
> +    case NVME_AIO_OPC_WRITE:        return "NVME_AIO_OP_WRITE";
> +    case NVME_AIO_OPC_WRITE_ZEROES: return "NVME_AIO_OP_WRITE_ZEROES";
> +    default:                        return "NVME_AIO_OP_UNKNOWN";
> +    }
> +}
> +
> +static inline bool nvme_req_is_write(NvmeRequest *req)
> +{
> +    switch (req->cmd.opcode) {
> +    case NVME_CMD_WRITE:
> +    case NVME_CMD_WRITE_UNCOR:
> +    case NVME_CMD_WRITE_ZEROS:
> +        return true;
> +    default:
> +        return false;
> +    }
> +}
> +
>  #define TYPE_NVME "nvme"
>  #define NVME(obj) \
>          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> @@ -139,10 +236,21 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
>  static inline uint16_t nvme_cid(NvmeRequest *req)
>  {
>      if (req) {
> -        return le16_to_cpu(req->cqe.cid);
> +        return le16_to_cpu(req->cmd.cid);
>      }
>  
>      return 0xffff;
>  }
>  
> +static inline bool nvme_status_is_error(uint16_t status, uint16_t err)
> +{
> +    /* strip DNR and MORE */
> +    return (status & 0xfff) == err;
> +}
> +
> +static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
> +{
> +    return req->sq->ctrl;
> +}
> +
>  #endif /* HW_NVME_H */
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 77aa0da99ee0..90a57fb6099a 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -34,7 +34,12 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
>  nvme_dev_irq_masked(void) "IRQ is masked"
>  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
>  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> +nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> %"PRIu64" opc \"%s\" req %p"
> +nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> +nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
>  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> +nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
> +nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
>  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16""
>  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
>  nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
> @@ -75,6 +80,9 @@ nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
>  nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  
>  # nvme traces for error conditions
> +nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
> +nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> +nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> status 0x%"PRIx16""
>  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
>  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
>  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""



The patch is large, I tried my best to spot issues, but I might have missed some.
Please split it as I pointed out.
Overall I do like most of the changes.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 18/26] nvme: use preallocated qsg/iov in nvme_dma_prp
  2020-02-04  9:52     ` [PATCH v5 18/26] nvme: use preallocated qsg/iov in nvme_dma_prp Klaus Jensen
@ 2020-02-12 11:49       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 11:49 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> Since clean up of the request qsg/iov has been moved to the common
> nvme_enqueue_req_completion function, there is no need to use a
> stack allocated qsg/iov in nvme_dma_prp.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 18 ++++++------------
>  1 file changed, 6 insertions(+), 12 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index e97da35c4ca1..f8c81b9e2202 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -298,23 +298,21 @@ unmap:
>  static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>      uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
>  {
> -    QEMUSGList qsg;
> -    QEMUIOVector iov;
>      uint16_t status = NVME_SUCCESS;
>      size_t bytes;
>  
> -    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
> +    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
>      if (status) {
>          return status;
>      }
>  
> -    if (qsg.nsg > 0) {
> +    if (req->qsg.nsg > 0) {
>          uint64_t residual;
>  
>          if (dir == DMA_DIRECTION_TO_DEVICE) {
> -            residual = dma_buf_write(ptr, len, &qsg);
> +            residual = dma_buf_write(ptr, len, &req->qsg);
>          } else {
> -            residual = dma_buf_read(ptr, len, &qsg);
> +            residual = dma_buf_read(ptr, len, &req->qsg);
>          }
>  
>          if (unlikely(residual)) {
> @@ -322,15 +320,13 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>              status = NVME_INVALID_FIELD | NVME_DNR;
>          }
>  
> -        qemu_sglist_destroy(&qsg);
> -
>          return status;
>      }
>  
>      if (dir == DMA_DIRECTION_TO_DEVICE) {
> -        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
> +        bytes = qemu_iovec_to_buf(&req->iov, 0, ptr, len);
>      } else {
> -        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
> +        bytes = qemu_iovec_from_buf(&req->iov, 0, ptr, len);
>      }
>  
>      if (unlikely(bytes != len)) {
> @@ -338,8 +334,6 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>          status = NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    qemu_iovec_destroy(&iov);
> -
>      return status;
>  }
>  


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 20/26] nvme: handle dma errors
  2020-02-04  9:52     ` [PATCH v5 20/26] nvme: handle dma errors Klaus Jensen
@ 2020-02-12 11:52       ` Maxim Levitsky
  2020-03-16  7:53         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 11:52 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> Handling DMA errors gracefully is required for the device to pass the
> block/011 test ("disable PCI device while doing I/O") in the blktests
> suite.
> 
> With this patch the device passes the test by retrying "critical"
> transfers (posting of completion entries and processing of submission
> queue entries).
> 
> If DMA errors occur at any other point in the execution of the command
> (say, while mapping the PRPs), the command is aborted with a Data
> Transfer Error status code.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c       | 42 +++++++++++++++++++++++++++++++++---------
>  hw/block/trace-events |  2 ++
>  include/block/nvme.h  |  2 +-
>  3 files changed, 36 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index f8c81b9e2202..204ae1d33234 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>      return addr >= low && addr < hi;
>  }
>  
> -static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> +static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>  {
>      if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
>          memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> -        return;
> +        return 0;
>      }
>  
> -    pci_dma_read(&n->parent_obj, addr, buf, size);
> +    return pci_dma_read(&n->parent_obj, addr, buf, size);
>  }
>  
>  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> @@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
>      uint16_t status = NVME_SUCCESS;
>      bool is_cmb = false;
>      bool prp_list_in_cmb = false;
> +    int ret;
>  
>      trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
>          prp1, prp2, num_prps);
> @@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
>  
>              nents = (len + n->page_size - 1) >> n->page_bits;
>              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> -            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> +            ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> +            if (ret) {
> +                trace_nvme_dev_err_addr_read(prp2);
> +                status = NVME_DATA_TRANSFER_ERROR;
> +                goto unmap;
> +            }
>              while (len != 0) {
>                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
>  
> @@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
>                      i = 0;
>                      nents = (len + n->page_size - 1) >> n->page_bits;
>                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> -                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
> +                    ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
> +                        prp_trans);
> +                    if (ret) {
> +                        trace_nvme_dev_err_addr_read(prp_ent);
> +                        status = NVME_DATA_TRANSFER_ERROR;
> +                        goto unmap;
> +                    }
>                      prp_ent = le64_to_cpu(prp_list[i]);
>                  }
>  
> @@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
>      NvmeCQueue *cq = opaque;
>      NvmeCtrl *n = cq->ctrl;
>      NvmeRequest *req, *next;
> +    int ret;
>  
>      QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
>          NvmeSQueue *sq;
> @@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
>              break;
>          }
>  
> -        QTAILQ_REMOVE(&cq->req_list, req, entry);
>          sq = req->sq;
>          req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
>          req->cqe.sq_id = cpu_to_le16(sq->sqid);
>          req->cqe.sq_head = cpu_to_le16(sq->head);
>          addr = cq->dma_addr + cq->tail * n->cqe_size;
> -        nvme_inc_cq_tail(cq);
> -        pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> +        ret = pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
>              sizeof(req->cqe));
> +        if (ret) {
> +            trace_nvme_dev_err_addr_write(addr);
> +            timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> +                100 * SCALE_MS);
> +            break;
> +        }
> +        QTAILQ_REMOVE(&cq->req_list, req, entry);
> +        nvme_inc_cq_tail(cq);
>          nvme_req_clear(req);
>          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
>      }
> @@ -1588,7 +1607,12 @@ static void nvme_process_sq(void *opaque)
>  
>      while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
>          addr = sq->dma_addr + sq->head * n->sqe_size;
> -        nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
> +        if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> +            trace_nvme_dev_err_addr_read(addr);
> +            timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> +                100 * SCALE_MS);
> +            break;
> +        }

Note that once the driver is optimized for performance, these timers must go,
since they run on main thread and also add latency to each request.
But for now this change is all right.

About user triggering this each 100ms on purpose, I don't think that this is such a big issue.
Maybe up that to 500ms or even one second, since this condition will not
happen in real life usage of the device anyway.

>          nvme_inc_sq_head(sq);
>  
>          req = QTAILQ_FIRST(&sq->req_list);
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 90a57fb6099a..09bfb3782dd0 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -83,6 +83,8 @@ nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
>  nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
>  nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
>  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> status 0x%"PRIx16""
> +nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
> +nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
>  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
>  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
>  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index c1de92179596..a873776d98b8 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -418,7 +418,7 @@ enum NvmeStatusCodes {
>      NVME_INVALID_OPCODE         = 0x0001,
>      NVME_INVALID_FIELD          = 0x0002,
>      NVME_CID_CONFLICT           = 0x0003,
> -    NVME_DATA_TRAS_ERROR        = 0x0004,
> +    NVME_DATA_TRANSFER_ERROR    = 0x0004,
>      NVME_POWER_LOSS_ABORT       = 0x0005,
>      NVME_INTERNAL_DEV_ERROR     = 0x0006,
>      NVME_CMD_ABORT_REQ          = 0x0007,


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 21/26] nvme: add support for scatter gather lists
  2020-02-04  9:52     ` [PATCH v5 21/26] nvme: add support for scatter gather lists Klaus Jensen
@ 2020-02-12 12:07       ` Maxim Levitsky
  2020-03-16  7:54         ` Klaus Birkelund Jensen
  0 siblings, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:07 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> For now, support the Data Block, Segment and Last Segment descriptor
> types.
> 
> See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> Acked-by: Fam Zheng <fam@euphon.net>
> ---
>  block/nvme.c          |  18 +-
>  hw/block/nvme.c       | 375 +++++++++++++++++++++++++++++++++++-------
>  hw/block/trace-events |   4 +
>  include/block/nvme.h  |  62 ++++++-
>  4 files changed, 389 insertions(+), 70 deletions(-)
> 
> diff --git a/block/nvme.c b/block/nvme.c
> index d41c4bda6e39..521f521054d5 100644
> --- a/block/nvme.c
> +++ b/block/nvme.c
> @@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
>          error_setg(errp, "Cannot map buffer for DMA");
>          goto out;
>      }
> -    cmd.prp1 = cpu_to_le64(iova);
> +    cmd.dptr.prp.prp1 = cpu_to_le64(iova);
>  
>      if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
>          error_setg(errp, "Failed to identify controller");
> @@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
>      }
>      cmd = (NvmeCmd) {
>          .opcode = NVME_ADM_CMD_CREATE_CQ,
> -        .prp1 = cpu_to_le64(q->cq.iova),
> +        .dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
>          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
>          .cdw11 = cpu_to_le32(0x3),
>      };
> @@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
>      }
>      cmd = (NvmeCmd) {
>          .opcode = NVME_ADM_CMD_CREATE_SQ,
> -        .prp1 = cpu_to_le64(q->sq.iova),
> +        .dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
>          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
>          .cdw11 = cpu_to_le32(0x1 | (n << 16)),
>      };
> @@ -906,16 +906,16 @@ try_map:
>      case 0:
>          abort();
>      case 1:
> -        cmd->prp1 = pagelist[0];
> -        cmd->prp2 = 0;
> +        cmd->dptr.prp.prp1 = pagelist[0];
> +        cmd->dptr.prp.prp2 = 0;
>          break;
>      case 2:
> -        cmd->prp1 = pagelist[0];
> -        cmd->prp2 = pagelist[1];
> +        cmd->dptr.prp.prp1 = pagelist[0];
> +        cmd->dptr.prp.prp2 = pagelist[1];
>          break;
>      default:
> -        cmd->prp1 = pagelist[0];
> -        cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> +        cmd->dptr.prp.prp1 = pagelist[0];
> +        cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
>          break;
>      }
>      trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 204ae1d33234..a91c60fdc111 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
>  
>  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>  {
> -    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> -        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> +    hwaddr hi = addr + size;
Are you sure you don't want to check for overflow here?
Its theoretical issue since addr has to be almost full 64 bit
but still for those things I check this very defensively.

> +
> +    if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
Here you fix the bug I mentioned in patch 6. I suggest you to move the fix there.
> +        memcpy(buf, nvme_addr_to_cmb(n, addr), size);
>          return 0;
>      }
>  
> @@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
>      }
>  }
>  
> +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
> +    size_t len)
> +{
> +    if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> +        return NVME_DATA_TRANSFER_ERROR;
> +    }
> +
> +    qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> +    hwaddr addr, size_t len)
> +{
> +    bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
> +
> +    if (addr_is_cmb) {
> +        if (qsg->sg) {
> +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +        }
> +
> +        if (!iov->iov) {
> +            qemu_iovec_init(iov, 1);
> +        }
> +
> +        return nvme_map_addr_cmb(n, iov, addr, len);
> +    }
> +
> +    if (iov->iov) {
> +        return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +    }
> +
> +    if (!qsg->sg) {
> +        pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> +    }
> +
> +    qemu_sglist_add(qsg, addr, len);
> +
> +    return NVME_SUCCESS;
> +}

Very good refactoring. I would also suggest you to move this to a separate
patch. I always put refactoring first and then patches that add features.

> +
>  static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
>      uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
>  {
> @@ -307,15 +351,240 @@ unmap:
>      return status;
>  }
>  
> -static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
> +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> +    QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> +    uint32_t *len, NvmeRequest *req)
> +{
> +    dma_addr_t addr, trans_len;
> +    uint32_t length;
> +    uint16_t status;
> +
> +    for (int i = 0; i < nsgld; i++) {
> +        uint8_t type = NVME_SGL_TYPE(segment[i].type);
> +
> +        if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> +            switch (type) {
> +            case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
> +            case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
> +                return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> +            default:
> +                break;
> +            }
> +
> +            return NVME_INVALID_NUM_SGL_DESCRIPTORS | NVME_DNR;
Since the only way to reach the above statement is by that 'default'
why not to move it there?
> +        }
> +
> +        if (*len == 0) {
> +            if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> +                trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> +                return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> +            }
> +
> +            break;
> +        }
> +
> +        addr = le64_to_cpu(segment[i].addr);
> +        length = le32_to_cpu(segment[i].len);
> +
> +        if (!length) {
> +            continue;
> +        }
> +
> +        if (UINT64_MAX - addr < length) {
> +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> +        }
> +
> +        trans_len = MIN(*len, length);
> +
> +        status = nvme_map_addr(n, qsg, iov, addr, trans_len);
> +        if (status) {
> +            return status;
> +        }
> +
> +        *len -= trans_len;
> +    }
> +
> +    return NVME_SUCCESS;
> +}
> +
> +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> +    NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
Minor nitpick: 
Usually structs are passed by reference (that is pointer in C), 
however I see that you change 'sgl' it in the function.
IMHO this is a bit hard to read, I usually prefer not to change input parameters.

> +{
> +    const int MAX_NSGLD = 256;

I personally would rename that const to something like SG_CHUNK_SIZE and add a comment, since
it is just an arbitrary chunk size you use to avoid dynamic memory allocation,
that is so we can avoid confusion vs the spec.

> +
> +    NvmeSglDescriptor segment[MAX_NSGLD], *sgld, *last_sgld;
> +    uint64_t nsgld;
> +    uint32_t length;
> +    uint16_t status;
> +    bool sgl_in_cmb = false;
> +    hwaddr addr;
> +    int ret;
> +
> +    sgld = &sgl;
> +    addr = le64_to_cpu(sgl.addr);
> +
> +    trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), req->nlb,
> +        len);
> +
> +    /*
> +     * If the entire transfer can be described with a single data block it can
> +     * be mapped directly.
> +     */
> +    if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> +        status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, &len, req);
> +        if (status) {
> +            goto unmap;
> +        }
> +
> +        goto out;
> +    }
> +
> +    /*
> +     * If the segment is located in the CMB, the submission queue of the
> +     * request must also reside there.
> +     */
> +    if (nvme_addr_is_cmb(n, addr)) {
> +        if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +        }
> +
> +        sgl_in_cmb = true;
> +    }
> +
> +    for (;;) {
> +        length = le32_to_cpu(sgld->len);
> +
> +        if (!length || length & 0xf) {
> +            return NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> +        }
> +
> +        if (UINT64_MAX - addr < length) {
I assume you check for overflow here. Looks like very nice way to do it.
This should be adopted in few more places
> +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> +        }
> +
> +        nsgld = length / sizeof(NvmeSglDescriptor);
> +
> +        /* read the segment in chunks of 256 descriptors (4k) */
That comment is perfect to move/copy to definition of MAX_NSGLD

> +        while (nsgld > MAX_NSGLD) {
> +            if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> +                trace_nvme_dev_err_addr_read(addr);
> +                status = NVME_DATA_TRANSFER_ERROR;
> +                goto unmap;
> +            }
> +
> +            status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, &len,
> +                req);
> +            if (status) {
> +                goto unmap;
> +            }
> +
> +            nsgld -= MAX_NSGLD;
> +            addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
> +        }
> +
> +        ret = nvme_addr_read(n, addr, segment, nsgld *
> +            sizeof(NvmeSglDescriptor));
Reminding you to fix the line split issues. (align the sizeof on '(')

> +        if (ret) {
> +            trace_nvme_dev_err_addr_read(addr);
> +            status = NVME_DATA_TRANSFER_ERROR;
> +            goto unmap;
> +        }
> +
> +        last_sgld = &segment[nsgld - 1];
> +
> +        /* if the segment ends with a Data Block, then we are done */
> +        if (NVME_SGL_TYPE(last_sgld->type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> +            status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld, &len, req);
> +            if (status) {
> +                goto unmap;
> +            }
> +
> +            break;
> +        }
> +
> +        /* a Last Segment must end with a Data Block descriptor */
> +        if (NVME_SGL_TYPE(sgld->type) == NVME_SGL_DESCR_TYPE_LAST_SEGMENT) {
> +            status = NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> +            goto unmap;
> +        }
> +
> +        sgld = last_sgld;
> +        addr = le64_to_cpu(sgld->addr);
> +
> +        /*
> +         * Do not map the last descriptor; it will be a Segment or Last Segment
> +         * descriptor instead and handled by the next iteration.
> +         */
> +        status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld - 1, &len, req);
> +        if (status) {
> +            goto unmap;
> +        }
> +
> +        /*
> +         * If the next segment is in the CMB, make sure that the sgl was
> +         * already located there.
> +         */
> +        if (sgl_in_cmb != nvme_addr_is_cmb(n, addr)) {
> +            status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> +            goto unmap;
> +        }
> +    }
> +
> +out:
> +    /* if there is any residual left in len, the SGL was too short */
> +    if (len) {
> +        status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> +        goto unmap;
> +    }
> +
> +    return NVME_SUCCESS;
> +
> +unmap:
> +    if (iov->iov) {
> +        qemu_iovec_destroy(iov);
> +    }
> +
> +    if (qsg->sg) {
> +        qemu_sglist_destroy(qsg);
> +    }
> +
> +    return status;
> +}
Looks good, much better than in V4


> +
> +static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> +    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
>  {
>      uint16_t status = NVME_SUCCESS;
>      size_t bytes;
>  
> -    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> -    if (status) {
> -        return status;
> +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> +    case PSDT_PRP:
> +        status = nvme_map_prp(n, &req->qsg, &req->iov,
> +            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
> +            len, req);
> +        if (status) {
> +            return status;
> +        }
> +
> +        break;
> +
> +    case PSDT_SGL_MPTR_CONTIGUOUS:
> +    case PSDT_SGL_MPTR_SGL:
> +        if (!req->sq->sqid) {
> +            /* SGLs shall not be used for Admin commands in NVMe over PCIe */
> +            return NVME_INVALID_FIELD;
> +        }
> +
> +        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
> +            req);
> +        if (status) {
> +            return status;
> +        }
Minor nitpick: you can probably refactor this to an 'err' label in the end of function.
> +
> +        break;
> +
> +    default:
> +        return NVME_INVALID_FIELD;
>      }


>  
>      if (req->qsg.nsg > 0) {
> @@ -351,13 +620,21 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>  
>  static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> -    NvmeNamespace *ns = req->ns;
> +    uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
> +    uint64_t prp1, prp2;
>  
> -    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> +    case PSDT_PRP:
> +        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
> +        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
>  
> -    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> +        return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> +    case PSDT_SGL_MPTR_CONTIGUOUS:
> +    case PSDT_SGL_MPTR_SGL:
> +        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
> +    default:
> +        return NVME_INVALID_FIELD;
> +    }
>  }
>  
>  static void nvme_aio_destroy(NvmeAIO *aio)
> @@ -972,8 +1249,6 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>  static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>      uint32_t buf_len, uint64_t off, NvmeRequest *req)
>  {
> -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
>      uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
>      uint32_t trans_len;
> @@ -1023,16 +1298,14 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>          nvme_clear_events(n, NVME_AER_TYPE_SMART);
>      }
>  
> -    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> +    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
>      uint64_t off, NvmeRequest *req)
>  {
>      uint32_t trans_len;
> -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
>      NvmeFwSlotInfoLog fw_log;
>  
>      if (off > sizeof(fw_log)) {
> @@ -1043,8 +1316,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
>  
>      trans_len = MIN(sizeof(fw_log) - off, buf_len);
>  
> -    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> +    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> @@ -1194,25 +1467,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> -    NvmeRequest *req)
> +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> -    uint64_t prp1 = le64_to_cpu(c->prp1);
> -    uint64_t prp2 = le64_to_cpu(c->prp2);
> -
>      trace_nvme_dev_identify_ctrl();
>  
> -    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> +    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> -    NvmeRequest *req)
> +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
> -    uint32_t nsid = le32_to_cpu(c->nsid);
> -    uint64_t prp1 = le64_to_cpu(c->prp1);
> -    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
>      trace_nvme_dev_identify_ns(nsid);
>  
> @@ -1223,17 +1489,15 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
>  
>      ns = &n->namespaces[nsid - 1];
>  
> -    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> +    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
>      static const int data_len = 4 * KiB;
> -    uint32_t min_nsid = le32_to_cpu(c->nsid);
> -    uint64_t prp1 = le64_to_cpu(c->prp1);
> -    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
>      uint32_t *list;
>      uint16_t ret;
>      int i, j = 0;
> @@ -1250,13 +1514,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
>              break;
>          }
>      }
> -    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> +    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
>          DMA_DIRECTION_FROM_DEVICE, req);
>      g_free(list);
>      return ret;
>  }
>  
> -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
>      static const int len = 4096;
> @@ -1268,9 +1532,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
>          uint8_t nid[16];
>      };
>  
> -    uint32_t nsid = le32_to_cpu(c->nsid);
> -    uint64_t prp1 = le64_to_cpu(c->prp1);
> -    uint64_t prp2 = le64_to_cpu(c->prp2);
> +    uint32_t nsid = le32_to_cpu(cmd->nsid);
>  
>      struct ns_descr *list;
>      uint16_t ret;
> @@ -1293,8 +1555,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
>      list->nidl = 0x10;
>      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
>  
> -    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> -        DMA_DIRECTION_FROM_DEVICE, req);
> +    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
> +        req);
>      g_free(list);
>      return ret;
>  }
> @@ -1305,13 +1567,13 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>      switch (le32_to_cpu(c->cns)) {
>      case 0x00:
> -        return nvme_identify_ns(n, c, req);
> +        return nvme_identify_ns(n, cmd, req);
>      case 0x01:
> -        return nvme_identify_ctrl(n, c, req);
> +        return nvme_identify_ctrl(n, cmd, req);
>      case 0x02:
> -        return nvme_identify_ns_list(n, c, req);
> +        return nvme_identify_ns_list(n, cmd, req);
>      case 0x03:
> -        return nvme_identify_ns_descr_list(n, c, req);
> +        return nvme_identify_ns_descr_list(n, cmd, req);
>      default:
>          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1373,13 +1635,10 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
>  static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
>      NvmeRequest *req)
>  {
> -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> -
>      uint64_t timestamp = nvme_get_timestamp(n);
>  
> -    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> +    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
> +        DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
>  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> @@ -1462,11 +1721,9 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
>  {
>      uint16_t ret;
>      uint64_t timestamp;
> -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
>  
> -    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> -        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
> +    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
> +        DMA_DIRECTION_TO_DEVICE, req);
>      if (ret != NVME_SUCCESS) {
>          return ret;
>      }
> @@ -2232,6 +2489,8 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>          id->vwc = 1;
>      }
>  
> +    id->sgls = cpu_to_le32(0x1);
Being part of the spec, it would be nice to #define this as well.
> +
>      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
>      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
>  
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 09bfb3782dd0..81d69e15fc32 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -34,6 +34,7 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
>  nvme_dev_irq_masked(void) "IRQ is masked"
>  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
>  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> +nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" nlb %"PRIu32" len %"PRIu64""
>  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> %"PRIu64" opc \"%s\" req %p"
>  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
>  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> @@ -85,6 +86,9 @@ nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
>  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> status 0x%"PRIx16""
>  nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
>  nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
> +nvme_dev_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> +nvme_dev_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> +nvme_dev_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
>  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
>  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
>  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index a873776d98b8..dbdeecf82358 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -205,15 +205,53 @@ enum NvmeCmbszMask {
>  #define NVME_CMBSZ_GETSIZE(cmbsz) \
>      (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
>  
> +enum NvmeSglDescriptorType {
> +    NVME_SGL_DESCR_TYPE_DATA_BLOCK           = 0x0,
> +    NVME_SGL_DESCR_TYPE_BIT_BUCKET           = 0x1,
> +    NVME_SGL_DESCR_TYPE_SEGMENT              = 0x2,
> +    NVME_SGL_DESCR_TYPE_LAST_SEGMENT         = 0x3,
> +    NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK     = 0x4,
> +
> +    NVME_SGL_DESCR_TYPE_VENDOR_SPECIFIC      = 0xf,
> +};
> +
> +enum NvmeSglDescriptorSubtype {
> +    NVME_SGL_DESCR_SUBTYPE_ADDRESS = 0x0,
> +};
> +
> +typedef struct NvmeSglDescriptor {
> +    uint64_t addr;
> +    uint32_t len;
> +    uint8_t  rsvd[3];
> +    uint8_t  type;
> +} NvmeSglDescriptor;

I suggest you add a build time struct size check for this,
just in case compiler tries something funny.
(look at _nvme_check_size, at nvme.h)

Also I think that the spec update change that adds the NvmeSglDescriptor
should be split into separate patch (or better be added in one big patch that adds all 1.3d features), 
which would make it also easier to see changes that touch the other nvme driver we have.

> +
> +#define NVME_SGL_TYPE(type)     ((type >> 4) & 0xf)
> +#define NVME_SGL_SUBTYPE(type)  (type & 0xf)
> +
> +typedef union NvmeCmdDptr {
> +    struct {
> +        uint64_t    prp1;
> +        uint64_t    prp2;
> +    } prp;
> +
> +    NvmeSglDescriptor sgl;
> +} NvmeCmdDptr;
> +
> +enum NvmePsdt {
> +    PSDT_PRP                 = 0x0,
> +    PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
> +    PSDT_SGL_MPTR_SGL        = 0x2,
> +};
> +
>  typedef struct NvmeCmd {
>      uint8_t     opcode;
> -    uint8_t     fuse;
> +    uint8_t     flags;
>      uint16_t    cid;
>      uint32_t    nsid;
>      uint64_t    res1;
>      uint64_t    mptr;
> -    uint64_t    prp1;
> -    uint64_t    prp2;
> +    NvmeCmdDptr dptr;
>      uint32_t    cdw10;
>      uint32_t    cdw11;
>      uint32_t    cdw12;
> @@ -222,6 +260,9 @@ typedef struct NvmeCmd {
>      uint32_t    cdw15;
>  } NvmeCmd;
>  
> +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> +
>  enum NvmeAdminCommands {
>      NVME_ADM_CMD_DELETE_SQ      = 0x00,
>      NVME_ADM_CMD_CREATE_SQ      = 0x01,
> @@ -427,6 +468,11 @@ enum NvmeStatusCodes {
>      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
>      NVME_INVALID_NSID           = 0x000b,
>      NVME_CMD_SEQ_ERROR          = 0x000c,
> +    NVME_INVALID_SGL_SEG_DESCRIPTOR  = 0x000d,
> +    NVME_INVALID_NUM_SGL_DESCRIPTORS = 0x000e,
> +    NVME_DATA_SGL_LENGTH_INVALID     = 0x000f,
> +    NVME_METADATA_SGL_LENGTH_INVALID = 0x0010,
> +    NVME_SGL_DESCRIPTOR_TYPE_INVALID = 0x0011,
>      NVME_INVALID_USE_OF_CMB     = 0x0012,
>      NVME_LBA_RANGE              = 0x0080,
>      NVME_CAP_EXCEEDED           = 0x0081,
> @@ -623,6 +669,16 @@ enum NvmeIdCtrlOncs {
>  #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
>  #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
>  
> +#define NVME_CTRL_SGLS_SUPPORTED(sgls)                 ((sgls) & 0x3)
> +#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(sgls)    ((sgls) & (0x1 <<  0))
> +#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT(sgls) ((sgls) & (0x1 <<  1))
> +#define NVME_CTRL_SGLS_KEYED(sgls)                     ((sgls) & (0x1 <<  2))
> +#define NVME_CTRL_SGLS_BITBUCKET(sgls)                 ((sgls) & (0x1 << 16))
> +#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS(sgls)           ((sgls) & (0x1 << 17))
> +#define NVME_CTRL_SGLS_EXCESS_LENGTH(sgls)             ((sgls) & (0x1 << 18))
> +#define NVME_CTRL_SGLS_MPTR_SGL(sgls)                  ((sgls) & (0x1 << 19))
> +#define NVME_CTRL_SGLS_ADDR_OFFSET(sgls)               ((sgls) & (0x1 << 20))
> +
>  typedef struct NvmeFeatureVal {
>      uint32_t    arbitration;
>      uint32_t    power_mgmt;

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 22/26] nvme: support multiple namespaces
  2020-02-04  9:52     ` [PATCH v5 22/26] nvme: support multiple namespaces Klaus Jensen
  2020-02-04 16:31       ` Keith Busch
@ 2020-02-12 12:34       ` Maxim Levitsky
  2020-03-16  7:55         ` Klaus Birkelund Jensen
  1 sibling, 1 reply; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:34 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> This adds support for multiple namespaces by introducing a new 'nvme-ns'
> device model. The nvme device creates a bus named from the device name
> ('id'). The nvme-ns devices then connect to this and registers
> themselves with the nvme device.
> 
> This changes how an nvme device is created. Example with two namespaces:
> 
>   -drive file=nvme0n1.img,if=none,id=disk1
>   -drive file=nvme0n2.img,if=none,id=disk2
>   -device nvme,serial=deadbeef,id=nvme0
>   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
>   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> 
> The drive property is kept on the nvme device to keep the change
> backward compatible, but the property is now optional. Specifying a
> drive for the nvme device will always create the namespace with nsid 1.
Very reasonable way to do it. 
> 
> Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/Makefile.objs |   2 +-
>  hw/block/nvme-ns.c     | 158 +++++++++++++++++++++++++++
>  hw/block/nvme-ns.h     |  60 +++++++++++
>  hw/block/nvme.c        | 235 +++++++++++++++++++++++++----------------
>  hw/block/nvme.h        |  47 ++++-----
>  hw/block/trace-events  |   6 +-
>  6 files changed, 389 insertions(+), 119 deletions(-)
>  create mode 100644 hw/block/nvme-ns.c
>  create mode 100644 hw/block/nvme-ns.h
> 
> diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> index 28c2495a00dc..45f463462f1e 100644
> --- a/hw/block/Makefile.objs
> +++ b/hw/block/Makefile.objs
> @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
>  common-obj-$(CONFIG_XEN) += xen-block.o
>  common-obj-$(CONFIG_ECC) += ecc.o
>  common-obj-$(CONFIG_ONENAND) += onenand.o
> -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
>  common-obj-$(CONFIG_SWIM) += swim.o
>  
>  obj-$(CONFIG_SH4) += tc58128.o
> diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> new file mode 100644
> index 000000000000..0e5be44486f4
> --- /dev/null
> +++ b/hw/block/nvme-ns.c
> @@ -0,0 +1,158 @@
> +#include "qemu/osdep.h"
> +#include "qemu/units.h"
> +#include "qemu/cutils.h"
> +#include "qemu/log.h"
> +#include "hw/block/block.h"
> +#include "hw/pci/msix.h"
Do you need this include?
> +#include "sysemu/sysemu.h"
> +#include "sysemu/block-backend.h"
> +#include "qapi/error.h"
> +
> +#include "hw/qdev-properties.h"
> +#include "hw/qdev-core.h"
> +
> +#include "nvme.h"
> +#include "nvme-ns.h"
> +
> +static int nvme_ns_init(NvmeNamespace *ns)
> +{
> +    NvmeIdNs *id_ns = &ns->id_ns;
> +
> +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> +    id_ns->nuse = id_ns->ncap = id_ns->nsze =
> +        cpu_to_le64(nvme_ns_nlbas(ns));
Nitpick: To be honest I don't really like that chain assignment, 
especially since it forces to wrap the line, but that is just my
personal taste.
> +
> +    return 0;
> +}
> +
> +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> +    Error **errp)
> +{
> +    uint64_t perm, shared_perm;
> +
> +    Error *local_err = NULL;
> +    int ret;
> +
> +    perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> +    shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> +        BLK_PERM_GRAPH_MOD;
> +
> +    ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
> +    if (ret) {
> +        error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> +        return ret;
> +    }

You should consider using blkconf_apply_backend_options.
Take a look at for example virtio_blk_device_realize.
That will give you support for read only block devices as well.

I personally only once grazed the area of block permissions,
so I prefer someone from the block layer to review this as well.

> +
> +    ns->size = blk_getlength(ns->blk);
> +    if (ns->size < 0) {
> +        error_setg_errno(errp, -ns->size, "blk_getlength");
> +        return 1;
> +    }
> +
> +    switch (n->conf.wce) {
> +    case ON_OFF_AUTO_ON:
> +        n->features.volatile_wc = 1;
> +        break;
> +    case ON_OFF_AUTO_OFF:
> +        n->features.volatile_wc = 0;
> +    case ON_OFF_AUTO_AUTO:
> +        n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> +        break;
> +    default:
> +        abort();
> +    }
> +
> +    blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> +
> +    return 0;

Nitpick: also I just noticed that you call the controller 'n' I didn't paid attention to this
before. I think something like 'ctrl' or ctl would be more readable.

> +}
> +
> +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> +{
> +    if (!ns->blk) {
> +        error_setg(errp, "block backend not configured");
> +        return 1;
> +    }
> +
> +    return 0;
> +}
> +
> +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> +{
> +    Error *local_err = NULL;
> +
> +    if (nvme_ns_check_constraints(ns, &local_err)) {
> +        error_propagate_prepend(errp, local_err,
> +            "nvme_ns_check_constraints: ");
> +        return 1;
> +    }
> +
> +    if (nvme_ns_init_blk(n, ns, &n->id_ctrl, &local_err)) {
> +        error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
> +        return 1;
> +    }
> +
> +    nvme_ns_init(ns);
> +    if (nvme_register_namespace(n, ns, &local_err)) {
> +        error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
> +        return 1;
> +    }
> +
> +    return 0;

Nitipick: to be honest I am not sure we want to expose internal function names like that in 
error hints are supposed to be readable to user that doesn't look at the source.

> +}
> +
> +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> +{
> +    NvmeNamespace *ns = NVME_NS(dev);
> +    BusState *s = qdev_get_parent_bus(dev);
> +    NvmeCtrl *n = NVME(s->parent);

Nitpick: Don't know if you defined this or it was like that always,
but I would prefer something like NVME_CTL instead.

> +    Error *local_err = NULL;
> +
> +    if (nvme_ns_setup(n, ns, &local_err)) {
> +        error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> +        return;
> +    }
> +}
> +
> +static Property nvme_ns_props[] = {
> +    DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),

If you go with my suggestion to use blkconf you will use here the
DEFINE_BLOCK_PROPERTIES_BASE

> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(oc);
> +
> +    set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> +
> +    dc->bus_type = TYPE_NVME_BUS;
> +    dc->realize = nvme_ns_realize;
> +    device_class_set_props(dc, nvme_ns_props);
> +    dc->desc = "virtual nvme namespace";
> +}

Looks reasonable.
I don't know the device/bus model in depth to be honest
(I learned it for few days some time ago though)
so a review from someone that knows this area better that I do
is very welcome.

> +
> +static void nvme_ns_instance_init(Object *obj)
> +{
> +    NvmeNamespace *ns = NVME_NS(obj);
> +    char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> +
> +    device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
> +        bootindex, DEVICE(obj), &error_abort);
> +
> +    g_free(bootindex);
> +}
> +
> +static const TypeInfo nvme_ns_info = {
> +    .name = TYPE_NVME_NS,
> +    .parent = TYPE_DEVICE,
> +    .class_init = nvme_ns_class_init,
> +    .instance_size = sizeof(NvmeNamespace),
> +    .instance_init = nvme_ns_instance_init,
> +};
> +
> +static void nvme_ns_register_types(void)
> +{
> +    type_register_static(&nvme_ns_info);
> +}
> +
> +type_init(nvme_ns_register_types)
> diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> new file mode 100644
> index 000000000000..b564bac25f6d
> --- /dev/null
> +++ b/hw/block/nvme-ns.h
> @@ -0,0 +1,60 @@
> +#ifndef NVME_NS_H
> +#define NVME_NS_H
> +
> +#define TYPE_NVME_NS "nvme-ns"
> +#define NVME_NS(obj) \
> +    OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
> +
> +#define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> +    DEFINE_PROP_DRIVE("drive", _state, blk), \
> +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> +
> +typedef struct NvmeNamespaceParams {
> +    uint32_t nsid;
> +} NvmeNamespaceParams;
> +
> +typedef struct NvmeNamespace {
> +    DeviceState  parent_obj;
> +    BlockBackend *blk;
> +    int32_t      bootindex;
> +    int64_t      size;
> +
> +    NvmeIdNs            id_ns;
> +    NvmeNamespaceParams params;
> +} NvmeNamespace;
> +
> +static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> +{
> +    if (ns) {
> +        return ns->params.nsid;
> +    }
> +
> +    return -1;
> +}

To be honest I would allow user to omit nsid,
and in this case pick a free slot out of valid namespaces.

Let me explain the concept of valid/allocated/active namespaces
from the spec as written in my summary:

Valid namespaces are 1..N range of namespaces as reported in IDCTRL.NN
That value is static, and it should be either set to some arbitrary large value (say 256)
or set using qemu device parameter, and not changed dynamically as you currently do.
As I understand it, IDCTRL output should not change during lifetime of the controller,
although I didn't find exact confirmation on this in spec.

Allocated namespaces are not relevant to us (this is only used for namespace management)
(these are namespaces that exist but are not attached to the controller)

And then you have Active namespaces which are the namespaces the user can actually address.

However If I understand this correctly, currently the NVME 'bus' doesn't
support hotplug, thus all namespaces will be already plugged in on
VM startup, thus the issue doesn't really exist yet.



> +
> +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> +{
> +    NvmeIdNs *id_ns = &ns->id_ns;
> +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> +}
> +
> +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> +{
> +    return nvme_ns_lbaf(ns).ds;
> +}
> +
> +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> +{
> +    return 1 << nvme_ns_lbads(ns);
> +}
> +
> +static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
> +{
> +    return ns->size >> nvme_ns_lbads(ns);
> +}
> +
> +typedef struct NvmeCtrl NvmeCtrl;
> +
> +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> +
> +#endif /* NVME_NS_H */
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index a91c60fdc111..3a377bc56734 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -17,10 +17,11 @@
>  /**
>   * Usage: add options:
>   *      -drive file=<file>,if=none,id=<drive_id>
> - *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
> + *      -device nvme,serial=<serial>,id=<bus_name>, \
>   *              cmb_size_mb=<cmb_size_mb[optional]>, \
>   *              num_queues=<N[optional]>, \
>   *              mdts=<mdts[optional]>
> + *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=1
>   *
>   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
>   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> @@ -28,6 +29,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/units.h"
> +#include "qemu/error-report.h"
>  #include "hw/block/block.h"
>  #include "hw/pci/msix.h"
>  #include "hw/pci/pci.h"
> @@ -43,6 +45,7 @@
>  #include "qemu/cutils.h"
>  #include "trace.h"
>  #include "nvme.h"
> +#include "nvme-ns.h"
>  
>  #define NVME_SPEC_VER 0x00010300
>  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> @@ -85,6 +88,17 @@ static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
>      return pci_dma_read(&n->parent_obj, addr, buf, size);
>  }
>  
> +static uint16_t nvme_nsid_err(NvmeCtrl *n, uint32_t nsid)
> +{
> +    if (nsid && nsid < n->num_namespaces) {
> +        trace_nvme_dev_err_inactive_ns(nsid, n->num_namespaces);
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> +    return NVME_INVALID_NSID | NVME_DNR;
> +}

I don't like that function to be honest.
This function is called when nvme_ns returns NULL.
IMHO it would be better to make nvme_ns return both namespace pointer and error code instead.
In kernel we encode error values into the returned pointer.


> +
>  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
>  {
>      return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
> @@ -889,7 +903,7 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
>      uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
>  
>      if (unlikely((slba + nlb) > nsze)) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk),
> +        block_acct_invalid(blk_get_stats(ns->blk),
>              nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
>          trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
>          return NVME_LBA_RANGE | NVME_DNR;
> @@ -924,11 +938,12 @@ static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
>  
>  static void nvme_rw_cb(NvmeRequest *req, void *opaque)
>  {
> +    NvmeNamespace *ns = req->ns;
>      NvmeSQueue *sq = req->sq;
>      NvmeCtrl *n = sq->ctrl;
>      NvmeCQueue *cq = n->cq[sq->cqid];
>  
> -    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
> +    trace_nvme_dev_rw_cb(nvme_cid(req), nvme_nsid(ns));
>  
>      nvme_enqueue_req_completion(cq, req);
>  }
> @@ -1011,10 +1026,11 @@ static void nvme_aio_cb(void *opaque, int ret)
>  
>  static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> +    NvmeNamespace *ns = req->ns;
>      NvmeAIO *aio = g_new0(NvmeAIO, 1);
>  
>      *aio = (NvmeAIO) {
> -        .blk = n->conf.blk,
> +        .blk = ns->blk,
>          .req = req,
>      };
>  
> @@ -1038,12 +1054,12 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      req->slba = le64_to_cpu(rw->slba);
>      req->nlb  = le16_to_cpu(rw->nlb) + 1;
>  
> -    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> -        req->slba, req->nlb);
> +    trace_nvme_dev_write_zeros(nvme_cid(req), nvme_nsid(ns), req->slba,
> +        req->nlb);
>  
>      status = nvme_check_bounds(n, req->slba, req->nlb, req);
>      if (unlikely(status)) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> +        block_acct_invalid(blk_get_stats(ns->blk), BLOCK_ACCT_WRITE);
>          return status;
>      }
>  
> @@ -1053,7 +1069,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      aio = g_new0(NvmeAIO, 1);
>  
>      *aio = (NvmeAIO) {
> -        .blk = n->conf.blk,
> +        .blk = ns->blk,
>          .offset = offset,
>          .len = count,
>          .req = req,
> @@ -1077,22 +1093,23 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      req->nlb  = le16_to_cpu(rw->nlb) + 1;
>      req->slba = le64_to_cpu(rw->slba);
>  
> -    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> -        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> +    trace_nvme_dev_rw(nvme_cid(req), nvme_req_is_write(req) ? "write" : "read",
> +        nvme_nsid(ns), req->nlb, req->nlb << nvme_ns_lbads(ns),
> +        req->slba);
>  
>      status = nvme_check_rw(n, req);
>      if (status) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> +        block_acct_invalid(blk_get_stats(ns->blk), acct);
>          return status;
>      }
>  
>      status = nvme_map(n, cmd, req);
>      if (status) {
> -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> +        block_acct_invalid(blk_get_stats(ns->blk), acct);
>          return status;
>      }
>  
> -    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> +    nvme_rw_aio(ns->blk, req->slba << nvme_ns_lbads(ns), req);
>      nvme_req_set_cb(req, nvme_rw_cb, NULL);
>  
>      return NVME_NO_COMPLETE;
> @@ -1105,12 +1122,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
>          cmd->opcode);
>  
> -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> -        return NVME_INVALID_NSID | NVME_DNR;
> -    }
> +    req->ns = nvme_ns(n, nsid);
>  
> -    req->ns = &n->namespaces[nsid - 1];
> +    if (unlikely(!req->ns)) {
> +        return nvme_nsid_err(n, nsid);
> +    }
>  
>      switch (cmd->opcode) {
>      case NVME_CMD_FLUSH:
> @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>      uint64_t units_read = 0, units_written = 0, read_commands = 0,
>          write_commands = 0;
>      NvmeSmartLog smart;
> -    BlockAcctStats *s;
>  
>      if (nsid && nsid != 0xffffffff) {
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  
> -    s = blk_get_stats(n->conf.blk);
> +    for (int i = 1; i <= n->num_namespaces; i++) {
> +        NvmeNamespace *ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
>  
> -    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> -    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> -    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> -    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> +        BlockAcctStats *s = blk_get_stats(ns->blk);
> +
> +        units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> +        units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> +        read_commands += s->nr_ops[BLOCK_ACCT_READ];
> +        write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
> +    }
Very minor nitpick: As something minor to do in the future, is to report the statistics per namespace.
>  
>      if (off > sizeof(smart)) {
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1477,19 +1499,25 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> -    NvmeNamespace *ns;
> +    NvmeIdNs *id_ns, inactive = { 0 };
>      uint32_t nsid = le32_to_cpu(cmd->nsid);
> +    NvmeNamespace *ns = nvme_ns(n, nsid);
>  
>      trace_nvme_dev_identify_ns(nsid);
>  
> -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> -        return NVME_INVALID_NSID | NVME_DNR;
> +    if (unlikely(!ns)) {
> +        uint16_t status = nvme_nsid_err(n, nsid);
> +
> +        if (!nvme_status_is_error(status, NVME_INVALID_FIELD)) {
> +            return status;
> +        }
I really don't like checking the error value like that. 
It would be better IMHO to have something like
nvme_is_valid_ns, nvme_is_active_ns or something like that.

> +
> +        id_ns = &inactive;
> +    } else {
> +        id_ns = &ns->id_ns;
>      }
>  
> -    ns = &n->namespaces[nsid - 1];
> -
> -    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> +    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> @@ -1505,11 +1533,11 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
>      trace_nvme_dev_identify_ns_list(min_nsid);
>  
>      list = g_malloc0(data_len);
> -    for (i = 0; i < n->num_namespaces; i++) {
> -        if (i < min_nsid) {
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        if (i <= min_nsid || !nvme_ns(n, i)) {
>              continue;
>          }
> -        list[j++] = cpu_to_le32(i + 1);
> +        list[j++] = cpu_to_le32(i);
>          if (j == data_len / sizeof(uint32_t)) {
>              break;
>          }
The refactoring part (removing that +1) which is very nice IMHO should move
to one of earlier refactoring patches.

> @@ -1539,9 +1567,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
>  
>      trace_nvme_dev_identify_ns_descr_list(nsid);
>  
> -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> -        return NVME_INVALID_NSID | NVME_DNR;
> +    if (unlikely(!nvme_ns(n, nsid))) {
> +        return nvme_nsid_err(n, nsid);
>      }
>  
>      /*
> @@ -1681,7 +1708,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          result = cpu_to_le32(n->features.err_rec);
>          break;
>      case NVME_VOLATILE_WRITE_CACHE:
> -        result = blk_enable_write_cache(n->conf.blk);
> +        result = cpu_to_le32(n->features.volatile_wc);
OK, this fixes lack of endianess conversion I pointed out in patch 12.
>          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
>          break;
>      case NVME_NUMBER_OF_QUEUES:
> @@ -1735,6 +1762,8 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
>  
>  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  {
> +    NvmeNamespace *ns;
> +
>      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
>      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
>  
> @@ -1766,8 +1795,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>          break;
>      case NVME_VOLATILE_WRITE_CACHE:
> -        blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> +        n->features.volatile_wc = dw11;
> +
> +        for (int i = 1; i <= n->num_namespaces; i++) {
> +            ns = nvme_ns(n, i);
> +            if (!ns) {
> +                continue;
> +            }
> +
> +            blk_set_enable_write_cache(ns->blk, dw11 & 1);
> +        }
> +
Features are per namespace (page 79 in the spec), so this
is a good candidate of per namespace feature

>          break;
> +
>      case NVME_NUMBER_OF_QUEUES:
>          if (n->qs_created) {
>              return NVME_CMD_SEQ_ERROR | NVME_DNR;
> @@ -1890,9 +1930,17 @@ static void nvme_process_sq(void *opaque)
>  
>  static void nvme_clear_ctrl(NvmeCtrl *n)
>  {
> +    NvmeNamespace *ns;
>      int i;
>  
> -    blk_drain(n->conf.blk);
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +
> +        blk_drain(ns->blk);
> +    }
>  
>      for (i = 0; i < n->params.num_queues; i++) {
>          if (n->sq[i] != NULL) {
> @@ -1915,7 +1963,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
>      n->outstanding_aers = 0;
>      n->qs_created = false;
>  
> -    blk_flush(n->conf.blk);
> +    for (i = 1; i <= n->num_namespaces; i++) {
> +        ns = nvme_ns(n, i);
> +        if (!ns) {
> +            continue;
> +        }
> +
> +        blk_flush(ns->blk);
> +    }
> +
>      n->bar.cc = 0;
>  }
>  
> @@ -2335,8 +2391,8 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
>  {
>      NvmeParams *params = &n->params;
>  
> -    if (!n->conf.blk) {
> -        error_setg(errp, "nvme: block backend not configured");
> +    if (!n->namespace.blk && !n->parent_obj.qdev.id) {
> +        error_setg(errp, "nvme: invalid 'id' parameter");

Nitpick: I think that usually qemu allows user to shoot him in the foot and allow to specify a device without ID,
to which you can't attach devices, so I think that this check is not needed.
You also probably mean 'missing ID'

>          return 1;
>      }
>  
> @@ -2353,22 +2409,10 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
>      return 0;
>  }
>  
> -static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> -{
> -    blkconf_blocksizes(&n->conf);
> -    if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> -        false, errp)) {
> -        return 1;
> -    }
> -
> -    return 0;
> -}
> -
>  static void nvme_init_state(NvmeCtrl *n)
>  {
> -    n->num_namespaces = 1;
> +    n->num_namespaces = 0;

And to say that again since number of valid namespaces should remain static,
here you should just initialize this NVME_MAX_NAMESPACES, and remove the code
that changes IDCTRL.NN dynamically.


>      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> -    n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
>      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
>      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
>  
> @@ -2483,12 +2527,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>      id->cqes = (0x4 << 4) | 0x4;
>      id->nn = cpu_to_le32(n->num_namespaces);
>      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> -
> -
> -    if (blk_enable_write_cache(n->conf.blk)) {
> -        id->vwc = 1;
> -    }
> -
> +    id->vwc = 1;
>      id->sgls = cpu_to_le32(0x1);
>  
>      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> @@ -2509,22 +2548,25 @@ static void nvme_init_ctrl(NvmeCtrl *n)
>      n->bar.intmc = n->bar.intms = 0;
>  }
>  
> -static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  {
> -    int64_t bs_size;
> -    NvmeIdNs *id_ns = &ns->id_ns;
> +    uint32_t nsid = nvme_nsid(ns);
>  
> -    bs_size = blk_getlength(n->conf.blk);
> -    if (bs_size < 0) {
> -        error_setg_errno(errp, -bs_size, "blk_getlength");
> +    if (nsid == 0 || nsid > NVME_MAX_NAMESPACES) {
> +        error_setg(errp, "invalid nsid");
>          return 1;
>      }

As I said above it would be nice to find a valid namespace slot instead
of erroring out when nsid == 0.
Also the error message can be a bit improved IMHO.

>  
> -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> -    n->ns_size = bs_size;
> +    if (n->namespaces[nsid - 1]) {
> +        error_setg(errp, "nsid must be unique");
> +        return 1;
> +    }
> +
> +    trace_nvme_dev_register_namespace(nsid);
>  
> -    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> -        cpu_to_le64(nvme_ns_nlbas(n, ns));
> +    n->namespaces[nsid - 1] = ns;

> +    n->num_namespaces = MAX(n->num_namespaces, nsid);
> +    n->id_ctrl.nn = cpu_to_le32(n->num_namespaces);

These should be removed once you set num_namespaces to be fixed number.

>  
>      return 0;
>  }
> @@ -2532,30 +2574,31 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
>  static void nvme_realize(PCIDevice *pci_dev, Error **errp)
>  {
>      NvmeCtrl *n = NVME(pci_dev);
> +    NvmeNamespace *ns;
>      Error *local_err = NULL;
> -    int i;
>  
>      if (nvme_check_constraints(n, &local_err)) {
>          error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
>          return;
>      }
>  
> +    qbus_create_inplace(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
> +        &pci_dev->qdev, n->parent_obj.qdev.id);
> +
>      nvme_init_state(n);
> -
> -    if (nvme_init_blk(n, &local_err)) {
> -        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
> -        return;
> -    }
> -
> -    for (i = 0; i < n->num_namespaces; i++) {
> -        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> -            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
> -            return;
> -        }
> -    }
> -
>      nvme_init_pci(n, pci_dev);
>      nvme_init_ctrl(n);
> +
> +    /* setup a namespace if the controller drive property was given */
> +    if (n->namespace.blk) {
> +        ns = &n->namespace;
> +        ns->params.nsid = 1;
> +
> +        if (nvme_ns_setup(n, ns, &local_err)) {
> +            error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> +            return;
> +        }
> +    }
>  }
>  
>  static void nvme_exit(PCIDevice *pci_dev)
> @@ -2576,7 +2619,8 @@ static void nvme_exit(PCIDevice *pci_dev)
>  }
>  
>  static Property nvme_props[] = {
> -    DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
> +    DEFINE_BLOCK_PROPERTIES_BASE(NvmeCtrl, conf), \
> +    DEFINE_PROP_DRIVE("drive", NvmeCtrl, namespace.blk), \
>      DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
>      DEFINE_PROP_END_OF_LIST(),
>  };
> @@ -2608,26 +2652,35 @@ static void nvme_instance_init(Object *obj)
>  {
>      NvmeCtrl *s = NVME(obj);
>  
> -    device_add_bootindex_property(obj, &s->conf.bootindex,
> -                                  "bootindex", "/namespace@1,0",
> -                                  DEVICE(obj), &error_abort);
> +    if (s->namespace.blk) {
> +        device_add_bootindex_property(obj, &s->conf.bootindex,
> +                                      "bootindex", "/namespace@1,0",
> +                                      DEVICE(obj), &error_abort);
> +    }
>  }
>  
>  static const TypeInfo nvme_info = {
>      .name          = TYPE_NVME,
>      .parent        = TYPE_PCI_DEVICE,
>      .instance_size = sizeof(NvmeCtrl),
> -    .class_init    = nvme_class_init,
>      .instance_init = nvme_instance_init,
> +    .class_init    = nvme_class_init,
>      .interfaces = (InterfaceInfo[]) {
>          { INTERFACE_PCIE_DEVICE },
>          { }
>      },
>  };
>  
> +static const TypeInfo nvme_bus_info = {
> +    .name = TYPE_NVME_BUS,
> +    .parent = TYPE_BUS,
> +    .instance_size = sizeof(NvmeBus),
> +};
> +
>  static void nvme_register_types(void)
>  {
>      type_register_static(&nvme_info);
> +    type_register_static(&nvme_bus_info);
>  }
>  
>  type_init(nvme_register_types)
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index 3319f8edd7e1..c3cef0f024da 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -2,6 +2,9 @@
>  #define HW_NVME_H
>  
>  #include "block/nvme.h"
> +#include "nvme-ns.h"
> +
> +#define NVME_MAX_NAMESPACES 256
>  
>  #define DEFINE_NVME_PROPERTIES(_state, _props) \
>      DEFINE_PROP_STRING("serial", _state, _props.serial), \
> @@ -108,26 +111,6 @@ typedef struct NvmeCQueue {
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>  } NvmeCQueue;
>  
> -typedef struct NvmeNamespace {
> -    NvmeIdNs        id_ns;
> -} NvmeNamespace;
> -
> -static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> -{
> -    NvmeIdNs *id_ns = &ns->id_ns;
> -    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> -}
> -
> -static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> -{
> -    return nvme_ns_lbaf(ns).ds;
> -}
> -
> -static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> -{
> -    return 1 << nvme_ns_lbads(ns);
> -}
> -
>  typedef enum NvmeAIOOp {
>      NVME_AIO_OPC_NONE         = 0x0,
>      NVME_AIO_OPC_FLUSH        = 0x1,
> @@ -182,6 +165,13 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
>      }
>  }
>  
> +#define TYPE_NVME_BUS "nvme-bus"
> +#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
> +
> +typedef struct NvmeBus {
> +    BusState parent_bus;
> +} NvmeBus;
> +
>  #define TYPE_NVME "nvme"
>  #define NVME(obj) \
>          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> @@ -191,8 +181,9 @@ typedef struct NvmeCtrl {
>      MemoryRegion iomem;
>      MemoryRegion ctrl_mem;
>      NvmeBar      bar;
> -    BlockConf    conf;
>      NvmeParams   params;
> +    NvmeBus      bus;
> +    BlockConf    conf;
>  
>      bool        qs_created;
>      uint32_t    page_size;
> @@ -203,7 +194,6 @@ typedef struct NvmeCtrl {
>      uint32_t    reg_size;
>      uint32_t    num_namespaces;
>      uint32_t    max_q_ents;
> -    uint64_t    ns_size;
>      uint8_t     outstanding_aers;
>      uint32_t    cmbsz;
>      uint32_t    cmbloc;
> @@ -219,7 +209,8 @@ typedef struct NvmeCtrl {
>      QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
>      int         aer_queued;
>  
> -    NvmeNamespace   *namespaces;
> +    NvmeNamespace   namespace;
> +    NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
>      NvmeSQueue      **sq;
>      NvmeCQueue      **cq;
>      NvmeSQueue      admin_sq;
> @@ -228,9 +219,13 @@ typedef struct NvmeCtrl {
>      NvmeFeatureVal  features;
>  } NvmeCtrl;
>  
> -static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> +static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
>  {
> -    return n->ns_size >> nvme_ns_lbads(ns);
> +    if (!nsid || nsid > n->num_namespaces) {
> +        return NULL;
> +    }
> +
> +    return n->namespaces[nsid - 1];
>  }
>  
>  static inline uint16_t nvme_cid(NvmeRequest *req)
> @@ -253,4 +248,6 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
>      return req->sq->ctrl;
>  }
>  
> +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> +
>  #endif /* HW_NVME_H */
> diff --git a/hw/block/trace-events b/hw/block/trace-events
> index 81d69e15fc32..aaf1fcda7923 100644
> --- a/hw/block/trace-events
> +++ b/hw/block/trace-events
> @@ -29,6 +29,7 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
>  
>  # nvme.c
>  # nvme traces for successful events
> +nvme_dev_register_namespace(uint32_t nsid) "nsid %"PRIu32""
>  nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
>  nvme_dev_irq_pin(void) "pulsing IRQ pin"
>  nvme_dev_irq_masked(void) "IRQ is masked"
> @@ -38,7 +39,7 @@ nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"P
>  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> %"PRIu64" opc \"%s\" req %p"
>  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
>  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> -nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> +nvme_dev_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" %s nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
>  nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
>  nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
>  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> qflags=%"PRIu16""
> @@ -94,7 +95,8 @@ nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or no
>  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
>  nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
>  nvme_dev_err_invalid_prp(void) "invalid PRP"
> -nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
> +nvme_dev_err_invalid_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
> +nvme_dev_err_inactive_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
>  nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
>  nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
>  nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 23/26] pci: allocate pci id for nvme
  2020-02-04  9:52     ` [PATCH v5 23/26] pci: allocate pci id for nvme Klaus Jensen
@ 2020-02-12 12:36       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:36 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Gerd Hoffmann, Beata Michalska, qemu-devel,
	Max Reitz, Keith Busch, Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> The emulated nvme device (hw/block/nvme.c) is currently using an
> internal Intel device id.
> 
> Prepare to change that by allocating a device id under the 1b36 (Red
> Hat, Inc.) vendor id.

> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  MAINTAINERS            |  1 +
>  docs/specs/nvme.txt    | 10 ++++++++++
>  docs/specs/pci-ids.txt |  1 +
>  include/hw/pci/pci.h   |  1 +
>  4 files changed, 13 insertions(+)
>  create mode 100644 docs/specs/nvme.txt
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 1f0bc72f2189..14a018e9c0ae 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -1645,6 +1645,7 @@ L: qemu-block@nongnu.org
>  S: Supported
>  F: hw/block/nvme*
>  F: tests/qtest/nvme-test.c
> +F: docs/specs/nvme.txt
>  
>  megasas
>  M: Hannes Reinecke <hare@suse.com>
> diff --git a/docs/specs/nvme.txt b/docs/specs/nvme.txt
> new file mode 100644
> index 000000000000..6ec7ddbc7ee0
> --- /dev/null
> +++ b/docs/specs/nvme.txt
> @@ -0,0 +1,10 @@
> +NVM Express Controller
> +======================
> +
> +The nvme device (-device nvme) emulates an NVM Express Controller.
> +
> +
> +Reference Specifications
> +------------------------
> +
> +  https://nvmexpress.org/resources/specifications/

Nitpick: maybe mention the nvme version here, plus some TODOs that are left?

> diff --git a/docs/specs/pci-ids.txt b/docs/specs/pci-ids.txt
> index 4d53e5c7d9d5..abbdbca6be38 100644
> --- a/docs/specs/pci-ids.txt
> +++ b/docs/specs/pci-ids.txt
> @@ -63,6 +63,7 @@ PCI devices (other than virtio):
>  1b36:000b  PCIe Expander Bridge (-device pxb-pcie)
>  1b36:000d  PCI xhci usb host adapter
>  1b36:000f  mdpy (mdev sample device), linux/samples/vfio-mdev/mdpy.c
> +1b36:0010  PCIe NVMe device (-device nvme)
>  
>  All these devices are documented in docs/specs.
>  
> diff --git a/include/hw/pci/pci.h b/include/hw/pci/pci.h
> index b5013b834b20..9a20c309d0f2 100644
> --- a/include/hw/pci/pci.h
> +++ b/include/hw/pci/pci.h
> @@ -103,6 +103,7 @@ extern bool pci_available;
>  #define PCI_DEVICE_ID_REDHAT_XHCI        0x000d
>  #define PCI_DEVICE_ID_REDHAT_PCIE_BRIDGE 0x000e
>  #define PCI_DEVICE_ID_REDHAT_MDPY        0x000f
> +#define PCI_DEVICE_ID_REDHAT_NVME        0x0010
>  #define PCI_DEVICE_ID_REDHAT_QXL         0x0100
>  
>  #define FMT_PCIBUS                      PRIx64

Other than the actual ID assignment which is not something
I can approve/allocate:

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 24/26] nvme: change controller pci id
  2020-02-04  9:52     ` [PATCH v5 24/26] nvme: change controller pci id Klaus Jensen
  2020-02-04 16:35       ` Keith Busch
@ 2020-02-12 12:37       ` Maxim Levitsky
  1 sibling, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:37 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> There are two reasons for changing this:
> 
>   1. The nvme device currently uses an internal Intel device id.
> 
>   2. Since commits "nvme: fix write zeroes offset and count" and "nvme:
>      support multiple namespaces" the controller device no longer has
>      the quirks that the Linux kernel think it has.
> 
>      As the quirks are applied based on pci vendor and device id, change
>      them to get rid of the quirks.
> 
> To keep backward compatibility, add a new 'x-use-intel-id' parameter to
> the nvme device to force use of the Intel vendor and device id. This is
> off by default but add a compat property to set this for machines 4.2
> and older.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c   | 13 +++++++++----
>  hw/block/nvme.h   |  4 +++-
>  hw/core/machine.c |  1 +
>  3 files changed, 13 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index 3a377bc56734..bdef53a590b0 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -2467,8 +2467,15 @@ static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
>  
>      pci_conf[PCI_INTERRUPT_PIN] = 1;
>      pci_config_set_prog_interface(pci_conf, 0x2);
> -    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> -    pci_config_set_device_id(pci_conf, 0x5845);
> +
> +    if (n->params.use_intel_id) {
> +        pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> +        pci_config_set_device_id(pci_conf, 0x5846);
> +    } else {
> +        pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_REDHAT);
> +        pci_config_set_device_id(pci_conf, PCI_DEVICE_ID_REDHAT_NVME);
> +    }
> +
>      pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
>      pcie_endpoint_cap_init(pci_dev, 0x80);
>  
> @@ -2638,8 +2645,6 @@ static void nvme_class_init(ObjectClass *oc, void *data)
>      pc->realize = nvme_realize;
>      pc->exit = nvme_exit;
>      pc->class_id = PCI_CLASS_STORAGE_EXPRESS;
> -    pc->vendor_id = PCI_VENDOR_ID_INTEL;
> -    pc->device_id = 0x5845;
>      pc->revision = 2;
>  
>      set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> index c3cef0f024da..6b584f53ed64 100644
> --- a/hw/block/nvme.h
> +++ b/hw/block/nvme.h
> @@ -12,7 +12,8 @@
>      DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
>      DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
>      DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
> -    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
> +    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7), \
> +    DEFINE_PROP_BOOL("x-use-intel-id", _state, _props.use_intel_id, false)
>  
>  typedef struct NvmeParams {
>      char     *serial;
> @@ -21,6 +22,7 @@ typedef struct NvmeParams {
>      uint8_t  aerl;
>      uint32_t aer_max_queued;
>      uint8_t  mdts;
> +    bool     use_intel_id;
>  } NvmeParams;
>  
>  typedef struct NvmeAsyncEvent {
> diff --git a/hw/core/machine.c b/hw/core/machine.c
> index 3e288bfceb7f..984412d98c9d 100644
> --- a/hw/core/machine.c
> +++ b/hw/core/machine.c
> @@ -34,6 +34,7 @@ GlobalProperty hw_compat_4_2[] = {
>      { "vhost-blk-device", "seg_max_adjust", "off"},
>      { "usb-host", "suppress-remote-wake", "off" },
>      { "usb-redir", "suppress-remote-wake", "off" },
> +    { "nvme", "x-use-intel-id", "on"},
>  };
>  const size_t hw_compat_4_2_len = G_N_ELEMENTS(hw_compat_4_2);
>  

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 25/26] nvme: remove redundant NvmeCmd pointer parameter
  2020-02-04  9:52     ` [PATCH v5 25/26] nvme: remove redundant NvmeCmd pointer parameter Klaus Jensen
@ 2020-02-12 12:37       ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:37 UTC (permalink / raw)
  To: Klaus Jensen, qemu-block
  Cc: Kevin Wolf, Beata Michalska, qemu-devel, Max Reitz, Keith Busch,
	Klaus Jensen, Javier Gonzalez

On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> The command struct is available in the NvmeRequest that we generally
> pass around anyway.
> 
> Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> ---
>  hw/block/nvme.c | 198 ++++++++++++++++++++++++------------------------
>  1 file changed, 98 insertions(+), 100 deletions(-)
> 
> diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> index bdef53a590b0..5fe2e2fe1fa9 100644
> --- a/hw/block/nvme.c
> +++ b/hw/block/nvme.c
> @@ -566,16 +566,18 @@ unmap:
>  }
>  
>  static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> -    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
> +    DMADirection dir, NvmeRequest *req)
>  {
>      uint16_t status = NVME_SUCCESS;
>      size_t bytes;
> +    uint64_t prp1, prp2;
>  
> -    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> +    switch (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
>      case PSDT_PRP:
> -        status = nvme_map_prp(n, &req->qsg, &req->iov,
> -            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
> -            len, req);
> +        prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
> +        prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
> +
> +        status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
>          if (status) {
>              return status;
>          }
> @@ -589,7 +591,7 @@ static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>              return NVME_INVALID_FIELD;
>          }
>  
> -        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
> +        status = nvme_map_sgl(n, &req->qsg, &req->iov, req->cmd.dptr.sgl, len,
>              req);
>          if (status) {
>              return status;
> @@ -632,20 +634,21 @@ static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
>      return status;
>  }
>  
> -static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_map(NvmeCtrl *n, NvmeRequest *req)
>  {
>      uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
>      uint64_t prp1, prp2;
>  
> -    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> +    switch (NVME_CMD_FLAGS_PSDT(req->cmd.flags)) {
>      case PSDT_PRP:
> -        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
> -        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
> +        prp1 = le64_to_cpu(req->cmd.dptr.prp.prp1);
> +        prp2 = le64_to_cpu(req->cmd.dptr.prp.prp2);
>  
>          return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
>      case PSDT_SGL_MPTR_CONTIGUOUS:
>      case PSDT_SGL_MPTR_SGL:
> -        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
> +        return nvme_map_sgl(n, &req->qsg, &req->iov, req->cmd.dptr.sgl, len,
> +            req);
>      default:
>          return NVME_INVALID_FIELD;
>      }
> @@ -1024,7 +1027,7 @@ static void nvme_aio_cb(void *opaque, int ret)
>      nvme_aio_destroy(aio);
>  }
>  
> -static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_flush(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeNamespace *ns = req->ns;
>      NvmeAIO *aio = g_new0(NvmeAIO, 1);
> @@ -1040,12 +1043,12 @@ static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeAIO *aio;
>  
>      NvmeNamespace *ns = req->ns;
> -    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> +    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
>  
>      int64_t offset;
>      size_t count;
> @@ -1081,9 +1084,9 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_rw(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> +    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
>      NvmeNamespace *ns = req->ns;
>      int status;
>  
> @@ -1103,7 +1106,7 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return status;
>      }
>  
> -    status = nvme_map(n, cmd, req);
> +    status = nvme_map(n, req);
>      if (status) {
>          block_acct_invalid(blk_get_stats(ns->blk), acct);
>          return status;
> @@ -1115,12 +1118,12 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    uint32_t nsid = le32_to_cpu(cmd->nsid);
> +    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
>  
>      trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> -        cmd->opcode);
> +        req->cmd.opcode);
>  
>      req->ns = nvme_ns(n, nsid);
>  
> @@ -1128,16 +1131,16 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          return nvme_nsid_err(n, nsid);
>      }
>  
> -    switch (cmd->opcode) {
> +    switch (req->cmd.opcode) {
>      case NVME_CMD_FLUSH:
> -        return nvme_flush(n, cmd, req);
> +        return nvme_flush(n, req);
>      case NVME_CMD_WRITE_ZEROS:
> -        return nvme_write_zeros(n, cmd, req);
> +        return nvme_write_zeros(n, req);
>      case NVME_CMD_WRITE:
>      case NVME_CMD_READ:
> -        return nvme_rw(n, cmd, req);
> +        return nvme_rw(n, req);
>      default:
> -        trace_nvme_dev_err_invalid_opc(cmd->opcode);
> +        trace_nvme_dev_err_invalid_opc(req->cmd.opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
>      }
>  }
> @@ -1153,10 +1156,10 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
>      }
>  }
>  
> -static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
> -    NvmeRequest *req, *next;
> +    NvmeDeleteQ *c = (NvmeDeleteQ *) &req->cmd;
> +    NvmeRequest *next;
>      NvmeSQueue *sq;
>      NvmeCQueue *cq;
>      NvmeAIO *aio;
> @@ -1224,10 +1227,10 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>      n->sq[sqid] = sq;
>  }
>  
> -static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeSQueue *sq;
> -    NvmeCreateSq *c = (NvmeCreateSq *)cmd;
> +    NvmeCreateSq *c = (NvmeCreateSq *) &req->cmd;
>  
>      uint16_t cqid = le16_to_cpu(c->cqid);
>      uint16_t sqid = le16_to_cpu(c->sqid);
> @@ -1262,10 +1265,10 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> -    uint32_t buf_len, uint64_t off, NvmeRequest *req)
> +static uint16_t nvme_smart_info(NvmeCtrl *n, uint8_t rae, uint32_t buf_len,
> +    uint64_t off, NvmeRequest *req)
>  {
> -    uint32_t nsid = le32_to_cpu(cmd->nsid);
> +    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
>  
>      uint32_t trans_len;
>      time_t current_ms;
> @@ -1320,12 +1323,12 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
>          nvme_clear_events(n, NVME_AER_TYPE_SMART);
>      }
>  
> -    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
> +    return nvme_dma(n, (uint8_t *) &smart + off, trans_len,
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> -    uint64_t off, NvmeRequest *req)
> +static uint16_t nvme_fw_log_info(NvmeCtrl *n, uint32_t buf_len, uint64_t off,
> +    NvmeRequest *req)
>  {
>      uint32_t trans_len;
>      NvmeFwSlotInfoLog fw_log;
> @@ -1338,16 +1341,16 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
>  
>      trans_len = MIN(sizeof(fw_log) - off, buf_len);
>  
> -    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
> +    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len,
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> -    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> -    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> -    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> +    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
> +    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
> +    uint32_t dw12 = le32_to_cpu(req->cmd.cdw12);
> +    uint32_t dw13 = le32_to_cpu(req->cmd.cdw13);
>      uint8_t  lid = dw10 & 0xff;
>      uint8_t  lsp = (dw10 >> 8) & 0xf;
>      uint8_t  rae = (dw10 >> 15) & 0x1;
> @@ -1387,9 +1390,9 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>  
>          return NVME_SUCCESS;
>      case NVME_LOG_SMART_INFO:
> -        return nvme_smart_info(n, cmd, rae, len, off, req);
> +        return nvme_smart_info(n, rae, len, off, req);
>      case NVME_LOG_FW_SLOT_INFO:
> -        return nvme_fw_log_info(n, cmd, len, off, req);
> +        return nvme_fw_log_info(n, len, off, req);
>      default:
>          trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
>          return NVME_INVALID_FIELD | NVME_DNR;
> @@ -1407,9 +1410,9 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>      }
>  }
>  
> -static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    NvmeDeleteQ *c = (NvmeDeleteQ *)cmd;
> +    NvmeDeleteQ *c = (NvmeDeleteQ *) &req->cmd;
>      NvmeCQueue *cq;
>      uint16_t qid = le16_to_cpu(c->qid);
>  
> @@ -1447,10 +1450,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>      cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
>  }
>  
> -static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> +static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeCQueue *cq;
> -    NvmeCreateCq *c = (NvmeCreateCq *)cmd;
> +    NvmeCreateCq *c = (NvmeCreateCq *) &req->cmd;
>      uint16_t cqid = le16_to_cpu(c->cqid);
>      uint16_t vector = le16_to_cpu(c->irq_vector);
>      uint16_t qsize = le16_to_cpu(c->qsize);
> @@ -1489,18 +1492,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_nvme_dev_identify_ctrl();
>  
> -    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
> +    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl),
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeIdNs *id_ns, inactive = { 0 };
> -    uint32_t nsid = le32_to_cpu(cmd->nsid);
> +    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
>      NvmeNamespace *ns = nvme_ns(n, nsid);
>  
>      trace_nvme_dev_identify_ns(nsid);
> @@ -1517,15 +1520,14 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          id_ns = &ns->id_ns;
>      }
>  
> -    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
> +    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs),
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
> -    NvmeRequest *req)
> +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeRequest *req)
>  {
>      static const int data_len = 4 * KiB;
> -    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
> +    uint32_t min_nsid = le32_to_cpu(req->cmd.nsid);
>      uint32_t *list;
>      uint16_t ret;
>      int i, j = 0;
> @@ -1542,14 +1544,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
>              break;
>          }
>      }
> -    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
> +    ret = nvme_dma(n, (uint8_t *) list, data_len,
>          DMA_DIRECTION_FROM_DEVICE, req);
>      g_free(list);
>      return ret;
>  }
>  
> -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
> -    NvmeRequest *req)
> +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeRequest *req)
>  {
>      static const int len = 4096;
>  
> @@ -1560,7 +1561,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
>          uint8_t nid[16];
>      };
>  
> -    uint32_t nsid = le32_to_cpu(cmd->nsid);
> +    uint32_t nsid = le32_to_cpu(req->cmd.nsid);
>  
>      struct ns_descr *list;
>      uint16_t ret;
> @@ -1582,34 +1583,33 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
>      list->nidl = 0x10;
>      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
>  
> -    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
> -        req);
> +    ret = nvme_dma(n, (uint8_t *) list, len, DMA_DIRECTION_FROM_DEVICE, req);
>      g_free(list);
>      return ret;
>  }
>  
> -static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_identify(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    NvmeIdentify *c = (NvmeIdentify *)cmd;
> +    NvmeIdentify *c = (NvmeIdentify *) &req->cmd;
>  
>      switch (le32_to_cpu(c->cns)) {
>      case 0x00:
> -        return nvme_identify_ns(n, cmd, req);
> +        return nvme_identify_ns(n, req);
>      case 0x01:
> -        return nvme_identify_ctrl(n, cmd, req);
> +        return nvme_identify_ctrl(n, req);
>      case 0x02:
> -        return nvme_identify_ns_list(n, cmd, req);
> +        return nvme_identify_ns_list(n, req);
>      case 0x03:
> -        return nvme_identify_ns_descr_list(n, cmd, req);
> +        return nvme_identify_ns_descr_list(n, req);
>      default:
>          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
>          return NVME_INVALID_FIELD | NVME_DNR;
>      }
>  }
>  
> -static uint16_t nvme_abort(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_abort(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    uint16_t sqid = le32_to_cpu(cmd->cdw10) & 0xffff;
> +    uint16_t sqid = le32_to_cpu(req->cmd.cdw10) & 0xffff;
>  
>      req->cqe.result = 1;
>      if (nvme_check_sqid(n, sqid)) {
> @@ -1659,19 +1659,18 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
>      return cpu_to_le64(ts.all);
>  }
>  
> -static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> -    NvmeRequest *req)
> +static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
>  {
>      uint64_t timestamp = nvme_get_timestamp(n);
>  
> -    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
> +    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp),
>          DMA_DIRECTION_FROM_DEVICE, req);
>  }
>  
> -static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> -    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
> +    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
>      uint32_t result;
>  
>      trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> @@ -1717,7 +1716,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>          trace_nvme_dev_getfeat_numq(result);
>          break;
>      case NVME_TIMESTAMP:
> -        return nvme_get_feature_timestamp(n, cmd, req);
> +        return nvme_get_feature_timestamp(n, req);
>      case NVME_INTERRUPT_COALESCING:
>          result = cpu_to_le32(n->features.int_coalescing);
>          break;
> @@ -1743,13 +1742,12 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> -    NvmeRequest *req)
> +static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeRequest *req)
>  {
>      uint16_t ret;
>      uint64_t timestamp;
>  
> -    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
> +    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp),
>          DMA_DIRECTION_TO_DEVICE, req);
>      if (ret != NVME_SUCCESS) {
>          return ret;
> @@ -1760,12 +1758,12 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeRequest *req)
>  {
>      NvmeNamespace *ns;
>  
> -    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> -    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> +    uint32_t dw10 = le32_to_cpu(req->cmd.cdw10);
> +    uint32_t dw11 = le32_to_cpu(req->cmd.cdw11);
>  
>      trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
>  
> @@ -1824,7 +1822,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>              ((n->params.num_queues - 2) << 16));
>          break;
>      case NVME_TIMESTAMP:
> -        return nvme_set_feature_timestamp(n, cmd, req);
> +        return nvme_set_feature_timestamp(n, req);
>      case NVME_ASYNCHRONOUS_EVENT_CONF:
>          n->features.async_config = dw11;
>          break;
> @@ -1843,7 +1841,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_SUCCESS;
>  }
>  
> -static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_aer(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_nvme_dev_aer(nvme_cid(req));
>  
> @@ -1862,31 +1860,31 @@ static uint16_t nvme_aer(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
>      return NVME_NO_COMPLETE;
>  }
>  
> -static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> +static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>  {
> -    switch (cmd->opcode) {
> +    switch (req->cmd.opcode) {
>      case NVME_ADM_CMD_DELETE_SQ:
> -        return nvme_del_sq(n, cmd);
> +        return nvme_del_sq(n, req);
>      case NVME_ADM_CMD_CREATE_SQ:
> -        return nvme_create_sq(n, cmd);
> +        return nvme_create_sq(n, req);
>      case NVME_ADM_CMD_GET_LOG_PAGE:
> -        return nvme_get_log(n, cmd, req);
> +        return nvme_get_log(n, req);
>      case NVME_ADM_CMD_DELETE_CQ:
> -        return nvme_del_cq(n, cmd);
> +        return nvme_del_cq(n, req);
>      case NVME_ADM_CMD_CREATE_CQ:
> -        return nvme_create_cq(n, cmd);
> +        return nvme_create_cq(n, req);
>      case NVME_ADM_CMD_IDENTIFY:
> -        return nvme_identify(n, cmd, req);
> +        return nvme_identify(n, req);
>      case NVME_ADM_CMD_ABORT:
> -        return nvme_abort(n, cmd, req);
> +        return nvme_abort(n, req);
>      case NVME_ADM_CMD_SET_FEATURES:
> -        return nvme_set_feature(n, cmd, req);
> +        return nvme_set_feature(n, req);
>      case NVME_ADM_CMD_GET_FEATURES:
> -        return nvme_get_feature(n, cmd, req);
> +        return nvme_get_feature(n, req);
>      case NVME_ADM_CMD_ASYNC_EV_REQ:
> -        return nvme_aer(n, cmd, req);
> +        return nvme_aer(n, req);
>      default:
> -        trace_nvme_dev_err_invalid_admin_opc(cmd->opcode);
> +        trace_nvme_dev_err_invalid_admin_opc(req->cmd.opcode);
>          return NVME_INVALID_OPCODE | NVME_DNR;
>      }
>  }
> @@ -1919,8 +1917,8 @@ static void nvme_process_sq(void *opaque)
>          req->cqe.cid = cmd.cid;
>          memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
>  
> -        status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
> -            nvme_admin_cmd(n, &cmd, req);
> +        status = sq->sqid ? nvme_io_cmd(n, req) :
> +            nvme_admin_cmd(n, req);
>          if (status != NVME_NO_COMPLETE) {
>              req->status = status;
>              nvme_enqueue_req_completion(cq, req);

Other that line wrapping issues,

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 26/26] nvme: make lba data size configurable
  2020-02-06  7:24         ` Klaus Birkelund Jensen
@ 2020-02-12 12:39           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 12:39 UTC (permalink / raw)
  To: Klaus Birkelund Jensen, Keith Busch
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Klaus Jensen, Javier Gonzalez

On Thu, 2020-02-06 at 08:24 +0100, Klaus Birkelund Jensen wrote:
> On Feb  5 01:43, Keith Busch wrote:
> > On Tue, Feb 04, 2020 at 10:52:08AM +0100, Klaus Jensen wrote:
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/nvme-ns.c | 2 +-
> > >  hw/block/nvme-ns.h | 4 +++-
> > >  hw/block/nvme.c    | 1 +
> > >  3 files changed, 5 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > > index 0e5be44486f4..981d7101b8f2 100644
> > > --- a/hw/block/nvme-ns.c
> > > +++ b/hw/block/nvme-ns.c
> > > @@ -18,7 +18,7 @@ static int nvme_ns_init(NvmeNamespace *ns)
> > >  {
> > >      NvmeIdNs *id_ns = &ns->id_ns;
> > >  
> > > -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > > +    id_ns->lbaf[0].ds = ns->params.lbads;
> > >      id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > >          cpu_to_le64(nvme_ns_nlbas(ns));
> > >  
> > > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > > index b564bac25f6d..f1fe4db78b41 100644
> > > --- a/hw/block/nvme-ns.h
> > > +++ b/hw/block/nvme-ns.h
> > > @@ -7,10 +7,12 @@
> > >  
> > >  #define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > >      DEFINE_PROP_DRIVE("drive", _state, blk), \
> > > -    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > > +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0), \
> > > +    DEFINE_PROP_UINT8("lbads", _state, _props.lbads, BDRV_SECTOR_BITS)
> > 
> > I think we need to validate the parameter is between 9 and 12 before
> > trusting it can be used safely.
> > 
> > Alternatively, add supported formats to the lbaf array and let the host
> > decide on a live system with the 'format' command.
> 
> The device does not yet support Format NVM, but we have a patch ready
> for that to be submitted with a new series when this is merged.
> 
> For now, while it does not support Format, I will change this patch such
> that it defaults to 9 (BRDV_SECTOR_BITS) and only accept 12 as an
> alternative (while always keeping the number of formats available to 1).
Looks like a good idea.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev
  2020-02-12  9:08       ` Maxim Levitsky
@ 2020-02-12 13:08         ` Klaus Birkelund Jensen
  2020-02-12 13:17           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-02-12 13:08 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Keith Busch, Klaus Jensen, Javier Gonzalez

[-- Attachment #1: Type: text/plain, Size: 991 bytes --]

On Feb 12 11:08, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > to not clash with trace events from the nvme block driver.
> > 

Hi Maxim,

Thank you very much for your thorough reviews! Utterly appreciated!

I'll start going through your suggested changes. There is a bit of work
to do on splitting patches into refactoring and bugfixes, but I can
definitely see the reason for this, so I'll get to work.

You mention the alignment with split lines alot. I actually thought I
was following CODING_STYLE.rst (which allows a single 4 space indent for
functions, but not statements such as if/else and while/for). But since
hw/block/nvme.c is originally written in the style of aligning with the
opening paranthesis I'm in the wrong here, so I will of course amend
it. Should have done that from the beginning, it's just my personal
taste shining through ;)

Thanks again,
Klaus

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 01/26] nvme: rename trace events to nvme_dev
  2020-02-12 13:08         ` Klaus Birkelund Jensen
@ 2020-02-12 13:17           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-02-12 13:17 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, qemu-devel, Max Reitz,
	Keith Busch, Klaus Jensen, Javier Gonzalez

On Wed, 2020-02-12 at 14:08 +0100, Klaus Birkelund Jensen wrote:
> On Feb 12 11:08, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Change the prefix of all nvme device related trace events to 'nvme_dev'
> > > to not clash with trace events from the nvme block driver.
> > > 
> 
> Hi Maxim,
> 
> Thank you very much for your thorough reviews! Utterly appreciated!

Thanks to you for the patch series!

> 
> I'll start going through your suggested changes. There is a bit of work
> to do on splitting patches into refactoring and bugfixes, but I can
> definitely see the reason for this, so I'll get to work.
> 
> You mention the alignment with split lines alot. I actually thought I
> was following CODING_STYLE.rst (which allows a single 4 space indent for
> functions, but not statements such as if/else and while/for). But since
> hw/block/nvme.c is originally written in the style of aligning with the
> opening paranthesis I'm in the wrong here, so I will of course amend
> it. Should have done that from the beginning, it's just my personal
> taste shining through ;)

TO be honest this is my personal taste as well, but after *many* review
complains about this I consider that aligning on opening paranthesis 
is kind of an official style.

If others are OK with this though, I am personally 100% fine with leaving the
split lines as is.


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 08/26] nvme: refactor device realization
  2020-02-12  9:27       ` Maxim Levitsky
@ 2020-03-16  7:43         ` Klaus Birkelund Jensen
  2020-03-25 10:21           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:43 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 11:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This patch splits up nvme_realize into multiple individual functions,
> > each initializing a different subset of the device.
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > ---
> >  hw/block/nvme.c | 175 +++++++++++++++++++++++++++++++-----------------
> >  hw/block/nvme.h |  21 ++++++
> >  2 files changed, 133 insertions(+), 63 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index e1810260d40b..81514eaef63a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -44,6 +44,7 @@
> >  #include "nvme.h"
> >  
> >  #define NVME_SPEC_VER 0x00010201
> > +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >      do { \
> > @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
> >      },
> >  };
> >  
> > -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >  {
> > -    NvmeCtrl *n = NVME(pci_dev);
> > -    NvmeIdCtrl *id = &n->id_ctrl;
> > -
> > -    int i;
> > -    int64_t bs_size;
> > -    uint8_t *pci_conf;
> > -
> > -    if (!n->params.num_queues) {
> > -        error_setg(errp, "num_queues can't be zero");
> > -        return;
> > -    }
> > +    NvmeParams *params = &n->params;
> >  
> >      if (!n->conf.blk) {
> > -        error_setg(errp, "drive property not set");
> > -        return;
> > +        error_setg(errp, "nvme: block backend not configured");
> > +        return 1;
> As a matter of taste, negative values indicate error, and 0 is the success value.
> In Linux kernel this is even an official rule.
> >      }

Fixed.

> >  
> > -    bs_size = blk_getlength(n->conf.blk);
> > -    if (bs_size < 0) {
> > -        error_setg(errp, "could not get backing file size");
> > -        return;
> > +    if (!params->serial) {
> > +        error_setg(errp, "nvme: serial not configured");
> > +        return 1;
> >      }
> >  
> > -    if (!n->params.serial) {
> > -        error_setg(errp, "serial property not set");
> > -        return;
> > +    if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> > +        error_setg(errp, "nvme: invalid queue configuration");
> Maybe something like "nvme: invalid queue count specified, should be between 1 and ..."?
> > +        return 1;
> >      }

Fixed.

> > +
> > +    return 0;
> > +}
> > +
> > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > +{
> >      blkconf_blocksizes(&n->conf);
> >      if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> > -                                       false, errp)) {
> > -        return;
> > +        false, errp)) {
> > +        return 1;
> >      }
> >  
> > -    pci_conf = pci_dev->config;
> > -    pci_conf[PCI_INTERRUPT_PIN] = 1;
> > -    pci_config_set_prog_interface(pci_dev->config, 0x2);
> > -    pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> > -    pcie_endpoint_cap_init(pci_dev, 0x80);
> > +    return 0;
> > +}
> >  
> > +static void nvme_init_state(NvmeCtrl *n)
> > +{
> >      n->num_namespaces = 1;
> >      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> 
> Isn't that wrong?
> First 4K of mmio (0x1000) is the registers, and that is followed by the doorbells,
> and each doorbell takes 8 bytes (assuming regular doorbell stride).
> so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 should be 0x1000 IMHO.
> I might miss some rounding magic here though.
> 

Yeah. I think you are right. It all becomes slightly more fishy due to
the num_queues device parameter being 1's based and accounts for the
admin queue pair.

But in get/set features, the value has to be 0's based and only account
for the I/O queues, so we need to subtract 2 from the value. It's
confusing all around.

Since the admin queue pair isn't really optional I think it would be
better that we introduces a new max_ioqpairs parameter that is 1's
based, counts number of pairs and obviously only accounts for the io
queues.

I guess we need to keep the num_queues parameter around for
compatibility.

The doorbells are only 4 bytes btw, but the calculation still looks
wrong. With a max_ioqpairs parameter in place, the reg_size should be

    pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4)

Right? Thats 0x1000 for the core registers, 8 bytes for the sq/cq
doorbells for the admin queue pair, and then room for the i/o queue
pairs.

I added a patch for this in v6.

> > -    n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> > -
> >      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +}
> >  
> > -    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
> > -                          "nvme", n->reg_size);
> > +static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > +{
> > +    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
> It would be nice to have #define for CMB bar number

Added.

> > +    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> > +
> > +    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> > +    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > +    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> > +    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> > +    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> > +    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> > +
> > +    n->cmbloc = n->bar.cmbloc;
> > +    n->cmbsz = n->bar.cmbsz;
> > +
> > +    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > +    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> > +                            "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > +    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> Same here although since you read it here from the controller register,
> then maybe leave it as is. I prefer though for this kind of thing
> to have a #define and use it everywhere. 
> 

Done.

> > +        PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > +        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> > +}
> > +
> > +static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
> > +{
> > +    uint8_t *pci_conf = pci_dev->config;
> > +
> > +    pci_conf[PCI_INTERRUPT_PIN] = 1;
> > +    pci_config_set_prog_interface(pci_conf, 0x2);
> Nitpick: How about adding some #define for that as well?
> (I know that this code is copied as is but still)

Yeah. A PCI_PI_NVME or something would be nice. But this should probably
go to some pci related header file? Any idea where that would fit?

> > +    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > +    pci_config_set_device_id(pci_conf, 0x5845);
> > +    pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> > +    pcie_endpoint_cap_init(pci_dev, 0x80);
> > +
> > +    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
> > +        n->reg_size);
> 
> Code on split lines should start at column right after the '('
> Now its my turn to notice this - our checkpatch.pl doesn't check this,
> and I can't explain how often I am getting burnt on this myself.
> 
> There are *lot* of these issues, I pointed out some of them but you should
> check all the patches for this.
> 

I fixed all that :)

> 
> >      pci_register_bar(pci_dev, 0,
> >          PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
> >          &n->iomem);
> Split line alignment issue here as well.
> >      msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
> >  
> > +    if (n->params.cmb_size_mb) {
> > +        nvme_init_cmb(n, pci_dev);
> > +    }
> > +}
> > +
> > +static void nvme_init_ctrl(NvmeCtrl *n)
> > +{
> > +    NvmeIdCtrl *id = &n->id_ctrl;
> > +    NvmeParams *params = &n->params;
> > +    uint8_t *pci_conf = n->parent_obj.config;
> > +
> >      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
> >      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
> >      strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
> >      strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
> > -    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
> > +    strpadcpy((char *)id->sn, sizeof(id->sn), params->serial, ' ');
> >      id->rab = 6;
> >      id->ieee[0] = 0x00;
> >      id->ieee[1] = 0x02;
> > @@ -1431,46 +1471,55 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >  
> >      n->bar.vs = NVME_SPEC_VER;
> >      n->bar.intmc = n->bar.intms = 0;
> > +}
> >  
> > -    if (n->params.cmb_size_mb) {
> > +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +{
> > +    int64_t bs_size;
> > +    NvmeIdNs *id_ns = &ns->id_ns;
> >  
> > -        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
> > -        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> > +    bs_size = blk_getlength(n->conf.blk);
> > +    if (bs_size < 0) {
> > +        error_setg_errno(errp, -bs_size, "blk_getlength");
> > +        return 1;
> > +    }
> >  
> > -        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> > -        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > -        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> > -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> > -        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> > -        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> > +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +    n->ns_size = bs_size;
> >  
> > -        n->cmbloc = n->bar.cmbloc;
> > -        n->cmbsz = n->bar.cmbsz;
> > +    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> > +        cpu_to_le64(nvme_ns_nlbas(n, ns));
> I myself don't know how to align these splits to be honest.
> I would just split this into multiple statements.
> >  
> > -        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > -        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> > -                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > -        pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> > -            PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > -            PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> > +    return 0;
> > +}
> >  
> > +static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > +{
> > +    NvmeCtrl *n = NVME(pci_dev);
> > +    Error *local_err = NULL;
> > +    int i;
> > +
> > +    if (nvme_check_constraints(n, &local_err)) {
> > +        error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
> Do we need that hint for the end user?

Removed.

> > +        return;
> > +    }
> > +
> > +    nvme_init_state(n);
> > +
> > +    if (nvme_init_blk(n, &local_err)) {
> > +        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
> Same here

Done.


> > +        return;
> >      }
> >  
> >      for (i = 0; i < n->num_namespaces; i++) {
> > -        NvmeNamespace *ns = &n->namespaces[i];
> > -        NvmeIdNs *id_ns = &ns->id_ns;
> > -        id_ns->nsfeat = 0;
> > -        id_ns->nlbaf = 0;
> > -        id_ns->flbas = 0;
> > -        id_ns->mc = 0;
> > -        id_ns->dpc = 0;
> > -        id_ns->dps = 0;
> > -        id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > -        id_ns->ncap  = id_ns->nuse = id_ns->nsze =
> > -            cpu_to_le64(n->ns_size >>
> > -                id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
> > +        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> > +            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
> And here

Done.


> > +            return;
> > +        }
> >      }
> > +
> > +    nvme_init_pci(n, pci_dev);
> > +    nvme_init_ctrl(n);
> >  }
> >  
> >  static void nvme_exit(PCIDevice *pci_dev)
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 9957c4a200e2..a867bdfabafd 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -65,6 +65,22 @@ typedef struct NvmeNamespace {
> >      NvmeIdNs        id_ns;
> >  } NvmeNamespace;
> >  
> > +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > +{
> Its not common to return a structure in C, usually pointer is returned to
> avoid copying. In this case this doesn't matter that much though.

It's actually gonna be used a lot. So swapped to pointer.

> > +    NvmeIdNs *id_ns = &ns->id_ns;
> > +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > +}
> > +
> > +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > +{
> > +    return nvme_ns_lbaf(ns).ds;
> > +}
> > +
> > +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > +{
> > +    return 1 << nvme_ns_lbads(ns);
> > +}
> > +
> >  #define TYPE_NVME "nvme"
> >  #define NVME(obj) \
> >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > @@ -101,4 +117,9 @@ typedef struct NvmeCtrl {
> >      NvmeIdCtrl      id_ctrl;
> >  } NvmeCtrl;
> >  
> > +static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > +{
> > +    return n->ns_size >> nvme_ns_lbads(ns);
> > +}
> Unless you need all these functions in the future, this feels like
> it is a bit verbose.
> 

These will be used in various places later.
 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 09/26] nvme: add temperature threshold feature
  2020-02-12  9:31       ` Maxim Levitsky
@ 2020-03-16  7:44         ` Klaus Birkelund Jensen
  2020-03-25 10:21           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:44 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 11:31, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > It might seem wierd to implement this feature for an emulated device,
> > but it is mandatory to support and the feature is useful for testing
> > asynchronous event request support, which will be added in a later
> > patch.
> 
> Absolutely but as the old saying is, rules are rules.
> At least, to the defense of the spec, making this mandatory
> forced the vendors to actually report some statistics about
> the device in neutral format as opposed to yet another
> vendor proprietary thing (I am talking about SMART log page).
> 
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> 
> I noticed that you sign off some patches with your @samsung.com email,
> and some with @cnexlabs.com
> Is there a reason for that?

Yeah. Some of this code was made while I was at CNEX Labs. I've since
moved to Samsung. But credit where credit's due.

> 
> 
> > ---
> >  hw/block/nvme.c      | 50 ++++++++++++++++++++++++++++++++++++++++++++
> >  hw/block/nvme.h      |  2 ++
> >  include/block/nvme.h |  7 ++++++-
> >  3 files changed, 58 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 81514eaef63a..f72348344832 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -45,6 +45,9 @@
> >  
> >  #define NVME_SPEC_VER 0x00010201
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > +#define NVME_TEMPERATURE 0x143
> > +#define NVME_TEMPERATURE_WARNING 0x157
> > +#define NVME_TEMPERATURE_CRITICAL 0x175
> >  
> >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> >      do { \
> > @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >      uint32_t result;
> >  
> >      switch (dw10) {
> > +    case NVME_TEMPERATURE_THRESHOLD:
> > +        result = 0;
> > +
> > +        /*
> > +         * The controller only implements the Composite Temperature sensor, so
> > +         * return 0 for all other sensors.
> > +         */
> > +        if (NVME_TEMP_TMPSEL(dw11)) {
> > +            break;
> > +        }
> > +
> > +        switch (NVME_TEMP_THSEL(dw11)) {
> > +        case 0x0:
> > +            result = cpu_to_le16(n->features.temp_thresh_hi);
> > +            break;
> > +        case 0x1:
> > +            result = cpu_to_le16(n->features.temp_thresh_low);
> > +            break;
> > +        }
> > +
> > +        break;
> >      case NVME_VOLATILE_WRITE_CACHE:
> >          result = blk_enable_write_cache(n->conf.blk);
> >          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> >      switch (dw10) {
> > +    case NVME_TEMPERATURE_THRESHOLD:
> > +        if (NVME_TEMP_TMPSEL(dw11)) {
> > +            break;
> > +        }
> > +
> > +        switch (NVME_TEMP_THSEL(dw11)) {
> > +        case 0x0:
> > +            n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> > +            break;
> > +        case 0x1:
> > +            n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> > +            break;
> > +        default:
> > +            return NVME_INVALID_FIELD | NVME_DNR;
> > +        }
> > +
> > +        break;
> >      case NVME_VOLATILE_WRITE_CACHE:
> >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >          break;
> > @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
> >      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > +
> > +    n->temperature = NVME_TEMPERATURE;
> 
> This appears not to be used in the patch.
> I think you should move that to the next patch that
> adds the get log page support.
> 

Fixed.

> > +    n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >  }
> >  
> >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >      id->acl = 3;
> >      id->frmw = 7 << 1;
> >      id->lpa = 1 << 0;
> > +
> > +    /* recommended default value (~70 C) */
> > +    id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > +    id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> > +
> >      id->sqes = (0x6 << 4) | 0x6;
> >      id->cqes = (0x4 << 4) | 0x4;
> >      id->nn = cpu_to_le32(n->num_namespaces);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index a867bdfabafd..1518f32557a3 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> >      uint64_t    irq_status;
> >      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> >      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> > +    uint16_t    temperature;
> >  
> >      NvmeNamespace   *namespaces;
> >      NvmeSQueue      **sq;
> > @@ -115,6 +116,7 @@ typedef struct NvmeCtrl {
> >      NvmeSQueue      admin_sq;
> >      NvmeCQueue      admin_cq;
> >      NvmeIdCtrl      id_ctrl;
> > +    NvmeFeatureVal  features;
> >  } NvmeCtrl;
> >  
> >  static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index d2f65e8fe496..ff31cb32117c 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -616,7 +616,8 @@ enum NvmeIdCtrlOncs {
> >  typedef struct NvmeFeatureVal {
> >      uint32_t    arbitration;
> >      uint32_t    power_mgmt;
> > -    uint32_t    temp_thresh;
> > +    uint16_t    temp_thresh_hi;
> > +    uint16_t    temp_thresh_low;
> >      uint32_t    err_rec;
> >      uint32_t    volatile_wc;
> >      uint32_t    num_queues;
> > @@ -635,6 +636,10 @@ typedef struct NvmeFeatureVal {
> >  #define NVME_INTC_THR(intc)     (intc & 0xff)
> >  #define NVME_INTC_TIME(intc)    ((intc >> 8) & 0xff)
> >  
> > +#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
> > +#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
> > +#define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
> > +
> >  enum NvmeFeatureIds {
> >      NVME_ARBITRATION                = 0x1,
> >      NVME_POWER_MANAGEMENT           = 0x2,
> 
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 10/26] nvme: add support for the get log page command
  2020-02-12  9:35       ` Maxim Levitsky
@ 2020-03-16  7:45         ` Klaus Birkelund Jensen
  2020-03-25 10:22           ` Maxim Levitsky
  2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 2 replies; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:45 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 11:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for the Get Log Page command and basic implementations of
> > the mandatory Error Information, SMART / Health Information and Firmware
> > Slot Information log pages.
> > 
> > In violation of the specification, the SMART / Health Information log
> > page does not persist information over the lifetime of the controller
> > because the device has no place to store such persistent state.
> Yea, not the end of the world.
> > 
> > Note that the LPA field in the Identify Controller data structure
> > intentionally has bit 0 cleared because there is no namespace specific
> > information in the SMART / Health information log page.
> Makes sense.
> > 
> > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > Section 5.10 ("Get Log Page command").
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > ---
> >  hw/block/nvme.c       | 122 +++++++++++++++++++++++++++++++++++++++++-
> >  hw/block/nvme.h       |  10 ++++
> >  hw/block/trace-events |   2 +
> >  include/block/nvme.h  |   2 +-
> >  4 files changed, 134 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f72348344832..468c36918042 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >      return NVME_SUCCESS;
> >  }
> >  
> > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > +    uint64_t off, NvmeRequest *req)
> > +{
> > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +
> > +    uint32_t trans_len;
> > +    time_t current_ms;
> > +    uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > +        write_commands = 0;
> > +    NvmeSmartLog smart;
> > +    BlockAcctStats *s;
> > +
> > +    if (nsid && nsid != 0xffffffff) {
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    s = blk_get_stats(n->conf.blk);
> > +
> > +    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > +    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +
> > +    if (off > sizeof(smart)) {
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    trans_len = MIN(sizeof(smart) - off, buf_len);
> > +
> > +    memset(&smart, 0x0, sizeof(smart));
> > +
> > +    smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > +    smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > +    smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > +    smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > +
> > +    smart.temperature[0] = n->temperature & 0xff;
> > +    smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > +
> > +    if ((n->temperature > n->features.temp_thresh_hi) ||
> > +        (n->temperature < n->features.temp_thresh_low)) {
> > +        smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > +    }
> > +
> > +    current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +    smart.power_on_hours[0] = cpu_to_le64(
> > +        (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > +
> > +    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > +        prp2);
> > +}
> Looks OK.
> > +
> > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > +    uint64_t off, NvmeRequest *req)
> > +{
> > +    uint32_t trans_len;
> > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +    NvmeFwSlotInfoLog fw_log;
> > +
> > +    if (off > sizeof(fw_log)) {
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> > +
> > +    trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > +
> > +    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > +        prp2);
> > +}
> Looks OK
> > +
> > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > +    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > +    uint8_t  lid = dw10 & 0xff;
> > +    uint8_t  rae = (dw10 >> 15) & 0x1;
> > +    uint32_t numdl, numdu;
> > +    uint64_t off, lpol, lpou;
> > +    size_t   len;
> > +
> > +    numdl = (dw10 >> 16);
> > +    numdu = (dw11 & 0xffff);
> > +    lpol = dw12;
> > +    lpou = dw13;
> > +
> > +    len = (((numdu << 16) | numdl) + 1) << 2;
> > +    off = (lpou << 32ULL) | lpol;
> > +
> > +    if (off & 0x3) {
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> 
> Good. 
> Note that there are plenty of other places in the driver that don't honor
> such tiny formal bits of the spec, like for instance checking for the reserved
> bits in commands.

Yeah. I know. You think its fair we leave that for subsequent patches?
It's not like its breaking the device, but compliance is not complete.

> > +
> > +    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > +
> > +    switch (lid) {
> > +    case NVME_LOG_ERROR_INFO:
> > +        if (off) {
> > +            return NVME_INVALID_FIELD | NVME_DNR;
> > +        }
> 
> I think you might want to memset the user given buffer to zero:
> 
> "This is a 64-bit incrementing error count, indicating a unique identifier for this error.
> The error count starts at 1h, is incremented for each unique error log entry, and is retained across
> power off conditions. A value of 0h indicates an invalid entry; this value is used when there are
> lost entries or when there are fewer errors than the maximum number of entries the controller
> supports."

Good catch. Fixed!

> > +
> > +        return NVME_SUCCESS;
> > +    case NVME_LOG_SMART_INFO:
> > +        return nvme_smart_info(n, cmd, len, off, req);
> > +    case NVME_LOG_FW_SLOT_INFO:
> > +        return nvme_fw_log_info(n, cmd, len, off, req);
> > +    default:
> > +        trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +}
> 
> 
> > +
> >  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> >  {
> >      n->cq[cq->cqid] = NULL;
> > @@ -914,6 +1031,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          return nvme_del_sq(n, cmd);
> >      case NVME_ADM_CMD_CREATE_SQ:
> >          return nvme_create_sq(n, cmd);
> > +    case NVME_ADM_CMD_GET_LOG_PAGE:
> > +        return nvme_get_log(n, cmd, req);
> >      case NVME_ADM_CMD_DELETE_CQ:
> >          return nvme_del_cq(n, cmd);
> >      case NVME_ADM_CMD_CREATE_CQ:
> > @@ -1411,6 +1530,7 @@ static void nvme_init_state(NvmeCtrl *n)
> >  
> >      n->temperature = NVME_TEMPERATURE;
> >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> > +    n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> >  }
> >  
> >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > @@ -1491,7 +1611,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >       */
> >      id->acl = 3;
> >      id->frmw = 7 << 1;
> > -    id->lpa = 1 << 0;
> > +    id->lpa = 1 << 2;
> >  
> >      /* recommended default value (~70 C) */
> >      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 1518f32557a3..89b0aafa02a2 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -109,6 +109,7 @@ typedef struct NvmeCtrl {
> >      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> >      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> >      uint16_t    temperature;
> > +    uint64_t    starttime_ms;
> >  
> >      NvmeNamespace   *namespaces;
> >      NvmeSQueue      **sq;
> > @@ -124,4 +125,13 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> >      return n->ns_size >> nvme_ns_lbads(ns);
> >  }
> >  
> > +static inline uint16_t nvme_cid(NvmeRequest *req)
> > +{
> > +    if (req) {
> > +        return le16_to_cpu(req->cqe.cid);
> > +    }
> > +
> > +    return 0xffff;
> > +}
> 
> I see that you added command ID reporting to trace events you added,
> which makes sense.
> I think it would be nice later to add it to existing trace events where it makes sense.
> 

Exactly. I'm doing that as I encounter it and it makes sense to have it
in the patch.

> 
> > +
> >  #endif /* HW_NVME_H */
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index ade506ea2bb2..7da088479f39 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -46,6 +46,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> >  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> >  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> > +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> >  nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> >  nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> >  nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> > @@ -85,6 +86,7 @@ nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completi
> >  nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
> >  nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
> >  nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> > +nvme_dev_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
> >  nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
> >  nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
> >  nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index ff31cb32117c..9a6055adeb61 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -515,7 +515,7 @@ enum NvmeSmartWarn {
> >      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
> >  };
> >  
> > -enum LogIdentifier {
> > +enum NvmeLogIdentifier {
> >      NVME_LOG_ERROR_INFO     = 0x01,
> >      NVME_LOG_SMART_INFO     = 0x02,
> >      NVME_LOG_FW_SLOT_INFO   = 0x03,
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 12/26] nvme: add missing mandatory features
  2020-02-12 10:27       ` Maxim Levitsky
@ 2020-03-16  7:47         ` Klaus Birkelund Jensen
  2020-03-25 10:22           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:47 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 12:27, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add support for returning a resonable response to Get/Set Features of
> > mandatory features.
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > ---
> >  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  3 ++-
> >  3 files changed, 58 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index a186d95df020..3267ee2de47a 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >      uint32_t result;
> >  
> > +    trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> > +
> >      switch (dw10) {
> > +    case NVME_ARBITRATION:
> > +        result = cpu_to_le32(n->features.arbitration);
> > +        break;
> > +    case NVME_POWER_MANAGEMENT:
> > +        result = cpu_to_le32(n->features.power_mgmt);
> > +        break;
> >      case NVME_TEMPERATURE_THRESHOLD:
> >          result = 0;
> >  
> > @@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >              break;
> >          }
> >  
> > +        break;
> > +    case NVME_ERROR_RECOVERY:
> > +        result = cpu_to_le32(n->features.err_rec);
> >          break;
> >      case NVME_VOLATILE_WRITE_CACHE:
> >          result = blk_enable_write_cache(n->conf.blk);
> 
> This is existing code but still like to point out that endianess conversion is missing.

Fixed.

> Also we need to think if we need to do some flush if the write cache is disabled.
> I don't know yet that area well enough.
> 

Looking at the block layer code it just sets a flag when disabling, but
subsequent requests will have BDRV_REQ_FUA set. So to make sure that
stuff in the cache is flushed, let's do a flush.

> > @@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          break;
> >      case NVME_TIMESTAMP:
> >          return nvme_get_feature_timestamp(n, cmd);
> > +    case NVME_INTERRUPT_COALESCING:
> > +        result = cpu_to_le32(n->features.int_coalescing);
> > +        break;
> > +    case NVME_INTERRUPT_VECTOR_CONF:
> > +        if ((dw11 & 0xffff) > n->params.num_queues) {
> Looks like it should be >= since interrupt vector is not zero based.

Fixed in other patch.

> > +            return NVME_INVALID_FIELD | NVME_DNR;
> > +        }
> > +
> > +        result = cpu_to_le32(n->features.int_vector_config[dw11 & 0xffff]);
> > +        break;
> > +    case NVME_WRITE_ATOMICITY:
> > +        result = cpu_to_le32(n->features.write_atomicity);
> > +        break;
> >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> >          result = cpu_to_le32(n->features.async_config);
> >          break;
> > @@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> > +    trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
> > +
> >      switch (dw10) {
> >      case NVME_TEMPERATURE_THRESHOLD:
> >          if (NVME_TEMP_TMPSEL(dw11)) {
> > @@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> >          n->features.async_config = dw11;
> >          break;
> > +    case NVME_ARBITRATION:
> > +    case NVME_POWER_MANAGEMENT:
> > +    case NVME_ERROR_RECOVERY:
> > +    case NVME_INTERRUPT_COALESCING:
> > +    case NVME_INTERRUPT_VECTOR_CONF:
> > +    case NVME_WRITE_ATOMICITY:
> > +        return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
> >      default:
> >          trace_nvme_dev_err_invalid_setfeat(dw10);
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
> >      n->temperature = NVME_TEMPERATURE;
> >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> >      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > +
> > +    /*
> > +     * There is no limit on the number of commands that the controller may
> > +     * launch at one time from a particular Submission Queue.
> > +     */
> > +    n->features.arbitration = 0x7;
> A nice #define in nvme.h stating that 0x7 means no burst limit would be nice.
> 

Done.

> > +
> > +    n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
> > +        sizeof(*n->features.int_vector_config));
> > +
> > +    /* disable coalescing (not supported) */
> > +    for (int i = 0; i < n->params.num_queues; i++) {
> > +        n->features.int_vector_config[i] = i | (1 << 16);
> Same here

Done.

> > +    }
> > +
> >      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> >  }
> >  
> > @@ -1782,15 +1830,17 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >      id->nn = cpu_to_le32(n->num_namespaces);
> >      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> >  
> > +
> > +    if (blk_enable_write_cache(n->conf.blk)) {
> > +        id->vwc = 1;
> > +    }
> > +
> >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> >      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
> >  
> >      id->psd[0].mp = cpu_to_le16(0x9c4);
> >      id->psd[0].enlat = cpu_to_le32(0x10);
> >      id->psd[0].exlat = cpu_to_le32(0x4);
> > -    if (blk_enable_write_cache(n->conf.blk)) {
> > -        id->vwc = 1;
> > -    }
> >  
> >      n->bar.cap = 0;
> >      NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> > @@ -1861,6 +1911,7 @@ static void nvme_exit(PCIDevice *pci_dev)
> >      g_free(n->cq);
> >      g_free(n->sq);
> >      g_free(n->aer_reqs);
> > +    g_free(n->features.int_vector_config);
> >  
> >      if (n->params.cmb_size_mb) {
> >          g_free(n->cmbuf);
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 3952c36774cf..4cf39961989d 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -41,6 +41,8 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> >  nvme_dev_identify_ctrl(void) "identify controller"
> >  nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> >  nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> > +nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
> > +nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
> >  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> >  nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index a24be047a311..09419ed499d0 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -445,7 +445,8 @@ enum NvmeStatusCodes {
> >      NVME_FW_REQ_RESET           = 0x010b,
> >      NVME_INVALID_QUEUE_DEL      = 0x010c,
> >      NVME_FID_NOT_SAVEABLE       = 0x010d,
> > -    NVME_FID_NOT_NSID_SPEC      = 0x010f,
> > +    NVME_FEAT_NOT_CHANGABLE     = 0x010e,
> > +    NVME_FEAT_NOT_NSID_SPEC     = 0x010f,
> >      NVME_FW_REQ_SUSYSTEM_RESET  = 0x0110,
> >      NVME_CONFLICTING_ATTRS      = 0x0180,
> >      NVME_INVALID_PROT_INFO      = 0x0181,
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid
  2020-02-12 10:30       ` Maxim Levitsky
@ 2020-03-16  7:48         ` Klaus Birkelund Jensen
  2020-03-25 10:25           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:48 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 12:30, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > 0xffff is not an allowed value for NCQR and NSQR in Set Features on
> > Number of Queues.
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme.c | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 30c5b3e7a67d..900732bb2f38 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >          break;
> >      case NVME_NUMBER_OF_QUEUES:
> > +        if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
> > +            return NVME_INVALID_FIELD | NVME_DNR;
> > +        }
> Very minor nitpick: since this spec requirement is not obvious, a quote/reference to the spec
> would be nice to have here. 
> 

Added.

> > +
> >          trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
> >              ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
> >              n->params.num_queues - 1);
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 15/26] nvme: bump supported specification to 1.3
  2020-02-12 10:35       ` Maxim Levitsky
@ 2020-03-16  7:50         ` Klaus Birkelund Jensen
  2020-03-25 10:22           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:50 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 12:35, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Add new fields to the Identify Controller and Identify Namespace data
> > structures accoding to NVM Express 1.3d.
> > 
> > NVM Express 1.3d requires the following additional features:
> >   - addition of the Namespace Identification Descriptor List (CNS 03h)
> >     for the Identify command
> >   - support for returning Command Sequence Error if a Set Features
> >     command is submitted for the Number of Queues feature after any I/O
> >     queues have been created.
> >   - The addition of the Log Specific Field (LSP) in the Get Log Page
> >     command.
> 
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > ---
> >  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
> >  hw/block/nvme.h       |  1 +
> >  hw/block/trace-events |  3 ++-
> >  include/block/nvme.h  | 20 ++++++++++-----
> >  4 files changed, 71 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 900732bb2f38..4acfc85b56a2 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -9,7 +9,7 @@
> >   */
> >  
> >  /**
> > - * Reference Specification: NVM Express 1.2.1
> > + * Reference Specification: NVM Express 1.3d
> >   *
> >   *   https://nvmexpress.org/resources/specifications/
> >   */
> > @@ -43,7 +43,7 @@
> >  #include "trace.h"
> >  #include "nvme.h"
> >  
> > -#define NVME_SPEC_VER 0x00010201
> > +#define NVME_SPEC_VER 0x00010300
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> >  #define NVME_TEMPERATURE 0x143
> >  #define NVME_TEMPERATURE_WARNING 0x157
> > @@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> >      uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> >      uint8_t  lid = dw10 & 0xff;
> > +    uint8_t  lsp = (dw10 >> 8) & 0xf;
> >      uint8_t  rae = (dw10 >> 15) & 0x1;
> >      uint32_t numdl, numdu;
> >      uint64_t off, lpol, lpou;
> > @@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >  
> > -    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > +    trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> >  
> >      switch (lid) {
> >      case NVME_LOG_ERROR_INFO:
> > @@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> >      cq = g_malloc0(sizeof(*cq));
> >      nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> >          NVME_CQ_FLAGS_IEN(qflags));
> Code alignment on that '('
> > +
> > +    n->qs_created = true;
> Should be done also at nvme_create_sq

No, because you can't create a SQ without a matching CQ:

    if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
        trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
        return NVME_INVALID_CQID | NVME_DNR;
    }


So if there is a matching cq, then qs_created = true.

> >      return NVME_SUCCESS;
> >  }
> >  
> > @@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> >      return ret;
> >  }
> >  
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +{
> > +    static const int len = 4096;
> The spec caps the Identify payload size to 4K,
> thus this should go to nvme.h

Done.

> > +
> > +    struct ns_descr {
> > +        uint8_t nidt;
> > +        uint8_t nidl;
> > +        uint8_t rsvd2[2];
> > +        uint8_t nid[16];
> > +    };
> This is also part of the spec, thus should
> move to nvme.h
> 

Done - and cleaned up.

> > +
> > +    uint32_t nsid = le32_to_cpu(c->nsid);
> > +    uint64_t prp1 = le64_to_cpu(c->prp1);
> > +    uint64_t prp2 = le64_to_cpu(c->prp2);
> > +
> > +    struct ns_descr *list;
> > +    uint16_t ret;
> > +
> > +    trace_nvme_dev_identify_ns_descr_list(nsid);
> > +
> > +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > +        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > +        return NVME_INVALID_NSID | NVME_DNR;
> > +    }
> > +
> > +    /*
> > +     * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
> > +     * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> > +     * Namespace Identification Descriptor. Add a very basic Namespace UUID
> > +     * here.
> Some per namespace uuid qemu property will be very nice to have to have a uuid that
> is at least somewhat unique.
> Linux kernel I think might complain if it detects namespaces with duplicate uuids.

It will be "unique" per controller (because it's just the namespace id).
The spec also says that it should be fixed for the lifetime of the
namespace, but I'm not sure how to ensure that without keeping that
state on disk somehow. I have a solution for this in a later series, but
for now, I think this is ok.

But since we actually support multiple controllers, there certainly is
an issue here. Maybe we can blend in some PCI id or something to make it
unique across controllers.

> 
> > +     */
> > +    list = g_malloc0(len);
> > +    list->nidt = 0x3;
> > +    list->nidl = 0x10;
> Those should also be #defined in nvme.h

Fixed.

> > +    *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > +
> > +    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > +    g_free(list);
> > +    return ret;
> > +}
> > +
> >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >  {
> >      NvmeIdentify *c = (NvmeIdentify *)cmd;
> > @@ -935,6 +979,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> >          return nvme_identify_ctrl(n, c);
> >      case 0x02:
> >          return nvme_identify_ns_list(n, c);
> > +    case 0x03:
> The CNS values should be defined in nvme.h.

Fixed.

> > +        return nvme_identify_ns_descr_list(n, cmd);
> >      default:
> >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1133,6 +1179,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> >          break;
> >      case NVME_NUMBER_OF_QUEUES:
> > +        if (n->qs_created) {
> > +            return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > +        }
> > +
> >          if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
> >              return NVME_INVALID_FIELD | NVME_DNR;
> >          }
> > @@ -1267,6 +1317,7 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> >  
> >      n->aer_queued = 0;
> >      n->outstanding_aers = 0;
> > +    n->qs_created = false;
> >  
> >      blk_flush(n->conf.blk);
> >      n->bar.cc = 0;
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 1e715ab1d75c..7ced5fd485a9 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -97,6 +97,7 @@ typedef struct NvmeCtrl {
> >      BlockConf    conf;
> >      NvmeParams   params;
> >  
> > +    bool        qs_created;
> >      uint32_t    page_size;
> >      uint16_t    page_bits;
> >      uint16_t    max_prp_ents;
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index f982ec1a3221..9e5a4548bde0 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -41,6 +41,7 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> >  nvme_dev_identify_ctrl(void) "identify controller"
> >  nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
> >  nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
> > +nvme_dev_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
> >  nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
> >  nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
> >  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> > @@ -48,7 +49,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> >  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> >  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> > -nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> > +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> >  nvme_dev_process_aers(int queued) "queued %d"
> >  nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
> >  nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 09419ed499d0..31eb9397d8c6 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -550,7 +550,9 @@ typedef struct NvmeIdCtrl {
> >      uint32_t    rtd3e;
> >      uint32_t    oaes;
> >      uint32_t    ctratt;
> > -    uint8_t     rsvd100[156];
> > +    uint8_t     rsvd100[12];
> > +    uint8_t     fguid[16];
> > +    uint8_t     rsvd128[128];
> looks OK
> >      uint16_t    oacs;
> >      uint8_t     acl;
> >      uint8_t     aerl;
> > @@ -568,9 +570,15 @@ typedef struct NvmeIdCtrl {
> >      uint8_t     tnvmcap[16];
> >      uint8_t     unvmcap[16];
> >      uint32_t    rpmbs;
> > -    uint8_t     rsvd316[4];
> > +    uint16_t    edstt;
> > +    uint8_t     dsto;
> > +    uint8_t     fwug;
> looks OK
> >      uint16_t    kas;
> > -    uint8_t     rsvd322[190];
> > +    uint16_t    hctma;
> > +    uint16_t    mntmt;
> > +    uint16_t    mxtmt;
> > +    uint32_t    sanicap;
> > +    uint8_t     rsvd332[180];
> looks OK
> >      uint8_t     sqes;
> >      uint8_t     cqes;
> >      uint16_t    maxcmd;
> > @@ -691,19 +699,19 @@ typedef struct NvmeIdNs {
> >      uint8_t     rescap;
> >      uint8_t     fpi;
> >      uint8_t     dlfeat;
> > -    uint8_t     rsvd33;
> >      uint16_t    nawun;
> >      uint16_t    nawupf;
> > +    uint16_t    nacwu;
> Aha! Here you 'fix' the bug you had in patch 4.
> >      uint16_t    nabsn;
> >      uint16_t    nabo;
> >      uint16_t    nabspf;
> > -    uint8_t     rsvd46[2];
> > +    uint16_t    noiob;
> >      uint8_t     nvmcap[16];
> >      uint8_t     rsvd64[40];
> >      uint8_t     nguid[16];
> >      uint64_t    eui64;
> >      NvmeLBAF    lbaf[16];
> > -    uint8_t     res192[192];
> > +    uint8_t     rsvd192[192];
> And even do what I suggested with that field :-)
> Please squash the changes.
> >      uint8_t     vs[3712];
> >  } NvmeIdNs;
> >  
> 
> So I suggest you squash this set of changes with patch 4.
> I also suggest you to split the other changes in this patch, 1 per feature added.
> The tracing change can also be squashed with the other tracing patch you submitted.
> 
> In summary I would suggest you to have:
> 
> 1. patch that only adds all the fields from the 1.3d spec, and overall updates nvme.h
> to be up to 1.3d spec
> 
> 2. patches that do refactoring, add more tracing (also form of refactoring, since tracing
> isn't a functional thing)
> 
> 3. set of patches that implement all the 1.3d features.
> 
> 4. patch that only bumps the supported version right to 1.3d
> 

Did this! :)


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 16/26] nvme: refactor prp mapping
  2020-02-12 11:44       ` Maxim Levitsky
@ 2020-03-16  7:51         ` Klaus Birkelund Jensen
  2020-03-25 10:23           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:51 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 13:44, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
> > ensures that if some of the PRP is in the CMB, all of it must be located
> > there, as per the specification.
> 
> To be honest this looks like not refactoring but a bugfix
> (old code was just assuming that if first prp entry is in cmb, the rest also is)

I split it up into a separate bugfix patch.

> > 
> > Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
> > takes an additional DMADirection parameter.
> 
> To be honest 'nvme_dma_prp' was not a clear function name to me at first glance.
> Could you rename this to nvme_dma_prp_rw or so? (Although even that is somewhat unclear
> to convey the meaning of read/write the data to/from the guest memory areas defined by the prp list.
> Also could you split this change into a new patch?
> 

Splitting into new patch.

> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> Now you even use your both addresses :-)
> 
> > ---
> >  hw/block/nvme.c       | 245 +++++++++++++++++++++++++++---------------
> >  hw/block/nvme.h       |   2 +-
> >  hw/block/trace-events |   1 +
> >  include/block/nvme.h  |   1 +
> >  4 files changed, 160 insertions(+), 89 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 4acfc85b56a2..334265efb21e 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -58,6 +58,11 @@
> >  
> >  static void nvme_process_sq(void *opaque);
> >  
> > +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > +{
> > +    return &n->cmbuf[addr - n->ctrl_mem.addr];
> > +}
> 
> To my taste I would put this together with the patch that
> added nvme_addr_is_cmb. I know that some people are against
> this citing the fact that you should use the code you add
> in the same patch. Your call.
> 
> Regardless of this I also prefer to put refactoring patches first in the series.
> 
> > +
> >  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> >      hwaddr low = n->ctrl_mem.addr;
> > @@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
> >      }
> >  }
> >  
> > -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
> > -                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
> > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > +    uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> 
> Split line alignment (it was correct before).
> Also while at the refactoring, it would be great to add some documentation
> to this and few more functions, since its not clear immediately what this does.
> 
> 
> >  {
> >      hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> >      trans_len = MIN(len, trans_len);
> >      int num_prps = (len >> n->page_bits) + 1;
> > +    uint16_t status = NVME_SUCCESS;
> > +    bool is_cmb = false;
> > +    bool prp_list_in_cmb = false;
> > +
> > +    trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> > +        prp1, prp2, num_prps);
> >  
> >      if (unlikely(!prp1)) {
> >          trace_nvme_dev_err_invalid_prp();
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > -    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > -               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > -        qsg->nsg = 0;
> > +    }
> > +
> > +    if (nvme_addr_is_cmb(n, prp1)) {
> > +        is_cmb = true;
> > +
> >          qemu_iovec_init(iov, num_prps);
> > -        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
> > +
> > +        /*
> > +         * PRPs do not cross page boundaries, so if the start address (here,
> > +         * prp1) is within the CMB, it cannot cross outside the controller
> > +         * memory buffer range. This is ensured by
> > +         *
> > +         *   len = n->page_size - (addr % n->page_size)
> > +         *
> > +         * Thus, we can directly add to the iovec without risking an out of
> > +         * bounds access. This also holds for the remaining qemu_iovec_add
> > +         * calls.
> > +         */
> > +        qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
> >      } else {
> >          pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
> >          qemu_sglist_add(qsg, prp1, trans_len);
> >      }
> > +
> >      len -= trans_len;
> >      if (len) {
> >          if (unlikely(!prp2)) {
> >              trace_nvme_dev_err_invalid_prp2_missing();
> > +            status = NVME_INVALID_FIELD | NVME_DNR;
> >              goto unmap;
> >          }
> > +
> >          if (len > n->page_size) {
> >              uint64_t prp_list[n->max_prp_ents];
> >              uint32_t nents, prp_trans;
> >              int i = 0;
> >  
> > +            if (nvme_addr_is_cmb(n, prp2)) {
> > +                prp_list_in_cmb = true;
> > +            }
> > +
> >              nents = (len + n->page_size - 1) >> n->page_bits;
> >              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> > +            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> >              while (len != 0) {
> >                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >  
> >                  if (i == n->max_prp_ents - 1 && len > n->page_size) {
> >                      if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> >                          trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> > +                        status = NVME_INVALID_FIELD | NVME_DNR;
> > +                        goto unmap;
> > +                    }
> > +
> > +                    if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> > +                        status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> >                          goto unmap;
> >                      }
> >  
> >                      i = 0;
> >                      nents = (len + n->page_size - 1) >> n->page_bits;
> >                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -                    nvme_addr_read(n, prp_ent, (void *)prp_list,
> > -                        prp_trans);
> > +                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
> >                      prp_ent = le64_to_cpu(prp_list[i]);
> >                  }
> >  
> >                  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> >                      trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> > +                    status = NVME_INVALID_FIELD | NVME_DNR;
> > +                    goto unmap;
> > +                }
> > +
> > +                if (is_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> > +                    status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> >                      goto unmap;
> >                  }
> >  
> >                  trans_len = MIN(len, n->page_size);
> > -                if (qsg->nsg){
> > -                    qemu_sglist_add(qsg, prp_ent, trans_len);
> > +                if (is_cmb) {
> > +                    qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp_ent),
> > +                        trans_len);
> >                  } else {
> > -                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
> > +                    qemu_sglist_add(qsg, prp_ent, trans_len);
> >                  }
> > +
> >                  len -= trans_len;
> >                  i++;
> >              }
> >          } else {
> > +            if (is_cmb != nvme_addr_is_cmb(n, prp2)) {
> > +                status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +                goto unmap;
> > +            }
> > +
> >              if (unlikely(prp2 & (n->page_size - 1))) {
> >                  trace_nvme_dev_err_invalid_prp2_align(prp2);
> > +                status = NVME_INVALID_FIELD | NVME_DNR;
> >                  goto unmap;
> >              }
> > -            if (qsg->nsg) {
> > +
> > +            if (is_cmb) {
> > +                qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp2), len);
> > +            } else {
> >                  qemu_sglist_add(qsg, prp2, len);
> > -            } else {
> > -                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
> >              }
> >          }
> >      }
> > +
> >      return NVME_SUCCESS;
> >  
> > - unmap:
> > -    qemu_sglist_destroy(qsg);
> > -    return NVME_INVALID_FIELD | NVME_DNR;
> > -}
> 
> I haven't checked the new nvme_map_prp to the extent that I am sure that
> it is correct, but it looks reasonable.
> 
> > -
> > -static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -                                   uint64_t prp1, uint64_t prp2)
> > -{
> > -    QEMUSGList qsg;
> > -    QEMUIOVector iov;
> > -    uint16_t status = NVME_SUCCESS;
> > -
> > -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > -        return NVME_INVALID_FIELD | NVME_DNR;
> > -    }
> > -    if (qsg.nsg > 0) {
> > -        if (dma_buf_write(ptr, len, &qsg)) {
> > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > -        }
> > -        qemu_sglist_destroy(&qsg);
> > +unmap:
> > +    if (is_cmb) {
> > +        qemu_iovec_destroy(iov);
> >      } else {
> > -        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
> > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > -        }
> > -        qemu_iovec_destroy(&iov);
> > +        qemu_sglist_destroy(qsg);
> >      }
> > +
> >      return status;
> >  }
> >  
> > -static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -    uint64_t prp1, uint64_t prp2)
> > +static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > +    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
> >  {
> >      QEMUSGList qsg;
> >      QEMUIOVector iov;
> >      uint16_t status = NVME_SUCCESS;
> > +    size_t bytes;
> >  
> > -    trace_nvme_dev_dma_read(prp1, prp2);
> > -
> > -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > -        return NVME_INVALID_FIELD | NVME_DNR;
> > +    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
> > +    if (status) {
> > +        return status;
> >      }
> > +
> >      if (qsg.nsg > 0) {
> > -        if (unlikely(dma_buf_read(ptr, len, &qsg))) {
> > +        uint64_t residual;
> > +
> > +        if (dir == DMA_DIRECTION_TO_DEVICE) {
> > +            residual = dma_buf_write(ptr, len, &qsg);
> > +        } else {
> > +            residual = dma_buf_read(ptr, len, &qsg);
> > +        }
> > +
> > +        if (unlikely(residual)) {
> >              trace_nvme_dev_err_invalid_dma();
> >              status = NVME_INVALID_FIELD | NVME_DNR;
> >          }
> > +
> >          qemu_sglist_destroy(&qsg);
> > +
> > +        return status;
> 
> I would prefer if/else here rather than that early return here.
> It would make code more symmetric.
> 

Looks nicer yeah. Done.

> > +    }
> > +
> > +    if (dir == DMA_DIRECTION_TO_DEVICE) {
> > +        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
> >      } else {
> > -        if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
> > -            trace_nvme_dev_err_invalid_dma();
> > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > -        }
> > -        qemu_iovec_destroy(&iov);
> > +        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
> >      }
> > +
> > +    if (unlikely(bytes != len)) {
> > +        trace_nvme_dev_err_invalid_dma();
> > +        status = NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    qemu_iovec_destroy(&iov);
> > +
> >      return status;
> >  }
> >  
> > @@ -420,16 +474,20 @@ static void nvme_rw_cb(void *opaque, int ret)
> >          block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
> >          req->status = NVME_INTERNAL_DEV_ERROR;
> >      }
> > -    if (req->has_sg) {
> > +
> > +    if (req->qsg.nalloc) {
> >          qemu_sglist_destroy(&req->qsg);
> >      }
> > +    if (req->iov.nalloc) {
> > +        qemu_iovec_destroy(&req->iov);
> > +    }
> > +
> >      nvme_enqueue_req_completion(cq, req);
> >  }
> >  
> >  static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> >      NvmeRequest *req)
> >  {
> > -    req->has_sg = false;
> >      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> >           BLOCK_ACCT_FLUSH);
> >      req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> > @@ -453,7 +511,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> >          return NVME_LBA_RANGE | NVME_DNR;
> >      }
> >  
> > -    req->has_sg = false;
> >      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> >                       BLOCK_ACCT_WRITE);
> >      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> > @@ -485,21 +542,24 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> >          return NVME_LBA_RANGE | NVME_DNR;
> >      }
> >  
> > -    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
> > +    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
> >          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >  
> > -    dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
> >      if (req->qsg.nsg > 0) {
> > -        req->has_sg = true;
> > +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> > +            acct);
> > +
> >          req->aiocb = is_write ?
> >              dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> >                            nvme_rw_cb, req) :
> >              dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> >                           nvme_rw_cb, req);
> >      } else {
> > -        req->has_sg = false;
> > +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> > +            acct);
> > +
> >          req->aiocb = is_write ?
> >              blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> >                              req) :
> > @@ -596,7 +656,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
> >      sq->size = size;
> >      sq->cqid = cqid;
> >      sq->head = sq->tail = 0;
> > -    sq->io_req = g_new(NvmeRequest, sq->size);
> > +    sq->io_req = g_new0(NvmeRequest, sq->size);
> >  
> >      QTAILQ_INIT(&sq->req_list);
> >      QTAILQ_INIT(&sq->out_req_list);
> > @@ -704,8 +764,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> >          nvme_clear_events(n, NVME_AER_TYPE_SMART);
> >      }
> >  
> > -    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > -        prp2);
> > +    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > @@ -724,8 +784,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> >  
> >      trans_len = MIN(sizeof(fw_log) - off, buf_len);
> >  
> > -    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > -        prp2);
> > +    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > @@ -869,18 +929,20 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> >      return NVME_SUCCESS;
> >  }
> >  
> > -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
> > +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> > +    NvmeRequest *req)
> >  {
> >      uint64_t prp1 = le64_to_cpu(c->prp1);
> >      uint64_t prp2 = le64_to_cpu(c->prp2);
> >  
> >      trace_nvme_dev_identify_ctrl();
> >  
> > -    return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > -        prp1, prp2);
> > +    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
> > +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> > +    NvmeRequest *req)
> >  {
> >      NvmeNamespace *ns;
> >      uint32_t nsid = le32_to_cpu(c->nsid);
> > @@ -896,11 +958,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
> >  
> >      ns = &n->namespaces[nsid - 1];
> >  
> > -    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > -        prp1, prp2);
> > +    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> > +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> > +    NvmeRequest *req)
> >  {
> >      static const int data_len = 4 * KiB;
> >      uint32_t min_nsid = le32_to_cpu(c->nsid);
> > @@ -922,12 +985,14 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> >              break;
> >          }
> >      }
> > -    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> > +    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >      g_free(list);
> >      return ret;
> >  }
> >  
> > -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > +    NvmeRequest *req)
> >  {
> >      static const int len = 4096;
> >  
> > @@ -963,24 +1028,25 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> >      list->nidl = 0x10;
> >      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> >  
> > -    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > +    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >      g_free(list);
> >      return ret;
> >  }
> >  
> > -static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > +static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> >      NvmeIdentify *c = (NvmeIdentify *)cmd;
> >  
> >      switch (le32_to_cpu(c->cns)) {
> >      case 0x00:
> > -        return nvme_identify_ns(n, c);
> > +        return nvme_identify_ns(n, c, req);
> >      case 0x01:
> > -        return nvme_identify_ctrl(n, c);
> > +        return nvme_identify_ctrl(n, c, req);
> >      case 0x02:
> > -        return nvme_identify_ns_list(n, c);
> > +        return nvme_identify_ns_list(n, c, req);
> >      case 0x03:
> > -        return nvme_identify_ns_descr_list(n, cmd);
> > +        return nvme_identify_ns_descr_list(n, c, req);
> >      default:
> >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1039,15 +1105,16 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> >      return cpu_to_le64(ts.all);
> >  }
> >  
> > -static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> > +static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > +    NvmeRequest *req)
> >  {
> >      uint64_t prp1 = le64_to_cpu(cmd->prp1);
> >      uint64_t prp2 = le64_to_cpu(cmd->prp2);
> >  
> >      uint64_t timestamp = nvme_get_timestamp(n);
> >  
> > -    return nvme_dma_read_prp(n, (uint8_t *)&timestamp,
> > -                                 sizeof(timestamp), prp1, prp2);
> > +    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > @@ -1099,7 +1166,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          trace_nvme_dev_getfeat_numq(result);
> >          break;
> >      case NVME_TIMESTAMP:
> > -        return nvme_get_feature_timestamp(n, cmd);
> > +        return nvme_get_feature_timestamp(n, cmd, req);
> >      case NVME_INTERRUPT_COALESCING:
> >          result = cpu_to_le32(n->features.int_coalescing);
> >          break;
> > @@ -1125,15 +1192,16 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      return NVME_SUCCESS;
> >  }
> >  
> > -static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> > +static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > +    NvmeRequest *req)
> >  {
> >      uint16_t ret;
> >      uint64_t timestamp;
> >      uint64_t prp1 = le64_to_cpu(cmd->prp1);
> >      uint64_t prp2 = le64_to_cpu(cmd->prp2);
> >  
> > -    ret = nvme_dma_write_prp(n, (uint8_t *)&timestamp,
> > -                                sizeof(timestamp), prp1, prp2);
> > +    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> > +        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
> >      if (ret != NVME_SUCCESS) {
> >          return ret;
> >      }
> > @@ -1194,7 +1262,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >              ((n->params.num_queues - 2) << 16));
> >          break;
> >      case NVME_TIMESTAMP:
> > -        return nvme_set_feature_timestamp(n, cmd);
> > +        return nvme_set_feature_timestamp(n, cmd, req);
> >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> >          n->features.async_config = dw11;
> >          break;
> > @@ -1246,7 +1314,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      case NVME_ADM_CMD_CREATE_CQ:
> >          return nvme_create_cq(n, cmd);
> >      case NVME_ADM_CMD_IDENTIFY:
> > -        return nvme_identify(n, cmd);
> > +        return nvme_identify(n, cmd, req);
> >      case NVME_ADM_CMD_ABORT:
> >          return nvme_abort(n, cmd, req);
> >      case NVME_ADM_CMD_SET_FEATURES:
> > @@ -1282,6 +1350,7 @@ static void nvme_process_sq(void *opaque)
> >          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> >          memset(&req->cqe, 0, sizeof(req->cqe));
> >          req->cqe.cid = cmd.cid;
> > +        memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
> >  
> >          status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
> >              nvme_admin_cmd(n, &cmd, req);
> > @@ -1804,7 +1873,7 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> >  
> >      NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> >      NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > -    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 1);
> >      NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> >      NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> >      NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 7ced5fd485a9..d27baa9d5391 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -27,11 +27,11 @@ typedef struct NvmeRequest {
> >      struct NvmeSQueue       *sq;
> >      BlockAIOCB              *aiocb;
> >      uint16_t                status;
> > -    bool                    has_sg;
> >      NvmeCqe                 cqe;
> >      BlockAcctCookie         acct;
> >      QEMUSGList              qsg;
> >      QEMUIOVector            iov;
> > +    NvmeCmd                 cmd;
> >      QTAILQ_ENTRY(NvmeRequest)entry;
> >  } NvmeRequest;
> >  
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 9e5a4548bde0..77aa0da99ee0 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -33,6 +33,7 @@ nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> >  nvme_dev_irq_pin(void) "pulsing IRQ pin"
> >  nvme_dev_irq_masked(void) "IRQ is masked"
> >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> > +nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> >  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > qflags=%"PRIu16""
> >  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> > qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index 31eb9397d8c6..c1de92179596 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -427,6 +427,7 @@ enum NvmeStatusCodes {
> >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> >      NVME_INVALID_NSID           = 0x000b,
> >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > +    NVME_INVALID_USE_OF_CMB     = 0x0012,
> >      NVME_LBA_RANGE              = 0x0080,
> >      NVME_CAP_EXCEEDED           = 0x0081,
> >      NVME_NS_NOT_READY           = 0x0082,
> 
> 
> Overall I would split this commit into real refactoring and bugfixes.

Done!

> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 17/26] nvme: allow multiple aios per command
  2020-02-12 11:48       ` Maxim Levitsky
@ 2020-03-16  7:53         ` Klaus Birkelund Jensen
  2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:53 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 13:48, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > This refactors how the device issues asynchronous block backend
> > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > associated with the command. This allows multiple aios to be issued for
> > a command. Only when all requests have been completed will the device
> > post a completion queue entry.
> > 
> > Because the device is currently guaranteed to only issue a single aio
> > request per command, the benefit is not immediately obvious. But this
> > functionality is required to support metadata, the dataset management
> > command and other features.
> 
> I don't know what the strategy will be chosen for supporting metadata
> (qemu doesn't have any notion of metadata in the block layer), but for dataset management
> you are right. Dataset management command can contain a table of areas to discard
> (although in reality I have seen no driver putting there more that one entry).
> 

The strategy is different depending on how the metadata is transferred
between host and device. For the "separate buffer" case, metadata is
transferred using a separate memory pointer in the nvme command (MPTR).
In this case the metadata is kept separately on a new blockdev attached
to the namespace.

In the other case, metadata is transferred as part of an extended lba
(say 512 + 8 bytes) and kept inline on the main namespace blockdev. This
is challenging for QEMU as it breaks interoperability of the image with
other devices. But that is a discussion for fresh RFC ;)

Note that the support for multiple AIOs is also used for DULBE support
down the line when I get around to posting those patches. So this is
preparatory for a lot of features that requires persistant state across
device power off.

> 
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme.c       | 449 +++++++++++++++++++++++++++++++++---------
> >  hw/block/nvme.h       | 134 +++++++++++--
> >  hw/block/trace-events |   8 +
> >  3 files changed, 480 insertions(+), 111 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 334265efb21e..e97da35c4ca1 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -19,7 +19,8 @@
> >   *      -drive file=<file>,if=none,id=<drive_id>
> >   *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
> >   *              cmb_size_mb=<cmb_size_mb[optional]>, \
> > - *              num_queues=<N[optional]>
> > + *              num_queues=<N[optional]>, \
> > + *              mdts=<mdts[optional]>
> 
> Could you split mdts checks into a separate patch? This is not related to the series.

Absolutely. Done.

> 
> >   *
> >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > @@ -57,6 +58,7 @@
> >      } while (0)
> >  
> >  static void nvme_process_sq(void *opaque);
> > +static void nvme_aio_cb(void *opaque, int ret);
> >  
> >  static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> >  {
> > @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> >      return status;
> >  }
> >  
> > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +
> > +    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +
> > +    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > +}
> 
> Same here, this is another nice refactoring and it should be in separate patch.

Done.

> 
> > +
> > +static void nvme_aio_destroy(NvmeAIO *aio)
> > +{
> > +    g_free(aio);
> > +}
> > +
> > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> > +    NvmeAIOOp opc)
> > +{
> > +    aio->opc = opc;
> > +
> > +    trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > +        aio->offset, aio->len, nvme_aio_opc_str(aio), req);
> > +
> > +    if (req) {
> > +        QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
> > +    }
> > +}
> > +
> > +static void nvme_aio(NvmeAIO *aio)
> Function name not clear to me. Maybe change this to something like nvme_submit_aio.

Fixed.

> > +{
> > +    BlockBackend *blk = aio->blk;
> > +    BlockAcctCookie *acct = &aio->acct;
> > +    BlockAcctStats *stats = blk_get_stats(blk);
> > +
> > +    bool is_write, dma;
> > +
> > +    switch (aio->opc) {
> > +    case NVME_AIO_OPC_NONE:
> > +        break;
> > +
> > +    case NVME_AIO_OPC_FLUSH:
> > +        block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
> > +        aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
> > +        break;
> > +
> > +    case NVME_AIO_OPC_WRITE_ZEROES:
> > +        block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
> > +        aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
> > +            BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> > +        break;
> > +
> > +    case NVME_AIO_OPC_READ:
> > +    case NVME_AIO_OPC_WRITE:
> 
> > +        dma = aio->qsg != NULL;
> 
> This doesn't work.
> aio->qsg is always not null since nvme_rw_aio sets this to &req->qsg
> which is then written to aio->qsg by nvme_aio_new.

Ouch. This is a refactoring gone awry. Very nicely spotted.

> 
> That is yet another reason I really don't like these parallel QEMUSGList
> and QEMUIOVector. However I see that few other qemu drivers do this,
> thus this is probably a necessary evil.
> 
> What we can do maybe is to do dma_memory_map on the SG list,
> and then deal with QEMUIOVector only. Virtio does this
> (virtqueue_pop/virtqueue_push)

Yeah, I agree. But I really wanna use the dma helpers to not mess around
with that complexity.

> 
> 
> > +        is_write = (aio->opc == NVME_AIO_OPC_WRITE);
> > +
> > +        block_acct_start(stats, acct, aio->len,
> > +            is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> > +
> > +        if (dma) {
> > +            aio->aiocb = is_write ?
> > +                dma_blk_write(blk, aio->qsg, aio->offset,
> > +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio) :
> > +                dma_blk_read(blk, aio->qsg, aio->offset,
> > +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio);
> > +
> Extra space
> > +            return;
> > +        }
> > +
> > +        aio->aiocb = is_write ?
> > +            blk_aio_pwritev(blk, aio->offset, aio->iov, 0,
> > +                nvme_aio_cb, aio) :
> > +            blk_aio_preadv(blk, aio->offset, aio->iov, 0,
> > +                nvme_aio_cb, aio);
> > +
> > +        break;
> > +    }
> > +}
> > +
> > +static void nvme_rw_aio(BlockBackend *blk, uint64_t offset, NvmeRequest *req)
> > +{
> > +    NvmeAIO *aio;
> > +    size_t len = req->qsg.nsg > 0 ? req->qsg.size : req->iov.size;
> > +
> > +    aio = g_new0(NvmeAIO, 1);
> > +
> > +    *aio = (NvmeAIO) {
> > +        .blk = blk,
> > +        .offset = offset,
> > +        .len = len,
> > +        .req = req,
> > +        .qsg = &req->qsg,
> > +        .iov = &req->iov,
> > +    };
> > +
> > +    nvme_req_register_aio(req, aio, nvme_req_is_write(req) ?
> > +        NVME_AIO_OPC_WRITE : NVME_AIO_OPC_READ);
> nitpick: I think I don't like the nvme_req_register_aio name either, but I don't think I have
> a better name for it yet. 

If you figure out a better name, let me know ;) I through about
"enqueue", but thats not really what it's doing. It is just registering
that an AIO is associated with the request. Maybe "post" or something,
not sure.

> > +    nvme_aio(aio);
> > +}
> > +
> >  static void nvme_post_cqes(void *opaque)
> >  {
> >      NvmeCQueue *cq = opaque;
> > @@ -364,6 +467,7 @@ static void nvme_post_cqes(void *opaque)
> >          nvme_inc_cq_tail(cq);
> >          pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> >              sizeof(req->cqe));
> > +        nvme_req_clear(req);
> >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> >      }
> >      if (cq->tail != cq->head) {
> > @@ -374,8 +478,8 @@ static void nvme_post_cqes(void *opaque)
> >  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
> >  {
> >      assert(cq->cqid == req->sq->cqid);
> > -    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
> > -        req->status);
> > +    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, req->status);
> > +
> >      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
> >      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> >      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > @@ -460,135 +564,272 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
> >      }
> >  }
> >  
> > -static void nvme_rw_cb(void *opaque, int ret)
> > +static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len,
> > +    NvmeRequest *req)
> > +{
> > +    uint8_t mdts = n->params.mdts;
> > +
> > +    if (mdts && len > n->page_size << mdts) {
> > +        trace_nvme_dev_err_mdts(nvme_cid(req), n->page_size << mdts, len);
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
> > +    NvmeNamespace *ns = req->ns;
> > +
> > +    uint16_t ctrl = le16_to_cpu(rw->control);
> > +
> > +    if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) {
> > +        trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > +    uint32_t nlb, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +    uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > +
> > +    if (unlikely((slba + nlb) > nsze)) {
> > +        block_acct_invalid(blk_get_stats(n->conf.blk),
> > +            nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> > +        trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
> > +        return NVME_LBA_RANGE | NVME_DNR;
> > +    }
> 
> Double check this in regard to integer overflows, e.g if slba + nlb overflows.
> 
> That is what I did in my nvme-mdev:
> 
> static inline bool check_range(u64 start, u64 size, u64 end)
> {
> 	u64 test = start + size;
> 
> 	/* check for overflow */
> 	if (test < start || test < size)
> 		return false;
> 	return test <= end;
> }
> 

Fixed in new patch.

> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> > +{
> > +    NvmeNamespace *ns = req->ns;
> > +    size_t len = req->nlb << nvme_ns_lbads(ns);
> > +    uint16_t status;
> > +
> > +    status = nvme_check_mdts(n, len, req);
> > +    if (status) {
> > +        return status;
> > +    }
> > +
> > +    status = nvme_check_prinfo(n, req);
> > +    if (status) {
> > +        return status;
> > +    }
> > +
> > +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > +    if (status) {
> > +        return status;
> > +    }
> > +
> > +    return NVME_SUCCESS;
> > +}
> 
> Note that there are more things to check if we don't support metadata,
> like for instance the metadata pointer in the submission entry is NULL.
> 

Yeah. I think these will be introduced along the way. It's a step
towards better compliance, but it doesnt break the device.

> All these check_ functions are very good but they should move to
> a separate patch since they just implement parts of the spec
> and have nothing to do with the patch subject.
> 

Done. 

> > +
> > +static void nvme_rw_cb(NvmeRequest *req, void *opaque)
> >  {
> > -    NvmeRequest *req = opaque;
> >      NvmeSQueue *sq = req->sq;
> >      NvmeCtrl *n = sq->ctrl;
> >      NvmeCQueue *cq = n->cq[sq->cqid];
> >  
> > -    if (!ret) {
> > -        block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
> > -        req->status = NVME_SUCCESS;
> > -    } else {
> > -        block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
> > -        req->status = NVME_INTERNAL_DEV_ERROR;
> > -    }
> > -
> > -    if (req->qsg.nalloc) {
> > -        qemu_sglist_destroy(&req->qsg);
> > -    }
> > -    if (req->iov.nalloc) {
> > -        qemu_iovec_destroy(&req->iov);
> > -    }
> > +    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
> >  
> >      nvme_enqueue_req_completion(cq, req);
> >  }
> >  
> > -static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > -    NvmeRequest *req)
> > +static void nvme_aio_cb(void *opaque, int ret)
> >  {
> > -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > -         BLOCK_ACCT_FLUSH);
> > -    req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> > +    NvmeAIO *aio = opaque;
> > +    NvmeRequest *req = aio->req;
> >  
> > -    return NVME_NO_COMPLETE;
> > -}
> > +    BlockBackend *blk = aio->blk;
> > +    BlockAcctCookie *acct = &aio->acct;
> > +    BlockAcctStats *stats = blk_get_stats(blk);
> >  
> > -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > -    NvmeRequest *req)
> > -{
> > -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> > -    const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > -    const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> > -    uint64_t slba = le64_to_cpu(rw->slba);
> > -    uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
> > -    uint64_t offset = slba << data_shift;
> > -    uint32_t count = nlb << data_shift;
> > -
> > -    if (unlikely(slba + nlb > ns->id_ns.nsze)) {
> > -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > -        return NVME_LBA_RANGE | NVME_DNR;
> > -    }
> > -
> > -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > -                     BLOCK_ACCT_WRITE);
> > -    req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> > -                                        BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
> > -    return NVME_NO_COMPLETE;
> > -}
> > -
> > -static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > -    NvmeRequest *req)
> > -{
> > -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> > -    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
> > -    uint64_t slba = le64_to_cpu(rw->slba);
> > -    uint64_t prp1 = le64_to_cpu(rw->prp1);
> > -    uint64_t prp2 = le64_to_cpu(rw->prp2);
> > -
> > -    uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > -    uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> > -    uint64_t data_size = (uint64_t)nlb << data_shift;
> > -    uint64_t data_offset = slba << data_shift;
> > -    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> > -    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> > +    Error *local_err = NULL;
> >  
> > -    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
> > +    trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> > +        nvme_aio_opc_str(aio), req);
> >  
> > -    if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > -        return NVME_LBA_RANGE | NVME_DNR;
> > +    if (req) {
> 
> I wonder in which case the aio callback will be called without req.
> Looking at the code it looks like that can't happen.
> (NvmeAIO is created by nvme_aio_new and all its callers pass not null req)

Yeah, this is preparatory for a patchset I have where an AIO can be
issued by the controller autonomously.

> 
> > +        QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
> >      }
> >  
> > -    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > -        return NVME_INVALID_FIELD | NVME_DNR;
> > -    }
> > -
> > -    if (req->qsg.nsg > 0) {
> > -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> > -            acct);
> > -
> > -        req->aiocb = is_write ?
> > -            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > -                          nvme_rw_cb, req) :
> > -            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > -                         nvme_rw_cb, req);
> > +    if (!ret) {
> > +        block_acct_done(stats, acct);
> >      } else {
> > -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> > -            acct);
> > +        block_acct_failed(stats, acct);
> >  
> > -        req->aiocb = is_write ?
> > -            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> > -                            req) :
> > -            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> > -                           req);
> > +        if (req) {
> > +            uint16_t status;
> > +
> > +            switch (aio->opc) {
> > +            case NVME_AIO_OPC_READ:
> > +                status = NVME_UNRECOVERED_READ;
> > +                break;
> > +            case NVME_AIO_OPC_WRITE:
> > +            case NVME_AIO_OPC_WRITE_ZEROES:
> > +                status = NVME_WRITE_FAULT;
> > +                break;
> > +            default:
> > +                status = NVME_INTERNAL_DEV_ERROR;
> > +                break;
> > +            }
> > +
> > +            trace_nvme_dev_err_aio(nvme_cid(req), aio, blk_name(blk),
> > +                aio->offset, nvme_aio_opc_str(aio), req, status);
> > +
> > +            error_setg_errno(&local_err, -ret, "aio failed");
> > +            error_report_err(local_err);
> > +
> > +            /*
> > +             * An Internal Error trumps all other errors. For other errors,
> > +             * only set the first error encountered. Any additional errors will
> > +             * be recorded in the error information log page.
> > +             */
> > +            if (!req->status ||
> > +                nvme_status_is_error(status, NVME_INTERNAL_DEV_ERROR)) {
> > +                req->status = status;
> > +            }
> > +        }
> > +    }
> > +
> > +    if (aio->cb) {
> > +        aio->cb(aio, aio->cb_arg, ret);
> > +    }
> > +
> > +    if (req && QTAILQ_EMPTY(&req->aio_tailq)) {
> > +        if (req->cb) {
> > +            req->cb(req, req->cb_arg);
> > +        } else {
> > +            NvmeSQueue *sq = req->sq;
> > +            NvmeCtrl *n = sq->ctrl;
> > +            NvmeCQueue *cq = n->cq[sq->cqid];
> > +
> > +            nvme_enqueue_req_completion(cq, req);
> > +        }
> >      }
> >  
> > +    nvme_aio_destroy(aio);
> > +}
> > +
> > +static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +    NvmeAIO *aio = g_new0(NvmeAIO, 1);
> > +
> > +    *aio = (NvmeAIO) {
> > +        .blk = n->conf.blk,
> > +        .req = req,
> > +    };
> > +
> > +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_FLUSH);
> > +    nvme_aio(aio);
> > +
> > +    return NVME_NO_COMPLETE;
> > +}
> > +
> > +static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +    NvmeAIO *aio;
> > +
> > +    NvmeNamespace *ns = req->ns;
> > +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> > +
> > +    int64_t offset;
> > +    size_t count;
> > +    uint16_t status;
> > +
> > +    req->slba = le64_to_cpu(rw->slba);
> > +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > +
> > +    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> > +        req->slba, req->nlb);
> > +
> > +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > +    if (unlikely(status)) {
> > +        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> > +        return status;
> > +    }
> This refactoring also should be in a separate patch.

Done.

> 
> > +
> > +    offset = req->slba << nvme_ns_lbads(ns);
> > +    count = req->nlb << nvme_ns_lbads(ns);
> > +
> > +    aio = g_new0(NvmeAIO, 1);
> > +
> > +    *aio = (NvmeAIO) {
> > +        .blk = n->conf.blk,
> > +        .offset = offset,
> > +        .len = count,
> > +        .req = req,
> > +    };
> > +
> > +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_WRITE_ZEROES);
> > +    nvme_aio(aio);
> > +
> > +    return NVME_NO_COMPLETE;
> > +}
> > +
> > +static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > +{
> > +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> > +    NvmeNamespace *ns = req->ns;
> > +    int status;
> > +
> > +    enum BlockAcctType acct =
> > +        nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> > +
> > +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > +    req->slba = le64_to_cpu(rw->slba);
> > +
> > +    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> > +        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> > +
> > +    status = nvme_check_rw(n, req);
> > +    if (status) {
> > +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > +        return status;
> > +    }
> > +
> > +    status = nvme_map(n, cmd, req);
> > +    if (status) {
> > +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > +        return status;
> > +    }
> > +
> > +    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> > +    nvme_req_set_cb(req, nvme_rw_cb, NULL);
> > +
> >      return NVME_NO_COMPLETE;
> >  }
> >  
> >  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > -    NvmeNamespace *ns;
> >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> >  
> > +    trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> > +        cmd->opcode);
> > +
> >      if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> >          trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> >          return NVME_INVALID_NSID | NVME_DNR;
> >      }
> >  
> > -    ns = &n->namespaces[nsid - 1];
> > +    req->ns = &n->namespaces[nsid - 1];
> > +
> >      switch (cmd->opcode) {
> >      case NVME_CMD_FLUSH:
> > -        return nvme_flush(n, ns, cmd, req);
> > +        return nvme_flush(n, cmd, req);
> >      case NVME_CMD_WRITE_ZEROS:
> > -        return nvme_write_zeros(n, ns, cmd, req);
> > +        return nvme_write_zeros(n, cmd, req);
> >      case NVME_CMD_WRITE:
> >      case NVME_CMD_READ:
> > -        return nvme_rw(n, ns, cmd, req);
> > +        return nvme_rw(n, cmd, req);
> >      default:
> >          trace_nvme_dev_err_invalid_opc(cmd->opcode);
> >          return NVME_INVALID_OPCODE | NVME_DNR;
> > @@ -612,6 +853,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >      NvmeRequest *req, *next;
> >      NvmeSQueue *sq;
> >      NvmeCQueue *cq;
> > +    NvmeAIO *aio;
> >      uint16_t qid = le16_to_cpu(c->qid);
> >  
> >      if (unlikely(!qid || nvme_check_sqid(n, qid))) {
> > @@ -624,8 +866,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >      sq = n->sq[qid];
> >      while (!QTAILQ_EMPTY(&sq->out_req_list)) {
> >          req = QTAILQ_FIRST(&sq->out_req_list);
> > -        assert(req->aiocb);
> > -        blk_aio_cancel(req->aiocb);
> > +        while (!QTAILQ_EMPTY(&req->aio_tailq)) {
> > +            aio = QTAILQ_FIRST(&req->aio_tailq);
> > +            assert(aio->aiocb);
> > +            blk_aio_cancel(aio->aiocb);
> > +        }
> >      }
> >      if (!nvme_check_cqid(n, sq->cqid)) {
> >          cq = n->cq[sq->cqid];
> > @@ -662,6 +907,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
> >      QTAILQ_INIT(&sq->out_req_list);
> >      for (i = 0; i < sq->size; i++) {
> >          sq->io_req[i].sq = sq;
> > +        QTAILQ_INIT(&(sq->io_req[i].aio_tailq));
> >          QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
> >      }
> >      sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
> > @@ -800,6 +1046,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      uint32_t numdl, numdu;
> >      uint64_t off, lpol, lpou;
> >      size_t   len;
> > +    uint16_t status;
> >  
> >      numdl = (dw10 >> 16);
> >      numdu = (dw11 & 0xffff);
> > @@ -815,6 +1062,11 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  
> >      trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> >  
> > +    status = nvme_check_mdts(n, len, req);
> > +    if (status) {
> > +        return status;
> > +    }
> > +
> >      switch (lid) {
> >      case NVME_LOG_ERROR_INFO:
> >          if (!rae) {
> > @@ -1348,7 +1600,7 @@ static void nvme_process_sq(void *opaque)
> >          req = QTAILQ_FIRST(&sq->req_list);
> >          QTAILQ_REMOVE(&sq->req_list, req, entry);
> >          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> > -        memset(&req->cqe, 0, sizeof(req->cqe));
> > +
> >          req->cqe.cid = cmd.cid;
> >          memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
> >  
> > @@ -1928,6 +2180,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >      id->ieee[0] = 0x00;
> >      id->ieee[1] = 0x02;
> >      id->ieee[2] = 0xb3;
> > +    id->mdts = params->mdts;
> >      id->ver = cpu_to_le32(NVME_SPEC_VER);
> >      id->oacs = cpu_to_le16(0);
> >  
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index d27baa9d5391..3319f8edd7e1 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -8,7 +8,8 @@
> >      DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
> >      DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
> >      DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
> > -    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
> > +    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
> > +    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
> >  
> >  typedef struct NvmeParams {
> >      char     *serial;
> > @@ -16,6 +17,7 @@ typedef struct NvmeParams {
> >      uint32_t cmb_size_mb;
> >      uint8_t  aerl;
> >      uint32_t aer_max_queued;
> > +    uint8_t  mdts;
> >  } NvmeParams;
> >  
> >  typedef struct NvmeAsyncEvent {
> > @@ -23,17 +25,58 @@ typedef struct NvmeAsyncEvent {
> >      NvmeAerResult result;
> >  } NvmeAsyncEvent;
> >  
> > -typedef struct NvmeRequest {
> > -    struct NvmeSQueue       *sq;
> > -    BlockAIOCB              *aiocb;
> > -    uint16_t                status;
> > -    NvmeCqe                 cqe;
> > -    BlockAcctCookie         acct;
> > -    QEMUSGList              qsg;
> > -    QEMUIOVector            iov;
> > -    NvmeCmd                 cmd;
> > -    QTAILQ_ENTRY(NvmeRequest)entry;
> > -} NvmeRequest;
> > +typedef struct NvmeRequest NvmeRequest;
> > +typedef void NvmeRequestCompletionFunc(NvmeRequest *req, void *opaque);
> > +
> > +struct NvmeRequest {
> > +    struct NvmeSQueue    *sq;
> > +    struct NvmeNamespace *ns;
> > +
> > +    NvmeCqe  cqe;
> > +    NvmeCmd  cmd;
> > +    uint16_t status;
> > +
> > +    uint64_t slba;
> > +    uint32_t nlb;
> > +
> > +    QEMUSGList   qsg;
> > +    QEMUIOVector iov;
> > +
> > +    NvmeRequestCompletionFunc *cb;
> > +    void                      *cb_arg;
> > +
> > +    QTAILQ_HEAD(, NvmeAIO)    aio_tailq;
> > +    QTAILQ_ENTRY(NvmeRequest) entry;
> > +};
> > +
> > +static inline void nvme_req_clear(NvmeRequest *req)
> > +{
> > +    req->ns = NULL;
> > +    memset(&req->cqe, 0, sizeof(req->cqe));
> > +    req->status = NVME_SUCCESS;
> > +    req->slba = req->nlb = 0x0;
> > +    req->cb = req->cb_arg = NULL;
> > +
> > +    if (req->qsg.sg) {
> > +        qemu_sglist_destroy(&req->qsg);
> > +    }
> > +
> > +    if (req->iov.iov) {
> > +        qemu_iovec_destroy(&req->iov);
> > +    }
> > +}
> > +
> > +static inline void nvme_req_set_cb(NvmeRequest *req,
> > +    NvmeRequestCompletionFunc *cb, void *cb_arg)
> > +{
> > +    req->cb = cb;
> > +    req->cb_arg = cb_arg;
> > +}
> > +
> > +static inline void nvme_req_clear_cb(NvmeRequest *req)
> > +{
> > +    req->cb = req->cb_arg = NULL;
> > +}
> >  
> >  typedef struct NvmeSQueue {
> >      struct NvmeCtrl *ctrl;
> > @@ -85,6 +128,60 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> >      return 1 << nvme_ns_lbads(ns);
> >  }
> >  
> > +typedef enum NvmeAIOOp {
> > +    NVME_AIO_OPC_NONE         = 0x0,
> > +    NVME_AIO_OPC_FLUSH        = 0x1,
> > +    NVME_AIO_OPC_READ         = 0x2,
> > +    NVME_AIO_OPC_WRITE        = 0x3,
> > +    NVME_AIO_OPC_WRITE_ZEROES = 0x4,
> > +} NvmeAIOOp;
> > +
> > +typedef struct NvmeAIO NvmeAIO;
> > +typedef void NvmeAIOCompletionFunc(NvmeAIO *aio, void *opaque, int ret);
> > +
> > +struct NvmeAIO {
> > +    NvmeRequest *req;
> > +
> > +    NvmeAIOOp       opc;
> > +    int64_t         offset;
> > +    size_t          len;
> > +    BlockBackend    *blk;
> > +    BlockAIOCB      *aiocb;
> > +    BlockAcctCookie acct;
> > +
> > +    NvmeAIOCompletionFunc *cb;
> > +    void                  *cb_arg;
> > +
> > +    QEMUSGList   *qsg;
> > +    QEMUIOVector *iov;
> > +
> > +    QTAILQ_ENTRY(NvmeAIO) tailq_entry;
> > +};
> > +
> > +static inline const char *nvme_aio_opc_str(NvmeAIO *aio)
> > +{
> > +    switch (aio->opc) {
> > +    case NVME_AIO_OPC_NONE:         return "NVME_AIO_OP_NONE";
> > +    case NVME_AIO_OPC_FLUSH:        return "NVME_AIO_OP_FLUSH";
> > +    case NVME_AIO_OPC_READ:         return "NVME_AIO_OP_READ";
> > +    case NVME_AIO_OPC_WRITE:        return "NVME_AIO_OP_WRITE";
> > +    case NVME_AIO_OPC_WRITE_ZEROES: return "NVME_AIO_OP_WRITE_ZEROES";
> > +    default:                        return "NVME_AIO_OP_UNKNOWN";
> > +    }
> > +}
> > +
> > +static inline bool nvme_req_is_write(NvmeRequest *req)
> > +{
> > +    switch (req->cmd.opcode) {
> > +    case NVME_CMD_WRITE:
> > +    case NVME_CMD_WRITE_UNCOR:
> > +    case NVME_CMD_WRITE_ZEROS:
> > +        return true;
> > +    default:
> > +        return false;
> > +    }
> > +}
> > +
> >  #define TYPE_NVME "nvme"
> >  #define NVME(obj) \
> >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > @@ -139,10 +236,21 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> >  static inline uint16_t nvme_cid(NvmeRequest *req)
> >  {
> >      if (req) {
> > -        return le16_to_cpu(req->cqe.cid);
> > +        return le16_to_cpu(req->cmd.cid);
> >      }
> >  
> >      return 0xffff;
> >  }
> >  
> > +static inline bool nvme_status_is_error(uint16_t status, uint16_t err)
> > +{
> > +    /* strip DNR and MORE */
> > +    return (status & 0xfff) == err;
> > +}
> > +
> > +static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
> > +{
> > +    return req->sq->ctrl;
> > +}
> > +
> >  #endif /* HW_NVME_H */
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 77aa0da99ee0..90a57fb6099a 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -34,7 +34,12 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
> >  nvme_dev_irq_masked(void) "IRQ is masked"
> >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> >  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> > +nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > %"PRIu64" opc \"%s\" req %p"
> > +nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> > +nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> >  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> > +nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
> > +nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
> >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > qflags=%"PRIu16""
> >  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> > qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> >  nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
> > @@ -75,6 +80,9 @@ nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
> >  nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
> >  
> >  # nvme traces for error conditions
> > +nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
> > +nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> > +nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > status 0x%"PRIx16""
> >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> 
> 
> 
> The patch is large, I tried my best to spot issues, but I might have missed some.
> Please split it as I pointed out.

Done!

> Overall I do like most of the changes.
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 20/26] nvme: handle dma errors
  2020-02-12 11:52       ` Maxim Levitsky
@ 2020-03-16  7:53         ` Klaus Birkelund Jensen
  2020-03-25 10:23           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:53 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 13:52, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > Handling DMA errors gracefully is required for the device to pass the
> > block/011 test ("disable PCI device while doing I/O") in the blktests
> > suite.
> > 
> > With this patch the device passes the test by retrying "critical"
> > transfers (posting of completion entries and processing of submission
> > queue entries).
> > 
> > If DMA errors occur at any other point in the execution of the command
> > (say, while mapping the PRPs), the command is aborted with a Data
> > Transfer Error status code.
> > 
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/nvme.c       | 42 +++++++++++++++++++++++++++++++++---------
> >  hw/block/trace-events |  2 ++
> >  include/block/nvme.h  |  2 +-
> >  3 files changed, 36 insertions(+), 10 deletions(-)
> > 
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index f8c81b9e2202..204ae1d33234 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >      return addr >= low && addr < hi;
> >  }
> >  
> > -static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > +static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> >      if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> >          memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > -        return;
> > +        return 0;
> >      }
> >  
> > -    pci_dma_read(&n->parent_obj, addr, buf, size);
> > +    return pci_dma_read(&n->parent_obj, addr, buf, size);
> >  }
> >  
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> > @@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> >      uint16_t status = NVME_SUCCESS;
> >      bool is_cmb = false;
> >      bool prp_list_in_cmb = false;
> > +    int ret;
> >  
> >      trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> >          prp1, prp2, num_prps);
> > @@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> >  
> >              nents = (len + n->page_size - 1) >> n->page_bits;
> >              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +            ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > +            if (ret) {
> > +                trace_nvme_dev_err_addr_read(prp2);
> > +                status = NVME_DATA_TRANSFER_ERROR;
> > +                goto unmap;
> > +            }
> >              while (len != 0) {
> >                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> >  
> > @@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> >                      i = 0;
> >                      nents = (len + n->page_size - 1) >> n->page_bits;
> >                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > -                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
> > +                    ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
> > +                        prp_trans);
> > +                    if (ret) {
> > +                        trace_nvme_dev_err_addr_read(prp_ent);
> > +                        status = NVME_DATA_TRANSFER_ERROR;
> > +                        goto unmap;
> > +                    }
> >                      prp_ent = le64_to_cpu(prp_list[i]);
> >                  }
> >  
> > @@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
> >      NvmeCQueue *cq = opaque;
> >      NvmeCtrl *n = cq->ctrl;
> >      NvmeRequest *req, *next;
> > +    int ret;
> >  
> >      QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
> >          NvmeSQueue *sq;
> > @@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
> >              break;
> >          }
> >  
> > -        QTAILQ_REMOVE(&cq->req_list, req, entry);
> >          sq = req->sq;
> >          req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
> >          req->cqe.sq_id = cpu_to_le16(sq->sqid);
> >          req->cqe.sq_head = cpu_to_le16(sq->head);
> >          addr = cq->dma_addr + cq->tail * n->cqe_size;
> > -        nvme_inc_cq_tail(cq);
> > -        pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> > +        ret = pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> >              sizeof(req->cqe));
> > +        if (ret) {
> > +            trace_nvme_dev_err_addr_write(addr);
> > +            timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +                100 * SCALE_MS);
> > +            break;
> > +        }
> > +        QTAILQ_REMOVE(&cq->req_list, req, entry);
> > +        nvme_inc_cq_tail(cq);
> >          nvme_req_clear(req);
> >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> >      }
> > @@ -1588,7 +1607,12 @@ static void nvme_process_sq(void *opaque)
> >  
> >      while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
> >          addr = sq->dma_addr + sq->head * n->sqe_size;
> > -        nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
> > +        if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> > +            trace_nvme_dev_err_addr_read(addr);
> > +            timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > +                100 * SCALE_MS);
> > +            break;
> > +        }
> 
> Note that once the driver is optimized for performance, these timers must go,
> since they run on main thread and also add latency to each request.
> But for now this change is all right.
> 
> About user triggering this each 100ms on purpose, I don't think that this is such a big issue.
> Maybe up that to 500ms or even one second, since this condition will not
> happen in real life usage of the device anyway.
> 

I bumped it to 500ms.

> >          nvme_inc_sq_head(sq);
> >  
> >          req = QTAILQ_FIRST(&sq->req_list);
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 90a57fb6099a..09bfb3782dd0 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -83,6 +83,8 @@ nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
> >  nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
> >  nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> >  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > status 0x%"PRIx16""
> > +nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
> > +nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
> >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index c1de92179596..a873776d98b8 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -418,7 +418,7 @@ enum NvmeStatusCodes {
> >      NVME_INVALID_OPCODE         = 0x0001,
> >      NVME_INVALID_FIELD          = 0x0002,
> >      NVME_CID_CONFLICT           = 0x0003,
> > -    NVME_DATA_TRAS_ERROR        = 0x0004,
> > +    NVME_DATA_TRANSFER_ERROR    = 0x0004,
> >      NVME_POWER_LOSS_ABORT       = 0x0005,
> >      NVME_INTERNAL_DEV_ERROR     = 0x0006,
> >      NVME_CMD_ABORT_REQ          = 0x0007,
> 
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 21/26] nvme: add support for scatter gather lists
  2020-02-12 12:07       ` Maxim Levitsky
@ 2020-03-16  7:54         ` Klaus Birkelund Jensen
  2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:54 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 14:07, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > For now, support the Data Block, Segment and Last Segment descriptor
> > types.
> > 
> > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > Acked-by: Fam Zheng <fam@euphon.net>
> > ---
> >  block/nvme.c          |  18 +-
> >  hw/block/nvme.c       | 375 +++++++++++++++++++++++++++++++++++-------
> >  hw/block/trace-events |   4 +
> >  include/block/nvme.h  |  62 ++++++-
> >  4 files changed, 389 insertions(+), 70 deletions(-)
> > 
> > diff --git a/block/nvme.c b/block/nvme.c
> > index d41c4bda6e39..521f521054d5 100644
> > --- a/block/nvme.c
> > +++ b/block/nvme.c
> > @@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> >          error_setg(errp, "Cannot map buffer for DMA");
> >          goto out;
> >      }
> > -    cmd.prp1 = cpu_to_le64(iova);
> > +    cmd.dptr.prp.prp1 = cpu_to_le64(iova);
> >  
> >      if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> >          error_setg(errp, "Failed to identify controller");
> > @@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> >      }
> >      cmd = (NvmeCmd) {
> >          .opcode = NVME_ADM_CMD_CREATE_CQ,
> > -        .prp1 = cpu_to_le64(q->cq.iova),
> > +        .dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
> >          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> >          .cdw11 = cpu_to_le32(0x3),
> >      };
> > @@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> >      }
> >      cmd = (NvmeCmd) {
> >          .opcode = NVME_ADM_CMD_CREATE_SQ,
> > -        .prp1 = cpu_to_le64(q->sq.iova),
> > +        .dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
> >          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> >          .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> >      };
> > @@ -906,16 +906,16 @@ try_map:
> >      case 0:
> >          abort();
> >      case 1:
> > -        cmd->prp1 = pagelist[0];
> > -        cmd->prp2 = 0;
> > +        cmd->dptr.prp.prp1 = pagelist[0];
> > +        cmd->dptr.prp.prp2 = 0;
> >          break;
> >      case 2:
> > -        cmd->prp1 = pagelist[0];
> > -        cmd->prp2 = pagelist[1];
> > +        cmd->dptr.prp.prp1 = pagelist[0];
> > +        cmd->dptr.prp.prp2 = pagelist[1];
> >          break;
> >      default:
> > -        cmd->prp1 = pagelist[0];
> > -        cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> > +        cmd->dptr.prp.prp1 = pagelist[0];
> > +        cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> >          break;
> >      }
> >      trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index 204ae1d33234..a91c60fdc111 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> >  
> >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >  {
> > -    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > -        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > +    hwaddr hi = addr + size;
> Are you sure you don't want to check for overflow here?
> Its theoretical issue since addr has to be almost full 64 bit
> but still for those things I check this very defensively.
> 

The use of nvme_addr_read in map_prp simply cannot overflow due to how
the size is calculated, but for SGLs it's different. But the overflow is
checked in map_sgl because we have to return a special error code in
that case.

On the other hand there may be other callers of nvme_addr_read in the
future that does not check this, so I'll re-add it.

> > +
> > +    if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
> Here you fix the bug I mentioned in patch 6. I suggest you to move the fix there.

Done.

> > +        memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> >          return 0;
> >      }
> >  
> > @@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
> >      }
> >  }
> >  
> > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
> > +    size_t len)
> > +{
> > +    if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > +        return NVME_DATA_TRANSFER_ERROR;
> > +    }
> > +
> > +    qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > +    hwaddr addr, size_t len)
> > +{
> > +    bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > +
> > +    if (addr_is_cmb) {
> > +        if (qsg->sg) {
> > +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +        }
> > +
> > +        if (!iov->iov) {
> > +            qemu_iovec_init(iov, 1);
> > +        }
> > +
> > +        return nvme_map_addr_cmb(n, iov, addr, len);
> > +    }
> > +
> > +    if (iov->iov) {
> > +        return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +    }
> > +
> > +    if (!qsg->sg) {
> > +        pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> > +    }
> > +
> > +    qemu_sglist_add(qsg, addr, len);
> > +
> > +    return NVME_SUCCESS;
> > +}
> 
> Very good refactoring. I would also suggest you to move this to a separate
> patch. I always put refactoring first and then patches that add features.
> 

Done.

> > +
> >  static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> >      uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> >  {
> > @@ -307,15 +351,240 @@ unmap:
> >      return status;
> >  }
> >  
> > -static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > -    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
> > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > +    QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> > +    uint32_t *len, NvmeRequest *req)
> > +{
> > +    dma_addr_t addr, trans_len;
> > +    uint32_t length;
> > +    uint16_t status;
> > +
> > +    for (int i = 0; i < nsgld; i++) {
> > +        uint8_t type = NVME_SGL_TYPE(segment[i].type);
> > +
> > +        if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > +            switch (type) {
> > +            case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
> > +            case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
> > +                return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> > +            default:
> > +                break;
> > +            }
> > +
> > +            return NVME_INVALID_NUM_SGL_DESCRIPTORS | NVME_DNR;
> Since the only way to reach the above statement is by that 'default'
> why not to move it there?

True. Fixed!

> > +        }
> > +
> > +        if (*len == 0) {
> > +            if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> > +                trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > +                return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +            }
> > +
> > +            break;
> > +        }
> > +
> > +        addr = le64_to_cpu(segment[i].addr);
> > +        length = le32_to_cpu(segment[i].len);
> > +
> > +        if (!length) {
> > +            continue;
> > +        }
> > +
> > +        if (UINT64_MAX - addr < length) {
> > +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +        }
> > +
> > +        trans_len = MIN(*len, length);
> > +
> > +        status = nvme_map_addr(n, qsg, iov, addr, trans_len);
> > +        if (status) {
> > +            return status;
> > +        }
> > +
> > +        *len -= trans_len;
> > +    }
> > +
> > +    return NVME_SUCCESS;
> > +}
> > +
> > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > +    NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> Minor nitpick: 
> Usually structs are passed by reference (that is pointer in C), 
> however I see that you change 'sgl' it in the function.
> IMHO this is a bit hard to read, I usually prefer not to change input parameters.
> 

Uhm, please help me, where am I changing it? That is unintentional I
think.

I *think* I prefer passing it by value, just because it fits nicely with
how different fields of the command is passed like that in other places.
We are "copying" the same amount of data as with PRPs (2x64 bits vs
1x128 bits).

> > +{
> > +    const int MAX_NSGLD = 256;
> 
> I personally would rename that const to something like SG_CHUNK_SIZE and add a comment, since
> it is just an arbitrary chunk size you use to avoid dynamic memory allocation,
> that is so we can avoid confusion vs the spec.

Good point. Done.

> 
> > +
> > +    NvmeSglDescriptor segment[MAX_NSGLD], *sgld, *last_sgld;
> > +    uint64_t nsgld;
> > +    uint32_t length;
> > +    uint16_t status;
> > +    bool sgl_in_cmb = false;
> > +    hwaddr addr;
> > +    int ret;
> > +
> > +    sgld = &sgl;
> > +    addr = le64_to_cpu(sgl.addr);
> > +
> > +    trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), req->nlb,
> > +        len);
> > +
> > +    /*
> > +     * If the entire transfer can be described with a single data block it can
> > +     * be mapped directly.
> > +     */
> > +    if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > +        status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, &len, req);
> > +        if (status) {
> > +            goto unmap;
> > +        }
> > +
> > +        goto out;
> > +    }
> > +
> > +    /*
> > +     * If the segment is located in the CMB, the submission queue of the
> > +     * request must also reside there.
> > +     */
> > +    if (nvme_addr_is_cmb(n, addr)) {
> > +        if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +        }
> > +
> > +        sgl_in_cmb = true;
> > +    }
> > +
> > +    for (;;) {
> > +        length = le32_to_cpu(sgld->len);
> > +
> > +        if (!length || length & 0xf) {
> > +            return NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> > +        }
> > +
> > +        if (UINT64_MAX - addr < length) {
> I assume you check for overflow here. Looks like very nice way to do it.
> This should be adopted in few more places
> > +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +        }
> > +
> > +        nsgld = length / sizeof(NvmeSglDescriptor);
> > +
> > +        /* read the segment in chunks of 256 descriptors (4k) */
> That comment is perfect to move/copy to definition of MAX_NSGLD

Done.

> 
> > +        while (nsgld > MAX_NSGLD) {
> > +            if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> > +                trace_nvme_dev_err_addr_read(addr);
> > +                status = NVME_DATA_TRANSFER_ERROR;
> > +                goto unmap;
> > +            }
> > +
> > +            status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, &len,
> > +                req);
> > +            if (status) {
> > +                goto unmap;
> > +            }
> > +
> > +            nsgld -= MAX_NSGLD;
> > +            addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
> > +        }
> > +
> > +        ret = nvme_addr_read(n, addr, segment, nsgld *
> > +            sizeof(NvmeSglDescriptor));
> Reminding you to fix the line split issues. (align the sizeof on '(')

Done.

> 
> > +        if (ret) {
> > +            trace_nvme_dev_err_addr_read(addr);
> > +            status = NVME_DATA_TRANSFER_ERROR;
> > +            goto unmap;
> > +        }
> > +
> > +        last_sgld = &segment[nsgld - 1];
> > +
> > +        /* if the segment ends with a Data Block, then we are done */
> > +        if (NVME_SGL_TYPE(last_sgld->type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > +            status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld, &len, req);
> > +            if (status) {
> > +                goto unmap;
> > +            }
> > +
> > +            break;
> > +        }
> > +
> > +        /* a Last Segment must end with a Data Block descriptor */
> > +        if (NVME_SGL_TYPE(sgld->type) == NVME_SGL_DESCR_TYPE_LAST_SEGMENT) {
> > +            status = NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> > +            goto unmap;
> > +        }
> > +
> > +        sgld = last_sgld;
> > +        addr = le64_to_cpu(sgld->addr);
> > +
> > +        /*
> > +         * Do not map the last descriptor; it will be a Segment or Last Segment
> > +         * descriptor instead and handled by the next iteration.
> > +         */
> > +        status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld - 1, &len, req);
> > +        if (status) {
> > +            goto unmap;
> > +        }
> > +
> > +        /*
> > +         * If the next segment is in the CMB, make sure that the sgl was
> > +         * already located there.
> > +         */
> > +        if (sgl_in_cmb != nvme_addr_is_cmb(n, addr)) {
> > +            status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > +            goto unmap;
> > +        }
> > +    }
> > +
> > +out:
> > +    /* if there is any residual left in len, the SGL was too short */
> > +    if (len) {
> > +        status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > +        goto unmap;
> > +    }
> > +
> > +    return NVME_SUCCESS;
> > +
> > +unmap:
> > +    if (iov->iov) {
> > +        qemu_iovec_destroy(iov);
> > +    }
> > +
> > +    if (qsg->sg) {
> > +        qemu_sglist_destroy(qsg);
> > +    }
> > +
> > +    return status;
> > +}
> Looks good, much better than in V4
> 
> 
> > +
> > +static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > +    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
> >  {
> >      uint16_t status = NVME_SUCCESS;
> >      size_t bytes;
> >  
> > -    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > -    if (status) {
> > -        return status;
> > +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> > +    case PSDT_PRP:
> > +        status = nvme_map_prp(n, &req->qsg, &req->iov,
> > +            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
> > +            len, req);
> > +        if (status) {
> > +            return status;
> > +        }
> > +
> > +        break;
> > +
> > +    case PSDT_SGL_MPTR_CONTIGUOUS:
> > +    case PSDT_SGL_MPTR_SGL:
> > +        if (!req->sq->sqid) {
> > +            /* SGLs shall not be used for Admin commands in NVMe over PCIe */
> > +            return NVME_INVALID_FIELD;
> > +        }
> > +
> > +        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
> > +            req);
> > +        if (status) {
> > +            return status;
> > +        }
> Minor nitpick: you can probably refactor this to an 'err' label in the end of function.

This has been refactored in another patch.

> > +
> > +        break;
> > +
> > +    default:
> > +        return NVME_INVALID_FIELD;
> >      }
> 
> 
> >  
> >      if (req->qsg.nsg > 0) {
> > @@ -351,13 +620,21 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> >  
> >  static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > -    NvmeNamespace *ns = req->ns;
> > +    uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
> > +    uint64_t prp1, prp2;
> >  
> > -    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> > +    case PSDT_PRP:
> > +        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
> > +        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
> >  
> > -    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > +        return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > +    case PSDT_SGL_MPTR_CONTIGUOUS:
> > +    case PSDT_SGL_MPTR_SGL:
> > +        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
> > +    default:
> > +        return NVME_INVALID_FIELD;
> > +    }
> >  }
> >  
> >  static void nvme_aio_destroy(NvmeAIO *aio)
> > @@ -972,8 +1249,6 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> >  static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> >      uint32_t buf_len, uint64_t off, NvmeRequest *req)
> >  {
> > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> >  
> >      uint32_t trans_len;
> > @@ -1023,16 +1298,14 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> >          nvme_clear_events(n, NVME_AER_TYPE_SMART);
> >      }
> >  
> > -    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > +    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> >      uint64_t off, NvmeRequest *req)
> >  {
> >      uint32_t trans_len;
> > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> >      NvmeFwSlotInfoLog fw_log;
> >  
> >      if (off > sizeof(fw_log)) {
> > @@ -1043,8 +1316,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> >  
> >      trans_len = MIN(sizeof(fw_log) - off, buf_len);
> >  
> > -    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > +    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > @@ -1194,25 +1467,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> >      return NVME_SUCCESS;
> >  }
> >  
> > -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> > -    NvmeRequest *req)
> > +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > -
> >      trace_nvme_dev_identify_ctrl();
> >  
> > -    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > +    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> > -    NvmeRequest *req)
> > +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> >      NvmeNamespace *ns;
> > -    uint32_t nsid = le32_to_cpu(c->nsid);
> > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> >  
> >      trace_nvme_dev_identify_ns(nsid);
> >  
> > @@ -1223,17 +1489,15 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> >  
> >      ns = &n->namespaces[nsid - 1];
> >  
> > -    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > +    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> > +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
> >      NvmeRequest *req)
> >  {
> >      static const int data_len = 4 * KiB;
> > -    uint32_t min_nsid = le32_to_cpu(c->nsid);
> > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > +    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
> >      uint32_t *list;
> >      uint16_t ret;
> >      int i, j = 0;
> > @@ -1250,13 +1514,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> >              break;
> >          }
> >      }
> > -    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> > +    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
> >          DMA_DIRECTION_FROM_DEVICE, req);
> >      g_free(list);
> >      return ret;
> >  }
> >  
> > -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
> >      NvmeRequest *req)
> >  {
> >      static const int len = 4096;
> > @@ -1268,9 +1532,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> >          uint8_t nid[16];
> >      };
> >  
> > -    uint32_t nsid = le32_to_cpu(c->nsid);
> > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> >  
> >      struct ns_descr *list;
> >      uint16_t ret;
> > @@ -1293,8 +1555,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> >      list->nidl = 0x10;
> >      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> >  
> > -    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> > -        DMA_DIRECTION_FROM_DEVICE, req);
> > +    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
> > +        req);
> >      g_free(list);
> >      return ret;
> >  }
> > @@ -1305,13 +1567,13 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  
> >      switch (le32_to_cpu(c->cns)) {
> >      case 0x00:
> > -        return nvme_identify_ns(n, c, req);
> > +        return nvme_identify_ns(n, cmd, req);
> >      case 0x01:
> > -        return nvme_identify_ctrl(n, c, req);
> > +        return nvme_identify_ctrl(n, cmd, req);
> >      case 0x02:
> > -        return nvme_identify_ns_list(n, c, req);
> > +        return nvme_identify_ns_list(n, cmd, req);
> >      case 0x03:
> > -        return nvme_identify_ns_descr_list(n, c, req);
> > +        return nvme_identify_ns_descr_list(n, cmd, req);
> >      default:
> >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1373,13 +1635,10 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> >  static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> >      NvmeRequest *req)
> >  {
> > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > -
> >      uint64_t timestamp = nvme_get_timestamp(n);
> >  
> > -    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > +    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
> > +        DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > @@ -1462,11 +1721,9 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> >  {
> >      uint16_t ret;
> >      uint64_t timestamp;
> > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> >  
> > -    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> > -        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
> > +    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
> > +        DMA_DIRECTION_TO_DEVICE, req);
> >      if (ret != NVME_SUCCESS) {
> >          return ret;
> >      }
> > @@ -2232,6 +2489,8 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >          id->vwc = 1;
> >      }
> >  
> > +    id->sgls = cpu_to_le32(0x1);
> Being part of the spec, it would be nice to #define this as well.

Done.

> > +
> >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> >      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
> >  
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 09bfb3782dd0..81d69e15fc32 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -34,6 +34,7 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
> >  nvme_dev_irq_masked(void) "IRQ is masked"
> >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> >  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> > +nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" nlb %"PRIu32" len %"PRIu64""
> >  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > %"PRIu64" opc \"%s\" req %p"
> >  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> >  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> > @@ -85,6 +86,9 @@ nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> >  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > status 0x%"PRIx16""
> >  nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
> >  nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
> > +nvme_dev_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> > +nvme_dev_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> > +nvme_dev_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
> >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > index a873776d98b8..dbdeecf82358 100644
> > --- a/include/block/nvme.h
> > +++ b/include/block/nvme.h
> > @@ -205,15 +205,53 @@ enum NvmeCmbszMask {
> >  #define NVME_CMBSZ_GETSIZE(cmbsz) \
> >      (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
> >  
> > +enum NvmeSglDescriptorType {
> > +    NVME_SGL_DESCR_TYPE_DATA_BLOCK           = 0x0,
> > +    NVME_SGL_DESCR_TYPE_BIT_BUCKET           = 0x1,
> > +    NVME_SGL_DESCR_TYPE_SEGMENT              = 0x2,
> > +    NVME_SGL_DESCR_TYPE_LAST_SEGMENT         = 0x3,
> > +    NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK     = 0x4,
> > +
> > +    NVME_SGL_DESCR_TYPE_VENDOR_SPECIFIC      = 0xf,
> > +};
> > +
> > +enum NvmeSglDescriptorSubtype {
> > +    NVME_SGL_DESCR_SUBTYPE_ADDRESS = 0x0,
> > +};
> > +
> > +typedef struct NvmeSglDescriptor {
> > +    uint64_t addr;
> > +    uint32_t len;
> > +    uint8_t  rsvd[3];
> > +    uint8_t  type;
> > +} NvmeSglDescriptor;
> 
> I suggest you add a build time struct size check for this,
> just in case compiler tries something funny.
> (look at _nvme_check_size, at nvme.h)
> 

Done.

> Also I think that the spec update change that adds the NvmeSglDescriptor
> should be split into separate patch (or better be added in one big patch that adds all 1.3d features), 
> which would make it also easier to see changes that touch the other nvme driver we have.
> 

Done.

> > +
> > +#define NVME_SGL_TYPE(type)     ((type >> 4) & 0xf)
> > +#define NVME_SGL_SUBTYPE(type)  (type & 0xf)
> > +
> > +typedef union NvmeCmdDptr {
> > +    struct {
> > +        uint64_t    prp1;
> > +        uint64_t    prp2;
> > +    } prp;
> > +
> > +    NvmeSglDescriptor sgl;
> > +} NvmeCmdDptr;
> > +
> > +enum NvmePsdt {
> > +    PSDT_PRP                 = 0x0,
> > +    PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
> > +    PSDT_SGL_MPTR_SGL        = 0x2,
> > +};
> > +
> >  typedef struct NvmeCmd {
> >      uint8_t     opcode;
> > -    uint8_t     fuse;
> > +    uint8_t     flags;
> >      uint16_t    cid;
> >      uint32_t    nsid;
> >      uint64_t    res1;
> >      uint64_t    mptr;
> > -    uint64_t    prp1;
> > -    uint64_t    prp2;
> > +    NvmeCmdDptr dptr;
> >      uint32_t    cdw10;
> >      uint32_t    cdw11;
> >      uint32_t    cdw12;
> > @@ -222,6 +260,9 @@ typedef struct NvmeCmd {
> >      uint32_t    cdw15;
> >  } NvmeCmd;
> >  
> > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> > +
> >  enum NvmeAdminCommands {
> >      NVME_ADM_CMD_DELETE_SQ      = 0x00,
> >      NVME_ADM_CMD_CREATE_SQ      = 0x01,
> > @@ -427,6 +468,11 @@ enum NvmeStatusCodes {
> >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> >      NVME_INVALID_NSID           = 0x000b,
> >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > +    NVME_INVALID_SGL_SEG_DESCRIPTOR  = 0x000d,
> > +    NVME_INVALID_NUM_SGL_DESCRIPTORS = 0x000e,
> > +    NVME_DATA_SGL_LENGTH_INVALID     = 0x000f,
> > +    NVME_METADATA_SGL_LENGTH_INVALID = 0x0010,
> > +    NVME_SGL_DESCRIPTOR_TYPE_INVALID = 0x0011,
> >      NVME_INVALID_USE_OF_CMB     = 0x0012,
> >      NVME_LBA_RANGE              = 0x0080,
> >      NVME_CAP_EXCEEDED           = 0x0081,
> > @@ -623,6 +669,16 @@ enum NvmeIdCtrlOncs {
> >  #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
> >  #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
> >  
> > +#define NVME_CTRL_SGLS_SUPPORTED(sgls)                 ((sgls) & 0x3)
> > +#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(sgls)    ((sgls) & (0x1 <<  0))
> > +#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT(sgls) ((sgls) & (0x1 <<  1))
> > +#define NVME_CTRL_SGLS_KEYED(sgls)                     ((sgls) & (0x1 <<  2))
> > +#define NVME_CTRL_SGLS_BITBUCKET(sgls)                 ((sgls) & (0x1 << 16))
> > +#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS(sgls)           ((sgls) & (0x1 << 17))
> > +#define NVME_CTRL_SGLS_EXCESS_LENGTH(sgls)             ((sgls) & (0x1 << 18))
> > +#define NVME_CTRL_SGLS_MPTR_SGL(sgls)                  ((sgls) & (0x1 << 19))
> > +#define NVME_CTRL_SGLS_ADDR_OFFSET(sgls)               ((sgls) & (0x1 << 20))
> > +
> >  typedef struct NvmeFeatureVal {
> >      uint32_t    arbitration;
> >      uint32_t    power_mgmt;
> 
> Best regards,
> 	Maxim Levitsky
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 22/26] nvme: support multiple namespaces
  2020-02-12 12:34       ` Maxim Levitsky
@ 2020-03-16  7:55         ` Klaus Birkelund Jensen
  2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 1 reply; 86+ messages in thread
From: Klaus Birkelund Jensen @ 2020-03-16  7:55 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Feb 12 14:34, Maxim Levitsky wrote:
> On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > device model. The nvme device creates a bus named from the device name
> > ('id'). The nvme-ns devices then connect to this and registers
> > themselves with the nvme device.
> > 
> > This changes how an nvme device is created. Example with two namespaces:
> > 
> >   -drive file=nvme0n1.img,if=none,id=disk1
> >   -drive file=nvme0n2.img,if=none,id=disk2
> >   -device nvme,serial=deadbeef,id=nvme0
> >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > 
> > The drive property is kept on the nvme device to keep the change
> > backward compatible, but the property is now optional. Specifying a
> > drive for the nvme device will always create the namespace with nsid 1.
> Very reasonable way to do it. 
> > 
> > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > ---
> >  hw/block/Makefile.objs |   2 +-
> >  hw/block/nvme-ns.c     | 158 +++++++++++++++++++++++++++
> >  hw/block/nvme-ns.h     |  60 +++++++++++
> >  hw/block/nvme.c        | 235 +++++++++++++++++++++++++----------------
> >  hw/block/nvme.h        |  47 ++++-----
> >  hw/block/trace-events  |   6 +-
> >  6 files changed, 389 insertions(+), 119 deletions(-)
> >  create mode 100644 hw/block/nvme-ns.c
> >  create mode 100644 hw/block/nvme-ns.h
> > 
> > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > index 28c2495a00dc..45f463462f1e 100644
> > --- a/hw/block/Makefile.objs
> > +++ b/hw/block/Makefile.objs
> > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> >  common-obj-$(CONFIG_XEN) += xen-block.o
> >  common-obj-$(CONFIG_ECC) += ecc.o
> >  common-obj-$(CONFIG_ONENAND) += onenand.o
> > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> >  common-obj-$(CONFIG_SWIM) += swim.o
> >  
> >  obj-$(CONFIG_SH4) += tc58128.o
> > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > new file mode 100644
> > index 000000000000..0e5be44486f4
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.c
> > @@ -0,0 +1,158 @@
> > +#include "qemu/osdep.h"
> > +#include "qemu/units.h"
> > +#include "qemu/cutils.h"
> > +#include "qemu/log.h"
> > +#include "hw/block/block.h"
> > +#include "hw/pci/msix.h"
> Do you need this include?

No, I needed hw/pci/pci.h instead :)

> > +#include "sysemu/sysemu.h"
> > +#include "sysemu/block-backend.h"
> > +#include "qapi/error.h"
> > +
> > +#include "hw/qdev-properties.h"
> > +#include "hw/qdev-core.h"
> > +
> > +#include "nvme.h"
> > +#include "nvme-ns.h"
> > +
> > +static int nvme_ns_init(NvmeNamespace *ns)
> > +{
> > +    NvmeIdNs *id_ns = &ns->id_ns;
> > +
> > +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > +    id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > +        cpu_to_le64(nvme_ns_nlbas(ns));
> Nitpick: To be honest I don't really like that chain assignment, 
> especially since it forces to wrap the line, but that is just my
> personal taste.

Fixed, and also added a comment as to why they are the same.

> > +
> > +    return 0;
> > +}
> > +
> > +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> > +    Error **errp)
> > +{
> > +    uint64_t perm, shared_perm;
> > +
> > +    Error *local_err = NULL;
> > +    int ret;
> > +
> > +    perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> > +    shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> > +        BLK_PERM_GRAPH_MOD;
> > +
> > +    ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
> > +    if (ret) {
> > +        error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> > +        return ret;
> > +    }
> 
> You should consider using blkconf_apply_backend_options.
> Take a look at for example virtio_blk_device_realize.
> That will give you support for read only block devices as well.

So, yeah. There is a reason for this. And I will add that as a comment,
but I will write it here for posterity.

The problem is when the nvme-ns device starts getting more than just a
single drive attached (I have patches ready that will add a "metadata"
and a "state" drive). The blkconf_ functions work on a BlockConf that
embeds a BlockBackend, so you can't have one BlockConf with multiple
BlockBackend's. That is why I'm kinda copying the "good parts" of
the blkconf_apply_backend_options code here.

> 
> I personally only once grazed the area of block permissions,
> so I prefer someone from the block layer to review this as well.
> 
> > +
> > +    ns->size = blk_getlength(ns->blk);
> > +    if (ns->size < 0) {
> > +        error_setg_errno(errp, -ns->size, "blk_getlength");
> > +        return 1;
> > +    }
> > +
> > +    switch (n->conf.wce) {
> > +    case ON_OFF_AUTO_ON:
> > +        n->features.volatile_wc = 1;
> > +        break;
> > +    case ON_OFF_AUTO_OFF:
> > +        n->features.volatile_wc = 0;
> > +    case ON_OFF_AUTO_AUTO:
> > +        n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> > +        break;
> > +    default:
> > +        abort();
> > +    }
> > +
> > +    blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> > +
> > +    return 0;
> 
> Nitpick: also I just noticed that you call the controller 'n' I didn't paid attention to this
> before. I think something like 'ctrl' or ctl would be more readable.
> 

Yeah, but using 'n' is done in all the existing code, so I think we
should stick with it.

> > +}
> > +
> > +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> > +{
> > +    if (!ns->blk) {
> > +        error_setg(errp, "block backend not configured");
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> > +}
> > +
> > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +{
> > +    Error *local_err = NULL;
> > +
> > +    if (nvme_ns_check_constraints(ns, &local_err)) {
> > +        error_propagate_prepend(errp, local_err,
> > +            "nvme_ns_check_constraints: ");
> > +        return 1;
> > +    }
> > +
> > +    if (nvme_ns_init_blk(n, ns, &n->id_ctrl, &local_err)) {
> > +        error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
> > +        return 1;
> > +    }
> > +
> > +    nvme_ns_init(ns);
> > +    if (nvme_register_namespace(n, ns, &local_err)) {
> > +        error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
> > +        return 1;
> > +    }
> > +
> > +    return 0;
> 
> Nitipick: to be honest I am not sure we want to expose internal function names like that in 
> error hints are supposed to be readable to user that doesn't look at the source.
> 

Fixed.

> > +}
> > +
> > +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > +{
> > +    NvmeNamespace *ns = NVME_NS(dev);
> > +    BusState *s = qdev_get_parent_bus(dev);
> > +    NvmeCtrl *n = NVME(s->parent);
> 
> Nitpick: Don't know if you defined this or it was like that always,
> but I would prefer something like NVME_CTL instead.
> 

This is also grandfathered from the nvme device.

> > +    Error *local_err = NULL;
> > +
> > +    if (nvme_ns_setup(n, ns, &local_err)) {
> > +        error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> > +        return;
> > +    }
> > +}
> > +
> > +static Property nvme_ns_props[] = {
> > +    DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
> 
> If you go with my suggestion to use blkconf you will use here the
> DEFINE_BLOCK_PROPERTIES_BASE
> 

See my comment above about that.

> > +    DEFINE_PROP_END_OF_LIST(),
> > +};
> > +
> > +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> > +{
> > +    DeviceClass *dc = DEVICE_CLASS(oc);
> > +
> > +    set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > +
> > +    dc->bus_type = TYPE_NVME_BUS;
> > +    dc->realize = nvme_ns_realize;
> > +    device_class_set_props(dc, nvme_ns_props);
> > +    dc->desc = "virtual nvme namespace";
> > +}
> 
> Looks reasonable.
> I don't know the device/bus model in depth to be honest
> (I learned it for few days some time ago though)
> so a review from someone that knows this area better that I do
> is very welcome.
> 
> > +
> > +static void nvme_ns_instance_init(Object *obj)
> > +{
> > +    NvmeNamespace *ns = NVME_NS(obj);
> > +    char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> > +
> > +    device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
> > +        bootindex, DEVICE(obj), &error_abort);
> > +
> > +    g_free(bootindex);
> > +}
> > +
> > +static const TypeInfo nvme_ns_info = {
> > +    .name = TYPE_NVME_NS,
> > +    .parent = TYPE_DEVICE,
> > +    .class_init = nvme_ns_class_init,
> > +    .instance_size = sizeof(NvmeNamespace),
> > +    .instance_init = nvme_ns_instance_init,
> > +};
> > +
> > +static void nvme_ns_register_types(void)
> > +{
> > +    type_register_static(&nvme_ns_info);
> > +}
> > +
> > +type_init(nvme_ns_register_types)
> > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > new file mode 100644
> > index 000000000000..b564bac25f6d
> > --- /dev/null
> > +++ b/hw/block/nvme-ns.h
> > @@ -0,0 +1,60 @@
> > +#ifndef NVME_NS_H
> > +#define NVME_NS_H
> > +
> > +#define TYPE_NVME_NS "nvme-ns"
> > +#define NVME_NS(obj) \
> > +    OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
> > +
> > +#define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > +    DEFINE_PROP_DRIVE("drive", _state, blk), \
> > +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > +
> > +typedef struct NvmeNamespaceParams {
> > +    uint32_t nsid;
> > +} NvmeNamespaceParams;
> > +
> > +typedef struct NvmeNamespace {
> > +    DeviceState  parent_obj;
> > +    BlockBackend *blk;
> > +    int32_t      bootindex;
> > +    int64_t      size;
> > +
> > +    NvmeIdNs            id_ns;
> > +    NvmeNamespaceParams params;
> > +} NvmeNamespace;
> > +
> > +static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> > +{
> > +    if (ns) {
> > +        return ns->params.nsid;
> > +    }
> > +
> > +    return -1;
> > +}
> 
> To be honest I would allow user to omit nsid,
> and in this case pick a free slot out of valid namespaces.
> 
> Let me explain the concept of valid/allocated/active namespaces
> from the spec as written in my summary:
> 
> Valid namespaces are 1..N range of namespaces as reported in IDCTRL.NN
> That value is static, and it should be either set to some arbitrary large value (say 256)
> or set using qemu device parameter, and not changed dynamically as you currently do.
> As I understand it, IDCTRL output should not change during lifetime of the controller,
> although I didn't find exact confirmation on this in spec.
> 
> Allocated namespaces are not relevant to us (this is only used for namespace management)
> (these are namespaces that exist but are not attached to the controller)
> 
> And then you have Active namespaces which are the namespaces the user can actually address.
> 
> However If I understand this correctly, currently the NVME 'bus' doesn't
> support hotplug, thus all namespaces will be already plugged in on
> VM startup, thus the issue doesn't really exist yet.
> 
> 
> 

I added support for this. It's a nice addition and it makes the code
much nicer.

> > +
> > +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > +{
> > +    NvmeIdNs *id_ns = &ns->id_ns;
> > +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > +}
> > +
> > +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > +{
> > +    return nvme_ns_lbaf(ns).ds;
> > +}
> > +
> > +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > +{
> > +    return 1 << nvme_ns_lbads(ns);
> > +}
> > +
> > +static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
> > +{
> > +    return ns->size >> nvme_ns_lbads(ns);
> > +}
> > +
> > +typedef struct NvmeCtrl NvmeCtrl;
> > +
> > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> > +
> > +#endif /* NVME_NS_H */
> > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > index a91c60fdc111..3a377bc56734 100644
> > --- a/hw/block/nvme.c
> > +++ b/hw/block/nvme.c
> > @@ -17,10 +17,11 @@
> >  /**
> >   * Usage: add options:
> >   *      -drive file=<file>,if=none,id=<drive_id>
> > - *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
> > + *      -device nvme,serial=<serial>,id=<bus_name>, \
> >   *              cmb_size_mb=<cmb_size_mb[optional]>, \
> >   *              num_queues=<N[optional]>, \
> >   *              mdts=<mdts[optional]>
> > + *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=1
> >   *
> >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > @@ -28,6 +29,7 @@
> >  
> >  #include "qemu/osdep.h"
> >  #include "qemu/units.h"
> > +#include "qemu/error-report.h"
> >  #include "hw/block/block.h"
> >  #include "hw/pci/msix.h"
> >  #include "hw/pci/pci.h"
> > @@ -43,6 +45,7 @@
> >  #include "qemu/cutils.h"
> >  #include "trace.h"
> >  #include "nvme.h"
> > +#include "nvme-ns.h"
> >  
> >  #define NVME_SPEC_VER 0x00010300
> >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > @@ -85,6 +88,17 @@ static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> >      return pci_dma_read(&n->parent_obj, addr, buf, size);
> >  }
> >  
> > +static uint16_t nvme_nsid_err(NvmeCtrl *n, uint32_t nsid)
> > +{
> > +    if (nsid && nsid < n->num_namespaces) {
> > +        trace_nvme_dev_err_inactive_ns(nsid, n->num_namespaces);
> > +        return NVME_INVALID_FIELD | NVME_DNR;
> > +    }
> > +
> > +    trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > +    return NVME_INVALID_NSID | NVME_DNR;
> > +}
> 
> I don't like that function to be honest.
> This function is called when nvme_ns returns NULL.
> IMHO it would be better to make nvme_ns return both namespace pointer and error code instead.
> In kernel we encode error values into the returned pointer.
> 

I'm not sure how you want me to do this? I'm not familiar with the way
the kernel does it.

> 
> > +
> >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> >  {
> >      return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
> > @@ -889,7 +903,7 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> >      uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> >  
> >      if (unlikely((slba + nlb) > nsze)) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk),
> > +        block_acct_invalid(blk_get_stats(ns->blk),
> >              nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> >          trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
> >          return NVME_LBA_RANGE | NVME_DNR;
> > @@ -924,11 +938,12 @@ static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> >  
> >  static void nvme_rw_cb(NvmeRequest *req, void *opaque)
> >  {
> > +    NvmeNamespace *ns = req->ns;
> >      NvmeSQueue *sq = req->sq;
> >      NvmeCtrl *n = sq->ctrl;
> >      NvmeCQueue *cq = n->cq[sq->cqid];
> >  
> > -    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
> > +    trace_nvme_dev_rw_cb(nvme_cid(req), nvme_nsid(ns));
> >  
> >      nvme_enqueue_req_completion(cq, req);
> >  }
> > @@ -1011,10 +1026,11 @@ static void nvme_aio_cb(void *opaque, int ret)
> >  
> >  static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > +    NvmeNamespace *ns = req->ns;
> >      NvmeAIO *aio = g_new0(NvmeAIO, 1);
> >  
> >      *aio = (NvmeAIO) {
> > -        .blk = n->conf.blk,
> > +        .blk = ns->blk,
> >          .req = req,
> >      };
> >  
> > @@ -1038,12 +1054,12 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      req->slba = le64_to_cpu(rw->slba);
> >      req->nlb  = le16_to_cpu(rw->nlb) + 1;
> >  
> > -    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> > -        req->slba, req->nlb);
> > +    trace_nvme_dev_write_zeros(nvme_cid(req), nvme_nsid(ns), req->slba,
> > +        req->nlb);
> >  
> >      status = nvme_check_bounds(n, req->slba, req->nlb, req);
> >      if (unlikely(status)) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> > +        block_acct_invalid(blk_get_stats(ns->blk), BLOCK_ACCT_WRITE);
> >          return status;
> >      }
> >  
> > @@ -1053,7 +1069,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      aio = g_new0(NvmeAIO, 1);
> >  
> >      *aio = (NvmeAIO) {
> > -        .blk = n->conf.blk,
> > +        .blk = ns->blk,
> >          .offset = offset,
> >          .len = count,
> >          .req = req,
> > @@ -1077,22 +1093,23 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      req->nlb  = le16_to_cpu(rw->nlb) + 1;
> >      req->slba = le64_to_cpu(rw->slba);
> >  
> > -    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> > -        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> > +    trace_nvme_dev_rw(nvme_cid(req), nvme_req_is_write(req) ? "write" : "read",
> > +        nvme_nsid(ns), req->nlb, req->nlb << nvme_ns_lbads(ns),
> > +        req->slba);
> >  
> >      status = nvme_check_rw(n, req);
> >      if (status) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > +        block_acct_invalid(blk_get_stats(ns->blk), acct);
> >          return status;
> >      }
> >  
> >      status = nvme_map(n, cmd, req);
> >      if (status) {
> > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > +        block_acct_invalid(blk_get_stats(ns->blk), acct);
> >          return status;
> >      }
> >  
> > -    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> > +    nvme_rw_aio(ns->blk, req->slba << nvme_ns_lbads(ns), req);
> >      nvme_req_set_cb(req, nvme_rw_cb, NULL);
> >  
> >      return NVME_NO_COMPLETE;
> > @@ -1105,12 +1122,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >      trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> >          cmd->opcode);
> >  
> > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > -        return NVME_INVALID_NSID | NVME_DNR;
> > -    }
> > +    req->ns = nvme_ns(n, nsid);
> >  
> > -    req->ns = &n->namespaces[nsid - 1];
> > +    if (unlikely(!req->ns)) {
> > +        return nvme_nsid_err(n, nsid);
> > +    }
> >  
> >      switch (cmd->opcode) {
> >      case NVME_CMD_FLUSH:
> > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> >      uint64_t units_read = 0, units_written = 0, read_commands = 0,
> >          write_commands = 0;
> >      NvmeSmartLog smart;
> > -    BlockAcctStats *s;
> >  
> >      if (nsid && nsid != 0xffffffff) {
> >          return NVME_INVALID_FIELD | NVME_DNR;
> >      }
> >  
> > -    s = blk_get_stats(n->conf.blk);
> > +    for (int i = 1; i <= n->num_namespaces; i++) {
> > +        NvmeNamespace *ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> >  
> > -    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > -    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > -    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > -    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > +        BlockAcctStats *s = blk_get_stats(ns->blk);
> > +
> > +        units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > +        units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > +        read_commands += s->nr_ops[BLOCK_ACCT_READ];
> > +        write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
> > +    }
> Very minor nitpick: As something minor to do in the future, is to
> report the statistics per namespace.

In NVMe v1.4 there is no namespace specific information in the
SMART/Health log page, so this is valid for both v1.3 and v1.4.

> >  
> >      if (off > sizeof(smart)) {
> >          return NVME_INVALID_FIELD | NVME_DNR;
> > @@ -1477,19 +1499,25 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  
> >  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > -    NvmeNamespace *ns;
> > +    NvmeIdNs *id_ns, inactive = { 0 };
> >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> > +    NvmeNamespace *ns = nvme_ns(n, nsid);
> >  
> >      trace_nvme_dev_identify_ns(nsid);
> >  
> > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > -        return NVME_INVALID_NSID | NVME_DNR;
> > +    if (unlikely(!ns)) {
> > +        uint16_t status = nvme_nsid_err(n, nsid);
> > +
> > +        if (!nvme_status_is_error(status, NVME_INVALID_FIELD)) {
> > +            return status;
> > +        }
> I really don't like checking the error value like that. 
> It would be better IMHO to have something like
> nvme_is_valid_ns, nvme_is_active_ns or something like that.
> 

Fixed.

> > +
> > +        id_ns = &inactive;
> > +    } else {
> > +        id_ns = &ns->id_ns;
> >      }
> >  
> > -    ns = &n->namespaces[nsid - 1];
> > -
> > -    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> > +    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
> >          DMA_DIRECTION_FROM_DEVICE, req);
> >  }
> >  
> > @@ -1505,11 +1533,11 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
> >      trace_nvme_dev_identify_ns_list(min_nsid);
> >  
> >      list = g_malloc0(data_len);
> > -    for (i = 0; i < n->num_namespaces; i++) {
> > -        if (i < min_nsid) {
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        if (i <= min_nsid || !nvme_ns(n, i)) {
> >              continue;
> >          }
> > -        list[j++] = cpu_to_le32(i + 1);
> > +        list[j++] = cpu_to_le32(i);
> >          if (j == data_len / sizeof(uint32_t)) {
> >              break;
> >          }
> The refactoring part (removing that +1) which is very nice IMHO should move
> to one of earlier refactoring patches.
> 

Done.

> > @@ -1539,9 +1567,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
> >  
> >      trace_nvme_dev_identify_ns_descr_list(nsid);
> >  
> > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > -        return NVME_INVALID_NSID | NVME_DNR;
> > +    if (unlikely(!nvme_ns(n, nsid))) {
> > +        return nvme_nsid_err(n, nsid);
> >      }
> >  
> >      /*
> > @@ -1681,7 +1708,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >          result = cpu_to_le32(n->features.err_rec);
> >          break;
> >      case NVME_VOLATILE_WRITE_CACHE:
> > -        result = blk_enable_write_cache(n->conf.blk);
> > +        result = cpu_to_le32(n->features.volatile_wc);
> OK, this fixes lack of endianess conversion I pointed out in patch 12.
> >          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> >          break;
> >      case NVME_NUMBER_OF_QUEUES:
> > @@ -1735,6 +1762,8 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> >  
> >  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  {
> > +    NvmeNamespace *ns;
> > +
> >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> >  
> > @@ -1766,8 +1795,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> >  
> >          break;
> >      case NVME_VOLATILE_WRITE_CACHE:
> > -        blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > +        n->features.volatile_wc = dw11;
> > +
> > +        for (int i = 1; i <= n->num_namespaces; i++) {
> > +            ns = nvme_ns(n, i);
> > +            if (!ns) {
> > +                continue;
> > +            }
> > +
> > +            blk_set_enable_write_cache(ns->blk, dw11 & 1);
> > +        }
> > +
> Features are per namespace (page 79 in the spec), so this
> is a good candidate of per namespace feature
> 

Some features are, but the Volatile Write Cache feature is actually not.

> >          break;
> > +
> >      case NVME_NUMBER_OF_QUEUES:
> >          if (n->qs_created) {
> >              return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > @@ -1890,9 +1930,17 @@ static void nvme_process_sq(void *opaque)
> >  
> >  static void nvme_clear_ctrl(NvmeCtrl *n)
> >  {
> > +    NvmeNamespace *ns;
> >      int i;
> >  
> > -    blk_drain(n->conf.blk);
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +
> > +        blk_drain(ns->blk);
> > +    }
> >  
> >      for (i = 0; i < n->params.num_queues; i++) {
> >          if (n->sq[i] != NULL) {
> > @@ -1915,7 +1963,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> >      n->outstanding_aers = 0;
> >      n->qs_created = false;
> >  
> > -    blk_flush(n->conf.blk);
> > +    for (i = 1; i <= n->num_namespaces; i++) {
> > +        ns = nvme_ns(n, i);
> > +        if (!ns) {
> > +            continue;
> > +        }
> > +
> > +        blk_flush(ns->blk);
> > +    }
> > +
> >      n->bar.cc = 0;
> >  }
> >  
> > @@ -2335,8 +2391,8 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >  {
> >      NvmeParams *params = &n->params;
> >  
> > -    if (!n->conf.blk) {
> > -        error_setg(errp, "nvme: block backend not configured");
> > +    if (!n->namespace.blk && !n->parent_obj.qdev.id) {
> > +        error_setg(errp, "nvme: invalid 'id' parameter");
> 
> Nitpick: I think that usually qemu allows user to shoot him in the foot and allow to specify a device without ID,
> to which you can't attach devices, so I think that this check is not needed.
> You also probably mean 'missing ID'
> 

Right. I added a deprecation warning when the drive parameter is used
instead.

> >          return 1;
> >      }
> >  
> > @@ -2353,22 +2409,10 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> >      return 0;
> >  }
> >  
> > -static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > -{
> > -    blkconf_blocksizes(&n->conf);
> > -    if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> > -        false, errp)) {
> > -        return 1;
> > -    }
> > -
> > -    return 0;
> > -}
> > -
> >  static void nvme_init_state(NvmeCtrl *n)
> >  {
> > -    n->num_namespaces = 1;
> > +    n->num_namespaces = 0;
> 
> And to say that again since number of valid namespaces should remain static,
> here you should just initialize this NVME_MAX_NAMESPACES, and remove the code
> that changes IDCTRL.NN dynamically.
> 

Done.

> 
> >      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> > -    n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> >  
> > @@ -2483,12 +2527,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >      id->cqes = (0x4 << 4) | 0x4;
> >      id->nn = cpu_to_le32(n->num_namespaces);
> >      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> > -
> > -
> > -    if (blk_enable_write_cache(n->conf.blk)) {
> > -        id->vwc = 1;
> > -    }
> > -
> > +    id->vwc = 1;
> >      id->sgls = cpu_to_le32(0x1);
> >  
> >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> > @@ -2509,22 +2548,25 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> >      n->bar.intmc = n->bar.intms = 0;
> >  }
> >  
> > -static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >  {
> > -    int64_t bs_size;
> > -    NvmeIdNs *id_ns = &ns->id_ns;
> > +    uint32_t nsid = nvme_nsid(ns);
> >  
> > -    bs_size = blk_getlength(n->conf.blk);
> > -    if (bs_size < 0) {
> > -        error_setg_errno(errp, -bs_size, "blk_getlength");
> > +    if (nsid == 0 || nsid > NVME_MAX_NAMESPACES) {
> > +        error_setg(errp, "invalid nsid");
> >          return 1;
> >      }
> 
> As I said above it would be nice to find a valid namespace slot instead
> of erroring out when nsid == 0.
> Also the error message can be a bit improved IMHO.
> 

Done.

> >  
> > -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > -    n->ns_size = bs_size;
> > +    if (n->namespaces[nsid - 1]) {
> > +        error_setg(errp, "nsid must be unique");
> > +        return 1;
> > +    }
> > +
> > +    trace_nvme_dev_register_namespace(nsid);
> >  
> > -    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> > -        cpu_to_le64(nvme_ns_nlbas(n, ns));
> > +    n->namespaces[nsid - 1] = ns;
> 
> > +    n->num_namespaces = MAX(n->num_namespaces, nsid);
> > +    n->id_ctrl.nn = cpu_to_le32(n->num_namespaces);
> 
> These should be removed once you set num_namespaces to be fixed number.
> 

Done.

> >  
> >      return 0;
> >  }
> > @@ -2532,30 +2574,31 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> >  static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> >  {
> >      NvmeCtrl *n = NVME(pci_dev);
> > +    NvmeNamespace *ns;
> >      Error *local_err = NULL;
> > -    int i;
> >  
> >      if (nvme_check_constraints(n, &local_err)) {
> >          error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
> >          return;
> >      }
> >  
> > +    qbus_create_inplace(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
> > +        &pci_dev->qdev, n->parent_obj.qdev.id);
> > +
> >      nvme_init_state(n);
> > -
> > -    if (nvme_init_blk(n, &local_err)) {
> > -        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
> > -        return;
> > -    }
> > -
> > -    for (i = 0; i < n->num_namespaces; i++) {
> > -        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> > -            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
> > -            return;
> > -        }
> > -    }
> > -
> >      nvme_init_pci(n, pci_dev);
> >      nvme_init_ctrl(n);
> > +
> > +    /* setup a namespace if the controller drive property was given */
> > +    if (n->namespace.blk) {
> > +        ns = &n->namespace;
> > +        ns->params.nsid = 1;
> > +
> > +        if (nvme_ns_setup(n, ns, &local_err)) {
> > +            error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> > +            return;
> > +        }
> > +    }
> >  }
> >  
> >  static void nvme_exit(PCIDevice *pci_dev)
> > @@ -2576,7 +2619,8 @@ static void nvme_exit(PCIDevice *pci_dev)
> >  }
> >  
> >  static Property nvme_props[] = {
> > -    DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
> > +    DEFINE_BLOCK_PROPERTIES_BASE(NvmeCtrl, conf), \
> > +    DEFINE_PROP_DRIVE("drive", NvmeCtrl, namespace.blk), \
> >      DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
> >      DEFINE_PROP_END_OF_LIST(),
> >  };
> > @@ -2608,26 +2652,35 @@ static void nvme_instance_init(Object *obj)
> >  {
> >      NvmeCtrl *s = NVME(obj);
> >  
> > -    device_add_bootindex_property(obj, &s->conf.bootindex,
> > -                                  "bootindex", "/namespace@1,0",
> > -                                  DEVICE(obj), &error_abort);
> > +    if (s->namespace.blk) {
> > +        device_add_bootindex_property(obj, &s->conf.bootindex,
> > +                                      "bootindex", "/namespace@1,0",
> > +                                      DEVICE(obj), &error_abort);
> > +    }
> >  }
> >  
> >  static const TypeInfo nvme_info = {
> >      .name          = TYPE_NVME,
> >      .parent        = TYPE_PCI_DEVICE,
> >      .instance_size = sizeof(NvmeCtrl),
> > -    .class_init    = nvme_class_init,
> >      .instance_init = nvme_instance_init,
> > +    .class_init    = nvme_class_init,
> >      .interfaces = (InterfaceInfo[]) {
> >          { INTERFACE_PCIE_DEVICE },
> >          { }
> >      },
> >  };
> >  
> > +static const TypeInfo nvme_bus_info = {
> > +    .name = TYPE_NVME_BUS,
> > +    .parent = TYPE_BUS,
> > +    .instance_size = sizeof(NvmeBus),
> > +};
> > +
> >  static void nvme_register_types(void)
> >  {
> >      type_register_static(&nvme_info);
> > +    type_register_static(&nvme_bus_info);
> >  }
> >  
> >  type_init(nvme_register_types)
> > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > index 3319f8edd7e1..c3cef0f024da 100644
> > --- a/hw/block/nvme.h
> > +++ b/hw/block/nvme.h
> > @@ -2,6 +2,9 @@
> >  #define HW_NVME_H
> >  
> >  #include "block/nvme.h"
> > +#include "nvme-ns.h"
> > +
> > +#define NVME_MAX_NAMESPACES 256
> >  
> >  #define DEFINE_NVME_PROPERTIES(_state, _props) \
> >      DEFINE_PROP_STRING("serial", _state, _props.serial), \
> > @@ -108,26 +111,6 @@ typedef struct NvmeCQueue {
> >      QTAILQ_HEAD(, NvmeRequest) req_list;
> >  } NvmeCQueue;
> >  
> > -typedef struct NvmeNamespace {
> > -    NvmeIdNs        id_ns;
> > -} NvmeNamespace;
> > -
> > -static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > -{
> > -    NvmeIdNs *id_ns = &ns->id_ns;
> > -    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > -}
> > -
> > -static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > -{
> > -    return nvme_ns_lbaf(ns).ds;
> > -}
> > -
> > -static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > -{
> > -    return 1 << nvme_ns_lbads(ns);
> > -}
> > -
> >  typedef enum NvmeAIOOp {
> >      NVME_AIO_OPC_NONE         = 0x0,
> >      NVME_AIO_OPC_FLUSH        = 0x1,
> > @@ -182,6 +165,13 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
> >      }
> >  }
> >  
> > +#define TYPE_NVME_BUS "nvme-bus"
> > +#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
> > +
> > +typedef struct NvmeBus {
> > +    BusState parent_bus;
> > +} NvmeBus;
> > +
> >  #define TYPE_NVME "nvme"
> >  #define NVME(obj) \
> >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > @@ -191,8 +181,9 @@ typedef struct NvmeCtrl {
> >      MemoryRegion iomem;
> >      MemoryRegion ctrl_mem;
> >      NvmeBar      bar;
> > -    BlockConf    conf;
> >      NvmeParams   params;
> > +    NvmeBus      bus;
> > +    BlockConf    conf;
> >  
> >      bool        qs_created;
> >      uint32_t    page_size;
> > @@ -203,7 +194,6 @@ typedef struct NvmeCtrl {
> >      uint32_t    reg_size;
> >      uint32_t    num_namespaces;
> >      uint32_t    max_q_ents;
> > -    uint64_t    ns_size;
> >      uint8_t     outstanding_aers;
> >      uint32_t    cmbsz;
> >      uint32_t    cmbloc;
> > @@ -219,7 +209,8 @@ typedef struct NvmeCtrl {
> >      QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
> >      int         aer_queued;
> >  
> > -    NvmeNamespace   *namespaces;
> > +    NvmeNamespace   namespace;
> > +    NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
> >      NvmeSQueue      **sq;
> >      NvmeCQueue      **cq;
> >      NvmeSQueue      admin_sq;
> > @@ -228,9 +219,13 @@ typedef struct NvmeCtrl {
> >      NvmeFeatureVal  features;
> >  } NvmeCtrl;
> >  
> > -static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > +static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
> >  {
> > -    return n->ns_size >> nvme_ns_lbads(ns);
> > +    if (!nsid || nsid > n->num_namespaces) {
> > +        return NULL;
> > +    }
> > +
> > +    return n->namespaces[nsid - 1];
> >  }
> >  
> >  static inline uint16_t nvme_cid(NvmeRequest *req)
> > @@ -253,4 +248,6 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
> >      return req->sq->ctrl;
> >  }
> >  
> > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> > +
> >  #endif /* HW_NVME_H */
> > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > index 81d69e15fc32..aaf1fcda7923 100644
> > --- a/hw/block/trace-events
> > +++ b/hw/block/trace-events
> > @@ -29,6 +29,7 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
> >  
> >  # nvme.c
> >  # nvme traces for successful events
> > +nvme_dev_register_namespace(uint32_t nsid) "nsid %"PRIu32""
> >  nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> >  nvme_dev_irq_pin(void) "pulsing IRQ pin"
> >  nvme_dev_irq_masked(void) "IRQ is masked"
> > @@ -38,7 +39,7 @@ nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"P
> >  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > %"PRIu64" opc \"%s\" req %p"
> >  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> >  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> > -nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> > +nvme_dev_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" %s nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
> >  nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
> >  nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
> >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > qflags=%"PRIu16""
> > @@ -94,7 +95,8 @@ nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or no
> >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> >  nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
> >  nvme_dev_err_invalid_prp(void) "invalid PRP"
> > -nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
> > +nvme_dev_err_invalid_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
> > +nvme_dev_err_inactive_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
> >  nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
> >  nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
> >  nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> 


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 08/26] nvme: refactor device realization
  2020-03-16  7:43         ` Klaus Birkelund Jensen
@ 2020-03-25 10:21           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:21 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:43 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 11:27, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > This patch splits up nvme_realize into multiple individual functions,
> > > each initializing a different subset of the device.
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > ---
> > >  hw/block/nvme.c | 175 +++++++++++++++++++++++++++++++-----------------
> > >  hw/block/nvme.h |  21 ++++++
> > >  2 files changed, 133 insertions(+), 63 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index e1810260d40b..81514eaef63a 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -44,6 +44,7 @@
> > >  #include "nvme.h"
> > >  
> > >  #define NVME_SPEC_VER 0x00010201
> > > +#define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > >  
> > >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> > >      do { \
> > > @@ -1325,67 +1326,106 @@ static const MemoryRegionOps nvme_cmb_ops = {
> > >      },
> > >  };
> > >  
> > > -static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > > +static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> > >  {
> > > -    NvmeCtrl *n = NVME(pci_dev);
> > > -    NvmeIdCtrl *id = &n->id_ctrl;
> > > -
> > > -    int i;
> > > -    int64_t bs_size;
> > > -    uint8_t *pci_conf;
> > > -
> > > -    if (!n->params.num_queues) {
> > > -        error_setg(errp, "num_queues can't be zero");
> > > -        return;
> > > -    }
> > > +    NvmeParams *params = &n->params;
> > >  
> > >      if (!n->conf.blk) {
> > > -        error_setg(errp, "drive property not set");
> > > -        return;
> > > +        error_setg(errp, "nvme: block backend not configured");
> > > +        return 1;
> > 
> > As a matter of taste, negative values indicate error, and 0 is the success value.
> > In Linux kernel this is even an official rule.
> > >      }
> 
> Fixed.
> 
> > >  
> > > -    bs_size = blk_getlength(n->conf.blk);
> > > -    if (bs_size < 0) {
> > > -        error_setg(errp, "could not get backing file size");
> > > -        return;
> > > +    if (!params->serial) {
> > > +        error_setg(errp, "nvme: serial not configured");
> > > +        return 1;
> > >      }
> > >  
> > > -    if (!n->params.serial) {
> > > -        error_setg(errp, "serial property not set");
> > > -        return;
> > > +    if ((params->num_queues < 1 || params->num_queues > NVME_MAX_QS)) {
> > > +        error_setg(errp, "nvme: invalid queue configuration");
> > 
> > Maybe something like "nvme: invalid queue count specified, should be between 1 and ..."?
> > > +        return 1;
> > >      }
> 
> Fixed.
Thanks
> 
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > > +{
> > >      blkconf_blocksizes(&n->conf);
> > >      if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> > > -                                       false, errp)) {
> > > -        return;
> > > +        false, errp)) {
> > > +        return 1;
> > >      }
> > >  
> > > -    pci_conf = pci_dev->config;
> > > -    pci_conf[PCI_INTERRUPT_PIN] = 1;
> > > -    pci_config_set_prog_interface(pci_dev->config, 0x2);
> > > -    pci_config_set_class(pci_dev->config, PCI_CLASS_STORAGE_EXPRESS);
> > > -    pcie_endpoint_cap_init(pci_dev, 0x80);
> > > +    return 0;
> > > +}
> > >  
> > > +static void nvme_init_state(NvmeCtrl *n)
> > > +{
> > >      n->num_namespaces = 1;
> > >      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> > 
> > Isn't that wrong?
> > First 4K of mmio (0x1000) is the registers, and that is followed by the doorbells,
> > and each doorbell takes 8 bytes (assuming regular doorbell stride).
> > so n->params.num_queues + 1 should be total number of queues, thus the 0x1004 should be 0x1000 IMHO.
> > I might miss some rounding magic here though.
> > 
> 
> Yeah. I think you are right. It all becomes slightly more fishy due to
> the num_queues device parameter being 1's based and accounts for the
> admin queue pair.
> 
> But in get/set features, the value has to be 0's based and only account
> for the I/O queues, so we need to subtract 2 from the value. It's
> confusing all around.
Yea, I can't agree more on that. The zero based values had bitten
me few times while I developed nvme-mdev as well.

> 
> Since the admin queue pair isn't really optional I think it would be
> better that we introduces a new max_ioqpairs parameter that is 1's
> based, counts number of pairs and obviously only accounts for the io
> queues.
> 
> I guess we need to keep the num_queues parameter around for
> compatibility.
> 
> The doorbells are only 4 bytes btw, but the calculation still looks
I don't understand that. Each doorbell is indeed 4 bytes, but they come
in pairs so each doorbell pair is 8 bytes.

BTW, the spec has so called doorbell stride, which allows to artificially increase
each doorbell by a power of two. This was intended for software implementations
(like my nvme-mdev), to make sure that each doorbell takes exactly one cacheline.

I personally wasn't able to notice any measurable difference, but then my  nvme-mdev
adds so little overhead, that it might not be measurable.
You might want to support this sometime in the future to increase the feature coverage
of this nvme device.

> wrong. With a max_ioqpairs parameter in place, the reg_size should be
> 
>     pow2ceil(0x1008 + 2 * (n->params.max_ioqpairs) * 4)
> 
> Right? Thats 0x1000 for the core registers, 8 bytes for the sq/cq
> doorbells for the admin queue pair, and then room for the i/o queue
> pairs.
Looks great.
BTW, 


> 
> I added a patch for this in v6.
> 
> > > -    n->ns_size = bs_size / (uint64_t)n->num_namespaces;
> > > -
> > >      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> > >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> > >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > > +}
> > >  
> > > -    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n,
> > > -                          "nvme", n->reg_size);
> > > +static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > > +{
> > > +    NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
> > 
> > It would be nice to have #define for CMB bar number
> 
> Added.
Thanks!
> 
> > > +    NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> > > +
> > > +    NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> > > +    NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > > +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > > +    NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> > > +    NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> > > +    NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> > > +    NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> > > +
> > > +    n->cmbloc = n->bar.cmbloc;
> > > +    n->cmbsz = n->bar.cmbsz;
> > > +
> > > +    n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > > +    memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> > > +                            "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > > +    pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> > 
> > Same here although since you read it here from the controller register,
> > then maybe leave it as is. I prefer though for this kind of thing
> > to have a #define and use it everywhere. 
> > 
> 
> Done.
> 
> > > +        PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > +        PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> > > +}
> > > +
> > > +static void nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev)
> > > +{
> > > +    uint8_t *pci_conf = pci_dev->config;
> > > +
> > > +    pci_conf[PCI_INTERRUPT_PIN] = 1;
> > > +    pci_config_set_prog_interface(pci_conf, 0x2);
> > 
> > Nitpick: How about adding some #define for that as well?
> > (I know that this code is copied as is but still)
> 
> Yeah. A PCI_PI_NVME or something would be nice. But this should probably
> go to some pci related header file? Any idea where that would fit?

in include/hw/pci/pci_ids.h maybe?

> 
> > > +    pci_config_set_vendor_id(pci_conf, PCI_VENDOR_ID_INTEL);
> > > +    pci_config_set_device_id(pci_conf, 0x5845);
> > > +    pci_config_set_class(pci_conf, PCI_CLASS_STORAGE_EXPRESS);
> > > +    pcie_endpoint_cap_init(pci_dev, 0x80);
> > > +
> > > +    memory_region_init_io(&n->iomem, OBJECT(n), &nvme_mmio_ops, n, "nvme",
> > > +        n->reg_size);
> > 
> > Code on split lines should start at column right after the '('
> > Now its my turn to notice this - our checkpatch.pl doesn't check this,
> > and I can't explain how often I am getting burnt on this myself.
> > 
> > There are *lot* of these issues, I pointed out some of them but you should
> > check all the patches for this.
> > 
> 
> I fixed all that :)

Thanks, but I bet that some of this remained - taking from my experience,
since I also like you wasn't used to this rule, 
so I  didn't yet adopt that rule subconsciously, and our checkpatch.pl
doesn't check for it, so I keep on violating this rule in most patches I send
despite me checking each patch for few times.
I'll go over V6, and if I spot this I'll take a note, now that you fixed most of
this issues.
Thanks again.

> 
> > 
> > >      pci_register_bar(pci_dev, 0,
> > >          PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64,
> > >          &n->iomem);
> > 
> > Split line alignment issue here as well.
> > >      msix_init_exclusive_bar(pci_dev, n->params.num_queues, 4, NULL);
> > >  
> > > +    if (n->params.cmb_size_mb) {
> > > +        nvme_init_cmb(n, pci_dev);
> > > +    }
> > > +}
> > > +
> > > +static void nvme_init_ctrl(NvmeCtrl *n)
> > > +{
> > > +    NvmeIdCtrl *id = &n->id_ctrl;
> > > +    NvmeParams *params = &n->params;
> > > +    uint8_t *pci_conf = n->parent_obj.config;
> > > +
> > >      id->vid = cpu_to_le16(pci_get_word(pci_conf + PCI_VENDOR_ID));
> > >      id->ssvid = cpu_to_le16(pci_get_word(pci_conf + PCI_SUBSYSTEM_VENDOR_ID));
> > >      strpadcpy((char *)id->mn, sizeof(id->mn), "QEMU NVMe Ctrl", ' ');
> > >      strpadcpy((char *)id->fr, sizeof(id->fr), "1.0", ' ');
> > > -    strpadcpy((char *)id->sn, sizeof(id->sn), n->params.serial, ' ');
> > > +    strpadcpy((char *)id->sn, sizeof(id->sn), params->serial, ' ');
> > >      id->rab = 6;
> > >      id->ieee[0] = 0x00;
> > >      id->ieee[1] = 0x02;
> > > @@ -1431,46 +1471,55 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > >  
> > >      n->bar.vs = NVME_SPEC_VER;
> > >      n->bar.intmc = n->bar.intms = 0;
> > > +}
> > >  
> > > -    if (n->params.cmb_size_mb) {
> > > +static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > > +{
> > > +    int64_t bs_size;
> > > +    NvmeIdNs *id_ns = &ns->id_ns;
> > >  
> > > -        NVME_CMBLOC_SET_BIR(n->bar.cmbloc, 2);
> > > -        NVME_CMBLOC_SET_OFST(n->bar.cmbloc, 0);
> > > +    bs_size = blk_getlength(n->conf.blk);
> > > +    if (bs_size < 0) {
> > > +        error_setg_errno(errp, -bs_size, "blk_getlength");
> > > +        return 1;
> > > +    }
> > >  
> > > -        NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> > > -        NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > > -        NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > > -        NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> > > -        NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> > > -        NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2); /* MBs */
> > > -        NVME_CMBSZ_SET_SZ(n->bar.cmbsz, n->params.cmb_size_mb);
> > > +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > > +    n->ns_size = bs_size;
> > >  
> > > -        n->cmbloc = n->bar.cmbloc;
> > > -        n->cmbsz = n->bar.cmbsz;
> > > +    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> > > +        cpu_to_le64(nvme_ns_nlbas(n, ns));
> > 
> > I myself don't know how to align these splits to be honest.
> > I would just split this into multiple statements.
> > >  
> > > -        n->cmbuf = g_malloc0(NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > > -        memory_region_init_io(&n->ctrl_mem, OBJECT(n), &nvme_cmb_ops, n,
> > > -                              "nvme-cmb", NVME_CMBSZ_GETSIZE(n->bar.cmbsz));
> > > -        pci_register_bar(pci_dev, NVME_CMBLOC_BIR(n->bar.cmbloc),
> > > -            PCI_BASE_ADDRESS_SPACE_MEMORY | PCI_BASE_ADDRESS_MEM_TYPE_64 |
> > > -            PCI_BASE_ADDRESS_MEM_PREFETCH, &n->ctrl_mem);
> > > +    return 0;
> > > +}
> > >  
> > > +static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > > +{
> > > +    NvmeCtrl *n = NVME(pci_dev);
> > > +    Error *local_err = NULL;
> > > +    int i;
> > > +
> > > +    if (nvme_check_constraints(n, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
> > 
> > Do we need that hint for the end user?
> 
> Removed.
> 
> > > +        return;
> > > +    }
> > > +
> > > +    nvme_init_state(n);
> > > +
> > > +    if (nvme_init_blk(n, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
> > 
> > Same here
> 
> Done.
> 
> 
> > > +        return;
> > >      }
> > >  
> > >      for (i = 0; i < n->num_namespaces; i++) {
> > > -        NvmeNamespace *ns = &n->namespaces[i];
> > > -        NvmeIdNs *id_ns = &ns->id_ns;
> > > -        id_ns->nsfeat = 0;
> > > -        id_ns->nlbaf = 0;
> > > -        id_ns->flbas = 0;
> > > -        id_ns->mc = 0;
> > > -        id_ns->dpc = 0;
> > > -        id_ns->dps = 0;
> > > -        id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > > -        id_ns->ncap  = id_ns->nuse = id_ns->nsze =
> > > -            cpu_to_le64(n->ns_size >>
> > > -                id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas)].ds);
> > > +        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> > > +            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
> > 
> > And here
> 
> Done.
> 
> 
> > > +            return;
> > > +        }
> > >      }
> > > +
> > > +    nvme_init_pci(n, pci_dev);
> > > +    nvme_init_ctrl(n);
> > >  }
> > >  
> > >  static void nvme_exit(PCIDevice *pci_dev)
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 9957c4a200e2..a867bdfabafd 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -65,6 +65,22 @@ typedef struct NvmeNamespace {
> > >      NvmeIdNs        id_ns;
> > >  } NvmeNamespace;
> > >  
> > > +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > > +{
> > 
> > Its not common to return a structure in C, usually pointer is returned to
> > avoid copying. In this case this doesn't matter that much though.
> 
> It's actually gonna be used a lot. So swapped to pointer.
Thanks.

> 
> > > +    NvmeIdNs *id_ns = &ns->id_ns;
> > > +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > > +}
> > > +
> > > +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > > +{
> > > +    return nvme_ns_lbaf(ns).ds;
> > > +}
> > > +
> > > +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > > +{
> > > +    return 1 << nvme_ns_lbads(ns);
> > > +}
> > > +
> > >  #define TYPE_NVME "nvme"
> > >  #define NVME(obj) \
> > >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > > @@ -101,4 +117,9 @@ typedef struct NvmeCtrl {
> > >      NvmeIdCtrl      id_ctrl;
> > >  } NvmeCtrl;
> > >  
> > > +static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > > +{
> > > +    return n->ns_size >> nvme_ns_lbads(ns);
> > > +}
> > 
> > Unless you need all these functions in the future, this feels like
> > it is a bit verbose.
> > 
> 
> These will be used in various places later.
OK, then it is all right.

>  
> 

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 09/26] nvme: add temperature threshold feature
  2020-03-16  7:44         ` Klaus Birkelund Jensen
@ 2020-03-25 10:21           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:21 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:44 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 11:31, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > It might seem wierd to implement this feature for an emulated device,
> > > but it is mandatory to support and the feature is useful for testing
> > > asynchronous event request support, which will be added in a later
> > > patch.
> > 
> > Absolutely but as the old saying is, rules are rules.
> > At least, to the defense of the spec, making this mandatory
> > forced the vendors to actually report some statistics about
> > the device in neutral format as opposed to yet another
> > vendor proprietary thing (I am talking about SMART log page).
> > 
> > > 
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > 
> > I noticed that you sign off some patches with your @samsung.com email,
> > and some with @cnexlabs.com
> > Is there a reason for that?
> 
> Yeah. Some of this code was made while I was at CNEX Labs. I've since
> moved to Samsung. But credit where credit's due.
I suspected something like that, but I just wanted to be sure that this is intentional,
and it looks all right to me now.

> 
> > 
> > 
> > > ---
> > >  hw/block/nvme.c      | 50 ++++++++++++++++++++++++++++++++++++++++++++
> > >  hw/block/nvme.h      |  2 ++
> > >  include/block/nvme.h |  7 ++++++-
> > >  3 files changed, 58 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 81514eaef63a..f72348344832 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -45,6 +45,9 @@
> > >  
> > >  #define NVME_SPEC_VER 0x00010201
> > >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > > +#define NVME_TEMPERATURE 0x143
> > > +#define NVME_TEMPERATURE_WARNING 0x157
> > > +#define NVME_TEMPERATURE_CRITICAL 0x175
> > >  
> > >  #define NVME_GUEST_ERR(trace, fmt, ...) \
> > >      do { \
> > > @@ -798,9 +801,31 @@ static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> > >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > > +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > >      uint32_t result;
> > >  
> > >      switch (dw10) {
> > > +    case NVME_TEMPERATURE_THRESHOLD:
> > > +        result = 0;
> > > +
> > > +        /*
> > > +         * The controller only implements the Composite Temperature sensor, so
> > > +         * return 0 for all other sensors.
> > > +         */
> > > +        if (NVME_TEMP_TMPSEL(dw11)) {
> > > +            break;
> > > +        }
> > > +
> > > +        switch (NVME_TEMP_THSEL(dw11)) {
> > > +        case 0x0:
> > > +            result = cpu_to_le16(n->features.temp_thresh_hi);
> > > +            break;
> > > +        case 0x1:
> > > +            result = cpu_to_le16(n->features.temp_thresh_low);
> > > +            break;
> > > +        }
> > > +
> > > +        break;
> > >      case NVME_VOLATILE_WRITE_CACHE:
> > >          result = blk_enable_write_cache(n->conf.blk);
> > >          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > > @@ -845,6 +870,23 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > >  
> > >      switch (dw10) {
> > > +    case NVME_TEMPERATURE_THRESHOLD:
> > > +        if (NVME_TEMP_TMPSEL(dw11)) {
> > > +            break;
> > > +        }
> > > +
> > > +        switch (NVME_TEMP_THSEL(dw11)) {
> > > +        case 0x0:
> > > +            n->features.temp_thresh_hi = NVME_TEMP_TMPTH(dw11);
> > > +            break;
> > > +        case 0x1:
> > > +            n->features.temp_thresh_low = NVME_TEMP_TMPTH(dw11);
> > > +            break;
> > > +        default:
> > > +            return NVME_INVALID_FIELD | NVME_DNR;
> > > +        }
> > > +
> > > +        break;
> > >      case NVME_VOLATILE_WRITE_CACHE:
> > >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > >          break;
> > > @@ -1366,6 +1408,9 @@ static void nvme_init_state(NvmeCtrl *n)
> > >      n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> > >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> > >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > > +
> > > +    n->temperature = NVME_TEMPERATURE;
> > 
> > This appears not to be used in the patch.
> > I think you should move that to the next patch that
> > adds the get log page support.
> > 
> 
> Fixed.
Thanks
> 
> > > +    n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> > >  }
> > >  
> > >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > > @@ -1447,6 +1492,11 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >      id->acl = 3;
> > >      id->frmw = 7 << 1;
> > >      id->lpa = 1 << 0;
> > > +
> > > +    /* recommended default value (~70 C) */
> > > +    id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > > +    id->cctemp = cpu_to_le16(NVME_TEMPERATURE_CRITICAL);
> > > +
> > >      id->sqes = (0x6 << 4) | 0x6;
> > >      id->cqes = (0x4 << 4) | 0x4;
> > >      id->nn = cpu_to_le32(n->num_namespaces);
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index a867bdfabafd..1518f32557a3 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -108,6 +108,7 @@ typedef struct NvmeCtrl {
> > >      uint64_t    irq_status;
> > >      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> > >      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> > > +    uint16_t    temperature;
> > >  
> > >      NvmeNamespace   *namespaces;
> > >      NvmeSQueue      **sq;
> > > @@ -115,6 +116,7 @@ typedef struct NvmeCtrl {
> > >      NvmeSQueue      admin_sq;
> > >      NvmeCQueue      admin_cq;
> > >      NvmeIdCtrl      id_ctrl;
> > > +    NvmeFeatureVal  features;
> > >  } NvmeCtrl;
> > >  
> > >  static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index d2f65e8fe496..ff31cb32117c 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -616,7 +616,8 @@ enum NvmeIdCtrlOncs {
> > >  typedef struct NvmeFeatureVal {
> > >      uint32_t    arbitration;
> > >      uint32_t    power_mgmt;
> > > -    uint32_t    temp_thresh;
> > > +    uint16_t    temp_thresh_hi;
> > > +    uint16_t    temp_thresh_low;
> > >      uint32_t    err_rec;
> > >      uint32_t    volatile_wc;
> > >      uint32_t    num_queues;
> > > @@ -635,6 +636,10 @@ typedef struct NvmeFeatureVal {
> > >  #define NVME_INTC_THR(intc)     (intc & 0xff)
> > >  #define NVME_INTC_TIME(intc)    ((intc >> 8) & 0xff)
> > >  
> > > +#define NVME_TEMP_THSEL(temp)  ((temp >> 20) & 0x3)
> > > +#define NVME_TEMP_TMPSEL(temp) ((temp >> 16) & 0xf)
> > > +#define NVME_TEMP_TMPTH(temp)  (temp & 0xffff)
> > > +
> > >  enum NvmeFeatureIds {
> > >      NVME_ARBITRATION                = 0x1,
> > >      NVME_POWER_MANAGEMENT           = 0x2,
> > 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 10/26] nvme: add support for the get log page command
  2020-03-16  7:45         ` Klaus Birkelund Jensen
@ 2020-03-25 10:22           ` Maxim Levitsky
  2020-03-25 10:24           ` Maxim Levitsky
  1 sibling, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:22 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:45 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 11:35, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Add support for the Get Log Page command and basic implementations of
> > > the mandatory Error Information, SMART / Health Information and Firmware
> > > Slot Information log pages.
> > > 
> > > In violation of the specification, the SMART / Health Information log
> > > page does not persist information over the lifetime of the controller
> > > because the device has no place to store such persistent state.
> > 
> > Yea, not the end of the world.
> > > 
> > > Note that the LPA field in the Identify Controller data structure
> > > intentionally has bit 0 cleared because there is no namespace specific
> > > information in the SMART / Health information log page.
> > 
> > Makes sense.
> > > 
> > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > Section 5.10 ("Get Log Page command").
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > ---
> > >  hw/block/nvme.c       | 122 +++++++++++++++++++++++++++++++++++++++++-
> > >  hw/block/nvme.h       |  10 ++++
> > >  hw/block/trace-events |   2 +
> > >  include/block/nvme.h  |   2 +-
> > >  4 files changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index f72348344832..468c36918042 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > > +    uint64_t off, NvmeRequest *req)
> > > +{
> > > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> > > +
> > > +    uint32_t trans_len;
> > > +    time_t current_ms;
> > > +    uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > > +        write_commands = 0;
> > > +    NvmeSmartLog smart;
> > > +    BlockAcctStats *s;
> > > +
> > > +    if (nsid && nsid != 0xffffffff) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    s = blk_get_stats(n->conf.blk);
> > > +
> > > +    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > > +    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > > +    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > > +    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > > +
> > > +    if (off > sizeof(smart)) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    trans_len = MIN(sizeof(smart) - off, buf_len);
> > > +
> > > +    memset(&smart, 0x0, sizeof(smart));
> > > +
> > > +    smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > > +    smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > > +    smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > > +    smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > > +
> > > +    smart.temperature[0] = n->temperature & 0xff;
> > > +    smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > > +
> > > +    if ((n->temperature > n->features.temp_thresh_hi) ||
> > > +        (n->temperature < n->features.temp_thresh_low)) {
> > > +        smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > > +    }
> > > +
> > > +    current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > > +    smart.power_on_hours[0] = cpu_to_le64(
> > > +        (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > > +
> > > +    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > > +        prp2);
> > > +}
> > 
> > Looks OK.
> > > +
> > > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > > +    uint64_t off, NvmeRequest *req)
> > > +{
> > > +    uint32_t trans_len;
> > > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +    NvmeFwSlotInfoLog fw_log;
> > > +
> > > +    if (off > sizeof(fw_log)) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> > > +
> > > +    trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > > +
> > > +    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > > +        prp2);
> > > +}
> > 
> > Looks OK
> > > +
> > > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > > +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > > +    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > > +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > > +    uint8_t  lid = dw10 & 0xff;
> > > +    uint8_t  rae = (dw10 >> 15) & 0x1;
> > > +    uint32_t numdl, numdu;
> > > +    uint64_t off, lpol, lpou;
> > > +    size_t   len;
> > > +
> > > +    numdl = (dw10 >> 16);
> > > +    numdu = (dw11 & 0xffff);
> > > +    lpol = dw12;
> > > +    lpou = dw13;
> > > +
> > > +    len = (((numdu << 16) | numdl) + 1) << 2;
> > > +    off = (lpou << 32ULL) | lpol;
> > > +
> > > +    if (off & 0x3) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > 
> > Good. 
> > Note that there are plenty of other places in the driver that don't honor
> > such tiny formal bits of the spec, like for instance checking for the reserved
> > bits in commands.
> 
> Yeah. I know. You think its fair we leave that for subsequent patches?
> It's not like its breaking the device, but compliance is not complete.
I don't have a strong opinion on this one, I would just bump the spec version in the last patch.

> 
> > > +
> > > +    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > > +
> > > +    switch (lid) {
> > > +    case NVME_LOG_ERROR_INFO:
> > > +        if (off) {
> > > +            return NVME_INVALID_FIELD | NVME_DNR;
> > > +        }
> > 
> > I think you might want to memset the user given buffer to zero:
> > 
> > "This is a 64-bit incrementing error count, indicating a unique identifier for this error.
> > The error count starts at 1h, is incremented for each unique error log entry, and is retained across
> > power off conditions. A value of 0h indicates an invalid entry; this value is used when there are
> > lost entries or when there are fewer errors than the maximum number of entries the controller
> > supports."
> 
> Good catch. Fixed!
> 
> > > +
> > > +        return NVME_SUCCESS;
> > > +    case NVME_LOG_SMART_INFO:
> > > +        return nvme_smart_info(n, cmd, len, off, req);
> > > +    case NVME_LOG_FW_SLOT_INFO:
> > > +        return nvme_fw_log_info(n, cmd, len, off, req);
> > > +    default:
> > > +        trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +}
> > 
> > 
> > > +
> > >  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> > >  {
> > >      n->cq[cq->cqid] = NULL;
> > > @@ -914,6 +1031,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          return nvme_del_sq(n, cmd);
> > >      case NVME_ADM_CMD_CREATE_SQ:
> > >          return nvme_create_sq(n, cmd);
> > > +    case NVME_ADM_CMD_GET_LOG_PAGE:
> > > +        return nvme_get_log(n, cmd, req);
> > >      case NVME_ADM_CMD_DELETE_CQ:
> > >          return nvme_del_cq(n, cmd);
> > >      case NVME_ADM_CMD_CREATE_CQ:
> > > @@ -1411,6 +1530,7 @@ static void nvme_init_state(NvmeCtrl *n)
> > >  
> > >      n->temperature = NVME_TEMPERATURE;
> > >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> > > +    n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > >  }
> > >  
> > >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > > @@ -1491,7 +1611,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >       */
> > >      id->acl = 3;
> > >      id->frmw = 7 << 1;
> > > -    id->lpa = 1 << 0;
> > > +    id->lpa = 1 << 2;
> > >  
> > >      /* recommended default value (~70 C) */
> > >      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 1518f32557a3..89b0aafa02a2 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -109,6 +109,7 @@ typedef struct NvmeCtrl {
> > >      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> > >      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> > >      uint16_t    temperature;
> > > +    uint64_t    starttime_ms;
> > >  
> > >      NvmeNamespace   *namespaces;
> > >      NvmeSQueue      **sq;
> > > @@ -124,4 +125,13 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > >      return n->ns_size >> nvme_ns_lbads(ns);
> > >  }
> > >  
> > > +static inline uint16_t nvme_cid(NvmeRequest *req)
> > > +{
> > > +    if (req) {
> > > +        return le16_to_cpu(req->cqe.cid);
> > > +    }
> > > +
> > > +    return 0xffff;
> > > +}
> > 
> > I see that you added command ID reporting to trace events you added,
> > which makes sense.
> > I think it would be nice later to add it to existing trace events where it makes sense.
> > 
> 
> Exactly. I'm doing that as I encounter it and it makes sense to have it
> in the patch.
OK, I don't mind.
> 
> > 
> > > +
> > >  #endif /* HW_NVME_H */
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index ade506ea2bb2..7da088479f39 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -46,6 +46,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> > >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> > >  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> > >  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> > > +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> > >  nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> > >  nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> > >  nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> > > @@ -85,6 +86,7 @@ nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completi
> > >  nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
> > >  nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
> > >  nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> > > +nvme_dev_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
> > >  nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
> > >  nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
> > >  nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index ff31cb32117c..9a6055adeb61 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -515,7 +515,7 @@ enum NvmeSmartWarn {
> > >      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
> > >  };
> > >  
> > > -enum LogIdentifier {
> > > +enum NvmeLogIdentifier {
> > >      NVME_LOG_ERROR_INFO     = 0x01,
> > >      NVME_LOG_SMART_INFO     = 0x02,
> > >      NVME_LOG_FW_SLOT_INFO   = 0x03,
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 12/26] nvme: add missing mandatory features
  2020-03-16  7:47         ` Klaus Birkelund Jensen
@ 2020-03-25 10:22           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:22 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:47 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 12:27, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Add support for returning a resonable response to Get/Set Features of
> > > mandatory features.
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > ---
> > >  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
> > >  hw/block/trace-events |  2 ++
> > >  include/block/nvme.h  |  3 ++-
> > >  3 files changed, 58 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index a186d95df020..3267ee2de47a 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1008,7 +1008,15 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > >      uint32_t result;
> > >  
> > > +    trace_nvme_dev_getfeat(nvme_cid(req), dw10);
> > > +
> > >      switch (dw10) {
> > > +    case NVME_ARBITRATION:
> > > +        result = cpu_to_le32(n->features.arbitration);
> > > +        break;
> > > +    case NVME_POWER_MANAGEMENT:
> > > +        result = cpu_to_le32(n->features.power_mgmt);
> > > +        break;
> > >      case NVME_TEMPERATURE_THRESHOLD:
> > >          result = 0;
> > >  
> > > @@ -1029,6 +1037,9 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >              break;
> > >          }
> > >  
> > > +        break;
> > > +    case NVME_ERROR_RECOVERY:
> > > +        result = cpu_to_le32(n->features.err_rec);
> > >          break;
> > >      case NVME_VOLATILE_WRITE_CACHE:
> > >          result = blk_enable_write_cache(n->conf.blk);
> > 
> > This is existing code but still like to point out that endianess conversion is missing.
> 
> Fixed.
> 
> > Also we need to think if we need to do some flush if the write cache is disabled.
> > I don't know yet that area well enough.
> > 
> 
> Looking at the block layer code it just sets a flag when disabling, but
> subsequent requests will have BDRV_REQ_FUA set. So to make sure that
> stuff in the cache is flushed, let's do a flush.
Good to know!

> 
> > > @@ -1041,6 +1052,19 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          break;
> > >      case NVME_TIMESTAMP:
> > >          return nvme_get_feature_timestamp(n, cmd);
> > > +    case NVME_INTERRUPT_COALESCING:
> > > +        result = cpu_to_le32(n->features.int_coalescing);
> > > +        break;
> > > +    case NVME_INTERRUPT_VECTOR_CONF:
> > > +        if ((dw11 & 0xffff) > n->params.num_queues) {
> > 
> > Looks like it should be >= since interrupt vector is not zero based.
> 
> Fixed in other patch.
> 
> > > +            return NVME_INVALID_FIELD | NVME_DNR;
> > > +        }
> > > +
> > > +        result = cpu_to_le32(n->features.int_vector_config[dw11 & 0xffff]);
> > > +        break;
> > > +    case NVME_WRITE_ATOMICITY:
> > > +        result = cpu_to_le32(n->features.write_atomicity);
> > > +        break;
> > >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> > >          result = cpu_to_le32(n->features.async_config);
> > >          break;
> > > @@ -1076,6 +1100,8 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > >  
> > > +    trace_nvme_dev_setfeat(nvme_cid(req), dw10, dw11);
> > > +
> > >      switch (dw10) {
> > >      case NVME_TEMPERATURE_THRESHOLD:
> > >          if (NVME_TEMP_TMPSEL(dw11)) {
> > > @@ -1116,6 +1142,13 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> > >          n->features.async_config = dw11;
> > >          break;
> > > +    case NVME_ARBITRATION:
> > > +    case NVME_POWER_MANAGEMENT:
> > > +    case NVME_ERROR_RECOVERY:
> > > +    case NVME_INTERRUPT_COALESCING:
> > > +    case NVME_INTERRUPT_VECTOR_CONF:
> > > +    case NVME_WRITE_ATOMICITY:
> > > +        return NVME_FEAT_NOT_CHANGABLE | NVME_DNR;
> > >      default:
> > >          trace_nvme_dev_err_invalid_setfeat(dw10);
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > @@ -1689,6 +1722,21 @@ static void nvme_init_state(NvmeCtrl *n)
> > >      n->temperature = NVME_TEMPERATURE;
> > >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> > >      n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > > +
> > > +    /*
> > > +     * There is no limit on the number of commands that the controller may
> > > +     * launch at one time from a particular Submission Queue.
> > > +     */
> > > +    n->features.arbitration = 0x7;
> > 
> > A nice #define in nvme.h stating that 0x7 means no burst limit would be nice.
> > 
> 
> Done.
> 
> > > +
> > > +    n->features.int_vector_config = g_malloc0_n(n->params.num_queues,
> > > +        sizeof(*n->features.int_vector_config));
> > > +
> > > +    /* disable coalescing (not supported) */
> > > +    for (int i = 0; i < n->params.num_queues; i++) {
> > > +        n->features.int_vector_config[i] = i | (1 << 16);
> > 
> > Same here
> 
> Done.
> 
> > > +    }
> > > +
> > >      n->aer_reqs = g_new0(NvmeRequest *, n->params.aerl + 1);
> > >  }
> > >  
> > > @@ -1782,15 +1830,17 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >      id->nn = cpu_to_le32(n->num_namespaces);
> > >      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> > >  
> > > +
> > > +    if (blk_enable_write_cache(n->conf.blk)) {
> > > +        id->vwc = 1;
> > > +    }
> > > +
> > >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> > >      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
> > >  
> > >      id->psd[0].mp = cpu_to_le16(0x9c4);
> > >      id->psd[0].enlat = cpu_to_le32(0x10);
> > >      id->psd[0].exlat = cpu_to_le32(0x4);
> > > -    if (blk_enable_write_cache(n->conf.blk)) {
> > > -        id->vwc = 1;
> > > -    }
> > >  
> > >      n->bar.cap = 0;
> > >      NVME_CAP_SET_MQES(n->bar.cap, 0x7ff);
> > > @@ -1861,6 +1911,7 @@ static void nvme_exit(PCIDevice *pci_dev)
> > >      g_free(n->cq);
> > >      g_free(n->sq);
> > >      g_free(n->aer_reqs);
> > > +    g_free(n->features.int_vector_config);
> > >  
> > >      if (n->params.cmb_size_mb) {
> > >          g_free(n->cmbuf);
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 3952c36774cf..4cf39961989d 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -41,6 +41,8 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> > >  nvme_dev_identify_ctrl(void) "identify controller"
> > >  nvme_dev_identify_ns(uint16_t ns) "identify namespace, nsid=%"PRIu16""
> > >  nvme_dev_identify_nslist(uint16_t ns) "identify namespace list, nsid=%"PRIu16""
> > > +nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
> > > +nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
> > >  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> > >  nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> > >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index a24be047a311..09419ed499d0 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -445,7 +445,8 @@ enum NvmeStatusCodes {
> > >      NVME_FW_REQ_RESET           = 0x010b,
> > >      NVME_INVALID_QUEUE_DEL      = 0x010c,
> > >      NVME_FID_NOT_SAVEABLE       = 0x010d,
> > > -    NVME_FID_NOT_NSID_SPEC      = 0x010f,
> > > +    NVME_FEAT_NOT_CHANGABLE     = 0x010e,
> > > +    NVME_FEAT_NOT_NSID_SPEC     = 0x010f,
> > >      NVME_FW_REQ_SUSYSTEM_RESET  = 0x0110,
> > >      NVME_CONFLICTING_ATTRS      = 0x0180,
> > >      NVME_INVALID_PROT_INFO      = 0x0181,
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Thanks,
Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 15/26] nvme: bump supported specification to 1.3
  2020-03-16  7:50         ` Klaus Birkelund Jensen
@ 2020-03-25 10:22           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:22 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:50 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 12:35, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Add new fields to the Identify Controller and Identify Namespace data
> > > structures accoding to NVM Express 1.3d.
> > > 
> > > NVM Express 1.3d requires the following additional features:
> > >   - addition of the Namespace Identification Descriptor List (CNS 03h)
> > >     for the Identify command
> > >   - support for returning Command Sequence Error if a Set Features
> > >     command is submitted for the Number of Queues feature after any I/O
> > >     queues have been created.
> > >   - The addition of the Log Specific Field (LSP) in the Get Log Page
> > >     command.
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > ---
> > >  hw/block/nvme.c       | 57 ++++++++++++++++++++++++++++++++++++++++---
> > >  hw/block/nvme.h       |  1 +
> > >  hw/block/trace-events |  3 ++-
> > >  include/block/nvme.h  | 20 ++++++++++-----
> > >  4 files changed, 71 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 900732bb2f38..4acfc85b56a2 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -9,7 +9,7 @@
> > >   */
> > >  
> > >  /**
> > > - * Reference Specification: NVM Express 1.2.1
> > > + * Reference Specification: NVM Express 1.3d
> > >   *
> > >   *   https://nvmexpress.org/resources/specifications/
> > >   */
> > > @@ -43,7 +43,7 @@
> > >  #include "trace.h"
> > >  #include "nvme.h"
> > >  
> > > -#define NVME_SPEC_VER 0x00010201
> > > +#define NVME_SPEC_VER 0x00010300
> > >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > >  #define NVME_TEMPERATURE 0x143
> > >  #define NVME_TEMPERATURE_WARNING 0x157
> > > @@ -735,6 +735,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > >      uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > >      uint8_t  lid = dw10 & 0xff;
> > > +    uint8_t  lsp = (dw10 >> 8) & 0xf;
> > >      uint8_t  rae = (dw10 >> 15) & 0x1;
> > >      uint32_t numdl, numdu;
> > >      uint64_t off, lpol, lpou;
> > > @@ -752,7 +753,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > >      }
> > >  
> > > -    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > > +    trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> > >  
> > >      switch (lid) {
> > >      case NVME_LOG_ERROR_INFO:
> > > @@ -863,6 +864,8 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      cq = g_malloc0(sizeof(*cq));
> > >      nvme_init_cq(cq, n, prp1, cqid, vector, qsize + 1,
> > >          NVME_CQ_FLAGS_IEN(qflags));
> > 
> > Code alignment on that '('
> > > +
> > > +    n->qs_created = true;
> > 
> > Should be done also at nvme_create_sq
> 
> No, because you can't create a SQ without a matching CQ:
True, I missed that.

> 
>     if (unlikely(!cqid || nvme_check_cqid(n, cqid))) {
>         trace_nvme_dev_err_invalid_create_sq_cqid(cqid);
>         return NVME_INVALID_CQID | NVME_DNR;
>     }
> 
> 
> So if there is a matching cq, then qs_created = true.
> 
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > @@ -924,6 +927,47 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> > >      return ret;
> > >  }
> > >  
> > > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > > +{
> > > +    static const int len = 4096;
> > 
> > The spec caps the Identify payload size to 4K,
> > thus this should go to nvme.h
> 
> Done.
> 
> > > +
> > > +    struct ns_descr {
> > > +        uint8_t nidt;
> > > +        uint8_t nidl;
> > > +        uint8_t rsvd2[2];
> > > +        uint8_t nid[16];
> > > +    };
> > 
> > This is also part of the spec, thus should
> > move to nvme.h
> > 
> 
> Done - and cleaned up.
Perfect, thanks!
> 
> > > +
> > > +    uint32_t nsid = le32_to_cpu(c->nsid);
> > > +    uint64_t prp1 = le64_to_cpu(c->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(c->prp2);
> > > +
> > > +    struct ns_descr *list;
> > > +    uint16_t ret;
> > > +
> > > +    trace_nvme_dev_identify_ns_descr_list(nsid);
> > > +
> > > +    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > +        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > > +        return NVME_INVALID_NSID | NVME_DNR;
> > > +    }
> > > +
> > > +    /*
> > > +     * Because the NGUID and EUI64 fields are 0 in the Identify Namespace data
> > > +     * structure, a Namespace UUID (nidt = 0x3) must be reported in the
> > > +     * Namespace Identification Descriptor. Add a very basic Namespace UUID
> > > +     * here.
> > 
> > Some per namespace uuid qemu property will be very nice to have to have a uuid that
> > is at least somewhat unique.
> > Linux kernel I think might complain if it detects namespaces with duplicate uuids.
> 
> It will be "unique" per controller (because it's just the namespace id).
> The spec also says that it should be fixed for the lifetime of the
> namespace, but I'm not sure how to ensure that without keeping that
> state on disk somehow. I have a solution for this in a later series, but
> for now, I think this is ok.
> 
> But since we actually support multiple controllers, there certainly is
> an issue here. Maybe we can blend in some PCI id or something to make it
> unique across controllers.
IMHO, a qemu device property nicely shifts blame for this to an external management
program (e.g libvirt) which can store this in its XML file indeed for the
lifetime of the namespace

> 
> > 
> > > +     */
> > > +    list = g_malloc0(len);
> > > +    list->nidt = 0x3;
> > > +    list->nidl = 0x10;
> > 
> > Those should also be #defined in nvme.h
> 
> Fixed.
> 
> > > +    *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > > +
> > > +    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > > +    g_free(list);
> > > +    return ret;
> > > +}
> > > +
> > >  static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > >  {
> > >      NvmeIdentify *c = (NvmeIdentify *)cmd;
> > > @@ -935,6 +979,8 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > >          return nvme_identify_ctrl(n, c);
> > >      case 0x02:
> > >          return nvme_identify_ns_list(n, c);
> > > +    case 0x03:
> > 
> > The CNS values should be defined in nvme.h.
> 
> Fixed.
> 
> > > +        return nvme_identify_ns_descr_list(n, cmd);
> > >      default:
> > >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > @@ -1133,6 +1179,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > >          break;
> > >      case NVME_NUMBER_OF_QUEUES:
> > > +        if (n->qs_created) {
> > > +            return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > > +        }
> > > +
> > >          if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
> > >              return NVME_INVALID_FIELD | NVME_DNR;
> > >          }
> > > @@ -1267,6 +1317,7 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> > >  
> > >      n->aer_queued = 0;
> > >      n->outstanding_aers = 0;
> > > +    n->qs_created = false;
> > >  
> > >      blk_flush(n->conf.blk);
> > >      n->bar.cc = 0;
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 1e715ab1d75c..7ced5fd485a9 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -97,6 +97,7 @@ typedef struct NvmeCtrl {
> > >      BlockConf    conf;
> > >      NvmeParams   params;
> > >  
> > > +    bool        qs_created;
> > >      uint32_t    page_size;
> > >      uint16_t    page_bits;
> > >      uint16_t    max_prp_ents;
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index f982ec1a3221..9e5a4548bde0 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -41,6 +41,7 @@ nvme_dev_del_cq(uint16_t cqid) "deleted completion queue, sqid=%"PRIu16""
> > >  nvme_dev_identify_ctrl(void) "identify controller"
> > >  nvme_dev_identify_ns(uint32_t ns) "nsid %"PRIu32""
> > >  nvme_dev_identify_ns_list(uint32_t ns) "nsid %"PRIu32""
> > > +nvme_dev_identify_ns_descr_list(uint32_t ns) "nsid %"PRIu32""
> > >  nvme_dev_getfeat(uint16_t cid, uint32_t fid) "cid %"PRIu16" fid 0x%"PRIx32""
> > >  nvme_dev_setfeat(uint16_t cid, uint32_t fid, uint32_t val) "cid %"PRIu16" fid 0x%"PRIx32" val 0x%"PRIx32""
> > >  nvme_dev_getfeat_vwcache(const char* result) "get feature volatile write cache, result=%s"
> > > @@ -48,7 +49,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> > >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> > >  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> > >  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> > > -nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> > > +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t lsp, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" lsp 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> > >  nvme_dev_process_aers(int queued) "queued %d"
> > >  nvme_dev_aer(uint16_t cid) "cid %"PRIu16""
> > >  nvme_dev_aer_aerl_exceeded(void) "aerl exceeded"
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index 09419ed499d0..31eb9397d8c6 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -550,7 +550,9 @@ typedef struct NvmeIdCtrl {
> > >      uint32_t    rtd3e;
> > >      uint32_t    oaes;
> > >      uint32_t    ctratt;
> > > -    uint8_t     rsvd100[156];
> > > +    uint8_t     rsvd100[12];
> > > +    uint8_t     fguid[16];
> > > +    uint8_t     rsvd128[128];
> > 
> > looks OK
> > >      uint16_t    oacs;
> > >      uint8_t     acl;
> > >      uint8_t     aerl;
> > > @@ -568,9 +570,15 @@ typedef struct NvmeIdCtrl {
> > >      uint8_t     tnvmcap[16];
> > >      uint8_t     unvmcap[16];
> > >      uint32_t    rpmbs;
> > > -    uint8_t     rsvd316[4];
> > > +    uint16_t    edstt;
> > > +    uint8_t     dsto;
> > > +    uint8_t     fwug;
> > 
> > looks OK
> > >      uint16_t    kas;
> > > -    uint8_t     rsvd322[190];
> > > +    uint16_t    hctma;
> > > +    uint16_t    mntmt;
> > > +    uint16_t    mxtmt;
> > > +    uint32_t    sanicap;
> > > +    uint8_t     rsvd332[180];
> > 
> > looks OK
> > >      uint8_t     sqes;
> > >      uint8_t     cqes;
> > >      uint16_t    maxcmd;
> > > @@ -691,19 +699,19 @@ typedef struct NvmeIdNs {
> > >      uint8_t     rescap;
> > >      uint8_t     fpi;
> > >      uint8_t     dlfeat;
> > > -    uint8_t     rsvd33;
> > >      uint16_t    nawun;
> > >      uint16_t    nawupf;
> > > +    uint16_t    nacwu;
> > 
> > Aha! Here you 'fix' the bug you had in patch 4.

I thought for a moment that you didn't fix this, but
after looking at V6, it is fixed.
I didn't get any feedback for patch 4, so I double checked this.

> > >      uint16_t    nabsn;
> > >      uint16_t    nabo;
> > >      uint16_t    nabspf;
> > > -    uint8_t     rsvd46[2];
> > > +    uint16_t    noiob;
> > >      uint8_t     nvmcap[16];
> > >      uint8_t     rsvd64[40];
> > >      uint8_t     nguid[16];
> > >      uint64_t    eui64;
> > >      NvmeLBAF    lbaf[16];
> > > -    uint8_t     res192[192];
> > > +    uint8_t     rsvd192[192];
> > 
> > And even do what I suggested with that field :-)
> > Please squash the changes.
> > >      uint8_t     vs[3712];
> > >  } NvmeIdNs;
> > >  
> > 
> > So I suggest you squash this set of changes with patch 4.
> > I also suggest you to split the other changes in this patch, 1 per feature added.
> > The tracing change can also be squashed with the other tracing patch you submitted.
> > 
> > In summary I would suggest you to have:
> > 
> > 1. patch that only adds all the fields from the 1.3d spec, and overall updates nvme.h
> > to be up to 1.3d spec
> > 
> > 2. patches that do refactoring, add more tracing (also form of refactoring, since tracing
> > isn't a functional thing)
> > 
> > 3. set of patches that implement all the 1.3d features.
> > 
> > 4. patch that only bumps the supported version right to 1.3d
> > 
> 
> Did this! :)

Thank you!

Best regards,
	Maxim Levitsky

> 







^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 16/26] nvme: refactor prp mapping
  2020-03-16  7:51         ` Klaus Birkelund Jensen
@ 2020-03-25 10:23           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:23 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:51 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 13:44, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Refactor nvme_map_prp and allow PRPs to be located in the CMB. The logic
> > > ensures that if some of the PRP is in the CMB, all of it must be located
> > > there, as per the specification.
> > 
> > To be honest this looks like not refactoring but a bugfix
> > (old code was just assuming that if first prp entry is in cmb, the rest also is)
> 
> I split it up into a separate bugfix patch.
> 
> > > 
> > > Also combine nvme_dma_{read,write}_prp into a single nvme_dma_prp that
> > > takes an additional DMADirection parameter.
> > 
> > To be honest 'nvme_dma_prp' was not a clear function name to me at first glance.
> > Could you rename this to nvme_dma_prp_rw or so? (Although even that is somewhat unclear
> > to convey the meaning of read/write the data to/from the guest memory areas defined by the prp list.
> > Also could you split this change into a new patch?
> > 
> 
> Splitting into new patch.
> 
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > 
> > Now you even use your both addresses :-)
> > 
> > > ---
> > >  hw/block/nvme.c       | 245 +++++++++++++++++++++++++++---------------
> > >  hw/block/nvme.h       |   2 +-
> > >  hw/block/trace-events |   1 +
> > >  include/block/nvme.h  |   1 +
> > >  4 files changed, 160 insertions(+), 89 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 4acfc85b56a2..334265efb21e 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -58,6 +58,11 @@
> > >  
> > >  static void nvme_process_sq(void *opaque);
> > >  
> > > +static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > > +{
> > > +    return &n->cmbuf[addr - n->ctrl_mem.addr];
> > > +}
> > 
> > To my taste I would put this together with the patch that
> > added nvme_addr_is_cmb. I know that some people are against
> > this citing the fact that you should use the code you add
> > in the same patch. Your call.
> > 
> > Regardless of this I also prefer to put refactoring patches first in the series.
Thanks!
> > 
> > > +
> > >  static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> > >  {
> > >      hwaddr low = n->ctrl_mem.addr;
> > > @@ -152,138 +157,187 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
> > >      }
> > >  }
> > >  
> > > -static uint16_t nvme_map_prp(QEMUSGList *qsg, QEMUIOVector *iov, uint64_t prp1,
> > > -                             uint64_t prp2, uint32_t len, NvmeCtrl *n)
> > > +static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > > +    uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> > 
> > Split line alignment (it was correct before).
> > Also while at the refactoring, it would be great to add some documentation
> > to this and few more functions, since its not clear immediately what this does.
> > 
> > 
> > >  {
> > >      hwaddr trans_len = n->page_size - (prp1 % n->page_size);
> > >      trans_len = MIN(len, trans_len);
> > >      int num_prps = (len >> n->page_bits) + 1;
> > > +    uint16_t status = NVME_SUCCESS;
> > > +    bool is_cmb = false;
> > > +    bool prp_list_in_cmb = false;
> > > +
> > > +    trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> > > +        prp1, prp2, num_prps);
> > >  
> > >      if (unlikely(!prp1)) {
> > >          trace_nvme_dev_err_invalid_prp();
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > -    } else if (n->cmbsz && prp1 >= n->ctrl_mem.addr &&
> > > -               prp1 < n->ctrl_mem.addr + int128_get64(n->ctrl_mem.size)) {
> > > -        qsg->nsg = 0;
> > > +    }
> > > +
> > > +    if (nvme_addr_is_cmb(n, prp1)) {
> > > +        is_cmb = true;
> > > +
> > >          qemu_iovec_init(iov, num_prps);
> > > -        qemu_iovec_add(iov, (void *)&n->cmbuf[prp1 - n->ctrl_mem.addr], trans_len);
> > > +
> > > +        /*
> > > +         * PRPs do not cross page boundaries, so if the start address (here,
> > > +         * prp1) is within the CMB, it cannot cross outside the controller
> > > +         * memory buffer range. This is ensured by
> > > +         *
> > > +         *   len = n->page_size - (addr % n->page_size)
> > > +         *
> > > +         * Thus, we can directly add to the iovec without risking an out of
> > > +         * bounds access. This also holds for the remaining qemu_iovec_add
> > > +         * calls.
> > > +         */
> > > +        qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp1), trans_len);
> > >      } else {
> > >          pci_dma_sglist_init(qsg, &n->parent_obj, num_prps);
> > >          qemu_sglist_add(qsg, prp1, trans_len);
> > >      }
> > > +
> > >      len -= trans_len;
> > >      if (len) {
> > >          if (unlikely(!prp2)) {
> > >              trace_nvme_dev_err_invalid_prp2_missing();
> > > +            status = NVME_INVALID_FIELD | NVME_DNR;
> > >              goto unmap;
> > >          }
> > > +
> > >          if (len > n->page_size) {
> > >              uint64_t prp_list[n->max_prp_ents];
> > >              uint32_t nents, prp_trans;
> > >              int i = 0;
> > >  
> > > +            if (nvme_addr_is_cmb(n, prp2)) {
> > > +                prp_list_in_cmb = true;
> > > +            }
> > > +
> > >              nents = (len + n->page_size - 1) >> n->page_bits;
> > >              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > > -            nvme_addr_read(n, prp2, (void *)prp_list, prp_trans);
> > > +            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > >              while (len != 0) {
> > >                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> > >  
> > >                  if (i == n->max_prp_ents - 1 && len > n->page_size) {
> > >                      if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> > >                          trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> > > +                        status = NVME_INVALID_FIELD | NVME_DNR;
> > > +                        goto unmap;
> > > +                    }
> > > +
> > > +                    if (prp_list_in_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> > > +                        status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > >                          goto unmap;
> > >                      }
> > >  
> > >                      i = 0;
> > >                      nents = (len + n->page_size - 1) >> n->page_bits;
> > >                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > > -                    nvme_addr_read(n, prp_ent, (void *)prp_list,
> > > -                        prp_trans);
> > > +                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
> > >                      prp_ent = le64_to_cpu(prp_list[i]);
> > >                  }
> > >  
> > >                  if (unlikely(!prp_ent || prp_ent & (n->page_size - 1))) {
> > >                      trace_nvme_dev_err_invalid_prplist_ent(prp_ent);
> > > +                    status = NVME_INVALID_FIELD | NVME_DNR;
> > > +                    goto unmap;
> > > +                }
> > > +
> > > +                if (is_cmb != nvme_addr_is_cmb(n, prp_ent)) {
> > > +                    status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > >                      goto unmap;
> > >                  }
> > >  
> > >                  trans_len = MIN(len, n->page_size);
> > > -                if (qsg->nsg){
> > > -                    qemu_sglist_add(qsg, prp_ent, trans_len);
> > > +                if (is_cmb) {
> > > +                    qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp_ent),
> > > +                        trans_len);
> > >                  } else {
> > > -                    qemu_iovec_add(iov, (void *)&n->cmbuf[prp_ent - n->ctrl_mem.addr], trans_len);
> > > +                    qemu_sglist_add(qsg, prp_ent, trans_len);
> > >                  }
> > > +
> > >                  len -= trans_len;
> > >                  i++;
> > >              }
> > >          } else {
> > > +            if (is_cmb != nvme_addr_is_cmb(n, prp2)) {
> > > +                status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > > +                goto unmap;
> > > +            }
> > > +
> > >              if (unlikely(prp2 & (n->page_size - 1))) {
> > >                  trace_nvme_dev_err_invalid_prp2_align(prp2);
> > > +                status = NVME_INVALID_FIELD | NVME_DNR;
> > >                  goto unmap;
> > >              }
> > > -            if (qsg->nsg) {
> > > +
> > > +            if (is_cmb) {
> > > +                qemu_iovec_add(iov, nvme_addr_to_cmb(n, prp2), len);
> > > +            } else {
> > >                  qemu_sglist_add(qsg, prp2, len);
> > > -            } else {
> > > -                qemu_iovec_add(iov, (void *)&n->cmbuf[prp2 - n->ctrl_mem.addr], trans_len);
> > >              }
> > >          }
> > >      }
> > > +
> > >      return NVME_SUCCESS;
> > >  
> > > - unmap:
> > > -    qemu_sglist_destroy(qsg);
> > > -    return NVME_INVALID_FIELD | NVME_DNR;
> > > -}
> > 
> > I haven't checked the new nvme_map_prp to the extent that I am sure that
> > it is correct, but it looks reasonable.
> > 
> > > -
> > > -static uint16_t nvme_dma_write_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > > -                                   uint64_t prp1, uint64_t prp2)
> > > -{
> > > -    QEMUSGList qsg;
> > > -    QEMUIOVector iov;
> > > -    uint16_t status = NVME_SUCCESS;
> > > -
> > > -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > > -        return NVME_INVALID_FIELD | NVME_DNR;
> > > -    }
> > > -    if (qsg.nsg > 0) {
> > > -        if (dma_buf_write(ptr, len, &qsg)) {
> > > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > > -        }
> > > -        qemu_sglist_destroy(&qsg);
> > > +unmap:
> > > +    if (is_cmb) {
> > > +        qemu_iovec_destroy(iov);
> > >      } else {
> > > -        if (qemu_iovec_to_buf(&iov, 0, ptr, len) != len) {
> > > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > > -        }
> > > -        qemu_iovec_destroy(&iov);
> > > +        qemu_sglist_destroy(qsg);
> > >      }
> > > +
> > >      return status;
> > >  }
> > >  
> > > -static uint16_t nvme_dma_read_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > > -    uint64_t prp1, uint64_t prp2)
> > > +static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > > +    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
> > >  {
> > >      QEMUSGList qsg;
> > >      QEMUIOVector iov;
> > >      uint16_t status = NVME_SUCCESS;
> > > +    size_t bytes;
> > >  
> > > -    trace_nvme_dev_dma_read(prp1, prp2);
> > > -
> > > -    if (nvme_map_prp(&qsg, &iov, prp1, prp2, len, n)) {
> > > -        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    status = nvme_map_prp(n, &qsg, &iov, prp1, prp2, len, req);
> > > +    if (status) {
> > > +        return status;
> > >      }
> > > +
> > >      if (qsg.nsg > 0) {
> > > -        if (unlikely(dma_buf_read(ptr, len, &qsg))) {
> > > +        uint64_t residual;
> > > +
> > > +        if (dir == DMA_DIRECTION_TO_DEVICE) {
> > > +            residual = dma_buf_write(ptr, len, &qsg);
> > > +        } else {
> > > +            residual = dma_buf_read(ptr, len, &qsg);
> > > +        }
> > > +
> > > +        if (unlikely(residual)) {
> > >              trace_nvme_dev_err_invalid_dma();
> > >              status = NVME_INVALID_FIELD | NVME_DNR;
> > >          }
> > > +
> > >          qemu_sglist_destroy(&qsg);
> > > +
> > > +        return status;
> > 
> > I would prefer if/else here rather than that early return here.
> > It would make code more symmetric.
> > 
> 
> Looks nicer yeah. Done.
> 
> > > +    }
> > > +
> > > +    if (dir == DMA_DIRECTION_TO_DEVICE) {
> > > +        bytes = qemu_iovec_to_buf(&iov, 0, ptr, len);
> > >      } else {
> > > -        if (unlikely(qemu_iovec_from_buf(&iov, 0, ptr, len) != len)) {
> > > -            trace_nvme_dev_err_invalid_dma();
> > > -            status = NVME_INVALID_FIELD | NVME_DNR;
> > > -        }
> > > -        qemu_iovec_destroy(&iov);
> > > +        bytes = qemu_iovec_from_buf(&iov, 0, ptr, len);
> > >      }
> > > +
> > > +    if (unlikely(bytes != len)) {
> > > +        trace_nvme_dev_err_invalid_dma();
> > > +        status = NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    qemu_iovec_destroy(&iov);
> > > +
> > >      return status;
> > >  }
> > >  
> > > @@ -420,16 +474,20 @@ static void nvme_rw_cb(void *opaque, int ret)
> > >          block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
> > >          req->status = NVME_INTERNAL_DEV_ERROR;
> > >      }
> > > -    if (req->has_sg) {
> > > +
> > > +    if (req->qsg.nalloc) {
> > >          qemu_sglist_destroy(&req->qsg);
> > >      }
> > > +    if (req->iov.nalloc) {
> > > +        qemu_iovec_destroy(&req->iov);
> > > +    }
> > > +
> > >      nvme_enqueue_req_completion(cq, req);
> > >  }
> > >  
> > >  static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > >      NvmeRequest *req)
> > >  {
> > > -    req->has_sg = false;
> > >      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > >           BLOCK_ACCT_FLUSH);
> > >      req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> > > @@ -453,7 +511,6 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > >          return NVME_LBA_RANGE | NVME_DNR;
> > >      }
> > >  
> > > -    req->has_sg = false;
> > >      block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > >                       BLOCK_ACCT_WRITE);
> > >      req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> > > @@ -485,21 +542,24 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > >          return NVME_LBA_RANGE | NVME_DNR;
> > >      }
> > >  
> > > -    if (nvme_map_prp(&req->qsg, &req->iov, prp1, prp2, data_size, n)) {
> > > +    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
> > >          block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > >      }
> > >  
> > > -    dma_acct_start(n->conf.blk, &req->acct, &req->qsg, acct);
> > >      if (req->qsg.nsg > 0) {
> > > -        req->has_sg = true;
> > > +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> > > +            acct);
> > > +
> > >          req->aiocb = is_write ?
> > >              dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > >                            nvme_rw_cb, req) :
> > >              dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > >                           nvme_rw_cb, req);
> > >      } else {
> > > -        req->has_sg = false;
> > > +        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> > > +            acct);
> > > +
> > >          req->aiocb = is_write ?
> > >              blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> > >                              req) :
> > > @@ -596,7 +656,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
> > >      sq->size = size;
> > >      sq->cqid = cqid;
> > >      sq->head = sq->tail = 0;
> > > -    sq->io_req = g_new(NvmeRequest, sq->size);
> > > +    sq->io_req = g_new0(NvmeRequest, sq->size);
> > >  
> > >      QTAILQ_INIT(&sq->req_list);
> > >      QTAILQ_INIT(&sq->out_req_list);
> > > @@ -704,8 +764,8 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> > >          nvme_clear_events(n, NVME_AER_TYPE_SMART);
> > >      }
> > >  
> > > -    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > > -        prp2);
> > > +    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > > +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > > @@ -724,8 +784,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > >  
> > >      trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > >  
> > > -    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > > -        prp2);
> > > +    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > > +        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > @@ -869,18 +929,20 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c)
> > > +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> > > +    NvmeRequest *req)
> > >  {
> > >      uint64_t prp1 = le64_to_cpu(c->prp1);
> > >      uint64_t prp2 = le64_to_cpu(c->prp2);
> > >  
> > >      trace_nvme_dev_identify_ctrl();
> > >  
> > > -    return nvme_dma_read_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > > -        prp1, prp2);
> > > +    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
> > > +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> > > +    NvmeRequest *req)
> > >  {
> > >      NvmeNamespace *ns;
> > >      uint32_t nsid = le32_to_cpu(c->nsid);
> > > @@ -896,11 +958,12 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c)
> > >  
> > >      ns = &n->namespaces[nsid - 1];
> > >  
> > > -    return nvme_dma_read_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > > -        prp1, prp2);
> > > +    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> > > +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> > > +    NvmeRequest *req)
> > >  {
> > >      static const int data_len = 4 * KiB;
> > >      uint32_t min_nsid = le32_to_cpu(c->nsid);
> > > @@ -922,12 +985,14 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c)
> > >              break;
> > >          }
> > >      }
> > > -    ret = nvme_dma_read_prp(n, (uint8_t *)list, data_len, prp1, prp2);
> > > +    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >      g_free(list);
> > >      return ret;
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > > +    NvmeRequest *req)
> > >  {
> > >      static const int len = 4096;
> > >  
> > > @@ -963,24 +1028,25 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *c)
> > >      list->nidl = 0x10;
> > >      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > >  
> > > -    ret = nvme_dma_read_prp(n, (uint8_t *) list, len, prp1, prp2);
> > > +    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >      g_free(list);
> > >      return ret;
> > >  }
> > >  
> > > -static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd)
> > > +static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > >      NvmeIdentify *c = (NvmeIdentify *)cmd;
> > >  
> > >      switch (le32_to_cpu(c->cns)) {
> > >      case 0x00:
> > > -        return nvme_identify_ns(n, c);
> > > +        return nvme_identify_ns(n, c, req);
> > >      case 0x01:
> > > -        return nvme_identify_ctrl(n, c);
> > > +        return nvme_identify_ctrl(n, c, req);
> > >      case 0x02:
> > > -        return nvme_identify_ns_list(n, c);
> > > +        return nvme_identify_ns_list(n, c, req);
> > >      case 0x03:
> > > -        return nvme_identify_ns_descr_list(n, cmd);
> > > +        return nvme_identify_ns_descr_list(n, c, req);
> > >      default:
> > >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > @@ -1039,15 +1105,16 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> > >      return cpu_to_le64(ts.all);
> > >  }
> > >  
> > > -static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> > > +static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > > +    NvmeRequest *req)
> > >  {
> > >      uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > >      uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > >  
> > >      uint64_t timestamp = nvme_get_timestamp(n);
> > >  
> > > -    return nvme_dma_read_prp(n, (uint8_t *)&timestamp,
> > > -                                 sizeof(timestamp), prp1, prp2);
> > > +    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> > > +        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > @@ -1099,7 +1166,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          trace_nvme_dev_getfeat_numq(result);
> > >          break;
> > >      case NVME_TIMESTAMP:
> > > -        return nvme_get_feature_timestamp(n, cmd);
> > > +        return nvme_get_feature_timestamp(n, cmd, req);
> > >      case NVME_INTERRUPT_COALESCING:
> > >          result = cpu_to_le32(n->features.int_coalescing);
> > >          break;
> > > @@ -1125,15 +1192,16 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > -static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd)
> > > +static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > > +    NvmeRequest *req)
> > >  {
> > >      uint16_t ret;
> > >      uint64_t timestamp;
> > >      uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > >      uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > >  
> > > -    ret = nvme_dma_write_prp(n, (uint8_t *)&timestamp,
> > > -                                sizeof(timestamp), prp1, prp2);
> > > +    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> > > +        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
> > >      if (ret != NVME_SUCCESS) {
> > >          return ret;
> > >      }
> > > @@ -1194,7 +1262,7 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >              ((n->params.num_queues - 2) << 16));
> > >          break;
> > >      case NVME_TIMESTAMP:
> > > -        return nvme_set_feature_timestamp(n, cmd);
> > > +        return nvme_set_feature_timestamp(n, cmd, req);
> > >      case NVME_ASYNCHRONOUS_EVENT_CONF:
> > >          n->features.async_config = dw11;
> > >          break;
> > > @@ -1246,7 +1314,7 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      case NVME_ADM_CMD_CREATE_CQ:
> > >          return nvme_create_cq(n, cmd);
> > >      case NVME_ADM_CMD_IDENTIFY:
> > > -        return nvme_identify(n, cmd);
> > > +        return nvme_identify(n, cmd, req);
> > >      case NVME_ADM_CMD_ABORT:
> > >          return nvme_abort(n, cmd, req);
> > >      case NVME_ADM_CMD_SET_FEATURES:
> > > @@ -1282,6 +1350,7 @@ static void nvme_process_sq(void *opaque)
> > >          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> > >          memset(&req->cqe, 0, sizeof(req->cqe));
> > >          req->cqe.cid = cmd.cid;
> > > +        memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
> > >  
> > >          status = sq->sqid ? nvme_io_cmd(n, &cmd, req) :
> > >              nvme_admin_cmd(n, &cmd, req);
> > > @@ -1804,7 +1873,7 @@ static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > >  
> > >      NVME_CMBSZ_SET_SQS(n->bar.cmbsz, 1);
> > >      NVME_CMBSZ_SET_CQS(n->bar.cmbsz, 0);
> > > -    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 0);
> > > +    NVME_CMBSZ_SET_LISTS(n->bar.cmbsz, 1);
> > >      NVME_CMBSZ_SET_RDS(n->bar.cmbsz, 1);
> > >      NVME_CMBSZ_SET_WDS(n->bar.cmbsz, 1);
> > >      NVME_CMBSZ_SET_SZU(n->bar.cmbsz, 2);
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 7ced5fd485a9..d27baa9d5391 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -27,11 +27,11 @@ typedef struct NvmeRequest {
> > >      struct NvmeSQueue       *sq;
> > >      BlockAIOCB              *aiocb;
> > >      uint16_t                status;
> > > -    bool                    has_sg;
> > >      NvmeCqe                 cqe;
> > >      BlockAcctCookie         acct;
> > >      QEMUSGList              qsg;
> > >      QEMUIOVector            iov;
> > > +    NvmeCmd                 cmd;
> > >      QTAILQ_ENTRY(NvmeRequest)entry;
> > >  } NvmeRequest;
> > >  
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 9e5a4548bde0..77aa0da99ee0 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -33,6 +33,7 @@ nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> > >  nvme_dev_irq_pin(void) "pulsing IRQ pin"
> > >  nvme_dev_irq_masked(void) "IRQ is masked"
> > >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> > > +nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> > >  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> > >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > > qflags=%"PRIu16""
> > >  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> > > qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index 31eb9397d8c6..c1de92179596 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -427,6 +427,7 @@ enum NvmeStatusCodes {
> > >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> > >      NVME_INVALID_NSID           = 0x000b,
> > >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > > +    NVME_INVALID_USE_OF_CMB     = 0x0012,
> > >      NVME_LBA_RANGE              = 0x0080,
> > >      NVME_CAP_EXCEEDED           = 0x0081,
> > >      NVME_NS_NOT_READY           = 0x0082,
> > 
> > 
> > Overall I would split this commit into real refactoring and bugfixes.
> 
> Done!
> 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 20/26] nvme: handle dma errors
  2020-03-16  7:53         ` Klaus Birkelund Jensen
@ 2020-03-25 10:23           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:23 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:53 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 13:52, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > > Handling DMA errors gracefully is required for the device to pass the
> > > block/011 test ("disable PCI device while doing I/O") in the blktests
> > > suite.
> > > 
> > > With this patch the device passes the test by retrying "critical"
> > > transfers (posting of completion entries and processing of submission
> > > queue entries).
> > > 
> > > If DMA errors occur at any other point in the execution of the command
> > > (say, while mapping the PRPs), the command is aborted with a Data
> > > Transfer Error status code.
> > > 
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/nvme.c       | 42 +++++++++++++++++++++++++++++++++---------
> > >  hw/block/trace-events |  2 ++
> > >  include/block/nvme.h  |  2 +-
> > >  3 files changed, 36 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index f8c81b9e2202..204ae1d33234 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -73,14 +73,14 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> > >      return addr >= low && addr < hi;
> > >  }
> > >  
> > > -static void nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > > +static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > >  {
> > >      if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > >          memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > > -        return;
> > > +        return 0;
> > >      }
> > >  
> > > -    pci_dma_read(&n->parent_obj, addr, buf, size);
> > > +    return pci_dma_read(&n->parent_obj, addr, buf, size);
> > >  }
> > >  
> > >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> > > @@ -168,6 +168,7 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > >      uint16_t status = NVME_SUCCESS;
> > >      bool is_cmb = false;
> > >      bool prp_list_in_cmb = false;
> > > +    int ret;
> > >  
> > >      trace_nvme_dev_map_prp(nvme_cid(req), req->cmd.opcode, trans_len, len,
> > >          prp1, prp2, num_prps);
> > > @@ -218,7 +219,12 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > >  
> > >              nents = (len + n->page_size - 1) >> n->page_bits;
> > >              prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > > -            nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > > +            ret = nvme_addr_read(n, prp2, (void *) prp_list, prp_trans);
> > > +            if (ret) {
> > > +                trace_nvme_dev_err_addr_read(prp2);
> > > +                status = NVME_DATA_TRANSFER_ERROR;
> > > +                goto unmap;
> > > +            }
> > >              while (len != 0) {
> > >                  uint64_t prp_ent = le64_to_cpu(prp_list[i]);
> > >  
> > > @@ -237,7 +243,13 @@ static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > >                      i = 0;
> > >                      nents = (len + n->page_size - 1) >> n->page_bits;
> > >                      prp_trans = MIN(n->max_prp_ents, nents) * sizeof(uint64_t);
> > > -                    nvme_addr_read(n, prp_ent, (void *) prp_list, prp_trans);
> > > +                    ret = nvme_addr_read(n, prp_ent, (void *) prp_list,
> > > +                        prp_trans);
> > > +                    if (ret) {
> > > +                        trace_nvme_dev_err_addr_read(prp_ent);
> > > +                        status = NVME_DATA_TRANSFER_ERROR;
> > > +                        goto unmap;
> > > +                    }
> > >                      prp_ent = le64_to_cpu(prp_list[i]);
> > >                  }
> > >  
> > > @@ -443,6 +455,7 @@ static void nvme_post_cqes(void *opaque)
> > >      NvmeCQueue *cq = opaque;
> > >      NvmeCtrl *n = cq->ctrl;
> > >      NvmeRequest *req, *next;
> > > +    int ret;
> > >  
> > >      QTAILQ_FOREACH_SAFE(req, &cq->req_list, entry, next) {
> > >          NvmeSQueue *sq;
> > > @@ -452,15 +465,21 @@ static void nvme_post_cqes(void *opaque)
> > >              break;
> > >          }
> > >  
> > > -        QTAILQ_REMOVE(&cq->req_list, req, entry);
> > >          sq = req->sq;
> > >          req->cqe.status = cpu_to_le16((req->status << 1) | cq->phase);
> > >          req->cqe.sq_id = cpu_to_le16(sq->sqid);
> > >          req->cqe.sq_head = cpu_to_le16(sq->head);
> > >          addr = cq->dma_addr + cq->tail * n->cqe_size;
> > > -        nvme_inc_cq_tail(cq);
> > > -        pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> > > +        ret = pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> > >              sizeof(req->cqe));
> > > +        if (ret) {
> > > +            trace_nvme_dev_err_addr_write(addr);
> > > +            timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > > +                100 * SCALE_MS);
> > > +            break;
> > > +        }
> > > +        QTAILQ_REMOVE(&cq->req_list, req, entry);
> > > +        nvme_inc_cq_tail(cq);
> > >          nvme_req_clear(req);
> > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > >      }
> > > @@ -1588,7 +1607,12 @@ static void nvme_process_sq(void *opaque)
> > >  
> > >      while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
> > >          addr = sq->dma_addr + sq->head * n->sqe_size;
> > > -        nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd));
> > > +        if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> > > +            trace_nvme_dev_err_addr_read(addr);
> > > +            timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) +
> > > +                100 * SCALE_MS);
> > > +            break;
> > > +        }
> > 
> > Note that once the driver is optimized for performance, these timers must go,
> > since they run on main thread and also add latency to each request.
> > But for now this change is all right.
> > 
> > About user triggering this each 100ms on purpose, I don't think that this is such a big issue.
> > Maybe up that to 500ms or even one second, since this condition will not
> > happen in real life usage of the device anyway.
> > 
> 
> I bumped it to 500ms.
Great!
> 
> > >          nvme_inc_sq_head(sq);
> > >  
> > >          req = QTAILQ_FIRST(&sq->req_list);
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 90a57fb6099a..09bfb3782dd0 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -83,6 +83,8 @@ nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
> > >  nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
> > >  nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> > >  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > > status 0x%"PRIx16""
> > > +nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
> > > +nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
> > >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> > >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> > >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index c1de92179596..a873776d98b8 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -418,7 +418,7 @@ enum NvmeStatusCodes {
> > >      NVME_INVALID_OPCODE         = 0x0001,
> > >      NVME_INVALID_FIELD          = 0x0002,
> > >      NVME_CID_CONFLICT           = 0x0003,
> > > -    NVME_DATA_TRAS_ERROR        = 0x0004,
> > > +    NVME_DATA_TRANSFER_ERROR    = 0x0004,
> > >      NVME_POWER_LOSS_ABORT       = 0x0005,
> > >      NVME_INTERNAL_DEV_ERROR     = 0x0006,
> > >      NVME_CMD_ABORT_REQ          = 0x0007,
> > 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 21/26] nvme: add support for scatter gather lists
  2020-03-16  7:54         ` Klaus Birkelund Jensen
@ 2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:24 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:54 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 14:07, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > > For now, support the Data Block, Segment and Last Segment descriptor
> > > types.
> > > 
> > > See NVM Express 1.3d, Section 4.4 ("Scatter Gather List (SGL)").
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > Acked-by: Fam Zheng <fam@euphon.net>
> > > ---
> > >  block/nvme.c          |  18 +-
> > >  hw/block/nvme.c       | 375 +++++++++++++++++++++++++++++++++++-------
> > >  hw/block/trace-events |   4 +
> > >  include/block/nvme.h  |  62 ++++++-
> > >  4 files changed, 389 insertions(+), 70 deletions(-)
> > > 
> > > diff --git a/block/nvme.c b/block/nvme.c
> > > index d41c4bda6e39..521f521054d5 100644
> > > --- a/block/nvme.c
> > > +++ b/block/nvme.c
> > > @@ -446,7 +446,7 @@ static void nvme_identify(BlockDriverState *bs, int namespace, Error **errp)
> > >          error_setg(errp, "Cannot map buffer for DMA");
> > >          goto out;
> > >      }
> > > -    cmd.prp1 = cpu_to_le64(iova);
> > > +    cmd.dptr.prp.prp1 = cpu_to_le64(iova);
> > >  
> > >      if (nvme_cmd_sync(bs, s->queues[0], &cmd)) {
> > >          error_setg(errp, "Failed to identify controller");
> > > @@ -545,7 +545,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> > >      }
> > >      cmd = (NvmeCmd) {
> > >          .opcode = NVME_ADM_CMD_CREATE_CQ,
> > > -        .prp1 = cpu_to_le64(q->cq.iova),
> > > +        .dptr.prp.prp1 = cpu_to_le64(q->cq.iova),
> > >          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> > >          .cdw11 = cpu_to_le32(0x3),
> > >      };
> > > @@ -556,7 +556,7 @@ static bool nvme_add_io_queue(BlockDriverState *bs, Error **errp)
> > >      }
> > >      cmd = (NvmeCmd) {
> > >          .opcode = NVME_ADM_CMD_CREATE_SQ,
> > > -        .prp1 = cpu_to_le64(q->sq.iova),
> > > +        .dptr.prp.prp1 = cpu_to_le64(q->sq.iova),
> > >          .cdw10 = cpu_to_le32(((queue_size - 1) << 16) | (n & 0xFFFF)),
> > >          .cdw11 = cpu_to_le32(0x1 | (n << 16)),
> > >      };
> > > @@ -906,16 +906,16 @@ try_map:
> > >      case 0:
> > >          abort();
> > >      case 1:
> > > -        cmd->prp1 = pagelist[0];
> > > -        cmd->prp2 = 0;
> > > +        cmd->dptr.prp.prp1 = pagelist[0];
> > > +        cmd->dptr.prp.prp2 = 0;
> > >          break;
> > >      case 2:
> > > -        cmd->prp1 = pagelist[0];
> > > -        cmd->prp2 = pagelist[1];
> > > +        cmd->dptr.prp.prp1 = pagelist[0];
> > > +        cmd->dptr.prp.prp2 = pagelist[1];
> > >          break;
> > >      default:
> > > -        cmd->prp1 = pagelist[0];
> > > -        cmd->prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> > > +        cmd->dptr.prp.prp1 = pagelist[0];
> > > +        cmd->dptr.prp.prp2 = cpu_to_le64(req->prp_list_iova + sizeof(uint64_t));
> > >          break;
> > >      }
> > >      trace_nvme_cmd_map_qiov(s, cmd, req, qiov, entries);
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 204ae1d33234..a91c60fdc111 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -75,8 +75,10 @@ static inline bool nvme_addr_is_cmb(NvmeCtrl *n, hwaddr addr)
> > >  
> > >  static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > >  {
> > > -    if (n->cmbsz && nvme_addr_is_cmb(n, addr)) {
> > > -        memcpy(buf, (void *) &n->cmbuf[addr - n->ctrl_mem.addr], size);
> > > +    hwaddr hi = addr + size;
> > 
> > Are you sure you don't want to check for overflow here?
> > Its theoretical issue since addr has to be almost full 64 bit
> > but still for those things I check this very defensively.
> > 
> 
> The use of nvme_addr_read in map_prp simply cannot overflow due to how
> the size is calculated, but for SGLs it's different. But the overflow is
> checked in map_sgl because we have to return a special error code in
> that case.
> 
> On the other hand there may be other callers of nvme_addr_read in the
> future that does not check this, so I'll re-add it.
Yep. For security things, as much checks as possible is always good.
Thanks!

> 
> > > +
> > > +    if (n->cmbsz && nvme_addr_is_cmb(n, addr) && nvme_addr_is_cmb(n, hi)) {
> > 
> > Here you fix the bug I mentioned in patch 6. I suggest you to move the fix there.
> 
> Done.
> 
> > > +        memcpy(buf, nvme_addr_to_cmb(n, addr), size);
> > >          return 0;
> > >      }
> > >  
> > > @@ -159,6 +161,48 @@ static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
> > >      }
> > >  }
> > >  
> > > +static uint16_t nvme_map_addr_cmb(NvmeCtrl *n, QEMUIOVector *iov, hwaddr addr,
> > > +    size_t len)
> > > +{
> > > +    if (!nvme_addr_is_cmb(n, addr) || !nvme_addr_is_cmb(n, addr + len)) {
> > > +        return NVME_DATA_TRANSFER_ERROR;
> > > +    }
> > > +
> > > +    qemu_iovec_add(iov, nvme_addr_to_cmb(n, addr), len);
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > > +
> > > +static uint16_t nvme_map_addr(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > > +    hwaddr addr, size_t len)
> > > +{
> > > +    bool addr_is_cmb = nvme_addr_is_cmb(n, addr);
> > > +
> > > +    if (addr_is_cmb) {
> > > +        if (qsg->sg) {
> > > +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > > +        }
> > > +
> > > +        if (!iov->iov) {
> > > +            qemu_iovec_init(iov, 1);
> > > +        }
> > > +
> > > +        return nvme_map_addr_cmb(n, iov, addr, len);
> > > +    }
> > > +
> > > +    if (iov->iov) {
> > > +        return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > > +    }
> > > +
> > > +    if (!qsg->sg) {
> > > +        pci_dma_sglist_init(qsg, &n->parent_obj, 1);
> > > +    }
> > > +
> > > +    qemu_sglist_add(qsg, addr, len);
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > 
> > Very good refactoring. I would also suggest you to move this to a separate
> > patch. I always put refactoring first and then patches that add features.
> > 
> 
> Done.
Perfect, thanks!
> 
> > > +
> > >  static uint16_t nvme_map_prp(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > >      uint64_t prp1, uint64_t prp2, uint32_t len, NvmeRequest *req)
> > >  {
> > > @@ -307,15 +351,240 @@ unmap:
> > >      return status;
> > >  }
> > >  
> > > -static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > > -    uint64_t prp1, uint64_t prp2, DMADirection dir, NvmeRequest *req)
> > > +static uint16_t nvme_map_sgl_data(NvmeCtrl *n, QEMUSGList *qsg,
> > > +    QEMUIOVector *iov, NvmeSglDescriptor *segment, uint64_t nsgld,
> > > +    uint32_t *len, NvmeRequest *req)
> > > +{
> > > +    dma_addr_t addr, trans_len;
> > > +    uint32_t length;
> > > +    uint16_t status;
> > > +
> > > +    for (int i = 0; i < nsgld; i++) {
> > > +        uint8_t type = NVME_SGL_TYPE(segment[i].type);
> > > +
> > > +        if (type != NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > > +            switch (type) {
> > > +            case NVME_SGL_DESCR_TYPE_BIT_BUCKET:
> > > +            case NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK:
> > > +                return NVME_SGL_DESCRIPTOR_TYPE_INVALID | NVME_DNR;
> > > +            default:
> > > +                break;
> > > +            }
> > > +
> > > +            return NVME_INVALID_NUM_SGL_DESCRIPTORS | NVME_DNR;
> > 
> > Since the only way to reach the above statement is by that 'default'
> > why not to move it there?
> 
> True. Fixed!
Thanks!
> 
> > > +        }
> > > +
> > > +        if (*len == 0) {
> > > +            if (!NVME_CTRL_SGLS_EXCESS_LENGTH(n->id_ctrl.sgls)) {
> > > +                trace_nvme_dev_err_invalid_sgl_excess_length(nvme_cid(req));
> > > +                return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > > +            }
> > > +
> > > +            break;
> > > +        }
> > > +
> > > +        addr = le64_to_cpu(segment[i].addr);
> > > +        length = le32_to_cpu(segment[i].len);
> > > +
> > > +        if (!length) {
> > > +            continue;
> > > +        }
> > > +
> > > +        if (UINT64_MAX - addr < length) {
> > > +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > > +        }
> > > +
> > > +        trans_len = MIN(*len, length);
> > > +
> > > +        status = nvme_map_addr(n, qsg, iov, addr, trans_len);
> > > +        if (status) {
> > > +            return status;
> > > +        }
> > > +
> > > +        *len -= trans_len;
> > > +    }
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > > +
> > > +static uint16_t nvme_map_sgl(NvmeCtrl *n, QEMUSGList *qsg, QEMUIOVector *iov,
> > > +    NvmeSglDescriptor sgl, uint32_t len, NvmeRequest *req)
> > 
> > Minor nitpick: 
> > Usually structs are passed by reference (that is pointer in C), 
> > however I see that you change 'sgl' it in the function.
> > IMHO this is a bit hard to read, I usually prefer not to change input parameters.
> > 
> 
> Uhm, please help me, where am I changing it? That is unintentional I
> think.
Ah, that is leftover from review of v4 (which I didn't send since you posted v5 meanwhile)
from which I mostly copy&pasted the comments and I didn't notice you changed that.


> 
> I *think* I prefer passing it by value, just because it fits nicely with
> how different fields of the command is passed like that in other places.
> We are "copying" the same amount of data as with PRPs (2x64 bits vs
> 1x128 bits).

Since you don't change the input value, I don't have a strong opinion on leaving it
passed by value, so I don't mind leaving this as is.


> 
> > > +{
> > > +    const int MAX_NSGLD = 256;
> > 
> > I personally would rename that const to something like SG_CHUNK_SIZE and add a comment, since
> > it is just an arbitrary chunk size you use to avoid dynamic memory allocation,
> > that is so we can avoid confusion vs the spec.
> 
> Good point. Done.
Thanks!

> 
> > 
> > > +
> > > +    NvmeSglDescriptor segment[MAX_NSGLD], *sgld, *last_sgld;
> > > +    uint64_t nsgld;
> > > +    uint32_t length;
> > > +    uint16_t status;
> > > +    bool sgl_in_cmb = false;
> > > +    hwaddr addr;
> > > +    int ret;
> > > +
> > > +    sgld = &sgl;
> > > +    addr = le64_to_cpu(sgl.addr);
> > > +
> > > +    trace_nvme_dev_map_sgl(nvme_cid(req), NVME_SGL_TYPE(sgl.type), req->nlb,
> > > +        len);
> > > +
> > > +    /*
> > > +     * If the entire transfer can be described with a single data block it can
> > > +     * be mapped directly.
> > > +     */
> > > +    if (NVME_SGL_TYPE(sgl.type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > > +        status = nvme_map_sgl_data(n, qsg, iov, sgld, 1, &len, req);
> > > +        if (status) {
> > > +            goto unmap;
> > > +        }
> > > +
> > > +        goto out;
> > > +    }
> > > +
> > > +    /*
> > > +     * If the segment is located in the CMB, the submission queue of the
> > > +     * request must also reside there.
> > > +     */
> > > +    if (nvme_addr_is_cmb(n, addr)) {
> > > +        if (!nvme_addr_is_cmb(n, req->sq->dma_addr)) {
> > > +            return NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > > +        }
> > > +
> > > +        sgl_in_cmb = true;
> > > +    }
> > > +
> > > +    for (;;) {
> > > +        length = le32_to_cpu(sgld->len);
> > > +
> > > +        if (!length || length & 0xf) {
> > > +            return NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> > > +        }
> > > +
> > > +        if (UINT64_MAX - addr < length) {
> > 
> > I assume you check for overflow here. Looks like very nice way to do it.
> > This should be adopted in few more places
> > > +            return NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > > +        }
> > > +
> > > +        nsgld = length / sizeof(NvmeSglDescriptor);
> > > +
> > > +        /* read the segment in chunks of 256 descriptors (4k) */
> > 
> > That comment is perfect to move/copy to definition of MAX_NSGLD
> 
> Done.
> 
> > 
> > > +        while (nsgld > MAX_NSGLD) {
> > > +            if (nvme_addr_read(n, addr, segment, sizeof(segment))) {
> > > +                trace_nvme_dev_err_addr_read(addr);
> > > +                status = NVME_DATA_TRANSFER_ERROR;
> > > +                goto unmap;
> > > +            }
> > > +
> > > +            status = nvme_map_sgl_data(n, qsg, iov, segment, MAX_NSGLD, &len,
> > > +                req);
> > > +            if (status) {
> > > +                goto unmap;
> > > +            }
> > > +
> > > +            nsgld -= MAX_NSGLD;
> > > +            addr += MAX_NSGLD * sizeof(NvmeSglDescriptor);
> > > +        }
> > > +
> > > +        ret = nvme_addr_read(n, addr, segment, nsgld *
> > > +            sizeof(NvmeSglDescriptor));
> > 
> > Reminding you to fix the line split issues. (align the sizeof on '(')
> 
> Done.
> 
> > 
> > > +        if (ret) {
> > > +            trace_nvme_dev_err_addr_read(addr);
> > > +            status = NVME_DATA_TRANSFER_ERROR;
> > > +            goto unmap;
> > > +        }
> > > +
> > > +        last_sgld = &segment[nsgld - 1];
> > > +
> > > +        /* if the segment ends with a Data Block, then we are done */
> > > +        if (NVME_SGL_TYPE(last_sgld->type) == NVME_SGL_DESCR_TYPE_DATA_BLOCK) {
> > > +            status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld, &len, req);
> > > +            if (status) {
> > > +                goto unmap;
> > > +            }
> > > +
> > > +            break;
> > > +        }
> > > +
> > > +        /* a Last Segment must end with a Data Block descriptor */
> > > +        if (NVME_SGL_TYPE(sgld->type) == NVME_SGL_DESCR_TYPE_LAST_SEGMENT) {
> > > +            status = NVME_INVALID_SGL_SEG_DESCRIPTOR | NVME_DNR;
> > > +            goto unmap;
> > > +        }
> > > +
> > > +        sgld = last_sgld;
> > > +        addr = le64_to_cpu(sgld->addr);
> > > +
> > > +        /*
> > > +         * Do not map the last descriptor; it will be a Segment or Last Segment
> > > +         * descriptor instead and handled by the next iteration.
> > > +         */
> > > +        status = nvme_map_sgl_data(n, qsg, iov, segment, nsgld - 1, &len, req);
> > > +        if (status) {
> > > +            goto unmap;
> > > +        }
> > > +
> > > +        /*
> > > +         * If the next segment is in the CMB, make sure that the sgl was
> > > +         * already located there.
> > > +         */
> > > +        if (sgl_in_cmb != nvme_addr_is_cmb(n, addr)) {
> > > +            status = NVME_INVALID_USE_OF_CMB | NVME_DNR;
> > > +            goto unmap;
> > > +        }
> > > +    }
> > > +
> > > +out:
> > > +    /* if there is any residual left in len, the SGL was too short */
> > > +    if (len) {
> > > +        status = NVME_DATA_SGL_LENGTH_INVALID | NVME_DNR;
> > > +        goto unmap;
> > > +    }
> > > +
> > > +    return NVME_SUCCESS;
> > > +
> > > +unmap:
> > > +    if (iov->iov) {
> > > +        qemu_iovec_destroy(iov);
> > > +    }
> > > +
> > > +    if (qsg->sg) {
> > > +        qemu_sglist_destroy(qsg);
> > > +    }
> > > +
> > > +    return status;
> > > +}
> > 
> > Looks good, much better than in V4
> > 
> > 
> > > +
> > > +static uint16_t nvme_dma(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > > +    NvmeCmd *cmd, DMADirection dir, NvmeRequest *req)
> > >  {
> > >      uint16_t status = NVME_SUCCESS;
> > >      size_t bytes;
> > >  
> > > -    status = nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > > -    if (status) {
> > > -        return status;
> > > +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> > > +    case PSDT_PRP:
> > > +        status = nvme_map_prp(n, &req->qsg, &req->iov,
> > > +            le64_to_cpu(cmd->dptr.prp.prp1), le64_to_cpu(cmd->dptr.prp.prp2),
> > > +            len, req);
> > > +        if (status) {
> > > +            return status;
> > > +        }
> > > +
> > > +        break;
> > > +
> > > +    case PSDT_SGL_MPTR_CONTIGUOUS:
> > > +    case PSDT_SGL_MPTR_SGL:
> > > +        if (!req->sq->sqid) {
> > > +            /* SGLs shall not be used for Admin commands in NVMe over PCIe */
> > > +            return NVME_INVALID_FIELD;
> > > +        }
> > > +
> > > +        status = nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len,
> > > +            req);
> > > +        if (status) {
> > > +            return status;
> > > +        }
> > 
> > Minor nitpick: you can probably refactor this to an 'err' label in the end of function.
> 
> This has been refactored in another patch.
Perfect!
> 
> > > +
> > > +        break;
> > > +
> > > +    default:
> > > +        return NVME_INVALID_FIELD;
> > >      }
> > 
> > 
> > >  
> > >      if (req->qsg.nsg > 0) {
> > > @@ -351,13 +620,21 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > >  
> > >  static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > -    NvmeNamespace *ns = req->ns;
> > > +    uint32_t len = req->nlb << nvme_ns_lbads(req->ns);
> > > +    uint64_t prp1, prp2;
> > >  
> > > -    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +    switch (NVME_CMD_FLAGS_PSDT(cmd->flags)) {
> > > +    case PSDT_PRP:
> > > +        prp1 = le64_to_cpu(cmd->dptr.prp.prp1);
> > > +        prp2 = le64_to_cpu(cmd->dptr.prp.prp2);
> > >  
> > > -    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > > +        return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > > +    case PSDT_SGL_MPTR_CONTIGUOUS:
> > > +    case PSDT_SGL_MPTR_SGL:
> > > +        return nvme_map_sgl(n, &req->qsg, &req->iov, cmd->dptr.sgl, len, req);
> > > +    default:
> > > +        return NVME_INVALID_FIELD;
> > > +    }
> > >  }
> > >  
> > >  static void nvme_aio_destroy(NvmeAIO *aio)
> > > @@ -972,8 +1249,6 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> > >  static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> > >      uint32_t buf_len, uint64_t off, NvmeRequest *req)
> > >  {
> > > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> > >  
> > >      uint32_t trans_len;
> > > @@ -1023,16 +1298,14 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> > >          nvme_clear_events(n, NVME_AER_TYPE_SMART);
> > >      }
> > >  
> > > -    return nvme_dma_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > > -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > > +    return nvme_dma(n, (uint8_t *) &smart + off, trans_len, cmd,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > >      uint64_t off, NvmeRequest *req)
> > >  {
> > >      uint32_t trans_len;
> > > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > >      NvmeFwSlotInfoLog fw_log;
> > >  
> > >      if (off > sizeof(fw_log)) {
> > > @@ -1043,8 +1316,8 @@ static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > >  
> > >      trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > >  
> > > -    return nvme_dma_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > > -        prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > > +    return nvme_dma(n, (uint8_t *) &fw_log + off, trans_len, cmd,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > @@ -1194,25 +1467,18 @@ static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeIdentify *c,
> > > -    NvmeRequest *req)
> > > +static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > > -
> > >      trace_nvme_dev_identify_ctrl();
> > >  
> > > -    return nvme_dma_prp(n, (uint8_t *)&n->id_ctrl, sizeof(n->id_ctrl),
> > > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > > +    return nvme_dma(n, (uint8_t *) &n->id_ctrl, sizeof(n->id_ctrl), cmd,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> > > -    NvmeRequest *req)
> > > +static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > >      NvmeNamespace *ns;
> > > -    uint32_t nsid = le32_to_cpu(c->nsid);
> > > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> > >  
> > >      trace_nvme_dev_identify_ns(nsid);
> > >  
> > > @@ -1223,17 +1489,15 @@ static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeIdentify *c,
> > >  
> > >      ns = &n->namespaces[nsid - 1];
> > >  
> > > -    return nvme_dma_prp(n, (uint8_t *)&ns->id_ns, sizeof(ns->id_ns),
> > > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > > +    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> > > +static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
> > >      NvmeRequest *req)
> > >  {
> > >      static const int data_len = 4 * KiB;
> > > -    uint32_t min_nsid = le32_to_cpu(c->nsid);
> > > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > > +    uint32_t min_nsid = le32_to_cpu(cmd->nsid);
> > >      uint32_t *list;
> > >      uint16_t ret;
> > >      int i, j = 0;
> > > @@ -1250,13 +1514,13 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeIdentify *c,
> > >              break;
> > >          }
> > >      }
> > > -    ret = nvme_dma_prp(n, (uint8_t *)list, data_len, prp1, prp2,
> > > +    ret = nvme_dma(n, (uint8_t *) list, data_len, cmd,
> > >          DMA_DIRECTION_FROM_DEVICE, req);
> > >      g_free(list);
> > >      return ret;
> > >  }
> > >  
> > > -static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > > +static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
> > >      NvmeRequest *req)
> > >  {
> > >      static const int len = 4096;
> > > @@ -1268,9 +1532,7 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > >          uint8_t nid[16];
> > >      };
> > >  
> > > -    uint32_t nsid = le32_to_cpu(c->nsid);
> > > -    uint64_t prp1 = le64_to_cpu(c->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(c->prp2);
> > > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> > >  
> > >      struct ns_descr *list;
> > >      uint16_t ret;
> > > @@ -1293,8 +1555,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeIdentify *c,
> > >      list->nidl = 0x10;
> > >      *(uint32_t *) &list->nid[12] = cpu_to_be32(nsid);
> > >  
> > > -    ret = nvme_dma_prp(n, (uint8_t *) list, len, prp1, prp2,
> > > -        DMA_DIRECTION_FROM_DEVICE, req);
> > > +    ret = nvme_dma(n, (uint8_t *) list, len, cmd, DMA_DIRECTION_FROM_DEVICE,
> > > +        req);
> > >      g_free(list);
> > >      return ret;
> > >  }
> > > @@ -1305,13 +1567,13 @@ static uint16_t nvme_identify(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  
> > >      switch (le32_to_cpu(c->cns)) {
> > >      case 0x00:
> > > -        return nvme_identify_ns(n, c, req);
> > > +        return nvme_identify_ns(n, cmd, req);
> > >      case 0x01:
> > > -        return nvme_identify_ctrl(n, c, req);
> > > +        return nvme_identify_ctrl(n, cmd, req);
> > >      case 0x02:
> > > -        return nvme_identify_ns_list(n, c, req);
> > > +        return nvme_identify_ns_list(n, cmd, req);
> > >      case 0x03:
> > > -        return nvme_identify_ns_descr_list(n, c, req);
> > > +        return nvme_identify_ns_descr_list(n, cmd, req);
> > >      default:
> > >          trace_nvme_dev_err_invalid_identify_cns(le32_to_cpu(c->cns));
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > @@ -1373,13 +1635,10 @@ static inline uint64_t nvme_get_timestamp(const NvmeCtrl *n)
> > >  static uint16_t nvme_get_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > >      NvmeRequest *req)
> > >  {
> > > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > -
> > >      uint64_t timestamp = nvme_get_timestamp(n);
> > >  
> > > -    return nvme_dma_prp(n, (uint8_t *)&timestamp, sizeof(timestamp),
> > > -        prp1, prp2, DMA_DIRECTION_FROM_DEVICE, req);
> > > +    return nvme_dma(n, (uint8_t *)&timestamp, sizeof(timestamp), cmd,
> > > +        DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > >  static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > @@ -1462,11 +1721,9 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > >  {
> > >      uint16_t ret;
> > >      uint64_t timestamp;
> > > -    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > >  
> > > -    ret = nvme_dma_prp(n, (uint8_t *) &timestamp, sizeof(timestamp),
> > > -        prp1, prp2, DMA_DIRECTION_TO_DEVICE, req);
> > > +    ret = nvme_dma(n, (uint8_t *) &timestamp, sizeof(timestamp), cmd,
> > > +        DMA_DIRECTION_TO_DEVICE, req);
> > >      if (ret != NVME_SUCCESS) {
> > >          return ret;
> > >      }
> > > @@ -2232,6 +2489,8 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >          id->vwc = 1;
> > >      }
> > >  
> > > +    id->sgls = cpu_to_le32(0x1);
> > 
> > Being part of the spec, it would be nice to #define this as well.
> 
> Done.
> 
> > > +
> > >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> > >      pstrcat((char *) id->subnqn, sizeof(id->subnqn), n->params.serial);
> > >  
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 09bfb3782dd0..81d69e15fc32 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -34,6 +34,7 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
> > >  nvme_dev_irq_masked(void) "IRQ is masked"
> > >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> > >  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> > > +nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"PRIu16" type 0x%"PRIx8" nlb %"PRIu32" len %"PRIu64""
> > >  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > > %"PRIu64" opc \"%s\" req %p"
> > >  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> > >  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> > > @@ -85,6 +86,9 @@ nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> > >  nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > > status 0x%"PRIx16""
> > >  nvme_dev_err_addr_read(uint64_t addr) "addr 0x%"PRIx64""
> > >  nvme_dev_err_addr_write(uint64_t addr) "addr 0x%"PRIx64""
> > > +nvme_dev_err_invalid_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> > > +nvme_dev_err_invalid_num_sgld(uint16_t cid, uint8_t typ) "cid %"PRIu16" type 0x%"PRIx8""
> > > +nvme_dev_err_invalid_sgl_excess_length(uint16_t cid) "cid %"PRIu16""
> > >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> > >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> > >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index a873776d98b8..dbdeecf82358 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -205,15 +205,53 @@ enum NvmeCmbszMask {
> > >  #define NVME_CMBSZ_GETSIZE(cmbsz) \
> > >      (NVME_CMBSZ_SZ(cmbsz) * (1 << (12 + 4 * NVME_CMBSZ_SZU(cmbsz))))
> > >  
> > > +enum NvmeSglDescriptorType {
> > > +    NVME_SGL_DESCR_TYPE_DATA_BLOCK           = 0x0,
> > > +    NVME_SGL_DESCR_TYPE_BIT_BUCKET           = 0x1,
> > > +    NVME_SGL_DESCR_TYPE_SEGMENT              = 0x2,
> > > +    NVME_SGL_DESCR_TYPE_LAST_SEGMENT         = 0x3,
> > > +    NVME_SGL_DESCR_TYPE_KEYED_DATA_BLOCK     = 0x4,
> > > +
> > > +    NVME_SGL_DESCR_TYPE_VENDOR_SPECIFIC      = 0xf,
> > > +};
> > > +
> > > +enum NvmeSglDescriptorSubtype {
> > > +    NVME_SGL_DESCR_SUBTYPE_ADDRESS = 0x0,
> > > +};
> > > +
> > > +typedef struct NvmeSglDescriptor {
> > > +    uint64_t addr;
> > > +    uint32_t len;
> > > +    uint8_t  rsvd[3];
> > > +    uint8_t  type;
> > > +} NvmeSglDescriptor;
> > 
> > I suggest you add a build time struct size check for this,
> > just in case compiler tries something funny.
> > (look at _nvme_check_size, at nvme.h)
> > 
> 
> Done.
> 
> > Also I think that the spec update change that adds the NvmeSglDescriptor
> > should be split into separate patch (or better be added in one big patch that adds all 1.3d features), 
> > which would make it also easier to see changes that touch the other nvme driver we have.
> > 
> 
> Done.
> 
> > > +
> > > +#define NVME_SGL_TYPE(type)     ((type >> 4) & 0xf)
> > > +#define NVME_SGL_SUBTYPE(type)  (type & 0xf)
> > > +
> > > +typedef union NvmeCmdDptr {
> > > +    struct {
> > > +        uint64_t    prp1;
> > > +        uint64_t    prp2;
> > > +    } prp;
> > > +
> > > +    NvmeSglDescriptor sgl;
> > > +} NvmeCmdDptr;
> > > +
> > > +enum NvmePsdt {
> > > +    PSDT_PRP                 = 0x0,
> > > +    PSDT_SGL_MPTR_CONTIGUOUS = 0x1,
> > > +    PSDT_SGL_MPTR_SGL        = 0x2,
> > > +};
> > > +
> > >  typedef struct NvmeCmd {
> > >      uint8_t     opcode;
> > > -    uint8_t     fuse;
> > > +    uint8_t     flags;
> > >      uint16_t    cid;
> > >      uint32_t    nsid;
> > >      uint64_t    res1;
> > >      uint64_t    mptr;
> > > -    uint64_t    prp1;
> > > -    uint64_t    prp2;
> > > +    NvmeCmdDptr dptr;
> > >      uint32_t    cdw10;
> > >      uint32_t    cdw11;
> > >      uint32_t    cdw12;
> > > @@ -222,6 +260,9 @@ typedef struct NvmeCmd {
> > >      uint32_t    cdw15;
> > >  } NvmeCmd;
> > >  
> > > +#define NVME_CMD_FLAGS_FUSE(flags) (flags & 0x3)
> > > +#define NVME_CMD_FLAGS_PSDT(flags) ((flags >> 6) & 0x3)
> > > +
> > >  enum NvmeAdminCommands {
> > >      NVME_ADM_CMD_DELETE_SQ      = 0x00,
> > >      NVME_ADM_CMD_CREATE_SQ      = 0x01,
> > > @@ -427,6 +468,11 @@ enum NvmeStatusCodes {
> > >      NVME_CMD_ABORT_MISSING_FUSE = 0x000a,
> > >      NVME_INVALID_NSID           = 0x000b,
> > >      NVME_CMD_SEQ_ERROR          = 0x000c,
> > > +    NVME_INVALID_SGL_SEG_DESCRIPTOR  = 0x000d,
> > > +    NVME_INVALID_NUM_SGL_DESCRIPTORS = 0x000e,
> > > +    NVME_DATA_SGL_LENGTH_INVALID     = 0x000f,
> > > +    NVME_METADATA_SGL_LENGTH_INVALID = 0x0010,
> > > +    NVME_SGL_DESCRIPTOR_TYPE_INVALID = 0x0011,
> > >      NVME_INVALID_USE_OF_CMB     = 0x0012,
> > >      NVME_LBA_RANGE              = 0x0080,
> > >      NVME_CAP_EXCEEDED           = 0x0081,
> > > @@ -623,6 +669,16 @@ enum NvmeIdCtrlOncs {
> > >  #define NVME_CTRL_CQES_MIN(cqes) ((cqes) & 0xf)
> > >  #define NVME_CTRL_CQES_MAX(cqes) (((cqes) >> 4) & 0xf)
> > >  
> > > +#define NVME_CTRL_SGLS_SUPPORTED(sgls)                 ((sgls) & 0x3)
> > > +#define NVME_CTRL_SGLS_SUPPORTED_NO_ALIGNMENT(sgls)    ((sgls) & (0x1 <<  0))
> > > +#define NVME_CTRL_SGLS_SUPPORTED_DWORD_ALIGNMENT(sgls) ((sgls) & (0x1 <<  1))
> > > +#define NVME_CTRL_SGLS_KEYED(sgls)                     ((sgls) & (0x1 <<  2))
> > > +#define NVME_CTRL_SGLS_BITBUCKET(sgls)                 ((sgls) & (0x1 << 16))
> > > +#define NVME_CTRL_SGLS_MPTR_CONTIGUOUS(sgls)           ((sgls) & (0x1 << 17))
> > > +#define NVME_CTRL_SGLS_EXCESS_LENGTH(sgls)             ((sgls) & (0x1 << 18))
> > > +#define NVME_CTRL_SGLS_MPTR_SGL(sgls)                  ((sgls) & (0x1 << 19))
> > > +#define NVME_CTRL_SGLS_ADDR_OFFSET(sgls)               ((sgls) & (0x1 << 20))
> > > +
> > >  typedef struct NvmeFeatureVal {
> > >      uint32_t    arbitration;
> > >      uint32_t    power_mgmt;
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 


Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 22/26] nvme: support multiple namespaces
  2020-03-16  7:55         ` Klaus Birkelund Jensen
@ 2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:24 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:55 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 14:34, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:52 +0100, Klaus Jensen wrote:
> > > This adds support for multiple namespaces by introducing a new 'nvme-ns'
> > > device model. The nvme device creates a bus named from the device name
> > > ('id'). The nvme-ns devices then connect to this and registers
> > > themselves with the nvme device.
> > > 
> > > This changes how an nvme device is created. Example with two namespaces:
> > > 
> > >   -drive file=nvme0n1.img,if=none,id=disk1
> > >   -drive file=nvme0n2.img,if=none,id=disk2
> > >   -device nvme,serial=deadbeef,id=nvme0
> > >   -device nvme-ns,drive=disk1,bus=nvme0,nsid=1
> > >   -device nvme-ns,drive=disk2,bus=nvme0,nsid=2
> > > 
> > > The drive property is kept on the nvme device to keep the change
> > > backward compatible, but the property is now optional. Specifying a
> > > drive for the nvme device will always create the namespace with nsid 1.
> > 
> > Very reasonable way to do it. 
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/Makefile.objs |   2 +-
> > >  hw/block/nvme-ns.c     | 158 +++++++++++++++++++++++++++
> > >  hw/block/nvme-ns.h     |  60 +++++++++++
> > >  hw/block/nvme.c        | 235 +++++++++++++++++++++++++----------------
> > >  hw/block/nvme.h        |  47 ++++-----
> > >  hw/block/trace-events  |   6 +-
> > >  6 files changed, 389 insertions(+), 119 deletions(-)
> > >  create mode 100644 hw/block/nvme-ns.c
> > >  create mode 100644 hw/block/nvme-ns.h
> > > 
> > > diff --git a/hw/block/Makefile.objs b/hw/block/Makefile.objs
> > > index 28c2495a00dc..45f463462f1e 100644
> > > --- a/hw/block/Makefile.objs
> > > +++ b/hw/block/Makefile.objs
> > > @@ -7,7 +7,7 @@ common-obj-$(CONFIG_PFLASH_CFI02) += pflash_cfi02.o
> > >  common-obj-$(CONFIG_XEN) += xen-block.o
> > >  common-obj-$(CONFIG_ECC) += ecc.o
> > >  common-obj-$(CONFIG_ONENAND) += onenand.o
> > > -common-obj-$(CONFIG_NVME_PCI) += nvme.o
> > > +common-obj-$(CONFIG_NVME_PCI) += nvme.o nvme-ns.o
> > >  common-obj-$(CONFIG_SWIM) += swim.o
> > >  
> > >  obj-$(CONFIG_SH4) += tc58128.o
> > > diff --git a/hw/block/nvme-ns.c b/hw/block/nvme-ns.c
> > > new file mode 100644
> > > index 000000000000..0e5be44486f4
> > > --- /dev/null
> > > +++ b/hw/block/nvme-ns.c
> > > @@ -0,0 +1,158 @@
> > > +#include "qemu/osdep.h"
> > > +#include "qemu/units.h"
> > > +#include "qemu/cutils.h"
> > > +#include "qemu/log.h"
> > > +#include "hw/block/block.h"
> > > +#include "hw/pci/msix.h"
> > 
> > Do you need this include?
> 
> No, I needed hw/pci/pci.h instead :)
I think it compiled without that include,
but including pci.h for  a pci device a a right thing
anyway.

> 
> > > +#include "sysemu/sysemu.h"
> > > +#include "sysemu/block-backend.h"
> > > +#include "qapi/error.h"
> > > +
> > > +#include "hw/qdev-properties.h"
> > > +#include "hw/qdev-core.h"
> > > +
> > > +#include "nvme.h"
> > > +#include "nvme-ns.h"
> > > +
> > > +static int nvme_ns_init(NvmeNamespace *ns)
> > > +{
> > > +    NvmeIdNs *id_ns = &ns->id_ns;
> > > +
> > > +    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > > +    id_ns->nuse = id_ns->ncap = id_ns->nsze =
> > > +        cpu_to_le64(nvme_ns_nlbas(ns));
> > 
> > Nitpick: To be honest I don't really like that chain assignment, 
> > especially since it forces to wrap the line, but that is just my
> > personal taste.
> 
> Fixed, and also added a comment as to why they are the same.
> 
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +static int nvme_ns_init_blk(NvmeCtrl *n, NvmeNamespace *ns, NvmeIdCtrl *id,
> > > +    Error **errp)
> > > +{
> > > +    uint64_t perm, shared_perm;
> > > +
> > > +    Error *local_err = NULL;
> > > +    int ret;
> > > +
> > > +    perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE;
> > > +    shared_perm = BLK_PERM_CONSISTENT_READ | BLK_PERM_WRITE_UNCHANGED |
> > > +        BLK_PERM_GRAPH_MOD;
> > > +
> > > +    ret = blk_set_perm(ns->blk, perm, shared_perm, &local_err);
> > > +    if (ret) {
> > > +        error_propagate_prepend(errp, local_err, "blk_set_perm: ");
> > > +        return ret;
> > > +    }
> > 
> > You should consider using blkconf_apply_backend_options.
> > Take a look at for example virtio_blk_device_realize.
> > That will give you support for read only block devices as well.
> 
> So, yeah. There is a reason for this. And I will add that as a comment,
> but I will write it here for posterity.
> 
> The problem is when the nvme-ns device starts getting more than just a
> single drive attached (I have patches ready that will add a "metadata"
> and a "state" drive). The blkconf_ functions work on a BlockConf that
> embeds a BlockBackend, so you can't have one BlockConf with multiple
> BlockBackend's. That is why I'm kinda copying the "good parts" of
> the blkconf_apply_backend_options code here.
All right. but I guess that eventually this code will need a review
from someone that knows the block layer better that I do.

> 
> > 
> > I personally only once grazed the area of block permissions,
> > so I prefer someone from the block layer to review this as well.
> > 
> > > +
> > > +    ns->size = blk_getlength(ns->blk);
> > > +    if (ns->size < 0) {
> > > +        error_setg_errno(errp, -ns->size, "blk_getlength");
> > > +        return 1;
> > > +    }
> > > +
> > > +    switch (n->conf.wce) {
> > > +    case ON_OFF_AUTO_ON:
> > > +        n->features.volatile_wc = 1;
> > > +        break;
> > > +    case ON_OFF_AUTO_OFF:
> > > +        n->features.volatile_wc = 0;
> > > +    case ON_OFF_AUTO_AUTO:
> > > +        n->features.volatile_wc = blk_enable_write_cache(ns->blk);
> > > +        break;
> > > +    default:
> > > +        abort();
> > > +    }
> > > +
> > > +    blk_set_enable_write_cache(ns->blk, n->features.volatile_wc);
> > > +
> > > +    return 0;
> > 
> > Nitpick: also I just noticed that you call the controller 'n' I didn't paid attention to this
> > before. I think something like 'ctrl' or ctl would be more readable.
> > 
> 
> Yeah, but using 'n' is done in all the existing code, so I think we
> should stick with it.
Or we can do a mass rename later when all the patches are merged.
Doesn't matter to me to be honest, it was a very minor nitpick after all.

> 
> > > +}
> > > +
> > > +static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
> > > +{
> > > +    if (!ns->blk) {
> > > +        error_setg(errp, "block backend not configured");
> > > +        return 1;
> > > +    }
> > > +
> > > +    return 0;
> > > +}
> > > +
> > > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > > +{
> > > +    Error *local_err = NULL;
> > > +
> > > +    if (nvme_ns_check_constraints(ns, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err,
> > > +            "nvme_ns_check_constraints: ");
> > > +        return 1;
> > > +    }
> > > +
> > > +    if (nvme_ns_init_blk(n, ns, &n->id_ctrl, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err, "nvme_ns_init_blk: ");
> > > +        return 1;
> > > +    }
> > > +
> > > +    nvme_ns_init(ns);
> > > +    if (nvme_register_namespace(n, ns, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err, "nvme_register_namespace: ");
> > > +        return 1;
> > > +    }
> > > +
> > > +    return 0;
> > 
> > Nitipick: to be honest I am not sure we want to expose internal function names like that in 
> > error hints are supposed to be readable to user that doesn't look at the source.
> > 
> 
> Fixed.
> 
> > > +}
> > > +
> > > +static void nvme_ns_realize(DeviceState *dev, Error **errp)
> > > +{
> > > +    NvmeNamespace *ns = NVME_NS(dev);
> > > +    BusState *s = qdev_get_parent_bus(dev);
> > > +    NvmeCtrl *n = NVME(s->parent);
> > 
> > Nitpick: Don't know if you defined this or it was like that always,
> > but I would prefer something like NVME_CTL instead.
> > 
> 
> This is also grandfathered from the nvme device.
OK, fair enough.
> 
> > > +    Error *local_err = NULL;
> > > +
> > > +    if (nvme_ns_setup(n, ns, &local_err)) {
> > > +        error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> > > +        return;
> > > +    }
> > > +}
> > > +
> > > +static Property nvme_ns_props[] = {
> > > +    DEFINE_NVME_NS_PROPERTIES(NvmeNamespace, params),
> > 
> > If you go with my suggestion to use blkconf you will use here the
> > DEFINE_BLOCK_PROPERTIES_BASE
> > 
> 
> See my comment above about that.
> 
> > > +    DEFINE_PROP_END_OF_LIST(),
> > > +};
> > > +
> > > +static void nvme_ns_class_init(ObjectClass *oc, void *data)
> > > +{
> > > +    DeviceClass *dc = DEVICE_CLASS(oc);
> > > +
> > > +    set_bit(DEVICE_CATEGORY_STORAGE, dc->categories);
> > > +
> > > +    dc->bus_type = TYPE_NVME_BUS;
> > > +    dc->realize = nvme_ns_realize;
> > > +    device_class_set_props(dc, nvme_ns_props);
> > > +    dc->desc = "virtual nvme namespace";
> > > +}
> > 
> > Looks reasonable.
> > I don't know the device/bus model in depth to be honest
> > (I learned it for few days some time ago though)
> > so a review from someone that knows this area better that I do
> > is very welcome.
> > 
> > > +
> > > +static void nvme_ns_instance_init(Object *obj)
> > > +{
> > > +    NvmeNamespace *ns = NVME_NS(obj);
> > > +    char *bootindex = g_strdup_printf("/namespace@%d,0", ns->params.nsid);
> > > +
> > > +    device_add_bootindex_property(obj, &ns->bootindex, "bootindex",
> > > +        bootindex, DEVICE(obj), &error_abort);
> > > +
> > > +    g_free(bootindex);
> > > +}
> > > +
> > > +static const TypeInfo nvme_ns_info = {
> > > +    .name = TYPE_NVME_NS,
> > > +    .parent = TYPE_DEVICE,
> > > +    .class_init = nvme_ns_class_init,
> > > +    .instance_size = sizeof(NvmeNamespace),
> > > +    .instance_init = nvme_ns_instance_init,
> > > +};
> > > +
> > > +static void nvme_ns_register_types(void)
> > > +{
> > > +    type_register_static(&nvme_ns_info);
> > > +}
> > > +
> > > +type_init(nvme_ns_register_types)
> > > diff --git a/hw/block/nvme-ns.h b/hw/block/nvme-ns.h
> > > new file mode 100644
> > > index 000000000000..b564bac25f6d
> > > --- /dev/null
> > > +++ b/hw/block/nvme-ns.h
> > > @@ -0,0 +1,60 @@
> > > +#ifndef NVME_NS_H
> > > +#define NVME_NS_H
> > > +
> > > +#define TYPE_NVME_NS "nvme-ns"
> > > +#define NVME_NS(obj) \
> > > +    OBJECT_CHECK(NvmeNamespace, (obj), TYPE_NVME_NS)
> > > +
> > > +#define DEFINE_NVME_NS_PROPERTIES(_state, _props) \
> > > +    DEFINE_PROP_DRIVE("drive", _state, blk), \
> > > +    DEFINE_PROP_UINT32("nsid", _state, _props.nsid, 0)
> > > +
> > > +typedef struct NvmeNamespaceParams {
> > > +    uint32_t nsid;
> > > +} NvmeNamespaceParams;
> > > +
> > > +typedef struct NvmeNamespace {
> > > +    DeviceState  parent_obj;
> > > +    BlockBackend *blk;
> > > +    int32_t      bootindex;
> > > +    int64_t      size;
> > > +
> > > +    NvmeIdNs            id_ns;
> > > +    NvmeNamespaceParams params;
> > > +} NvmeNamespace;
> > > +
> > > +static inline uint32_t nvme_nsid(NvmeNamespace *ns)
> > > +{
> > > +    if (ns) {
> > > +        return ns->params.nsid;
> > > +    }
> > > +
> > > +    return -1;
> > > +}
> > 
> > To be honest I would allow user to omit nsid,
> > and in this case pick a free slot out of valid namespaces.
> > 
> > Let me explain the concept of valid/allocated/active namespaces
> > from the spec as written in my summary:
> > 
> > Valid namespaces are 1..N range of namespaces as reported in IDCTRL.NN
> > That value is static, and it should be either set to some arbitrary large value (say 256)
> > or set using qemu device parameter, and not changed dynamically as you currently do.
> > As I understand it, IDCTRL output should not change during lifetime of the controller,
> > although I didn't find exact confirmation on this in spec.
> > 
> > Allocated namespaces are not relevant to us (this is only used for namespace management)
> > (these are namespaces that exist but are not attached to the controller)
> > 
> > And then you have Active namespaces which are the namespaces the user can actually address.
> > 
> > However If I understand this correctly, currently the NVME 'bus' doesn't
> > support hotplug, thus all namespaces will be already plugged in on
> > VM startup, thus the issue doesn't really exist yet.
> > 
> > 
> > 
> 
> I added support for this. It's a nice addition and it makes the code
> much nicer.
Perfect!
> 
> > > +
> > > +static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > > +{
> > > +    NvmeIdNs *id_ns = &ns->id_ns;
> > > +    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > > +}
> > > +
> > > +static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > > +{
> > > +    return nvme_ns_lbaf(ns).ds;
> > > +}
> > > +
> > > +static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > > +{
> > > +    return 1 << nvme_ns_lbads(ns);
> > > +}
> > > +
> > > +static inline uint64_t nvme_ns_nlbas(NvmeNamespace *ns)
> > > +{
> > > +    return ns->size >> nvme_ns_lbads(ns);
> > > +}
> > > +
> > > +typedef struct NvmeCtrl NvmeCtrl;
> > > +
> > > +int nvme_ns_setup(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> > > +
> > > +#endif /* NVME_NS_H */
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index a91c60fdc111..3a377bc56734 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -17,10 +17,11 @@
> > >  /**
> > >   * Usage: add options:
> > >   *      -drive file=<file>,if=none,id=<drive_id>
> > > - *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
> > > + *      -device nvme,serial=<serial>,id=<bus_name>, \
> > >   *              cmb_size_mb=<cmb_size_mb[optional]>, \
> > >   *              num_queues=<N[optional]>, \
> > >   *              mdts=<mdts[optional]>
> > > + *      -device nvme-ns,drive=<drive_id>,bus=bus_name,nsid=1
> > >   *
> > >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> > >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > > @@ -28,6 +29,7 @@
> > >  
> > >  #include "qemu/osdep.h"
> > >  #include "qemu/units.h"
> > > +#include "qemu/error-report.h"
> > >  #include "hw/block/block.h"
> > >  #include "hw/pci/msix.h"
> > >  #include "hw/pci/pci.h"
> > > @@ -43,6 +45,7 @@
> > >  #include "qemu/cutils.h"
> > >  #include "trace.h"
> > >  #include "nvme.h"
> > > +#include "nvme-ns.h"
> > >  
> > >  #define NVME_SPEC_VER 0x00010300
> > >  #define NVME_MAX_QS PCI_MSIX_FLAGS_QSIZE
> > > @@ -85,6 +88,17 @@ static int nvme_addr_read(NvmeCtrl *n, hwaddr addr, void *buf, int size)
> > >      return pci_dma_read(&n->parent_obj, addr, buf, size);
> > >  }
> > >  
> > > +static uint16_t nvme_nsid_err(NvmeCtrl *n, uint32_t nsid)
> > > +{
> > > +    if (nsid && nsid < n->num_namespaces) {
> > > +        trace_nvme_dev_err_inactive_ns(nsid, n->num_namespaces);
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > > +    return NVME_INVALID_NSID | NVME_DNR;
> > > +}
> > 
> > I don't like that function to be honest.
> > This function is called when nvme_ns returns NULL.
> > IMHO it would be better to make nvme_ns return both namespace pointer and error code instead.
> > In kernel we encode error values into the returned pointer.
> > 
> 
> I'm not sure how you want me to do this? I'm not familiar with the way
> the kernel does it.
The kernel doesn't map the first memory page, and uses the pointer values 0..4095
as error values. I don' know if qemu can do this as well, but
you can just return have a function that returns error value, and the pointer to
nvme namespace can be returned through double pointer parameter.

> 
> > 
> > > +
> > >  static int nvme_check_sqid(NvmeCtrl *n, uint16_t sqid)
> > >  {
> > >      return sqid < n->params.num_queues && n->sq[sqid] != NULL ? 0 : -1;
> > > @@ -889,7 +903,7 @@ static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > >      uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > >  
> > >      if (unlikely((slba + nlb) > nsze)) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk),
> > > +        block_acct_invalid(blk_get_stats(ns->blk),
> > >              nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> > >          trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
> > >          return NVME_LBA_RANGE | NVME_DNR;
> > > @@ -924,11 +938,12 @@ static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> > >  
> > >  static void nvme_rw_cb(NvmeRequest *req, void *opaque)
> > >  {
> > > +    NvmeNamespace *ns = req->ns;
> > >      NvmeSQueue *sq = req->sq;
> > >      NvmeCtrl *n = sq->ctrl;
> > >      NvmeCQueue *cq = n->cq[sq->cqid];
> > >  
> > > -    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
> > > +    trace_nvme_dev_rw_cb(nvme_cid(req), nvme_nsid(ns));
> > >  
> > >      nvme_enqueue_req_completion(cq, req);
> > >  }
> > > @@ -1011,10 +1026,11 @@ static void nvme_aio_cb(void *opaque, int ret)
> > >  
> > >  static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > +    NvmeNamespace *ns = req->ns;
> > >      NvmeAIO *aio = g_new0(NvmeAIO, 1);
> > >  
> > >      *aio = (NvmeAIO) {
> > > -        .blk = n->conf.blk,
> > > +        .blk = ns->blk,
> > >          .req = req,
> > >      };
> > >  
> > > @@ -1038,12 +1054,12 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      req->slba = le64_to_cpu(rw->slba);
> > >      req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > >  
> > > -    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> > > -        req->slba, req->nlb);
> > > +    trace_nvme_dev_write_zeros(nvme_cid(req), nvme_nsid(ns), req->slba,
> > > +        req->nlb);
> > >  
> > >      status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > >      if (unlikely(status)) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> > > +        block_acct_invalid(blk_get_stats(ns->blk), BLOCK_ACCT_WRITE);
> > >          return status;
> > >      }
> > >  
> > > @@ -1053,7 +1069,7 @@ static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      aio = g_new0(NvmeAIO, 1);
> > >  
> > >      *aio = (NvmeAIO) {
> > > -        .blk = n->conf.blk,
> > > +        .blk = ns->blk,
> > >          .offset = offset,
> > >          .len = count,
> > >          .req = req,
> > > @@ -1077,22 +1093,23 @@ static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > >      req->slba = le64_to_cpu(rw->slba);
> > >  
> > > -    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> > > -        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> > > +    trace_nvme_dev_rw(nvme_cid(req), nvme_req_is_write(req) ? "write" : "read",
> > > +        nvme_nsid(ns), req->nlb, req->nlb << nvme_ns_lbads(ns),
> > > +        req->slba);
> > >  
> > >      status = nvme_check_rw(n, req);
> > >      if (status) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > +        block_acct_invalid(blk_get_stats(ns->blk), acct);
> > >          return status;
> > >      }
> > >  
> > >      status = nvme_map(n, cmd, req);
> > >      if (status) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > +        block_acct_invalid(blk_get_stats(ns->blk), acct);
> > >          return status;
> > >      }
> > >  
> > > -    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> > > +    nvme_rw_aio(ns->blk, req->slba << nvme_ns_lbads(ns), req);
> > >      nvme_req_set_cb(req, nvme_rw_cb, NULL);
> > >  
> > >      return NVME_NO_COMPLETE;
> > > @@ -1105,12 +1122,11 @@ static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> > >          cmd->opcode);
> > >  
> > > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > > -        return NVME_INVALID_NSID | NVME_DNR;
> > > -    }
> > > +    req->ns = nvme_ns(n, nsid);
> > >  
> > > -    req->ns = &n->namespaces[nsid - 1];
> > > +    if (unlikely(!req->ns)) {
> > > +        return nvme_nsid_err(n, nsid);
> > > +    }
> > >  
> > >      switch (cmd->opcode) {
> > >      case NVME_CMD_FLUSH:
> > > @@ -1256,18 +1272,24 @@ static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint8_t rae,
> > >      uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > >          write_commands = 0;
> > >      NvmeSmartLog smart;
> > > -    BlockAcctStats *s;
> > >  
> > >      if (nsid && nsid != 0xffffffff) {
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > >      }
> > >  
> > > -    s = blk_get_stats(n->conf.blk);
> > > +    for (int i = 1; i <= n->num_namespaces; i++) {
> > > +        NvmeNamespace *ns = nvme_ns(n, i);
> > > +        if (!ns) {
> > > +            continue;
> > > +        }
> > >  
> > > -    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > > -    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > > -    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > > -    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > > +        BlockAcctStats *s = blk_get_stats(ns->blk);
> > > +
> > > +        units_read += s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > > +        units_written += s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > > +        read_commands += s->nr_ops[BLOCK_ACCT_READ];
> > > +        write_commands += s->nr_ops[BLOCK_ACCT_WRITE];
> > > +    }
> > 
> > Very minor nitpick: As something minor to do in the future, is to
> > report the statistics per namespace.
> 
> In NVMe v1.4 there is no namespace specific information in the
> SMART/Health log page, so this is valid for both v1.3 and v1.4.

It does seem to be supported

"This log page is used to provide SMART and general health information. The information provided is over
the life of the controller and is retained across power cycles. To request the controller log page, the
namespace identifier specified is FFFFFFFFh. The controller may also support requesting the log page on
a per namespace basis, as indicated by bit 0 of the LPA field in the Identify Controller data structure in
Figure 247."


However I read the spec, and it does look like the IO statistics are indeed per controller,
thus you are right. Spec is misleading here.


> 
> > >  
> > >      if (off > sizeof(smart)) {
> > >          return NVME_INVALID_FIELD | NVME_DNR;
> > > @@ -1477,19 +1499,25 @@ static uint16_t nvme_identify_ctrl(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  
> > >  static uint16_t nvme_identify_ns(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > -    NvmeNamespace *ns;
> > > +    NvmeIdNs *id_ns, inactive = { 0 };
> > >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> > > +    NvmeNamespace *ns = nvme_ns(n, nsid);
> > >  
> > >      trace_nvme_dev_identify_ns(nsid);
> > >  
> > > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > > -        return NVME_INVALID_NSID | NVME_DNR;
> > > +    if (unlikely(!ns)) {
> > > +        uint16_t status = nvme_nsid_err(n, nsid);
> > > +
> > > +        if (!nvme_status_is_error(status, NVME_INVALID_FIELD)) {
> > > +            return status;
> > > +        }
> > 
> > I really don't like checking the error value like that. 
> > It would be better IMHO to have something like
> > nvme_is_valid_ns, nvme_is_active_ns or something like that.
> > 
> 
> Fixed.
> 
> > > +
> > > +        id_ns = &inactive;
> > > +    } else {
> > > +        id_ns = &ns->id_ns;
> > >      }
> > >  
> > > -    ns = &n->namespaces[nsid - 1];
> > > -
> > > -    return nvme_dma(n, (uint8_t *) &ns->id_ns, sizeof(ns->id_ns), cmd,
> > > +    return nvme_dma(n, (uint8_t *) id_ns, sizeof(NvmeIdNs), cmd,
> > >          DMA_DIRECTION_FROM_DEVICE, req);
> > >  }
> > >  
> > > @@ -1505,11 +1533,11 @@ static uint16_t nvme_identify_ns_list(NvmeCtrl *n, NvmeCmd *cmd,
> > >      trace_nvme_dev_identify_ns_list(min_nsid);
> > >  
> > >      list = g_malloc0(data_len);
> > > -    for (i = 0; i < n->num_namespaces; i++) {
> > > -        if (i < min_nsid) {
> > > +    for (i = 1; i <= n->num_namespaces; i++) {
> > > +        if (i <= min_nsid || !nvme_ns(n, i)) {
> > >              continue;
> > >          }
> > > -        list[j++] = cpu_to_le32(i + 1);
> > > +        list[j++] = cpu_to_le32(i);
> > >          if (j == data_len / sizeof(uint32_t)) {
> > >              break;
> > >          }
> > 
> > The refactoring part (removing that +1) which is very nice IMHO should move
> > to one of earlier refactoring patches.
> > 
> 
> Done.
> 
> > > @@ -1539,9 +1567,8 @@ static uint16_t nvme_identify_ns_descr_list(NvmeCtrl *n, NvmeCmd *cmd,
> > >  
> > >      trace_nvme_dev_identify_ns_descr_list(nsid);
> > >  
> > > -    if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > > -        trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > > -        return NVME_INVALID_NSID | NVME_DNR;
> > > +    if (unlikely(!nvme_ns(n, nsid))) {
> > > +        return nvme_nsid_err(n, nsid);
> > >      }
> > >  
> > >      /*
> > > @@ -1681,7 +1708,7 @@ static uint16_t nvme_get_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          result = cpu_to_le32(n->features.err_rec);
> > >          break;
> > >      case NVME_VOLATILE_WRITE_CACHE:
> > > -        result = blk_enable_write_cache(n->conf.blk);
> > > +        result = cpu_to_le32(n->features.volatile_wc);
> > 
> > OK, this fixes lack of endianess conversion I pointed out in patch 12.
> > >          trace_nvme_dev_getfeat_vwcache(result ? "enabled" : "disabled");
> > >          break;
> > >      case NVME_NUMBER_OF_QUEUES:
> > > @@ -1735,6 +1762,8 @@ static uint16_t nvme_set_feature_timestamp(NvmeCtrl *n, NvmeCmd *cmd,
> > >  
> > >  static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > +    NvmeNamespace *ns;
> > > +
> > >      uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > >      uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > >  
> > > @@ -1766,8 +1795,19 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  
> > >          break;
> > >      case NVME_VOLATILE_WRITE_CACHE:
> > > -        blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > > +        n->features.volatile_wc = dw11;
> > > +
> > > +        for (int i = 1; i <= n->num_namespaces; i++) {
> > > +            ns = nvme_ns(n, i);
> > > +            if (!ns) {
> > > +                continue;
> > > +            }
> > > +
> > > +            blk_set_enable_write_cache(ns->blk, dw11 & 1);
> > > +        }
> > > +
> > 
> > Features are per namespace (page 79 in the spec), so this
> > is a good candidate of per namespace feature
> > 
> 
> Some features are, but the Volatile Write Cache feature is actually not.
After looking again to the spec, I confirm that.

> 
> > >          break;
> > > +
> > >      case NVME_NUMBER_OF_QUEUES:
> > >          if (n->qs_created) {
> > >              return NVME_CMD_SEQ_ERROR | NVME_DNR;
> > > @@ -1890,9 +1930,17 @@ static void nvme_process_sq(void *opaque)
> > >  
> > >  static void nvme_clear_ctrl(NvmeCtrl *n)
> > >  {
> > > +    NvmeNamespace *ns;
> > >      int i;
> > >  
> > > -    blk_drain(n->conf.blk);
> > > +    for (i = 1; i <= n->num_namespaces; i++) {
> > > +        ns = nvme_ns(n, i);
> > > +        if (!ns) {
> > > +            continue;
> > > +        }
> > > +
> > > +        blk_drain(ns->blk);
> > > +    }
> > >  
> > >      for (i = 0; i < n->params.num_queues; i++) {
> > >          if (n->sq[i] != NULL) {
> > > @@ -1915,7 +1963,15 @@ static void nvme_clear_ctrl(NvmeCtrl *n)
> > >      n->outstanding_aers = 0;
> > >      n->qs_created = false;
> > >  
> > > -    blk_flush(n->conf.blk);
> > > +    for (i = 1; i <= n->num_namespaces; i++) {
> > > +        ns = nvme_ns(n, i);
> > > +        if (!ns) {
> > > +            continue;
> > > +        }
> > > +
> > > +        blk_flush(ns->blk);
> > > +    }
> > > +
> > >      n->bar.cc = 0;
> > >  }
> > >  
> > > @@ -2335,8 +2391,8 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> > >  {
> > >      NvmeParams *params = &n->params;
> > >  
> > > -    if (!n->conf.blk) {
> > > -        error_setg(errp, "nvme: block backend not configured");
> > > +    if (!n->namespace.blk && !n->parent_obj.qdev.id) {
> > > +        error_setg(errp, "nvme: invalid 'id' parameter");
> > 
> > Nitpick: I think that usually qemu allows user to shoot him in the foot and allow to specify a device without ID,
> > to which you can't attach devices, so I think that this check is not needed.
> > You also probably mean 'missing ID'
> > 
> 
> Right. I added a deprecation warning when the drive parameter is used
> instead.
Great!
> 
> > >          return 1;
> > >      }
> > >  
> > > @@ -2353,22 +2409,10 @@ static int nvme_check_constraints(NvmeCtrl *n, Error **errp)
> > >      return 0;
> > >  }
> > >  
> > > -static int nvme_init_blk(NvmeCtrl *n, Error **errp)
> > > -{
> > > -    blkconf_blocksizes(&n->conf);
> > > -    if (!blkconf_apply_backend_options(&n->conf, blk_is_read_only(n->conf.blk),
> > > -        false, errp)) {
> > > -        return 1;
> > > -    }
> > > -
> > > -    return 0;
> > > -}
> > > -
> > >  static void nvme_init_state(NvmeCtrl *n)
> > >  {
> > > -    n->num_namespaces = 1;
> > > +    n->num_namespaces = 0;
> > 
> > And to say that again since number of valid namespaces should remain static,
> > here you should just initialize this NVME_MAX_NAMESPACES, and remove the code
> > that changes IDCTRL.NN dynamically.
> > 
> 
> Done.
> 
> > 
> > >      n->reg_size = pow2ceil(0x1004 + 2 * (n->params.num_queues + 1) * 4);
> > > -    n->namespaces = g_new0(NvmeNamespace, n->num_namespaces);
> > >      n->sq = g_new0(NvmeSQueue *, n->params.num_queues);
> > >      n->cq = g_new0(NvmeCQueue *, n->params.num_queues);
> > >  
> > > @@ -2483,12 +2527,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >      id->cqes = (0x4 << 4) | 0x4;
> > >      id->nn = cpu_to_le32(n->num_namespaces);
> > >      id->oncs = cpu_to_le16(NVME_ONCS_WRITE_ZEROS | NVME_ONCS_TIMESTAMP);
> > > -
> > > -
> > > -    if (blk_enable_write_cache(n->conf.blk)) {
> > > -        id->vwc = 1;
> > > -    }
> > > -
> > > +    id->vwc = 1;
> > >      id->sgls = cpu_to_le32(0x1);
> > >  
> > >      strcpy((char *) id->subnqn, "nqn.2019-08.org.qemu:");
> > > @@ -2509,22 +2548,25 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >      n->bar.intmc = n->bar.intms = 0;
> > >  }
> > >  
> > > -static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > >  {
> > > -    int64_t bs_size;
> > > -    NvmeIdNs *id_ns = &ns->id_ns;
> > > +    uint32_t nsid = nvme_nsid(ns);
> > >  
> > > -    bs_size = blk_getlength(n->conf.blk);
> > > -    if (bs_size < 0) {
> > > -        error_setg_errno(errp, -bs_size, "blk_getlength");
> > > +    if (nsid == 0 || nsid > NVME_MAX_NAMESPACES) {
> > > +        error_setg(errp, "invalid nsid");
> > >          return 1;
> > >      }
> > 
> > As I said above it would be nice to find a valid namespace slot instead
> > of erroring out when nsid == 0.
> > Also the error message can be a bit improved IMHO.
> > 
> 
> Done.
Thanks!

> 
> > >  
> > > -    id_ns->lbaf[0].ds = BDRV_SECTOR_BITS;
> > > -    n->ns_size = bs_size;
> > > +    if (n->namespaces[nsid - 1]) {
> > > +        error_setg(errp, "nsid must be unique");
> > > +        return 1;
> > > +    }
> > > +
> > > +    trace_nvme_dev_register_namespace(nsid);
> > >  
> > > -    id_ns->ncap = id_ns->nuse = id_ns->nsze =
> > > -        cpu_to_le64(nvme_ns_nlbas(n, ns));
> > > +    n->namespaces[nsid - 1] = ns;
> > > +    n->num_namespaces = MAX(n->num_namespaces, nsid);
> > > +    n->id_ctrl.nn = cpu_to_le32(n->num_namespaces);
> > 
> > These should be removed once you set num_namespaces to be fixed number.
> > 
> 
> Done.
> 
> > >  
> > >      return 0;
> > >  }
> > > @@ -2532,30 +2574,31 @@ static int nvme_init_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp)
> > >  static void nvme_realize(PCIDevice *pci_dev, Error **errp)
> > >  {
> > >      NvmeCtrl *n = NVME(pci_dev);
> > > +    NvmeNamespace *ns;
> > >      Error *local_err = NULL;
> > > -    int i;
> > >  
> > >      if (nvme_check_constraints(n, &local_err)) {
> > >          error_propagate_prepend(errp, local_err, "nvme_check_constraints: ");
> > >          return;
> > >      }
> > >  
> > > +    qbus_create_inplace(&n->bus, sizeof(NvmeBus), TYPE_NVME_BUS,
> > > +        &pci_dev->qdev, n->parent_obj.qdev.id);
> > > +
> > >      nvme_init_state(n);
> > > -
> > > -    if (nvme_init_blk(n, &local_err)) {
> > > -        error_propagate_prepend(errp, local_err, "nvme_init_blk: ");
> > > -        return;
> > > -    }
> > > -
> > > -    for (i = 0; i < n->num_namespaces; i++) {
> > > -        if (nvme_init_namespace(n, &n->namespaces[i], &local_err)) {
> > > -            error_propagate_prepend(errp, local_err, "nvme_init_namespace: ");
> > > -            return;
> > > -        }
> > > -    }
> > > -
> > >      nvme_init_pci(n, pci_dev);
> > >      nvme_init_ctrl(n);
> > > +
> > > +    /* setup a namespace if the controller drive property was given */
> > > +    if (n->namespace.blk) {
> > > +        ns = &n->namespace;
> > > +        ns->params.nsid = 1;
> > > +
> > > +        if (nvme_ns_setup(n, ns, &local_err)) {
> > > +            error_propagate_prepend(errp, local_err, "nvme_ns_setup: ");
> > > +            return;
> > > +        }
> > > +    }
> > >  }
> > >  
> > >  static void nvme_exit(PCIDevice *pci_dev)
> > > @@ -2576,7 +2619,8 @@ static void nvme_exit(PCIDevice *pci_dev)
> > >  }
> > >  
> > >  static Property nvme_props[] = {
> > > -    DEFINE_BLOCK_PROPERTIES(NvmeCtrl, conf),
> > > +    DEFINE_BLOCK_PROPERTIES_BASE(NvmeCtrl, conf), \
> > > +    DEFINE_PROP_DRIVE("drive", NvmeCtrl, namespace.blk), \
> > >      DEFINE_NVME_PROPERTIES(NvmeCtrl, params),
> > >      DEFINE_PROP_END_OF_LIST(),
> > >  };
> > > @@ -2608,26 +2652,35 @@ static void nvme_instance_init(Object *obj)
> > >  {
> > >      NvmeCtrl *s = NVME(obj);
> > >  
> > > -    device_add_bootindex_property(obj, &s->conf.bootindex,
> > > -                                  "bootindex", "/namespace@1,0",
> > > -                                  DEVICE(obj), &error_abort);
> > > +    if (s->namespace.blk) {
> > > +        device_add_bootindex_property(obj, &s->conf.bootindex,
> > > +                                      "bootindex", "/namespace@1,0",
> > > +                                      DEVICE(obj), &error_abort);
> > > +    }
> > >  }
> > >  
> > >  static const TypeInfo nvme_info = {
> > >      .name          = TYPE_NVME,
> > >      .parent        = TYPE_PCI_DEVICE,
> > >      .instance_size = sizeof(NvmeCtrl),
> > > -    .class_init    = nvme_class_init,
> > >      .instance_init = nvme_instance_init,
> > > +    .class_init    = nvme_class_init,
> > >      .interfaces = (InterfaceInfo[]) {
> > >          { INTERFACE_PCIE_DEVICE },
> > >          { }
> > >      },
> > >  };
> > >  
> > > +static const TypeInfo nvme_bus_info = {
> > > +    .name = TYPE_NVME_BUS,
> > > +    .parent = TYPE_BUS,
> > > +    .instance_size = sizeof(NvmeBus),
> > > +};
> > > +
> > >  static void nvme_register_types(void)
> > >  {
> > >      type_register_static(&nvme_info);
> > > +    type_register_static(&nvme_bus_info);
> > >  }
> > >  
> > >  type_init(nvme_register_types)
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 3319f8edd7e1..c3cef0f024da 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -2,6 +2,9 @@
> > >  #define HW_NVME_H
> > >  
> > >  #include "block/nvme.h"
> > > +#include "nvme-ns.h"
> > > +
> > > +#define NVME_MAX_NAMESPACES 256
> > >  
> > >  #define DEFINE_NVME_PROPERTIES(_state, _props) \
> > >      DEFINE_PROP_STRING("serial", _state, _props.serial), \
> > > @@ -108,26 +111,6 @@ typedef struct NvmeCQueue {
> > >      QTAILQ_HEAD(, NvmeRequest) req_list;
> > >  } NvmeCQueue;
> > >  
> > > -typedef struct NvmeNamespace {
> > > -    NvmeIdNs        id_ns;
> > > -} NvmeNamespace;
> > > -
> > > -static inline NvmeLBAF nvme_ns_lbaf(NvmeNamespace *ns)
> > > -{
> > > -    NvmeIdNs *id_ns = &ns->id_ns;
> > > -    return id_ns->lbaf[NVME_ID_NS_FLBAS_INDEX(id_ns->flbas)];
> > > -}
> > > -
> > > -static inline uint8_t nvme_ns_lbads(NvmeNamespace *ns)
> > > -{
> > > -    return nvme_ns_lbaf(ns).ds;
> > > -}
> > > -
> > > -static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > > -{
> > > -    return 1 << nvme_ns_lbads(ns);
> > > -}
> > > -
> > >  typedef enum NvmeAIOOp {
> > >      NVME_AIO_OPC_NONE         = 0x0,
> > >      NVME_AIO_OPC_FLUSH        = 0x1,
> > > @@ -182,6 +165,13 @@ static inline bool nvme_req_is_write(NvmeRequest *req)
> > >      }
> > >  }
> > >  
> > > +#define TYPE_NVME_BUS "nvme-bus"
> > > +#define NVME_BUS(obj) OBJECT_CHECK(NvmeBus, (obj), TYPE_NVME_BUS)
> > > +
> > > +typedef struct NvmeBus {
> > > +    BusState parent_bus;
> > > +} NvmeBus;
> > > +
> > >  #define TYPE_NVME "nvme"
> > >  #define NVME(obj) \
> > >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > > @@ -191,8 +181,9 @@ typedef struct NvmeCtrl {
> > >      MemoryRegion iomem;
> > >      MemoryRegion ctrl_mem;
> > >      NvmeBar      bar;
> > > -    BlockConf    conf;
> > >      NvmeParams   params;
> > > +    NvmeBus      bus;
> > > +    BlockConf    conf;
> > >  
> > >      bool        qs_created;
> > >      uint32_t    page_size;
> > > @@ -203,7 +194,6 @@ typedef struct NvmeCtrl {
> > >      uint32_t    reg_size;
> > >      uint32_t    num_namespaces;
> > >      uint32_t    max_q_ents;
> > > -    uint64_t    ns_size;
> > >      uint8_t     outstanding_aers;
> > >      uint32_t    cmbsz;
> > >      uint32_t    cmbloc;
> > > @@ -219,7 +209,8 @@ typedef struct NvmeCtrl {
> > >      QTAILQ_HEAD(, NvmeAsyncEvent) aer_queue;
> > >      int         aer_queued;
> > >  
> > > -    NvmeNamespace   *namespaces;
> > > +    NvmeNamespace   namespace;
> > > +    NvmeNamespace   *namespaces[NVME_MAX_NAMESPACES];
> > >      NvmeSQueue      **sq;
> > >      NvmeCQueue      **cq;
> > >      NvmeSQueue      admin_sq;
> > > @@ -228,9 +219,13 @@ typedef struct NvmeCtrl {
> > >      NvmeFeatureVal  features;
> > >  } NvmeCtrl;
> > >  
> > > -static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > > +static inline NvmeNamespace *nvme_ns(NvmeCtrl *n, uint32_t nsid)
> > >  {
> > > -    return n->ns_size >> nvme_ns_lbads(ns);
> > > +    if (!nsid || nsid > n->num_namespaces) {
> > > +        return NULL;
> > > +    }
> > > +
> > > +    return n->namespaces[nsid - 1];
> > >  }
> > >  
> > >  static inline uint16_t nvme_cid(NvmeRequest *req)
> > > @@ -253,4 +248,6 @@ static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
> > >      return req->sq->ctrl;
> > >  }
> > >  
> > > +int nvme_register_namespace(NvmeCtrl *n, NvmeNamespace *ns, Error **errp);
> > > +
> > >  #endif /* HW_NVME_H */
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 81d69e15fc32..aaf1fcda7923 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -29,6 +29,7 @@ hd_geometry_guess(void *blk, uint32_t cyls, uint32_t heads, uint32_t secs, int t
> > >  
> > >  # nvme.c
> > >  # nvme traces for successful events
> > > +nvme_dev_register_namespace(uint32_t nsid) "nsid %"PRIu32""
> > >  nvme_dev_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
> > >  nvme_dev_irq_pin(void) "pulsing IRQ pin"
> > >  nvme_dev_irq_masked(void) "IRQ is masked"
> > > @@ -38,7 +39,7 @@ nvme_dev_map_sgl(uint16_t cid, uint8_t typ, uint32_t nlb, uint64_t len) "cid %"P
> > >  nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > > %"PRIu64" opc \"%s\" req %p"
> > >  nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> > >  nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> > > -nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> > > +nvme_dev_rw(uint16_t cid, const char *verb, uint32_t nsid, uint32_t nlb, uint64_t count, uint64_t lba) "cid %"PRIu16" %s nsid %"PRIu32" nlb %"PRIu32" count %"PRIu64" lba 0x%"PRIx64""
> > >  nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
> > >  nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
> > >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > > qflags=%"PRIu16""
> > > @@ -94,7 +95,8 @@ nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or no
> > >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > >  nvme_dev_err_invalid_prp2_missing(void) "PRP2 is null and more data to be transferred"
> > >  nvme_dev_err_invalid_prp(void) "invalid PRP"
> > > -nvme_dev_err_invalid_ns(uint32_t ns, uint32_t limit) "invalid namespace %u not within 1-%u"
> > > +nvme_dev_err_invalid_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
> > > +nvme_dev_err_inactive_ns(uint32_t nsid, uint32_t nn) "nsid %"PRIu32" nn %"PRIu32""
> > >  nvme_dev_err_invalid_opc(uint8_t opc) "invalid opcode 0x%"PRIx8""
> > >  nvme_dev_err_invalid_admin_opc(uint8_t opc) "invalid admin opcode 0x%"PRIx8""
> > >  nvme_dev_err_invalid_lba_range(uint64_t start, uint64_t len, uint64_t limit) "Invalid LBA start=%"PRIu64" len=%"PRIu64" limit=%"PRIu64""
> > 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > 
> > 
> 
> 







^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 17/26] nvme: allow multiple aios per command
  2020-03-16  7:53         ` Klaus Birkelund Jensen
@ 2020-03-25 10:24           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:24 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:53 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 13:48, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > This refactors how the device issues asynchronous block backend
> > > requests. The NvmeRequest now holds a queue of NvmeAIOs that are
> > > associated with the command. This allows multiple aios to be issued for
> > > a command. Only when all requests have been completed will the device
> > > post a completion queue entry.
> > > 
> > > Because the device is currently guaranteed to only issue a single aio
> > > request per command, the benefit is not immediately obvious. But this
> > > functionality is required to support metadata, the dataset management
> > > command and other features.
> > 
> > I don't know what the strategy will be chosen for supporting metadata
> > (qemu doesn't have any notion of metadata in the block layer), but for dataset management
> > you are right. Dataset management command can contain a table of areas to discard
> > (although in reality I have seen no driver putting there more that one entry).
> > 
> 
> The strategy is different depending on how the metadata is transferred
> between host and device. For the "separate buffer" case, metadata is
> transferred using a separate memory pointer in the nvme command (MPTR).
> In this case the metadata is kept separately on a new blockdev attached
> to the namespace.
Looks reasonable.
> 


> In the other case, metadata is transferred as part of an extended lba
> (say 512 + 8 bytes) and kept inline on the main namespace blockdev. This
> is challenging for QEMU as it breaks interoperability of the image with
> other devices. But that is a discussion for fresh RFC ;)

Yes, this one is quite problemetic. IMHO even the kernel opted out to not
support this kind of metadata (I know that since I played with one of Intel's enterprise
SSDs when I developed nvme-mdev, and sadly this is the only kind of metadata it supports).
I guess if we have to support this format (for the sake of making our nvme virtual device
as feature complete as possible for driver development), I would emulate this with a
separate drive as well.

> 
> Note that the support for multiple AIOs is also used for DULBE support
This is a typo? I don't recall something like that from the spec.

> down the line when I get around to posting those patches. So this is
> preparatory for a lot of features that requires persistant state across
> device power off.
All right. Thanks again for your work. I wish I had all these features
when I developed nvme-mdev, it would make my life much easier.

> 
> > 
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/nvme.c       | 449 +++++++++++++++++++++++++++++++++---------
> > >  hw/block/nvme.h       | 134 +++++++++++--
> > >  hw/block/trace-events |   8 +
> > >  3 files changed, 480 insertions(+), 111 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 334265efb21e..e97da35c4ca1 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -19,7 +19,8 @@
> > >   *      -drive file=<file>,if=none,id=<drive_id>
> > >   *      -device nvme,drive=<drive_id>,serial=<serial>,id=<id[optional]>, \
> > >   *              cmb_size_mb=<cmb_size_mb[optional]>, \
> > > - *              num_queues=<N[optional]>
> > > + *              num_queues=<N[optional]>, \
> > > + *              mdts=<mdts[optional]>
> > 
> > Could you split mdts checks into a separate patch? This is not related to the series.
> 
> Absolutely. Done.
Perfect, thanks!
> 
> > 
> > >   *
> > >   * Note cmb_size_mb denotes size of CMB in MB. CMB is assumed to be at
> > >   * offset 0 in BAR2 and supports only WDS, RDS and SQS for now.
> > > @@ -57,6 +58,7 @@
> > >      } while (0)
> > >  
> > >  static void nvme_process_sq(void *opaque);
> > > +static void nvme_aio_cb(void *opaque, int ret);
> > >  
> > >  static inline void *nvme_addr_to_cmb(NvmeCtrl *n, hwaddr addr)
> > >  {
> > > @@ -341,6 +343,107 @@ static uint16_t nvme_dma_prp(NvmeCtrl *n, uint8_t *ptr, uint32_t len,
> > >      return status;
> > >  }
> > >  
> > > +static uint16_t nvme_map(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    NvmeNamespace *ns = req->ns;
> > > +
> > > +    uint32_t len = req->nlb << nvme_ns_lbads(ns);
> > > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +
> > > +    return nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, len, req);
> > > +}
> > 
> > Same here, this is another nice refactoring and it should be in separate patch.
> 
> Done.
> 
> > 
> > > +
> > > +static void nvme_aio_destroy(NvmeAIO *aio)
> > > +{
> > > +    g_free(aio);
> > > +}
> > > +
> > > +static inline void nvme_req_register_aio(NvmeRequest *req, NvmeAIO *aio,
> > > +    NvmeAIOOp opc)
> > > +{
> > > +    aio->opc = opc;
> > > +
> > > +    trace_nvme_dev_req_register_aio(nvme_cid(req), aio, blk_name(aio->blk),
> > > +        aio->offset, aio->len, nvme_aio_opc_str(aio), req);
> > > +
> > > +    if (req) {
> > > +        QTAILQ_INSERT_TAIL(&req->aio_tailq, aio, tailq_entry);
> > > +    }
> > > +}
> > > +
> > > +static void nvme_aio(NvmeAIO *aio)
> > 
> > Function name not clear to me. Maybe change this to something like nvme_submit_aio.
> 
> Fixed.
> 
> > > +{
> > > +    BlockBackend *blk = aio->blk;
> > > +    BlockAcctCookie *acct = &aio->acct;
> > > +    BlockAcctStats *stats = blk_get_stats(blk);
> > > +
> > > +    bool is_write, dma;
> > > +
> > > +    switch (aio->opc) {
> > > +    case NVME_AIO_OPC_NONE:
> > > +        break;
> > > +
> > > +    case NVME_AIO_OPC_FLUSH:
> > > +        block_acct_start(stats, acct, 0, BLOCK_ACCT_FLUSH);
> > > +        aio->aiocb = blk_aio_flush(blk, nvme_aio_cb, aio);
> > > +        break;
> > > +
> > > +    case NVME_AIO_OPC_WRITE_ZEROES:
> > > +        block_acct_start(stats, acct, aio->len, BLOCK_ACCT_WRITE);
> > > +        aio->aiocb = blk_aio_pwrite_zeroes(blk, aio->offset, aio->len,
> > > +            BDRV_REQ_MAY_UNMAP, nvme_aio_cb, aio);
> > > +        break;
> > > +
> > > +    case NVME_AIO_OPC_READ:
> > > +    case NVME_AIO_OPC_WRITE:
> > > +        dma = aio->qsg != NULL;
> > 
> > This doesn't work.
> > aio->qsg is always not null since nvme_rw_aio sets this to &req->qsg
> > which is then written to aio->qsg by nvme_aio_new.
> 
> Ouch. This is a refactoring gone awry. Very nicely spotted.
> 
> > 
> > That is yet another reason I really don't like these parallel QEMUSGList
> > and QEMUIOVector. However I see that few other qemu drivers do this,
> > thus this is probably a necessary evil.
> > 
> > What we can do maybe is to do dma_memory_map on the SG list,
> > and then deal with QEMUIOVector only. Virtio does this
> > (virtqueue_pop/virtqueue_push)
> 
> Yeah, I agree. But I really wanna use the dma helpers to not mess around
> with that complexity.
Yea, after reviewing all of the patchset, I also kind of got used to this,
so I don't mind leaving this like that for now.

> 
> > 
> > 
> > > +        is_write = (aio->opc == NVME_AIO_OPC_WRITE);
> > > +
> > > +        block_acct_start(stats, acct, aio->len,
> > > +            is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> > > +
> > > +        if (dma) {
> > > +            aio->aiocb = is_write ?
> > > +                dma_blk_write(blk, aio->qsg, aio->offset,
> > > +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio) :
> > > +                dma_blk_read(blk, aio->qsg, aio->offset,
> > > +                    BDRV_SECTOR_SIZE, nvme_aio_cb, aio);
> > > +
> > 
> > Extra space
> > > +            return;
> > > +        }
> > > +
> > > +        aio->aiocb = is_write ?
> > > +            blk_aio_pwritev(blk, aio->offset, aio->iov, 0,
> > > +                nvme_aio_cb, aio) :
> > > +            blk_aio_preadv(blk, aio->offset, aio->iov, 0,
> > > +                nvme_aio_cb, aio);
> > > +
> > > +        break;
> > > +    }
> > > +}
> > > +
> > > +static void nvme_rw_aio(BlockBackend *blk, uint64_t offset, NvmeRequest *req)
> > > +{
> > > +    NvmeAIO *aio;
> > > +    size_t len = req->qsg.nsg > 0 ? req->qsg.size : req->iov.size;
> > > +
> > > +    aio = g_new0(NvmeAIO, 1);
> > > +
> > > +    *aio = (NvmeAIO) {
> > > +        .blk = blk,
> > > +        .offset = offset,
> > > +        .len = len,
> > > +        .req = req,
> > > +        .qsg = &req->qsg,
> > > +        .iov = &req->iov,
> > > +    };
> > > +
> > > +    nvme_req_register_aio(req, aio, nvme_req_is_write(req) ?
> > > +        NVME_AIO_OPC_WRITE : NVME_AIO_OPC_READ);
> > 
> > nitpick: I think I don't like the nvme_req_register_aio name either, but I don't think I have
> > a better name for it yet. 
> 
> If you figure out a better name, let me know ;) I through about
> "enqueue", but thats not really what it's doing. It is just registering
> that an AIO is associated with the request. Maybe "post" or something,
> not sure.
nvme_reg_add_aio maybe (with a comment on top explaining what it does)?
> 
> > > +    nvme_aio(aio);
> > > +}
> > > +
> > >  static void nvme_post_cqes(void *opaque)
> > >  {
> > >      NvmeCQueue *cq = opaque;
> > > @@ -364,6 +467,7 @@ static void nvme_post_cqes(void *opaque)
> > >          nvme_inc_cq_tail(cq);
> > >          pci_dma_write(&n->parent_obj, addr, (void *)&req->cqe,
> > >              sizeof(req->cqe));
> > > +        nvme_req_clear(req);
> > >          QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
> > >      }
> > >      if (cq->tail != cq->head) {
> > > @@ -374,8 +478,8 @@ static void nvme_post_cqes(void *opaque)
> > >  static void nvme_enqueue_req_completion(NvmeCQueue *cq, NvmeRequest *req)
> > >  {
> > >      assert(cq->cqid == req->sq->cqid);
> > > -    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid,
> > > -        req->status);
> > > +    trace_nvme_dev_enqueue_req_completion(nvme_cid(req), cq->cqid, req->status);
> > > +
> > >      QTAILQ_REMOVE(&req->sq->out_req_list, req, entry);
> > >      QTAILQ_INSERT_TAIL(&cq->req_list, req, entry);
> > >      timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> > > @@ -460,135 +564,272 @@ static void nvme_clear_events(NvmeCtrl *n, uint8_t event_type)
> > >      }
> > >  }
> > >  
> > > -static void nvme_rw_cb(void *opaque, int ret)
> > > +static inline uint16_t nvme_check_mdts(NvmeCtrl *n, size_t len,
> > > +    NvmeRequest *req)
> > > +{
> > > +    uint8_t mdts = n->params.mdts;
> > > +
> > > +    if (mdts && len > n->page_size << mdts) {
> > > +        trace_nvme_dev_err_mdts(nvme_cid(req), n->page_size << mdts, len);
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > > +
> > > +static inline uint16_t nvme_check_prinfo(NvmeCtrl *n, NvmeRequest *req)
> > > +{
> > > +    NvmeRwCmd *rw = (NvmeRwCmd *) &req->cmd;
> > > +    NvmeNamespace *ns = req->ns;
> > > +
> > > +    uint16_t ctrl = le16_to_cpu(rw->control);
> > > +
> > > +    if ((ctrl & NVME_RW_PRINFO_PRACT) && !(ns->id_ns.dps & DPS_TYPE_MASK)) {
> > > +        trace_nvme_dev_err_prinfo(nvme_cid(req), ctrl);
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > > +
> > > +static inline uint16_t nvme_check_bounds(NvmeCtrl *n, uint64_t slba,
> > > +    uint32_t nlb, NvmeRequest *req)
> > > +{
> > > +    NvmeNamespace *ns = req->ns;
> > > +    uint64_t nsze = le64_to_cpu(ns->id_ns.nsze);
> > > +
> > > +    if (unlikely((slba + nlb) > nsze)) {
> > > +        block_acct_invalid(blk_get_stats(n->conf.blk),
> > > +            nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ);
> > > +        trace_nvme_dev_err_invalid_lba_range(slba, nlb, nsze);
> > > +        return NVME_LBA_RANGE | NVME_DNR;
> > > +    }
> > 
> > Double check this in regard to integer overflows, e.g if slba + nlb overflows.
> > 
> > That is what I did in my nvme-mdev:
> > 
> > static inline bool check_range(u64 start, u64 size, u64 end)
> > {
> > 	u64 test = start + size;
> > 
> > 	/* check for overflow */
> > 	if (test < start || test < size)
> > 		return false;
> > 	return test <= end;
> > }
> > 
> 
> Fixed in new patch.
> 
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > > +
> > > +static uint16_t nvme_check_rw(NvmeCtrl *n, NvmeRequest *req)
> > > +{
> > > +    NvmeNamespace *ns = req->ns;
> > > +    size_t len = req->nlb << nvme_ns_lbads(ns);
> > > +    uint16_t status;
> > > +
> > > +    status = nvme_check_mdts(n, len, req);
> > > +    if (status) {
> > > +        return status;
> > > +    }
> > > +
> > > +    status = nvme_check_prinfo(n, req);
> > > +    if (status) {
> > > +        return status;
> > > +    }
> > > +
> > > +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > > +    if (status) {
> > > +        return status;
> > > +    }
> > > +
> > > +    return NVME_SUCCESS;
> > > +}
> > 
> > Note that there are more things to check if we don't support metadata,
> > like for instance the metadata pointer in the submission entry is NULL.
> > 
> 
> Yeah. I think these will be introduced along the way. It's a step
> towards better compliance, but it doesnt break the device.
> 
> > All these check_ functions are very good but they should move to
> > a separate patch since they just implement parts of the spec
> > and have nothing to do with the patch subject.
> > 
> 
> Done. 
> 
> > > +
> > > +static void nvme_rw_cb(NvmeRequest *req, void *opaque)
> > >  {
> > > -    NvmeRequest *req = opaque;
> > >      NvmeSQueue *sq = req->sq;
> > >      NvmeCtrl *n = sq->ctrl;
> > >      NvmeCQueue *cq = n->cq[sq->cqid];
> > >  
> > > -    if (!ret) {
> > > -        block_acct_done(blk_get_stats(n->conf.blk), &req->acct);
> > > -        req->status = NVME_SUCCESS;
> > > -    } else {
> > > -        block_acct_failed(blk_get_stats(n->conf.blk), &req->acct);
> > > -        req->status = NVME_INTERNAL_DEV_ERROR;
> > > -    }
> > > -
> > > -    if (req->qsg.nalloc) {
> > > -        qemu_sglist_destroy(&req->qsg);
> > > -    }
> > > -    if (req->iov.nalloc) {
> > > -        qemu_iovec_destroy(&req->iov);
> > > -    }
> > > +    trace_nvme_dev_rw_cb(nvme_cid(req), req->cmd.nsid);
> > >  
> > >      nvme_enqueue_req_completion(cq, req);
> > >  }
> > >  
> > > -static uint16_t nvme_flush(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > > -    NvmeRequest *req)
> > > +static void nvme_aio_cb(void *opaque, int ret)
> > >  {
> > > -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > > -         BLOCK_ACCT_FLUSH);
> > > -    req->aiocb = blk_aio_flush(n->conf.blk, nvme_rw_cb, req);
> > > +    NvmeAIO *aio = opaque;
> > > +    NvmeRequest *req = aio->req;
> > >  
> > > -    return NVME_NO_COMPLETE;
> > > -}
> > > +    BlockBackend *blk = aio->blk;
> > > +    BlockAcctCookie *acct = &aio->acct;
> > > +    BlockAcctStats *stats = blk_get_stats(blk);
> > >  
> > > -static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > > -    NvmeRequest *req)
> > > -{
> > > -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> > > -    const uint8_t lba_index = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > > -    const uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> > > -    uint64_t slba = le64_to_cpu(rw->slba);
> > > -    uint32_t nlb  = le16_to_cpu(rw->nlb) + 1;
> > > -    uint64_t offset = slba << data_shift;
> > > -    uint32_t count = nlb << data_shift;
> > > -
> > > -    if (unlikely(slba + nlb > ns->id_ns.nsze)) {
> > > -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > > -        return NVME_LBA_RANGE | NVME_DNR;
> > > -    }
> > > -
> > > -    block_acct_start(blk_get_stats(n->conf.blk), &req->acct, 0,
> > > -                     BLOCK_ACCT_WRITE);
> > > -    req->aiocb = blk_aio_pwrite_zeroes(n->conf.blk, offset, count,
> > > -                                        BDRV_REQ_MAY_UNMAP, nvme_rw_cb, req);
> > > -    return NVME_NO_COMPLETE;
> > > -}
> > > -
> > > -static uint16_t nvme_rw(NvmeCtrl *n, NvmeNamespace *ns, NvmeCmd *cmd,
> > > -    NvmeRequest *req)
> > > -{
> > > -    NvmeRwCmd *rw = (NvmeRwCmd *)cmd;
> > > -    uint32_t nlb  = le32_to_cpu(rw->nlb) + 1;
> > > -    uint64_t slba = le64_to_cpu(rw->slba);
> > > -    uint64_t prp1 = le64_to_cpu(rw->prp1);
> > > -    uint64_t prp2 = le64_to_cpu(rw->prp2);
> > > -
> > > -    uint8_t lba_index  = NVME_ID_NS_FLBAS_INDEX(ns->id_ns.flbas);
> > > -    uint8_t data_shift = ns->id_ns.lbaf[lba_index].ds;
> > > -    uint64_t data_size = (uint64_t)nlb << data_shift;
> > > -    uint64_t data_offset = slba << data_shift;
> > > -    int is_write = rw->opcode == NVME_CMD_WRITE ? 1 : 0;
> > > -    enum BlockAcctType acct = is_write ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> > > +    Error *local_err = NULL;
> > >  
> > > -    trace_nvme_dev_rw(is_write ? "write" : "read", nlb, data_size, slba);
> > > +    trace_nvme_dev_aio_cb(nvme_cid(req), aio, blk_name(blk), aio->offset,
> > > +        nvme_aio_opc_str(aio), req);
> > >  
> > > -    if (unlikely((slba + nlb) > ns->id_ns.nsze)) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > -        trace_nvme_dev_err_invalid_lba_range(slba, nlb, ns->id_ns.nsze);
> > > -        return NVME_LBA_RANGE | NVME_DNR;
> > > +    if (req) {
> > 
> > I wonder in which case the aio callback will be called without req.
> > Looking at the code it looks like that can't happen.
> > (NvmeAIO is created by nvme_aio_new and all its callers pass not null req)
> 
> Yeah, this is preparatory for a patchset I have where an AIO can be
> issued by the controller autonomously.
ok then.
> 
> > 
> > > +        QTAILQ_REMOVE(&req->aio_tailq, aio, tailq_entry);
> > >      }
> > >  
> > > -    if (nvme_map_prp(n, &req->qsg, &req->iov, prp1, prp2, data_size, req)) {
> > > -        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > -        return NVME_INVALID_FIELD | NVME_DNR;
> > > -    }
> > > -
> > > -    if (req->qsg.nsg > 0) {
> > > -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->qsg.size,
> > > -            acct);
> > > -
> > > -        req->aiocb = is_write ?
> > > -            dma_blk_write(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > > -                          nvme_rw_cb, req) :
> > > -            dma_blk_read(n->conf.blk, &req->qsg, data_offset, BDRV_SECTOR_SIZE,
> > > -                         nvme_rw_cb, req);
> > > +    if (!ret) {
> > > +        block_acct_done(stats, acct);
> > >      } else {
> > > -        block_acct_start(blk_get_stats(n->conf.blk), &req->acct, req->iov.size,
> > > -            acct);
> > > +        block_acct_failed(stats, acct);
> > >  
> > > -        req->aiocb = is_write ?
> > > -            blk_aio_pwritev(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> > > -                            req) :
> > > -            blk_aio_preadv(n->conf.blk, data_offset, &req->iov, 0, nvme_rw_cb,
> > > -                           req);
> > > +        if (req) {
> > > +            uint16_t status;
> > > +
> > > +            switch (aio->opc) {
> > > +            case NVME_AIO_OPC_READ:
> > > +                status = NVME_UNRECOVERED_READ;
> > > +                break;
> > > +            case NVME_AIO_OPC_WRITE:
> > > +            case NVME_AIO_OPC_WRITE_ZEROES:
> > > +                status = NVME_WRITE_FAULT;
> > > +                break;
> > > +            default:
> > > +                status = NVME_INTERNAL_DEV_ERROR;
> > > +                break;
> > > +            }
> > > +
> > > +            trace_nvme_dev_err_aio(nvme_cid(req), aio, blk_name(blk),
> > > +                aio->offset, nvme_aio_opc_str(aio), req, status);
> > > +
> > > +            error_setg_errno(&local_err, -ret, "aio failed");
> > > +            error_report_err(local_err);
> > > +
> > > +            /*
> > > +             * An Internal Error trumps all other errors. For other errors,
> > > +             * only set the first error encountered. Any additional errors will
> > > +             * be recorded in the error information log page.
> > > +             */
> > > +            if (!req->status ||
> > > +                nvme_status_is_error(status, NVME_INTERNAL_DEV_ERROR)) {
> > > +                req->status = status;
> > > +            }
> > > +        }
> > > +    }
> > > +
> > > +    if (aio->cb) {
> > > +        aio->cb(aio, aio->cb_arg, ret);
> > > +    }
> > > +
> > > +    if (req && QTAILQ_EMPTY(&req->aio_tailq)) {
> > > +        if (req->cb) {
> > > +            req->cb(req, req->cb_arg);
> > > +        } else {
> > > +            NvmeSQueue *sq = req->sq;
> > > +            NvmeCtrl *n = sq->ctrl;
> > > +            NvmeCQueue *cq = n->cq[sq->cqid];
> > > +
> > > +            nvme_enqueue_req_completion(cq, req);
> > > +        }
> > >      }
> > >  
> > > +    nvme_aio_destroy(aio);
> > > +}
> > > +
> > > +static uint16_t nvme_flush(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    NvmeAIO *aio = g_new0(NvmeAIO, 1);
> > > +
> > > +    *aio = (NvmeAIO) {
> > > +        .blk = n->conf.blk,
> > > +        .req = req,
> > > +    };
> > > +
> > > +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_FLUSH);
> > > +    nvme_aio(aio);
> > > +
> > > +    return NVME_NO_COMPLETE;
> > > +}
> > > +
> > > +static uint16_t nvme_write_zeros(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    NvmeAIO *aio;
> > > +
> > > +    NvmeNamespace *ns = req->ns;
> > > +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> > > +
> > > +    int64_t offset;
> > > +    size_t count;
> > > +    uint16_t status;
> > > +
> > > +    req->slba = le64_to_cpu(rw->slba);
> > > +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > > +
> > > +    trace_nvme_dev_write_zeros(nvme_cid(req), le32_to_cpu(cmd->nsid),
> > > +        req->slba, req->nlb);
> > > +
> > > +    status = nvme_check_bounds(n, req->slba, req->nlb, req);
> > > +    if (unlikely(status)) {
> > > +        block_acct_invalid(blk_get_stats(n->conf.blk), BLOCK_ACCT_WRITE);
> > > +        return status;
> > > +    }
> > 
> > This refactoring also should be in a separate patch.
> 
> Done.
> 
> > 
> > > +
> > > +    offset = req->slba << nvme_ns_lbads(ns);
> > > +    count = req->nlb << nvme_ns_lbads(ns);
> > > +
> > > +    aio = g_new0(NvmeAIO, 1);
> > > +
> > > +    *aio = (NvmeAIO) {
> > > +        .blk = n->conf.blk,
> > > +        .offset = offset,
> > > +        .len = count,
> > > +        .req = req,
> > > +    };
> > > +
> > > +    nvme_req_register_aio(req, aio, NVME_AIO_OPC_WRITE_ZEROES);
> > > +    nvme_aio(aio);
> > > +
> > > +    return NVME_NO_COMPLETE;
> > > +}
> > > +
> > > +static uint16_t nvme_rw(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    NvmeRwCmd *rw = (NvmeRwCmd *) cmd;
> > > +    NvmeNamespace *ns = req->ns;
> > > +    int status;
> > > +
> > > +    enum BlockAcctType acct =
> > > +        nvme_req_is_write(req) ? BLOCK_ACCT_WRITE : BLOCK_ACCT_READ;
> > > +
> > > +    req->nlb  = le16_to_cpu(rw->nlb) + 1;
> > > +    req->slba = le64_to_cpu(rw->slba);
> > > +
> > > +    trace_nvme_dev_rw(nvme_req_is_write(req) ? "write" : "read", req->nlb,
> > > +        req->nlb << nvme_ns_lbads(req->ns), req->slba);
> > > +
> > > +    status = nvme_check_rw(n, req);
> > > +    if (status) {
> > > +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > +        return status;
> > > +    }
> > > +
> > > +    status = nvme_map(n, cmd, req);
> > > +    if (status) {
> > > +        block_acct_invalid(blk_get_stats(n->conf.blk), acct);
> > > +        return status;
> > > +    }
> > > +
> > > +    nvme_rw_aio(n->conf.blk, req->slba << nvme_ns_lbads(ns), req);
> > > +    nvme_req_set_cb(req, nvme_rw_cb, NULL);
> > > +
> > >      return NVME_NO_COMPLETE;
> > >  }
> > >  
> > >  static uint16_t nvme_io_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  {
> > > -    NvmeNamespace *ns;
> > >      uint32_t nsid = le32_to_cpu(cmd->nsid);
> > >  
> > > +    trace_nvme_dev_io_cmd(nvme_cid(req), nsid, le16_to_cpu(req->sq->sqid),
> > > +        cmd->opcode);
> > > +
> > >      if (unlikely(nsid == 0 || nsid > n->num_namespaces)) {
> > >          trace_nvme_dev_err_invalid_ns(nsid, n->num_namespaces);
> > >          return NVME_INVALID_NSID | NVME_DNR;
> > >      }
> > >  
> > > -    ns = &n->namespaces[nsid - 1];
> > > +    req->ns = &n->namespaces[nsid - 1];
> > > +
> > >      switch (cmd->opcode) {
> > >      case NVME_CMD_FLUSH:
> > > -        return nvme_flush(n, ns, cmd, req);
> > > +        return nvme_flush(n, cmd, req);
> > >      case NVME_CMD_WRITE_ZEROS:
> > > -        return nvme_write_zeros(n, ns, cmd, req);
> > > +        return nvme_write_zeros(n, cmd, req);
> > >      case NVME_CMD_WRITE:
> > >      case NVME_CMD_READ:
> > > -        return nvme_rw(n, ns, cmd, req);
> > > +        return nvme_rw(n, cmd, req);
> > >      default:
> > >          trace_nvme_dev_err_invalid_opc(cmd->opcode);
> > >          return NVME_INVALID_OPCODE | NVME_DNR;
> > > @@ -612,6 +853,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      NvmeRequest *req, *next;
> > >      NvmeSQueue *sq;
> > >      NvmeCQueue *cq;
> > > +    NvmeAIO *aio;
> > >      uint16_t qid = le16_to_cpu(c->qid);
> > >  
> > >      if (unlikely(!qid || nvme_check_sqid(n, qid))) {
> > > @@ -624,8 +866,11 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      sq = n->sq[qid];
> > >      while (!QTAILQ_EMPTY(&sq->out_req_list)) {
> > >          req = QTAILQ_FIRST(&sq->out_req_list);
> > > -        assert(req->aiocb);
> > > -        blk_aio_cancel(req->aiocb);
> > > +        while (!QTAILQ_EMPTY(&req->aio_tailq)) {
> > > +            aio = QTAILQ_FIRST(&req->aio_tailq);
> > > +            assert(aio->aiocb);
> > > +            blk_aio_cancel(aio->aiocb);
> > > +        }
> > >      }
> > >      if (!nvme_check_cqid(n, sq->cqid)) {
> > >          cq = n->cq[sq->cqid];
> > > @@ -662,6 +907,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
> > >      QTAILQ_INIT(&sq->out_req_list);
> > >      for (i = 0; i < sq->size; i++) {
> > >          sq->io_req[i].sq = sq;
> > > +        QTAILQ_INIT(&(sq->io_req[i].aio_tailq));
> > >          QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
> > >      }
> > >      sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
> > > @@ -800,6 +1046,7 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >      uint32_t numdl, numdu;
> > >      uint64_t off, lpol, lpou;
> > >      size_t   len;
> > > +    uint16_t status;
> > >  
> > >      numdl = (dw10 >> 16);
> > >      numdu = (dw11 & 0xffff);
> > > @@ -815,6 +1062,11 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >  
> > >      trace_nvme_dev_get_log(nvme_cid(req), lid, lsp, rae, len, off);
> > >  
> > > +    status = nvme_check_mdts(n, len, req);
> > > +    if (status) {
> > > +        return status;
> > > +    }
> > > +
> > >      switch (lid) {
> > >      case NVME_LOG_ERROR_INFO:
> > >          if (!rae) {
> > > @@ -1348,7 +1600,7 @@ static void nvme_process_sq(void *opaque)
> > >          req = QTAILQ_FIRST(&sq->req_list);
> > >          QTAILQ_REMOVE(&sq->req_list, req, entry);
> > >          QTAILQ_INSERT_TAIL(&sq->out_req_list, req, entry);
> > > -        memset(&req->cqe, 0, sizeof(req->cqe));
> > > +
> > >          req->cqe.cid = cmd.cid;
> > >          memcpy(&req->cmd, &cmd, sizeof(NvmeCmd));
> > >  
> > > @@ -1928,6 +2180,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >      id->ieee[0] = 0x00;
> > >      id->ieee[1] = 0x02;
> > >      id->ieee[2] = 0xb3;
> > > +    id->mdts = params->mdts;
> > >      id->ver = cpu_to_le32(NVME_SPEC_VER);
> > >      id->oacs = cpu_to_le16(0);
> > >  
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index d27baa9d5391..3319f8edd7e1 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -8,7 +8,8 @@
> > >      DEFINE_PROP_UINT32("cmb_size_mb", _state, _props.cmb_size_mb, 0), \
> > >      DEFINE_PROP_UINT32("num_queues", _state, _props.num_queues, 64), \
> > >      DEFINE_PROP_UINT8("aerl", _state, _props.aerl, 3), \
> > > -    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64)
> > > +    DEFINE_PROP_UINT32("aer_max_queued", _state, _props.aer_max_queued, 64), \
> > > +    DEFINE_PROP_UINT8("mdts", _state, _props.mdts, 7)
> > >  
> > >  typedef struct NvmeParams {
> > >      char     *serial;
> > > @@ -16,6 +17,7 @@ typedef struct NvmeParams {
> > >      uint32_t cmb_size_mb;
> > >      uint8_t  aerl;
> > >      uint32_t aer_max_queued;
> > > +    uint8_t  mdts;
> > >  } NvmeParams;
> > >  
> > >  typedef struct NvmeAsyncEvent {
> > > @@ -23,17 +25,58 @@ typedef struct NvmeAsyncEvent {
> > >      NvmeAerResult result;
> > >  } NvmeAsyncEvent;
> > >  
> > > -typedef struct NvmeRequest {
> > > -    struct NvmeSQueue       *sq;
> > > -    BlockAIOCB              *aiocb;
> > > -    uint16_t                status;
> > > -    NvmeCqe                 cqe;
> > > -    BlockAcctCookie         acct;
> > > -    QEMUSGList              qsg;
> > > -    QEMUIOVector            iov;
> > > -    NvmeCmd                 cmd;
> > > -    QTAILQ_ENTRY(NvmeRequest)entry;
> > > -} NvmeRequest;
> > > +typedef struct NvmeRequest NvmeRequest;
> > > +typedef void NvmeRequestCompletionFunc(NvmeRequest *req, void *opaque);
> > > +
> > > +struct NvmeRequest {
> > > +    struct NvmeSQueue    *sq;
> > > +    struct NvmeNamespace *ns;
> > > +
> > > +    NvmeCqe  cqe;
> > > +    NvmeCmd  cmd;
> > > +    uint16_t status;
> > > +
> > > +    uint64_t slba;
> > > +    uint32_t nlb;
> > > +
> > > +    QEMUSGList   qsg;
> > > +    QEMUIOVector iov;
> > > +
> > > +    NvmeRequestCompletionFunc *cb;
> > > +    void                      *cb_arg;
> > > +
> > > +    QTAILQ_HEAD(, NvmeAIO)    aio_tailq;
> > > +    QTAILQ_ENTRY(NvmeRequest) entry;
> > > +};
> > > +
> > > +static inline void nvme_req_clear(NvmeRequest *req)
> > > +{
> > > +    req->ns = NULL;
> > > +    memset(&req->cqe, 0, sizeof(req->cqe));
> > > +    req->status = NVME_SUCCESS;
> > > +    req->slba = req->nlb = 0x0;
> > > +    req->cb = req->cb_arg = NULL;
> > > +
> > > +    if (req->qsg.sg) {
> > > +        qemu_sglist_destroy(&req->qsg);
> > > +    }
> > > +
> > > +    if (req->iov.iov) {
> > > +        qemu_iovec_destroy(&req->iov);
> > > +    }
> > > +}
> > > +
> > > +static inline void nvme_req_set_cb(NvmeRequest *req,
> > > +    NvmeRequestCompletionFunc *cb, void *cb_arg)
> > > +{
> > > +    req->cb = cb;
> > > +    req->cb_arg = cb_arg;
> > > +}
> > > +
> > > +static inline void nvme_req_clear_cb(NvmeRequest *req)
> > > +{
> > > +    req->cb = req->cb_arg = NULL;
> > > +}
> > >  
> > >  typedef struct NvmeSQueue {
> > >      struct NvmeCtrl *ctrl;
> > > @@ -85,6 +128,60 @@ static inline size_t nvme_ns_lbads_bytes(NvmeNamespace *ns)
> > >      return 1 << nvme_ns_lbads(ns);
> > >  }
> > >  
> > > +typedef enum NvmeAIOOp {
> > > +    NVME_AIO_OPC_NONE         = 0x0,
> > > +    NVME_AIO_OPC_FLUSH        = 0x1,
> > > +    NVME_AIO_OPC_READ         = 0x2,
> > > +    NVME_AIO_OPC_WRITE        = 0x3,
> > > +    NVME_AIO_OPC_WRITE_ZEROES = 0x4,
> > > +} NvmeAIOOp;
> > > +
> > > +typedef struct NvmeAIO NvmeAIO;
> > > +typedef void NvmeAIOCompletionFunc(NvmeAIO *aio, void *opaque, int ret);
> > > +
> > > +struct NvmeAIO {
> > > +    NvmeRequest *req;
> > > +
> > > +    NvmeAIOOp       opc;
> > > +    int64_t         offset;
> > > +    size_t          len;
> > > +    BlockBackend    *blk;
> > > +    BlockAIOCB      *aiocb;
> > > +    BlockAcctCookie acct;
> > > +
> > > +    NvmeAIOCompletionFunc *cb;
> > > +    void                  *cb_arg;
> > > +
> > > +    QEMUSGList   *qsg;
> > > +    QEMUIOVector *iov;
> > > +
> > > +    QTAILQ_ENTRY(NvmeAIO) tailq_entry;
> > > +};
> > > +
> > > +static inline const char *nvme_aio_opc_str(NvmeAIO *aio)
> > > +{
> > > +    switch (aio->opc) {
> > > +    case NVME_AIO_OPC_NONE:         return "NVME_AIO_OP_NONE";
> > > +    case NVME_AIO_OPC_FLUSH:        return "NVME_AIO_OP_FLUSH";
> > > +    case NVME_AIO_OPC_READ:         return "NVME_AIO_OP_READ";
> > > +    case NVME_AIO_OPC_WRITE:        return "NVME_AIO_OP_WRITE";
> > > +    case NVME_AIO_OPC_WRITE_ZEROES: return "NVME_AIO_OP_WRITE_ZEROES";
> > > +    default:                        return "NVME_AIO_OP_UNKNOWN";
> > > +    }
> > > +}
> > > +
> > > +static inline bool nvme_req_is_write(NvmeRequest *req)
> > > +{
> > > +    switch (req->cmd.opcode) {
> > > +    case NVME_CMD_WRITE:
> > > +    case NVME_CMD_WRITE_UNCOR:
> > > +    case NVME_CMD_WRITE_ZEROS:
> > > +        return true;
> > > +    default:
> > > +        return false;
> > > +    }
> > > +}
> > > +
> > >  #define TYPE_NVME "nvme"
> > >  #define NVME(obj) \
> > >          OBJECT_CHECK(NvmeCtrl, (obj), TYPE_NVME)
> > > @@ -139,10 +236,21 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > >  static inline uint16_t nvme_cid(NvmeRequest *req)
> > >  {
> > >      if (req) {
> > > -        return le16_to_cpu(req->cqe.cid);
> > > +        return le16_to_cpu(req->cmd.cid);
> > >      }
> > >  
> > >      return 0xffff;
> > >  }
> > >  
> > > +static inline bool nvme_status_is_error(uint16_t status, uint16_t err)
> > > +{
> > > +    /* strip DNR and MORE */
> > > +    return (status & 0xfff) == err;
> > > +}
> > > +
> > > +static inline NvmeCtrl *nvme_ctrl(NvmeRequest *req)
> > > +{
> > > +    return req->sq->ctrl;
> > > +}
> > > +
> > >  #endif /* HW_NVME_H */
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index 77aa0da99ee0..90a57fb6099a 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -34,7 +34,12 @@ nvme_dev_irq_pin(void) "pulsing IRQ pin"
> > >  nvme_dev_irq_masked(void) "IRQ is masked"
> > >  nvme_dev_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
> > >  nvme_dev_map_prp(uint16_t cid, uint8_t opc, uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "cid %"PRIu16" opc 0x%"PRIx8" trans_len %"PRIu64" len %"PRIu32" prp1
> > > 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
> > > +nvme_dev_req_register_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, uint64_t count, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" count
> > > %"PRIu64" opc \"%s\" req %p"
> > > +nvme_dev_aio_cb(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p"
> > > +nvme_dev_io_cmd(uint16_t cid, uint32_t nsid, uint16_t sqid, uint8_t opcode) "cid %"PRIu16" nsid %"PRIu32" sqid %"PRIu16" opc 0x%"PRIx8""
> > >  nvme_dev_rw(const char *verb, uint32_t blk_count, uint64_t byte_count, uint64_t lba) "%s %"PRIu32" blocks (%"PRIu64" bytes) from LBA %"PRIu64""
> > > +nvme_dev_rw_cb(uint16_t cid, uint32_t nsid) "cid %"PRIu16" nsid %"PRIu32""
> > > +nvme_dev_write_zeros(uint16_t cid, uint32_t nsid, uint64_t slba, uint32_t nlb) "cid %"PRIu16" nsid %"PRIu32" slba %"PRIu64" nlb %"PRIu32""
> > >  nvme_dev_create_sq(uint64_t addr, uint16_t sqid, uint16_t cqid, uint16_t qsize, uint16_t qflags) "create submission queue, addr=0x%"PRIx64", sqid=%"PRIu16", cqid=%"PRIu16", qsize=%"PRIu16",
> > > qflags=%"PRIu16""
> > >  nvme_dev_create_cq(uint64_t addr, uint16_t cqid, uint16_t vector, uint16_t size, uint16_t qflags, int ien) "create completion queue, addr=0x%"PRIx64", cqid=%"PRIu16", vector=%"PRIu16",
> > > qsize=%"PRIu16", qflags=%"PRIu16", ien=%d"
> > >  nvme_dev_del_sq(uint16_t qid) "deleting submission queue sqid=%"PRIu16""
> > > @@ -75,6 +80,9 @@ nvme_dev_mmio_shutdown_set(void) "shutdown bit set"
> > >  nvme_dev_mmio_shutdown_cleared(void) "shutdown bit cleared"
> > >  
> > >  # nvme traces for error conditions
> > > +nvme_dev_err_mdts(uint16_t cid, size_t mdts, size_t len) "cid %"PRIu16" mdts %"PRIu64" len %"PRIu64""
> > > +nvme_dev_err_prinfo(uint16_t cid, uint16_t ctrl) "cid %"PRIu16" ctrl %"PRIu16""
> > > +nvme_dev_err_aio(uint16_t cid, void *aio, const char *blkname, uint64_t offset, const char *opc, void *req, uint16_t status) "cid %"PRIu16" aio %p blk \"%s\" offset %"PRIu64" opc \"%s\" req %p
> > > status 0x%"PRIx16""
> > >  nvme_dev_err_invalid_dma(void) "PRP/SGL is too small for transfer size"
> > >  nvme_dev_err_invalid_prplist_ent(uint64_t prplist) "PRP list entry is null or not page aligned: 0x%"PRIx64""
> > >  nvme_dev_err_invalid_prp2_align(uint64_t prp2) "PRP2 is not page aligned: 0x%"PRIx64""
> > 
> > 
> > 
> > The patch is large, I tried my best to spot issues, but I might have missed some.
> > Please split it as I pointed out.
> 
> Done!
> 
> > Overall I do like most of the changes.
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 
Thanks!

Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 10/26] nvme: add support for the get log page command
  2020-03-16  7:45         ` Klaus Birkelund Jensen
  2020-03-25 10:22           ` Maxim Levitsky
@ 2020-03-25 10:24           ` Maxim Levitsky
  1 sibling, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:24 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:45 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 11:35, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > Add support for the Get Log Page command and basic implementations of
> > > the mandatory Error Information, SMART / Health Information and Firmware
> > > Slot Information log pages.
> > > 
> > > In violation of the specification, the SMART / Health Information log
> > > page does not persist information over the lifetime of the controller
> > > because the device has no place to store such persistent state.
> > 
> > Yea, not the end of the world.
> > > 
> > > Note that the LPA field in the Identify Controller data structure
> > > intentionally has bit 0 cleared because there is no namespace specific
> > > information in the SMART / Health information log page.
> > 
> > Makes sense.
> > > 
> > > Required for compliance with NVMe revision 1.2.1. See NVM Express 1.2.1,
> > > Section 5.10 ("Get Log Page command").
> > > 
> > > Signed-off-by: Klaus Jensen <klaus.jensen@cnexlabs.com>
> > > ---
> > >  hw/block/nvme.c       | 122 +++++++++++++++++++++++++++++++++++++++++-
> > >  hw/block/nvme.h       |  10 ++++
> > >  hw/block/trace-events |   2 +
> > >  include/block/nvme.h  |   2 +-
> > >  4 files changed, 134 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index f72348344832..468c36918042 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -569,6 +569,123 @@ static uint16_t nvme_create_sq(NvmeCtrl *n, NvmeCmd *cmd)
> > >      return NVME_SUCCESS;
> > >  }
> > >  
> > > +static uint16_t nvme_smart_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > > +    uint64_t off, NvmeRequest *req)
> > > +{
> > > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +    uint32_t nsid = le32_to_cpu(cmd->nsid);
> > > +
> > > +    uint32_t trans_len;
> > > +    time_t current_ms;
> > > +    uint64_t units_read = 0, units_written = 0, read_commands = 0,
> > > +        write_commands = 0;
> > > +    NvmeSmartLog smart;
> > > +    BlockAcctStats *s;
> > > +
> > > +    if (nsid && nsid != 0xffffffff) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    s = blk_get_stats(n->conf.blk);
> > > +
> > > +    units_read = s->nr_bytes[BLOCK_ACCT_READ] >> BDRV_SECTOR_BITS;
> > > +    units_written = s->nr_bytes[BLOCK_ACCT_WRITE] >> BDRV_SECTOR_BITS;
> > > +    read_commands = s->nr_ops[BLOCK_ACCT_READ];
> > > +    write_commands = s->nr_ops[BLOCK_ACCT_WRITE];
> > > +
> > > +    if (off > sizeof(smart)) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    trans_len = MIN(sizeof(smart) - off, buf_len);
> > > +
> > > +    memset(&smart, 0x0, sizeof(smart));
> > > +
> > > +    smart.data_units_read[0] = cpu_to_le64(units_read / 1000);
> > > +    smart.data_units_written[0] = cpu_to_le64(units_written / 1000);
> > > +    smart.host_read_commands[0] = cpu_to_le64(read_commands);
> > > +    smart.host_write_commands[0] = cpu_to_le64(write_commands);
> > > +
> > > +    smart.temperature[0] = n->temperature & 0xff;
> > > +    smart.temperature[1] = (n->temperature >> 8) & 0xff;
> > > +
> > > +    if ((n->temperature > n->features.temp_thresh_hi) ||
> > > +        (n->temperature < n->features.temp_thresh_low)) {
> > > +        smart.critical_warning |= NVME_SMART_TEMPERATURE;
> > > +    }
> > > +
> > > +    current_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > > +    smart.power_on_hours[0] = cpu_to_le64(
> > > +        (((current_ms - n->starttime_ms) / 1000) / 60) / 60);
> > > +
> > > +    return nvme_dma_read_prp(n, (uint8_t *) &smart + off, trans_len, prp1,
> > > +        prp2);
> > > +}
> > 
> > Looks OK.
> > > +
> > > +static uint16_t nvme_fw_log_info(NvmeCtrl *n, NvmeCmd *cmd, uint32_t buf_len,
> > > +    uint64_t off, NvmeRequest *req)
> > > +{
> > > +    uint32_t trans_len;
> > > +    uint64_t prp1 = le64_to_cpu(cmd->prp1);
> > > +    uint64_t prp2 = le64_to_cpu(cmd->prp2);
> > > +    NvmeFwSlotInfoLog fw_log;
> > > +
> > > +    if (off > sizeof(fw_log)) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +
> > > +    memset(&fw_log, 0, sizeof(NvmeFwSlotInfoLog));
> > > +
> > > +    trans_len = MIN(sizeof(fw_log) - off, buf_len);
> > > +
> > > +    return nvme_dma_read_prp(n, (uint8_t *) &fw_log + off, trans_len, prp1,
> > > +        prp2);
> > > +}
> > 
> > Looks OK
> > > +
> > > +static uint16_t nvme_get_log(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > > +{
> > > +    uint32_t dw10 = le32_to_cpu(cmd->cdw10);
> > > +    uint32_t dw11 = le32_to_cpu(cmd->cdw11);
> > > +    uint32_t dw12 = le32_to_cpu(cmd->cdw12);
> > > +    uint32_t dw13 = le32_to_cpu(cmd->cdw13);
> > > +    uint8_t  lid = dw10 & 0xff;
> > > +    uint8_t  rae = (dw10 >> 15) & 0x1;
> > > +    uint32_t numdl, numdu;
> > > +    uint64_t off, lpol, lpou;
> > > +    size_t   len;
> > > +
> > > +    numdl = (dw10 >> 16);
> > > +    numdu = (dw11 & 0xffff);
> > > +    lpol = dw12;
> > > +    lpou = dw13;
> > > +
> > > +    len = (((numdu << 16) | numdl) + 1) << 2;
> > > +    off = (lpou << 32ULL) | lpol;
> > > +
> > > +    if (off & 0x3) {
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > 
> > Good. 
> > Note that there are plenty of other places in the driver that don't honor
> > such tiny formal bits of the spec, like for instance checking for the reserved
> > bits in commands.
> 
> Yeah. I know. You think its fair we leave that for subsequent patches?
> It's not like its breaking the device, but compliance is not complete.
I don't have a strong opinion on this one, I would just bump the spec version in the last patch.

> 
> > > +
> > > +    trace_nvme_dev_get_log(nvme_cid(req), lid, rae, len, off);
> > > +
> > > +    switch (lid) {
> > > +    case NVME_LOG_ERROR_INFO:
> > > +        if (off) {
> > > +            return NVME_INVALID_FIELD | NVME_DNR;
> > > +        }
> > 
> > I think you might want to memset the user given buffer to zero:
> > 
> > "This is a 64-bit incrementing error count, indicating a unique identifier for this error.
> > The error count starts at 1h, is incremented for each unique error log entry, and is retained across
> > power off conditions. A value of 0h indicates an invalid entry; this value is used when there are
> > lost entries or when there are fewer errors than the maximum number of entries the controller
> > supports."
> 
> Good catch. Fixed!
> 
> > > +
> > > +        return NVME_SUCCESS;
> > > +    case NVME_LOG_SMART_INFO:
> > > +        return nvme_smart_info(n, cmd, len, off, req);
> > > +    case NVME_LOG_FW_SLOT_INFO:
> > > +        return nvme_fw_log_info(n, cmd, len, off, req);
> > > +    default:
> > > +        trace_nvme_dev_err_invalid_log_page(nvme_cid(req), lid);
> > > +        return NVME_INVALID_FIELD | NVME_DNR;
> > > +    }
> > > +}
> > 
> > 
> > > +
> > >  static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
> > >  {
> > >      n->cq[cq->cqid] = NULL;
> > > @@ -914,6 +1031,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          return nvme_del_sq(n, cmd);
> > >      case NVME_ADM_CMD_CREATE_SQ:
> > >          return nvme_create_sq(n, cmd);
> > > +    case NVME_ADM_CMD_GET_LOG_PAGE:
> > > +        return nvme_get_log(n, cmd, req);
> > >      case NVME_ADM_CMD_DELETE_CQ:
> > >          return nvme_del_cq(n, cmd);
> > >      case NVME_ADM_CMD_CREATE_CQ:
> > > @@ -1411,6 +1530,7 @@ static void nvme_init_state(NvmeCtrl *n)
> > >  
> > >      n->temperature = NVME_TEMPERATURE;
> > >      n->features.temp_thresh_hi = NVME_TEMPERATURE_WARNING;
> > > +    n->starttime_ms = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> > >  }
> > >  
> > >  static void nvme_init_cmb(NvmeCtrl *n, PCIDevice *pci_dev)
> > > @@ -1491,7 +1611,7 @@ static void nvme_init_ctrl(NvmeCtrl *n)
> > >       */
> > >      id->acl = 3;
> > >      id->frmw = 7 << 1;
> > > -    id->lpa = 1 << 0;
> > > +    id->lpa = 1 << 2;
> > >  
> > >      /* recommended default value (~70 C) */
> > >      id->wctemp = cpu_to_le16(NVME_TEMPERATURE_WARNING);
> > > diff --git a/hw/block/nvme.h b/hw/block/nvme.h
> > > index 1518f32557a3..89b0aafa02a2 100644
> > > --- a/hw/block/nvme.h
> > > +++ b/hw/block/nvme.h
> > > @@ -109,6 +109,7 @@ typedef struct NvmeCtrl {
> > >      uint64_t    host_timestamp;                 /* Timestamp sent by the host */
> > >      uint64_t    timestamp_set_qemu_clock_ms;    /* QEMU clock time */
> > >      uint16_t    temperature;
> > > +    uint64_t    starttime_ms;
> > >  
> > >      NvmeNamespace   *namespaces;
> > >      NvmeSQueue      **sq;
> > > @@ -124,4 +125,13 @@ static inline uint64_t nvme_ns_nlbas(NvmeCtrl *n, NvmeNamespace *ns)
> > >      return n->ns_size >> nvme_ns_lbads(ns);
> > >  }
> > >  
> > > +static inline uint16_t nvme_cid(NvmeRequest *req)
> > > +{
> > > +    if (req) {
> > > +        return le16_to_cpu(req->cqe.cid);
> > > +    }
> > > +
> > > +    return 0xffff;
> > > +}
> > 
> > I see that you added command ID reporting to trace events you added,
> > which makes sense.
> > I think it would be nice later to add it to existing trace events where it makes sense.
> > 
> 
> Exactly. I'm doing that as I encounter it and it makes sense to have it
> in the patch.
OK, I don't mind.
> 
> > 
> > > +
> > >  #endif /* HW_NVME_H */
> > > diff --git a/hw/block/trace-events b/hw/block/trace-events
> > > index ade506ea2bb2..7da088479f39 100644
> > > --- a/hw/block/trace-events
> > > +++ b/hw/block/trace-events
> > > @@ -46,6 +46,7 @@ nvme_dev_getfeat_numq(int result) "get feature number of queues, result=%d"
> > >  nvme_dev_setfeat_numq(int reqcq, int reqsq, int gotcq, int gotsq) "requested cq_count=%d sq_count=%d, responding with cq_count=%d sq_count=%d"
> > >  nvme_dev_setfeat_timestamp(uint64_t ts) "set feature timestamp = 0x%"PRIx64""
> > >  nvme_dev_getfeat_timestamp(uint64_t ts) "get feature timestamp = 0x%"PRIx64""
> > > +nvme_dev_get_log(uint16_t cid, uint8_t lid, uint8_t rae, uint32_t len, uint64_t off) "cid %"PRIu16" lid 0x%"PRIx8" rae 0x%"PRIx8" len %"PRIu32" off %"PRIu64""
> > >  nvme_dev_mmio_intm_set(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask set, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> > >  nvme_dev_mmio_intm_clr(uint64_t data, uint64_t new_mask) "wrote MMIO, interrupt mask clr, data=0x%"PRIx64", new_mask=0x%"PRIx64""
> > >  nvme_dev_mmio_cfg(uint64_t data) "wrote MMIO, config controller config=0x%"PRIx64""
> > > @@ -85,6 +86,7 @@ nvme_dev_err_invalid_create_cq_qflags(uint16_t qflags) "failed creating completi
> > >  nvme_dev_err_invalid_identify_cns(uint16_t cns) "identify, invalid cns=0x%"PRIx16""
> > >  nvme_dev_err_invalid_getfeat(int dw10) "invalid get features, dw10=0x%"PRIx32""
> > >  nvme_dev_err_invalid_setfeat(uint32_t dw10) "invalid set features, dw10=0x%"PRIx32""
> > > +nvme_dev_err_invalid_log_page(uint16_t cid, uint16_t lid) "cid %"PRIu16" lid 0x%"PRIx16""
> > >  nvme_dev_err_startfail_cq(void) "nvme_start_ctrl failed because there are non-admin completion queues"
> > >  nvme_dev_err_startfail_sq(void) "nvme_start_ctrl failed because there are non-admin submission queues"
> > >  nvme_dev_err_startfail_nbarasq(void) "nvme_start_ctrl failed because the admin submission queue address is null"
> > > diff --git a/include/block/nvme.h b/include/block/nvme.h
> > > index ff31cb32117c..9a6055adeb61 100644
> > > --- a/include/block/nvme.h
> > > +++ b/include/block/nvme.h
> > > @@ -515,7 +515,7 @@ enum NvmeSmartWarn {
> > >      NVME_SMART_FAILED_VOLATILE_MEDIA  = 1 << 4,
> > >  };
> > >  
> > > -enum LogIdentifier {
> > > +enum NvmeLogIdentifier {
> > >      NVME_LOG_ERROR_INFO     = 0x01,
> > >      NVME_LOG_SMART_INFO     = 0x02,
> > >      NVME_LOG_FW_SLOT_INFO   = 0x03,
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid
  2020-03-16  7:48         ` Klaus Birkelund Jensen
@ 2020-03-25 10:25           ` Maxim Levitsky
  0 siblings, 0 replies; 86+ messages in thread
From: Maxim Levitsky @ 2020-03-25 10:25 UTC (permalink / raw)
  To: Klaus Birkelund Jensen
  Cc: Kevin Wolf, Beata Michalska, qemu-block, Klaus Jensen,
	qemu-devel, Max Reitz, Keith Busch, Javier Gonzalez

On Mon, 2020-03-16 at 00:48 -0700, Klaus Birkelund Jensen wrote:
> On Feb 12 12:30, Maxim Levitsky wrote:
> > On Tue, 2020-02-04 at 10:51 +0100, Klaus Jensen wrote:
> > > 0xffff is not an allowed value for NCQR and NSQR in Set Features on
> > > Number of Queues.
> > > 
> > > Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
> > > ---
> > >  hw/block/nvme.c | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/hw/block/nvme.c b/hw/block/nvme.c
> > > index 30c5b3e7a67d..900732bb2f38 100644
> > > --- a/hw/block/nvme.c
> > > +++ b/hw/block/nvme.c
> > > @@ -1133,6 +1133,10 @@ static uint16_t nvme_set_feature(NvmeCtrl *n, NvmeCmd *cmd, NvmeRequest *req)
> > >          blk_set_enable_write_cache(n->conf.blk, dw11 & 1);
> > >          break;
> > >      case NVME_NUMBER_OF_QUEUES:
> > > +        if ((dw11 & 0xffff) == 0xffff || ((dw11 >> 16) & 0xffff) == 0xffff) {
> > > +            return NVME_INVALID_FIELD | NVME_DNR;
> > > +        }
> > 
> > Very minor nitpick: since this spec requirement is not obvious, a quote/reference to the spec
> > would be nice to have here. 
> > 
> 
> Added.
Thanks!
> 
> > > +
> > >          trace_nvme_dev_setfeat_numq((dw11 & 0xFFFF) + 1,
> > >              ((dw11 >> 16) & 0xFFFF) + 1, n->params.num_queues - 1,
> > >              n->params.num_queues - 1);
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 
> 
> 

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2020-03-25 10:31 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CGME20200204095215eucas1p1bb0d5a3c183f7531d8b0e5e081f1ae6b@eucas1p1.samsung.com>
2020-02-04  9:51 ` [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces Klaus Jensen
     [not found]   ` <CGME20200204095216eucas1p2cb2b4772c04b92c97b0690c8e565234c@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 01/26] nvme: rename trace events to nvme_dev Klaus Jensen
2020-02-12  9:08       ` Maxim Levitsky
2020-02-12 13:08         ` Klaus Birkelund Jensen
2020-02-12 13:17           ` Maxim Levitsky
     [not found]   ` <CGME20200204095216eucas1p137a2adf666e82d490aefca96a269acd9@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 02/26] nvme: remove superfluous breaks Klaus Jensen
2020-02-12  9:09       ` Maxim Levitsky
     [not found]   ` <CGME20200204095217eucas1p1f3e1d113d5eaad4327de0158d1e480cb@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 03/26] nvme: move device parameters to separate struct Klaus Jensen
2020-02-12  9:12       ` Maxim Levitsky
     [not found]   ` <CGME20200204095218eucas1p25d4623d82b1b7db3e555f3b27ca19763@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 04/26] nvme: add missing fields in the identify data structures Klaus Jensen
2020-02-12  9:15       ` Maxim Levitsky
     [not found]   ` <CGME20200204095218eucas1p2400645e2400b3d4450386a46e71b9e9a@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 05/26] nvme: populate the mandatory subnqn and ver fields Klaus Jensen
2020-02-12  9:18       ` Maxim Levitsky
     [not found]   ` <CGME20200204095219eucas1p1a7d44c741e119939c60ff60b96c7652e@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 06/26] nvme: refactor nvme_addr_read Klaus Jensen
2020-02-12  9:23       ` Maxim Levitsky
     [not found]   ` <CGME20200204095219eucas1p1a7e88f8f4090988b3dee34d4d4bcc239@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 07/26] nvme: add support for the abort command Klaus Jensen
2020-02-12  9:25       ` Maxim Levitsky
     [not found]   ` <CGME20200204095220eucas1p186b0de598359750d49278e0226ae45fb@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 08/26] nvme: refactor device realization Klaus Jensen
2020-02-12  9:27       ` Maxim Levitsky
2020-03-16  7:43         ` Klaus Birkelund Jensen
2020-03-25 10:21           ` Maxim Levitsky
     [not found]   ` <CGME20200204095221eucas1p1d5b1c9578d79e6bcc5714976bbe7dc11@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 09/26] nvme: add temperature threshold feature Klaus Jensen
2020-02-12  9:31       ` Maxim Levitsky
2020-03-16  7:44         ` Klaus Birkelund Jensen
2020-03-25 10:21           ` Maxim Levitsky
     [not found]   ` <CGME20200204095221eucas1p216ca2452c4184eb06bff85cff3c6a82b@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 10/26] nvme: add support for the get log page command Klaus Jensen
2020-02-12  9:35       ` Maxim Levitsky
2020-03-16  7:45         ` Klaus Birkelund Jensen
2020-03-25 10:22           ` Maxim Levitsky
2020-03-25 10:24           ` Maxim Levitsky
     [not found]   ` <CGME20200204095222eucas1p2a2351bfc0930b3939927e485f1417e29@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 11/26] nvme: add support for the asynchronous event request command Klaus Jensen
2020-02-12 10:21       ` Maxim Levitsky
     [not found]   ` <CGME20200204095223eucas1p281b4ef7c8f4170d8a42da3b4aea9e166@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 12/26] nvme: add missing mandatory features Klaus Jensen
2020-02-12 10:27       ` Maxim Levitsky
2020-03-16  7:47         ` Klaus Birkelund Jensen
2020-03-25 10:22           ` Maxim Levitsky
     [not found]   ` <CGME20200204095223eucas1p2b24d674e4b201c13a5fffc6853520d9b@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 13/26] nvme: additional tracing Klaus Jensen
2020-02-12 10:28       ` Maxim Levitsky
     [not found]   ` <CGME20200204095224eucas1p10807239f5dc4aa809650c85186c426a8@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 14/26] nvme: make sure ncqr and nsqr is valid Klaus Jensen
2020-02-12 10:30       ` Maxim Levitsky
2020-03-16  7:48         ` Klaus Birkelund Jensen
2020-03-25 10:25           ` Maxim Levitsky
     [not found]   ` <CGME20200204095225eucas1p1e44b4de86afdf936e3c7f61359d529ce@eucas1p1.samsung.com>
2020-02-04  9:51     ` [PATCH v5 15/26] nvme: bump supported specification to 1.3 Klaus Jensen
2020-02-12 10:35       ` Maxim Levitsky
2020-03-16  7:50         ` Klaus Birkelund Jensen
2020-03-25 10:22           ` Maxim Levitsky
     [not found]   ` <CGME20200204095225eucas1p226336a91fb5460dddae5caa85964279f@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 16/26] nvme: refactor prp mapping Klaus Jensen
2020-02-12 11:44       ` Maxim Levitsky
2020-03-16  7:51         ` Klaus Birkelund Jensen
2020-03-25 10:23           ` Maxim Levitsky
     [not found]   ` <CGME20200204095226eucas1p2429f45a5e23fe6ed57dee293be5e1b44@eucas1p2.samsung.com>
2020-02-04  9:51     ` [PATCH v5 17/26] nvme: allow multiple aios per command Klaus Jensen
2020-02-12 11:48       ` Maxim Levitsky
2020-03-16  7:53         ` Klaus Birkelund Jensen
2020-03-25 10:24           ` Maxim Levitsky
     [not found]   ` <CGME20200204095227eucas1p2f23061d391e67f4d3bde8bab74d1e44b@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 18/26] nvme: use preallocated qsg/iov in nvme_dma_prp Klaus Jensen
2020-02-12 11:49       ` Maxim Levitsky
     [not found]   ` <CGME20200204095227eucas1p2d86cd6abcb66327dc112d58c83664139@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 19/26] pci: pass along the return value of dma_memory_rw Klaus Jensen
     [not found]   ` <CGME20200204095228eucas1p2878eb150a933bb196fe5ca10a0b76eaf@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 20/26] nvme: handle dma errors Klaus Jensen
2020-02-12 11:52       ` Maxim Levitsky
2020-03-16  7:53         ` Klaus Birkelund Jensen
2020-03-25 10:23           ` Maxim Levitsky
     [not found]   ` <CGME20200204095229eucas1p2b290e3603d73c129a4f6149805273705@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 21/26] nvme: add support for scatter gather lists Klaus Jensen
2020-02-12 12:07       ` Maxim Levitsky
2020-03-16  7:54         ` Klaus Birkelund Jensen
2020-03-25 10:24           ` Maxim Levitsky
     [not found]   ` <CGME20200204095230eucas1p27456c6c0ab3b688d2f891d0dff098821@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 22/26] nvme: support multiple namespaces Klaus Jensen
2020-02-04 16:31       ` Keith Busch
2020-02-06  7:27         ` Klaus Birkelund Jensen
2020-02-12 12:34       ` Maxim Levitsky
2020-03-16  7:55         ` Klaus Birkelund Jensen
2020-03-25 10:24           ` Maxim Levitsky
     [not found]   ` <CGME20200204095230eucas1p23f3105c4cab4aaec77a3dd42b8158c10@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 23/26] pci: allocate pci id for nvme Klaus Jensen
2020-02-12 12:36       ` Maxim Levitsky
     [not found]   ` <CGME20200204095231eucas1p21019b1d857fcda9d67950e7d01de6b6a@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 24/26] nvme: change controller pci id Klaus Jensen
2020-02-04 16:35       ` Keith Busch
2020-02-06  7:28         ` Klaus Birkelund Jensen
2020-02-12 12:37       ` Maxim Levitsky
     [not found]   ` <CGME20200204095231eucas1p1f2b78a655b1a217fe4f7006f79e37f86@eucas1p1.samsung.com>
2020-02-04  9:52     ` [PATCH v5 25/26] nvme: remove redundant NvmeCmd pointer parameter Klaus Jensen
2020-02-12 12:37       ` Maxim Levitsky
     [not found]   ` <CGME20200204095232eucas1p2b3264104447a42882f10edb06608ece5@eucas1p2.samsung.com>
2020-02-04  9:52     ` [PATCH v5 26/26] nvme: make lba data size configurable Klaus Jensen
2020-02-04 16:43       ` Keith Busch
2020-02-06  7:24         ` Klaus Birkelund Jensen
2020-02-12 12:39           ` Maxim Levitsky
2020-02-04 10:34   ` [PATCH v5 00/26] nvme: support NVMe v1.3d, SGLs and multiple namespaces no-reply
2020-02-04 16:47   ` Keith Busch
2020-02-06  7:29     ` Klaus Birkelund Jensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).