All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/2] hw/nvme: Add shadow doorbell buffer support
@ 2022-06-08  1:36 Jinhao Fan
  2022-06-08  1:36 ` [PATCH 1/2] hw/nvme: Implement " Jinhao Fan
  2022-06-08  1:36 ` [PATCH 2/2] hw/nvme: Add trace events for shadow doorbell buffer Jinhao Fan
  0 siblings, 2 replies; 22+ messages in thread
From: Jinhao Fan @ 2022-06-08  1:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, Jinhao Fan

This patch adds shadow doorbell buffer support in NVMe 1.3 to QEMU
NVMe. The Doorbell Buffer Config admin command is implemented for the
guest to enable shadow doobell buffer. When this feature is enabled, each
SQ/CQ is associated with two buffers, i.e., Shadow Doorbell buffer and
EventIdx buffer. According to the Spec, each queue's doorbell register
is only updated when the Shadow Doorbell buffer value changes from being
less than or equal to the value of the corresponding EventIdx buffer
entry to being greater than that value. Therefore, the number of MMIO's
on the doorbell registers is greatly reduced.

This patch is adapted from Huaicheng Li's patch[1] in 2018.

[1] https://patchwork.kernel.org/project/qemu-devel/patch/20180305194906.GA3630@gmail.com/

IOPS comparison with FIO:

iodepth    1      2      4      8
  QEMU   25.1k  25.9k  24.5k  24.0k
 +dbbuf  29.1k  60.1k  99.8k  82.5k

MMIO's per IO measured by perf-kvm:

iodepth    1      2      4      8
  QEMU   2.01   1.99   1.99   1.99
 +dbbuf  1.00   0.52   0.27   0.46

The tests are done on Ubuntu 22.04 with 5.15.0-33 kernel with Intel(R) 
Xeon(R) Gold 6248R CPU @ 3.00GHz.

QEMU set up:

bin/x86_64-softmmu/qemu-system-x86_64 \
    -name "nvme-test" \
    -machine accel=kvm \
    -cpu host \
    -smp 4 \
    -m 8G \
    -daemonize \
    -device virtio-scsi-pci,id=scsi0 \
    -device scsi-hd,drive=hd0 \
    -drive file=$OSIMGF,if=none,aio=native,cache=none,format=qcow2,id=hd0,snapshot=on \
    -drive "id=nvm,if=none,file=null-co://,file.read-zeroes=on,format=raw" \
    -device nvme,serial=deadbeef,drive=nvm \
    -net user,hostfwd=tcp::8080-:22 \
    -net nic,model=virtio

FIO configuration:

[global]
ioengine=libaio
filename=/dev/nvme0n1
thread=1
group_reporting=1
direct=1
verify=0
time_based=1
ramp_time=0
runtime=30
;size=1G
;iodepth=1
rw=randread
bs=4k

[test]
numjobs=1

Jinhao Fan (2):
  hw/nvme: Implement shadow doorbell buffer support
  hw/nvme: Add trace events for shadow doorbell buffer

 hw/nvme/ctrl.c       | 100 +++++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h       |   8 ++++
 hw/nvme/trace-events |   5 +++
 include/block/nvme.h |   2 +
 4 files changed, 112 insertions(+), 3 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-08  1:36 [PATCH 0/2] hw/nvme: Add shadow doorbell buffer support Jinhao Fan
@ 2022-06-08  1:36 ` Jinhao Fan
  2022-06-08 20:55   ` Klaus Jensen
  2022-06-08  1:36 ` [PATCH 2/2] hw/nvme: Add trace events for shadow doorbell buffer Jinhao Fan
  1 sibling, 1 reply; 22+ messages in thread
From: Jinhao Fan @ 2022-06-08  1:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, Jinhao Fan

Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
command, the nvme_dbbuf_config function tries to associate each existing
SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
Queues created after the Doorbell Buffer Config command will have the
doorbell buffers associated with them when they are initialized.

In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
Doorbell buffer changes instead of wait for doorbell register changes.
This reduces the number of MMIOs.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c       | 95 ++++++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h       |  8 ++++
 include/block/nvme.h |  2 +
 3 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 03760ddeae..d3f6c432df 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
     [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
+    [NVME_ADM_CMD_DBBUF_CONFIG]     = NVME_CMD_EFF_CSUPP,
     [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
 };
 
@@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, int64_t offset,
     }
 }
 
+static void nvme_update_cq_head(NvmeCQueue *cq)
+{
+    pci_dma_read(&cq->ctrl->parent_obj, cq->db_addr, &cq->head,
+            sizeof(cq->head));
+}
+
 static void nvme_post_cqes(void *opaque)
 {
     NvmeCQueue *cq = opaque;
@@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
         NvmeSQueue *sq;
         hwaddr addr;
 
+        if (cq->cqid && n->dbbuf_enabled) {
+            nvme_update_cq_head(cq);
+        }
+
         if (nvme_cq_full(cq)) {
             break;
         }
@@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
 static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
                          uint16_t sqid, uint16_t cqid, uint16_t size)
 {
+    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
     int i;
     NvmeCQueue *cq;
 
@@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     }
     sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
 
+    if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
+        sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
+        sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
+    }
+
     assert(n->cq[cqid]);
     cq = n->cq[cqid];
     QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
@@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
                          uint16_t cqid, uint16_t vector, uint16_t size,
                          uint16_t irq_enabled)
 {
+    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
     int ret;
 
     if (msix_enabled(&n->parent_obj)) {
@@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     cq->head = cq->tail = 0;
     QTAILQ_INIT(&cq->req_list);
     QTAILQ_INIT(&cq->sq_list);
+    if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
+        cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
+        cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
+    }
     n->cq[cqid] = cq;
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
 }
@@ -5767,6 +5789,43 @@ out:
     return status;
 }
 
+static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
+{
+    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
+    uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
+    uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
+    int i;
+
+    /* Address should be page aligned */
+    if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
+        return NVME_INVALID_FIELD | NVME_DNR;
+    }
+
+    /* Save shadow buffer base addr for use during queue creation */
+    n->dbbuf_dbs = dbs_addr;
+    n->dbbuf_eis = eis_addr;
+    n->dbbuf_enabled = true;
+
+    for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+        NvmeSQueue *sq = n->sq[i];
+        NvmeCQueue *cq = n->cq[i];
+
+        if (sq) {
+            /* Submission queue tail pointer location, 2 * QID * stride */
+            sq->db_addr = dbs_addr + 2 * i * stride;
+            sq->ei_addr = eis_addr + 2 * i * stride;
+        }
+
+        if (cq) {
+            /* Completion queue head pointer location, (2 * QID + 1) * stride */
+            cq->db_addr = dbs_addr + (2 * i + 1) * stride;
+            cq->ei_addr = eis_addr + (2 * i + 1) * stride;
+        }
+    }
+
+    return NVME_SUCCESS;
+}
+
 static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
 {
     trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
@@ -5809,6 +5868,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
         return nvme_aer(n, req);
     case NVME_ADM_CMD_NS_ATTACHMENT:
         return nvme_ns_attachment(n, req);
+    case NVME_ADM_CMD_DBBUF_CONFIG:
+        return nvme_dbbuf_config(n, req);
     case NVME_ADM_CMD_FORMAT_NVM:
         return nvme_format(n, req);
     default:
@@ -5818,6 +5879,18 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
     return NVME_INVALID_OPCODE | NVME_DNR;
 }
 
+static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
+{
+    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
+                  sizeof(sq->tail));
+}
+
+static void nvme_update_sq_tail(NvmeSQueue *sq)
+{
+    pci_dma_read(&sq->ctrl->parent_obj, sq->db_addr, &sq->tail,
+                 sizeof(sq->tail));
+}
+
 static void nvme_process_sq(void *opaque)
 {
     NvmeSQueue *sq = opaque;
@@ -5829,6 +5902,10 @@ static void nvme_process_sq(void *opaque)
     NvmeCmd cmd;
     NvmeRequest *req;
 
+    if (sq->sqid && n->dbbuf_enabled) {
+        nvme_update_sq_tail(sq);
+    }
+
     while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
         addr = sq->dma_addr + sq->head * n->sqe_size;
         if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
@@ -5852,6 +5929,11 @@ static void nvme_process_sq(void *opaque)
             req->status = status;
             nvme_enqueue_req_completion(cq, req);
         }
+
+        if (sq->sqid && n->dbbuf_enabled) {
+            nvme_update_sq_eventidx(sq);
+            nvme_update_sq_tail(sq);
+        }
     }
 }
 
@@ -5889,6 +5971,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
     n->aer_queued = 0;
     n->outstanding_aers = 0;
     n->qs_created = false;
+    n->dbbuf_dbs = 0;
+    n->dbbuf_eis = 0;
+    n->dbbuf_enabled = false;
 }
 
 static void nvme_ctrl_shutdown(NvmeCtrl *n)
@@ -6397,7 +6482,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
         trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
 
         start_sqs = nvme_cq_full(cq) ? 1 : 0;
-        cq->head = new_head;
+        if (!cq->cqid || !n->dbbuf_enabled) {
+            cq->head = new_head;
+        }
         if (start_sqs) {
             NvmeSQueue *sq;
             QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
@@ -6454,7 +6541,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
 
-        sq->tail = new_tail;
+        if (!sq->sqid || !n->dbbuf_enabled) {
+            sq->tail = new_tail;
+        }
         timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
     }
 }
@@ -6733,7 +6822,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
 
     id->mdts = n->params.mdts;
     id->ver = cpu_to_le32(NVME_SPEC_VER);
-    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
+    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
     id->cntrltype = 0x1;
 
     /*
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 6773819325..4452e4b1bf 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -334,6 +334,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc)
     case NVME_ADM_CMD_GET_FEATURES:     return "NVME_ADM_CMD_GET_FEATURES";
     case NVME_ADM_CMD_ASYNC_EV_REQ:     return "NVME_ADM_CMD_ASYNC_EV_REQ";
     case NVME_ADM_CMD_NS_ATTACHMENT:    return "NVME_ADM_CMD_NS_ATTACHMENT";
+    case NVME_ADM_CMD_DBBUF_CONFIG:     return "NVME_ADM_CMD_DBBUF_CONFIG";
     case NVME_ADM_CMD_FORMAT_NVM:       return "NVME_ADM_CMD_FORMAT_NVM";
     default:                            return "NVME_ADM_CMD_UNKNOWN";
     }
@@ -365,6 +366,8 @@ typedef struct NvmeSQueue {
     uint32_t    tail;
     uint32_t    size;
     uint64_t    dma_addr;
+    uint64_t    db_addr;
+    uint64_t    ei_addr;
     QEMUTimer   *timer;
     NvmeRequest *io_req;
     QTAILQ_HEAD(, NvmeRequest) req_list;
@@ -382,6 +385,8 @@ typedef struct NvmeCQueue {
     uint32_t    vector;
     uint32_t    size;
     uint64_t    dma_addr;
+    uint64_t    db_addr;
+    uint64_t    ei_addr;
     QEMUTimer   *timer;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
@@ -432,6 +437,9 @@ typedef struct NvmeCtrl {
     uint64_t    starttime_ms;
     uint16_t    temperature;
     uint8_t     smart_critical_warning;
+    uint64_t    dbbuf_dbs;
+    uint64_t    dbbuf_eis;
+    bool        dbbuf_enabled;
 
     struct {
         MemoryRegion mem;
diff --git a/include/block/nvme.h b/include/block/nvme.h
index 3737351cc8..5b522d7b0e 100644
--- a/include/block/nvme.h
+++ b/include/block/nvme.h
@@ -595,6 +595,7 @@ enum NvmeAdminCommands {
     NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
     NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
     NVME_ADM_CMD_NS_ATTACHMENT  = 0x15,
+    NVME_ADM_CMD_DBBUF_CONFIG   = 0x7c,
     NVME_ADM_CMD_FORMAT_NVM     = 0x80,
     NVME_ADM_CMD_SECURITY_SEND  = 0x81,
     NVME_ADM_CMD_SECURITY_RECV  = 0x82,
@@ -1134,6 +1135,7 @@ enum NvmeIdCtrlOacs {
     NVME_OACS_FORMAT    = 1 << 1,
     NVME_OACS_FW        = 1 << 2,
     NVME_OACS_NS_MGMT   = 1 << 3,
+    NVME_OACS_DBBUF     = 1 << 8,
 };
 
 enum NvmeIdCtrlOncs {
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH 2/2] hw/nvme: Add trace events for shadow doorbell buffer
  2022-06-08  1:36 [PATCH 0/2] hw/nvme: Add shadow doorbell buffer support Jinhao Fan
  2022-06-08  1:36 ` [PATCH 1/2] hw/nvme: Implement " Jinhao Fan
@ 2022-06-08  1:36 ` Jinhao Fan
  1 sibling, 0 replies; 22+ messages in thread
From: Jinhao Fan @ 2022-06-08  1:36 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, Jinhao Fan

When shadow doorbell buffer is enabled, doorbell registers are lazily
updated. The actual queue head and tail pointers are stored in Shadow
Doorbell buffers.

Add trace events for updates on the Shadow Doorbell buffers and EventIdx
buffers. Also add trace event for the Doorbell Buffer Config command.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c       | 5 +++++
 hw/nvme/trace-events | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index d3f6c432df..7df52d204d 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1309,6 +1309,7 @@ static void nvme_update_cq_head(NvmeCQueue *cq)
 {
     pci_dma_read(&cq->ctrl->parent_obj, cq->db_addr, &cq->head,
             sizeof(cq->head));
+    trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head);
 }
 
 static void nvme_post_cqes(void *opaque)
@@ -5823,6 +5824,8 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
         }
     }
 
+    trace_pci_nvme_dbbuf_config(dbs_addr, eis_addr);
+
     return NVME_SUCCESS;
 }
 
@@ -5883,12 +5886,14 @@ static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
 {
     pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
                   sizeof(sq->tail));
+    trace_pci_nvme_eventidx_sq(sq->sqid, sq->tail);
 }
 
 static void nvme_update_sq_tail(NvmeSQueue *sq)
 {
     pci_dma_read(&sq->ctrl->parent_obj, sq->db_addr, &sq->tail,
                  sizeof(sq->tail));
+    trace_pci_nvme_shadow_doorbell_sq(sq->sqid, sq->tail);
 }
 
 static void nvme_process_sq(void *opaque)
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index ff1b458969..00ee42f475 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -3,6 +3,7 @@ pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 pci_nvme_irq_pin(void) "pulsing IRQ pin"
 pci_nvme_irq_masked(void) "IRQ is masked"
 pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
+pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64" eis_addr=0x%"PRIx64""
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
 pci_nvme_map_addr_cmb(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
 pci_nvme_map_prp(uint64_t trans_len, uint32_t len, uint64_t prp1, uint64_t prp2, int num_prps) "trans_len %"PRIu64" len %"PRIu32" prp1 0x%"PRIx64" prp2 0x%"PRIx64" num_prps %d"
@@ -81,6 +82,8 @@ pci_nvme_enqueue_event_noqueue(int queued) "queued %d"
 pci_nvme_enqueue_event_masked(uint8_t typ) "type 0x%"PRIx8""
 pci_nvme_no_outstanding_aers(void) "ignoring event; no outstanding AERs"
 pci_nvme_enqueue_req_completion(uint16_t cid, uint16_t cqid, uint32_t dw0, uint32_t dw1, uint16_t status) "cid %"PRIu16" cqid %"PRIu16" dw0 0x%"PRIx32" dw1 0x%"PRIx32" status 0x%"PRIx16""
+pci_nvme_eventidx_cq(uint16_t cqid, uint16_t new_eventidx) "cqid %"PRIu16" new_eventidx %"PRIu16""
+pci_nvme_eventidx_sq(uint16_t sqid, uint16_t new_eventidx) "sqid %"PRIu16" new_eventidx %"PRIu16""
 pci_nvme_mmio_read(uint64_t addr, unsigned size) "addr 0x%"PRIx64" size %d"
 pci_nvme_mmio_write(uint64_t addr, uint64_t data, unsigned size) "addr 0x%"PRIx64" data 0x%"PRIx64" size %d"
 pci_nvme_mmio_doorbell_cq(uint16_t cqid, uint16_t new_head) "cqid %"PRIu16" new_head %"PRIu16""
@@ -97,6 +100,8 @@ pci_nvme_mmio_start_success(void) "setting controller enable bit succeeded"
 pci_nvme_mmio_stopped(void) "cleared controller enable bit"
 pci_nvme_mmio_shutdown_set(void) "shutdown bit set"
 pci_nvme_mmio_shutdown_cleared(void) "shutdown bit cleared"
+pci_nvme_shadow_doorbell_cq(uint16_t cqid, uint16_t new_shadow_doorbell) "cqid %"PRIu16" new_shadow_doorbell %"PRIu16""
+pci_nvme_shadow_doorbell_sq(uint16_t sqid, uint16_t new_shadow_doorbell) "sqid %"PRIu16" new_shadow_doorbell %"PRIu16""
 pci_nvme_open_zone(uint64_t slba, uint32_t zone_idx, int all) "open zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
 pci_nvme_close_zone(uint64_t slba, uint32_t zone_idx, int all) "close zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
 pci_nvme_finish_zone(uint64_t slba, uint32_t zone_idx, int all) "finish zone, slba=%"PRIu64", idx=%"PRIu32", all=%"PRIi32""
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-08  1:36 ` [PATCH 1/2] hw/nvme: Implement " Jinhao Fan
@ 2022-06-08 20:55   ` Klaus Jensen
  2022-06-09  1:49     ` Jinhao Fan
  2022-06-09 14:29     ` Keith Busch
  0 siblings, 2 replies; 22+ messages in thread
From: Klaus Jensen @ 2022-06-08 20:55 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch

[-- Attachment #1: Type: text/plain, Size: 12012 bytes --]

On Jun  8 09:36, Jinhao Fan wrote:
> Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
> and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
> in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
> command, the nvme_dbbuf_config function tries to associate each existing
> SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
> Queues created after the Doorbell Buffer Config command will have the
> doorbell buffers associated with them when they are initialized.
> 
> In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
> Doorbell buffer changes instead of wait for doorbell register changes.
> This reduces the number of MMIOs.
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c       | 95 ++++++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h       |  8 ++++
>  include/block/nvme.h |  2 +
>  3 files changed, 102 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 03760ddeae..d3f6c432df 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
>      [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
>      [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
>      [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
> +    [NVME_ADM_CMD_DBBUF_CONFIG]     = NVME_CMD_EFF_CSUPP,
>      [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>  };
>  
> @@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, int64_t offset,
>      }
>  }
>  
> +static void nvme_update_cq_head(NvmeCQueue *cq)
> +{
> +    pci_dma_read(&cq->ctrl->parent_obj, cq->db_addr, &cq->head,
> +            sizeof(cq->head));
> +}
> +
>  static void nvme_post_cqes(void *opaque)
>  {
>      NvmeCQueue *cq = opaque;
> @@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
>          NvmeSQueue *sq;
>          hwaddr addr;
>  
> +        if (cq->cqid && n->dbbuf_enabled) {
> +            nvme_update_cq_head(cq);

Shouldn't we update the cq head eventidx here (prior to reading the
doorbell buffer)? Like we do for the sq tail?

> +        }
> +
>          if (nvme_cq_full(cq)) {
>              break;
>          }
> @@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
>  static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>                           uint16_t sqid, uint16_t cqid, uint16_t size)
>  {
> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>      int i;
>      NvmeCQueue *cq;
>  
> @@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>      }
>      sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
>  
> +    if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
> +        sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
> +        sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
> +    }
> +
>      assert(n->cq[cqid]);
>      cq = n->cq[cqid];
>      QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
> @@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>                           uint16_t cqid, uint16_t vector, uint16_t size,
>                           uint16_t irq_enabled)
>  {
> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>      int ret;
>  
>      if (msix_enabled(&n->parent_obj)) {
> @@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>      cq->head = cq->tail = 0;
>      QTAILQ_INIT(&cq->req_list);
>      QTAILQ_INIT(&cq->sq_list);
> +    if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
> +        cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
> +        cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
> +    }
>      n->cq[cqid] = cq;
>      cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
>  }
> @@ -5767,6 +5789,43 @@ out:
>      return status;
>  }
>  
> +static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
> +{
> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
> +    uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
> +    uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
> +    int i;
> +
> +    /* Address should be page aligned */
> +    if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
> +        return NVME_INVALID_FIELD | NVME_DNR;
> +    }
> +
> +    /* Save shadow buffer base addr for use during queue creation */
> +    n->dbbuf_dbs = dbs_addr;
> +    n->dbbuf_eis = eis_addr;
> +    n->dbbuf_enabled = true;
> +
> +    for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
> +        NvmeSQueue *sq = n->sq[i];
> +        NvmeCQueue *cq = n->cq[i];
> +
> +        if (sq) {
> +            /* Submission queue tail pointer location, 2 * QID * stride */
> +            sq->db_addr = dbs_addr + 2 * i * stride;
> +            sq->ei_addr = eis_addr + 2 * i * stride;
> +        }
> +
> +        if (cq) {
> +            /* Completion queue head pointer location, (2 * QID + 1) * stride */
> +            cq->db_addr = dbs_addr + (2 * i + 1) * stride;
> +            cq->ei_addr = eis_addr + (2 * i + 1) * stride;
> +        }
> +    }

Why no love for the admin queue? :)

You are special-casing the admin queue below in process_sq() and
process_db(), as well as above in post_cqes(). As I'm reading the spec,
I do not see why the Admin queue should be treated differently wrt.
doorbell buffer configuration. Could this be a left-over from the
behavior in the original Google extension (prior to going into NVMe)?

I peeked in to the kernel and it looks like it doesnt enable doorbell
buffers for admin queue, only for subsequently created I/O queues.

Keith, is this a bug in the kernel? If the code here would expect the
doorbell buffer to be updated for the admin queue as well, would we
break?

> +
> +    return NVME_SUCCESS;
> +}
> +
>  static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>  {
>      trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
> @@ -5809,6 +5868,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>          return nvme_aer(n, req);
>      case NVME_ADM_CMD_NS_ATTACHMENT:
>          return nvme_ns_attachment(n, req);
> +    case NVME_ADM_CMD_DBBUF_CONFIG:
> +        return nvme_dbbuf_config(n, req);
>      case NVME_ADM_CMD_FORMAT_NVM:
>          return nvme_format(n, req);
>      default:
> @@ -5818,6 +5879,18 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>      return NVME_INVALID_OPCODE | NVME_DNR;
>  }
>  
> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> +{
> +    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
> +                  sizeof(sq->tail));
> +}
> +
> +static void nvme_update_sq_tail(NvmeSQueue *sq)
> +{
> +    pci_dma_read(&sq->ctrl->parent_obj, sq->db_addr, &sq->tail,
> +                 sizeof(sq->tail));
> +}
> +
>  static void nvme_process_sq(void *opaque)
>  {
>      NvmeSQueue *sq = opaque;
> @@ -5829,6 +5902,10 @@ static void nvme_process_sq(void *opaque)
>      NvmeCmd cmd;
>      NvmeRequest *req;
>  
> +    if (sq->sqid && n->dbbuf_enabled) {
> +        nvme_update_sq_tail(sq);
> +    }
> +
>      while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
>          addr = sq->dma_addr + sq->head * n->sqe_size;
>          if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
> @@ -5852,6 +5929,11 @@ static void nvme_process_sq(void *opaque)
>              req->status = status;
>              nvme_enqueue_req_completion(cq, req);
>          }
> +
> +        if (sq->sqid && n->dbbuf_enabled) {
> +            nvme_update_sq_eventidx(sq);
> +            nvme_update_sq_tail(sq);
> +        }
>      }
>  }
>  
> @@ -5889,6 +5971,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
>      n->aer_queued = 0;
>      n->outstanding_aers = 0;
>      n->qs_created = false;
> +    n->dbbuf_dbs = 0;
> +    n->dbbuf_eis = 0;
> +    n->dbbuf_enabled = false;
>  }
>  
>  static void nvme_ctrl_shutdown(NvmeCtrl *n)
> @@ -6397,7 +6482,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>          trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
>  
>          start_sqs = nvme_cq_full(cq) ? 1 : 0;
> -        cq->head = new_head;
> +        if (!cq->cqid || !n->dbbuf_enabled) {
> +            cq->head = new_head;
> +        }
>          if (start_sqs) {
>              NvmeSQueue *sq;
>              QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
> @@ -6454,7 +6541,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>  
>          trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
>  
> -        sq->tail = new_tail;
> +        if (!sq->sqid || !n->dbbuf_enabled) {
> +            sq->tail = new_tail;
> +        }
>          timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>      }
>  }
> @@ -6733,7 +6822,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>  
>      id->mdts = n->params.mdts;
>      id->ver = cpu_to_le32(NVME_SPEC_VER);
> -    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
> +    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
>      id->cntrltype = 0x1;
>  
>      /*
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 6773819325..4452e4b1bf 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -334,6 +334,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc)
>      case NVME_ADM_CMD_GET_FEATURES:     return "NVME_ADM_CMD_GET_FEATURES";
>      case NVME_ADM_CMD_ASYNC_EV_REQ:     return "NVME_ADM_CMD_ASYNC_EV_REQ";
>      case NVME_ADM_CMD_NS_ATTACHMENT:    return "NVME_ADM_CMD_NS_ATTACHMENT";
> +    case NVME_ADM_CMD_DBBUF_CONFIG:     return "NVME_ADM_CMD_DBBUF_CONFIG";
>      case NVME_ADM_CMD_FORMAT_NVM:       return "NVME_ADM_CMD_FORMAT_NVM";
>      default:                            return "NVME_ADM_CMD_UNKNOWN";
>      }
> @@ -365,6 +366,8 @@ typedef struct NvmeSQueue {
>      uint32_t    tail;
>      uint32_t    size;
>      uint64_t    dma_addr;
> +    uint64_t    db_addr;
> +    uint64_t    ei_addr;
>      QEMUTimer   *timer;
>      NvmeRequest *io_req;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
> @@ -382,6 +385,8 @@ typedef struct NvmeCQueue {
>      uint32_t    vector;
>      uint32_t    size;
>      uint64_t    dma_addr;
> +    uint64_t    db_addr;
> +    uint64_t    ei_addr;
>      QEMUTimer   *timer;
>      QTAILQ_HEAD(, NvmeSQueue) sq_list;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
> @@ -432,6 +437,9 @@ typedef struct NvmeCtrl {
>      uint64_t    starttime_ms;
>      uint16_t    temperature;
>      uint8_t     smart_critical_warning;
> +    uint64_t    dbbuf_dbs;
> +    uint64_t    dbbuf_eis;
> +    bool        dbbuf_enabled;
>  
>      struct {
>          MemoryRegion mem;
> diff --git a/include/block/nvme.h b/include/block/nvme.h
> index 3737351cc8..5b522d7b0e 100644
> --- a/include/block/nvme.h
> +++ b/include/block/nvme.h
> @@ -595,6 +595,7 @@ enum NvmeAdminCommands {
>      NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
>      NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
>      NVME_ADM_CMD_NS_ATTACHMENT  = 0x15,
> +    NVME_ADM_CMD_DBBUF_CONFIG   = 0x7c,
>      NVME_ADM_CMD_FORMAT_NVM     = 0x80,
>      NVME_ADM_CMD_SECURITY_SEND  = 0x81,
>      NVME_ADM_CMD_SECURITY_RECV  = 0x82,
> @@ -1134,6 +1135,7 @@ enum NvmeIdCtrlOacs {
>      NVME_OACS_FORMAT    = 1 << 1,
>      NVME_OACS_FW        = 1 << 2,
>      NVME_OACS_NS_MGMT   = 1 << 3,
> +    NVME_OACS_DBBUF     = 1 << 8,
>  };
>  
>  enum NvmeIdCtrlOncs {
> -- 
> 2.25.1
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-08 20:55   ` Klaus Jensen
@ 2022-06-09  1:49     ` Jinhao Fan
  2022-06-09 14:29     ` Keith Busch
  1 sibling, 0 replies; 22+ messages in thread
From: Jinhao Fan @ 2022-06-09  1:49 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, Keith Busch

[-- Attachment #1: Type: text/plain, Size: 13136 bytes --]



> On Jun 9, 2022, at 4:55 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> On Jun  8 09:36, Jinhao Fan wrote:
>> Implement Doorbel Buffer Config command (Section 5.7 in NVMe Spec 1.3)
>> and Shadow Doorbel buffer & EventIdx buffer handling logic (Section 7.13
>> in NVMe Spec 1.3). For queues created before the Doorbell Buffer Config
>> command, the nvme_dbbuf_config function tries to associate each existing
>> SQ and CQ with its Shadow Doorbel buffer and EventIdx buffer address.
>> Queues created after the Doorbell Buffer Config command will have the
>> doorbell buffers associated with them when they are initialized.
>> 
>> In nvme_process_sq and nvme_post_cqe, proactively check for Shadow
>> Doorbell buffer changes instead of wait for doorbell register changes.
>> This reduces the number of MMIOs.
>> 
>> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
>> ---
>> hw/nvme/ctrl.c       | 95 ++++++++++++++++++++++++++++++++++++++++++--
>> hw/nvme/nvme.h       |  8 ++++
>> include/block/nvme.h |  2 +
>> 3 files changed, 102 insertions(+), 3 deletions(-)
>> 
>> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
>> index 03760ddeae..d3f6c432df 100644
>> --- a/hw/nvme/ctrl.c
>> +++ b/hw/nvme/ctrl.c
>> @@ -223,6 +223,7 @@ static const uint32_t nvme_cse_acs[256] = {
>>     [NVME_ADM_CMD_GET_FEATURES]     = NVME_CMD_EFF_CSUPP,
>>     [NVME_ADM_CMD_ASYNC_EV_REQ]     = NVME_CMD_EFF_CSUPP,
>>     [NVME_ADM_CMD_NS_ATTACHMENT]    = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_NIC,
>> +    [NVME_ADM_CMD_DBBUF_CONFIG]     = NVME_CMD_EFF_CSUPP,
>>     [NVME_ADM_CMD_FORMAT_NVM]       = NVME_CMD_EFF_CSUPP | NVME_CMD_EFF_LBCC,
>> };
>> 
>> @@ -1304,6 +1305,12 @@ static inline void nvme_blk_write(BlockBackend *blk, int64_t offset,
>>     }
>> }
>> 
>> +static void nvme_update_cq_head(NvmeCQueue *cq)
>> +{
>> +    pci_dma_read(&cq->ctrl->parent_obj, cq->db_addr, &cq->head,
>> +            sizeof(cq->head));
>> +}
>> +
>> static void nvme_post_cqes(void *opaque)
>> {
>>     NvmeCQueue *cq = opaque;
>> @@ -1316,6 +1323,10 @@ static void nvme_post_cqes(void *opaque)
>>         NvmeSQueue *sq;
>>         hwaddr addr;
>> 
>> +        if (cq->cqid && n->dbbuf_enabled) {
>> +            nvme_update_cq_head(cq);
> 
> Shouldn't we update the cq head eventidx here (prior to reading the
> doorbell buffer)? Like we do for the sq tail?

I’m not sure whether updating cq head eventidx is necessary. My understanding is that the purpose of updating eventidx is for the guest to notify the host about cq head or sq tail changes with MMIO. For sq tail this is necessary because other than MMIO, there is no way to trigger the host to check for sq tail updates (shadow doorbell here). For cq head this is different. The host will read cq head shadow doorbell every time it wants to post a cqe. Therefore, letting the guest notify the host on cq head changes seems unnecessary. 

Please correct me if I miss some point.

> 
>> +        }
>> +
>>         if (nvme_cq_full(cq)) {
>>             break;
>>         }
>> @@ -4237,6 +4248,7 @@ static uint16_t nvme_del_sq(NvmeCtrl *n, NvmeRequest *req)
>> static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>>                          uint16_t sqid, uint16_t cqid, uint16_t size)
>> {
>> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>>     int i;
>>     NvmeCQueue *cq;
>> 
>> @@ -4256,6 +4268,11 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
>>     }
>>     sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
>> 
>> +    if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
>> +        sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
>> +        sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
>> +    }
>> +
>>     assert(n->cq[cqid]);
>>     cq = n->cq[cqid];
>>     QTAILQ_INSERT_TAIL(&(cq->sq_list), sq, entry);
>> @@ -4599,6 +4616,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>>                          uint16_t cqid, uint16_t vector, uint16_t size,
>>                          uint16_t irq_enabled)
>> {
>> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>>     int ret;
>> 
>>     if (msix_enabled(&n->parent_obj)) {
>> @@ -4615,6 +4633,10 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>>     cq->head = cq->tail = 0;
>>     QTAILQ_INIT(&cq->req_list);
>>     QTAILQ_INIT(&cq->sq_list);
>> +    if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
>> +        cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
>> +        cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
>> +    }
>>     n->cq[cqid] = cq;
>>     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
>> }
>> @@ -5767,6 +5789,43 @@ out:
>>     return status;
>> }
>> 
>> +static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
>> +{
>> +    uint32_t stride = 4 << NVME_CAP_DSTRD(n->bar.cap);
>> +    uint64_t dbs_addr = le64_to_cpu(req->cmd.dptr.prp1);
>> +    uint64_t eis_addr = le64_to_cpu(req->cmd.dptr.prp2);
>> +    int i;
>> +
>> +    /* Address should be page aligned */
>> +    if (dbs_addr & (n->page_size - 1) || eis_addr & (n->page_size - 1)) {
>> +        return NVME_INVALID_FIELD | NVME_DNR;
>> +    }
>> +
>> +    /* Save shadow buffer base addr for use during queue creation */
>> +    n->dbbuf_dbs = dbs_addr;
>> +    n->dbbuf_eis = eis_addr;
>> +    n->dbbuf_enabled = true;
>> +
>> +    for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
>> +        NvmeSQueue *sq = n->sq[i];
>> +        NvmeCQueue *cq = n->cq[i];
>> +
>> +        if (sq) {
>> +            /* Submission queue tail pointer location, 2 * QID * stride */
>> +            sq->db_addr = dbs_addr + 2 * i * stride;
>> +            sq->ei_addr = eis_addr + 2 * i * stride;
>> +        }
>> +
>> +        if (cq) {
>> +            /* Completion queue head pointer location, (2 * QID + 1) * stride */
>> +            cq->db_addr = dbs_addr + (2 * i + 1) * stride;
>> +            cq->ei_addr = eis_addr + (2 * i + 1) * stride;
>> +        }
>> +    }
> 
> Why no love for the admin queue? :)
> 
> You are special-casing the admin queue below in process_sq() and
> process_db(), as well as above in post_cqes(). As I'm reading the spec,
> I do not see why the Admin queue should be treated differently wrt.
> doorbell buffer configuration. Could this be a left-over from the
> behavior in the original Google extension (prior to going into NVMe)?

Yes, this is “inherited” from the Google extension code. I also checked 1.3d of the spec and even found in Table 64 the addresses for admin queue’s shadow doorbell buffer and eventidx buffer.
Seems we need to take admin queue into account.

> 
> I peeked in to the kernel and it looks like it doesnt enable doorbell
> buffers for admin queue, only for subsequently created I/O queues.
> 
> Keith, is this a bug in the kernel? If the code here would expect the
> doorbell buffer to be updated for the admin queue as well, would we
> break?
> 
>> +
>> +    return NVME_SUCCESS;
>> +}
>> +
>> static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>> {
>>     trace_pci_nvme_admin_cmd(nvme_cid(req), nvme_sqid(req), req->cmd.opcode,
>> @@ -5809,6 +5868,8 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>>         return nvme_aer(n, req);
>>     case NVME_ADM_CMD_NS_ATTACHMENT:
>>         return nvme_ns_attachment(n, req);
>> +    case NVME_ADM_CMD_DBBUF_CONFIG:
>> +        return nvme_dbbuf_config(n, req);
>>     case NVME_ADM_CMD_FORMAT_NVM:
>>         return nvme_format(n, req);
>>     default:
>> @@ -5818,6 +5879,18 @@ static uint16_t nvme_admin_cmd(NvmeCtrl *n, NvmeRequest *req)
>>     return NVME_INVALID_OPCODE | NVME_DNR;
>> }
>> 
>> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
>> +{
>> +    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
>> +                  sizeof(sq->tail));
>> +}
>> +
>> +static void nvme_update_sq_tail(NvmeSQueue *sq)
>> +{
>> +    pci_dma_read(&sq->ctrl->parent_obj, sq->db_addr, &sq->tail,
>> +                 sizeof(sq->tail));
>> +}
>> +
>> static void nvme_process_sq(void *opaque)
>> {
>>     NvmeSQueue *sq = opaque;
>> @@ -5829,6 +5902,10 @@ static void nvme_process_sq(void *opaque)
>>     NvmeCmd cmd;
>>     NvmeRequest *req;
>> 
>> +    if (sq->sqid && n->dbbuf_enabled) {
>> +        nvme_update_sq_tail(sq);
>> +    }
>> +
>>     while (!(nvme_sq_empty(sq) || QTAILQ_EMPTY(&sq->req_list))) {
>>         addr = sq->dma_addr + sq->head * n->sqe_size;
>>         if (nvme_addr_read(n, addr, (void *)&cmd, sizeof(cmd))) {
>> @@ -5852,6 +5929,11 @@ static void nvme_process_sq(void *opaque)
>>             req->status = status;
>>             nvme_enqueue_req_completion(cq, req);
>>         }
>> +
>> +        if (sq->sqid && n->dbbuf_enabled) {
>> +            nvme_update_sq_eventidx(sq);
>> +            nvme_update_sq_tail(sq);
>> +        }
>>     }
>> }
>> 
>> @@ -5889,6 +5971,9 @@ static void nvme_ctrl_reset(NvmeCtrl *n)
>>     n->aer_queued = 0;
>>     n->outstanding_aers = 0;
>>     n->qs_created = false;
>> +    n->dbbuf_dbs = 0;
>> +    n->dbbuf_eis = 0;
>> +    n->dbbuf_enabled = false;
>> }
>> 
>> static void nvme_ctrl_shutdown(NvmeCtrl *n)
>> @@ -6397,7 +6482,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>>         trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
>> 
>>         start_sqs = nvme_cq_full(cq) ? 1 : 0;
>> -        cq->head = new_head;
>> +        if (!cq->cqid || !n->dbbuf_enabled) {
>> +            cq->head = new_head;
>> +        }
>>         if (start_sqs) {
>>             NvmeSQueue *sq;
>>             QTAILQ_FOREACH(sq, &cq->sq_list, entry) {
>> @@ -6454,7 +6541,9 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>> 
>>         trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
>> 
>> -        sq->tail = new_tail;
>> +        if (!sq->sqid || !n->dbbuf_enabled) {
>> +            sq->tail = new_tail;
>> +        }
>>         timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>>     }
>> }
>> @@ -6733,7 +6822,7 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
>> 
>>     id->mdts = n->params.mdts;
>>     id->ver = cpu_to_le32(NVME_SPEC_VER);
>> -    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT);
>> +    id->oacs = cpu_to_le16(NVME_OACS_NS_MGMT | NVME_OACS_FORMAT | NVME_OACS_DBBUF);
>>     id->cntrltype = 0x1;
>> 
>>     /*
>> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
>> index 6773819325..4452e4b1bf 100644
>> --- a/hw/nvme/nvme.h
>> +++ b/hw/nvme/nvme.h
>> @@ -334,6 +334,7 @@ static inline const char *nvme_adm_opc_str(uint8_t opc)
>>     case NVME_ADM_CMD_GET_FEATURES:     return "NVME_ADM_CMD_GET_FEATURES";
>>     case NVME_ADM_CMD_ASYNC_EV_REQ:     return "NVME_ADM_CMD_ASYNC_EV_REQ";
>>     case NVME_ADM_CMD_NS_ATTACHMENT:    return "NVME_ADM_CMD_NS_ATTACHMENT";
>> +    case NVME_ADM_CMD_DBBUF_CONFIG:     return "NVME_ADM_CMD_DBBUF_CONFIG";
>>     case NVME_ADM_CMD_FORMAT_NVM:       return "NVME_ADM_CMD_FORMAT_NVM";
>>     default:                            return "NVME_ADM_CMD_UNKNOWN";
>>     }
>> @@ -365,6 +366,8 @@ typedef struct NvmeSQueue {
>>     uint32_t    tail;
>>     uint32_t    size;
>>     uint64_t    dma_addr;
>> +    uint64_t    db_addr;
>> +    uint64_t    ei_addr;
>>     QEMUTimer   *timer;
>>     NvmeRequest *io_req;
>>     QTAILQ_HEAD(, NvmeRequest) req_list;
>> @@ -382,6 +385,8 @@ typedef struct NvmeCQueue {
>>     uint32_t    vector;
>>     uint32_t    size;
>>     uint64_t    dma_addr;
>> +    uint64_t    db_addr;
>> +    uint64_t    ei_addr;
>>     QEMUTimer   *timer;
>>     QTAILQ_HEAD(, NvmeSQueue) sq_list;
>>     QTAILQ_HEAD(, NvmeRequest) req_list;
>> @@ -432,6 +437,9 @@ typedef struct NvmeCtrl {
>>     uint64_t    starttime_ms;
>>     uint16_t    temperature;
>>     uint8_t     smart_critical_warning;
>> +    uint64_t    dbbuf_dbs;
>> +    uint64_t    dbbuf_eis;
>> +    bool        dbbuf_enabled;
>> 
>>     struct {
>>         MemoryRegion mem;
>> diff --git a/include/block/nvme.h b/include/block/nvme.h
>> index 3737351cc8..5b522d7b0e 100644
>> --- a/include/block/nvme.h
>> +++ b/include/block/nvme.h
>> @@ -595,6 +595,7 @@ enum NvmeAdminCommands {
>>     NVME_ADM_CMD_ACTIVATE_FW    = 0x10,
>>     NVME_ADM_CMD_DOWNLOAD_FW    = 0x11,
>>     NVME_ADM_CMD_NS_ATTACHMENT  = 0x15,
>> +    NVME_ADM_CMD_DBBUF_CONFIG   = 0x7c,
>>     NVME_ADM_CMD_FORMAT_NVM     = 0x80,
>>     NVME_ADM_CMD_SECURITY_SEND  = 0x81,
>>     NVME_ADM_CMD_SECURITY_RECV  = 0x82,
>> @@ -1134,6 +1135,7 @@ enum NvmeIdCtrlOacs {
>>     NVME_OACS_FORMAT    = 1 << 1,
>>     NVME_OACS_FW        = 1 << 2,
>>     NVME_OACS_NS_MGMT   = 1 << 3,
>> +    NVME_OACS_DBBUF     = 1 << 8,
>> };
>> 
>> enum NvmeIdCtrlOncs {
>> -- 
>> 2.25.1
>> 
> 
> -- 
> One of us - No more doubt, silence or taboo about mental illness.


[-- Attachment #2: Type: text/html, Size: 37592 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-08 20:55   ` Klaus Jensen
  2022-06-09  1:49     ` Jinhao Fan
@ 2022-06-09 14:29     ` Keith Busch
  2022-06-09 15:52       ` John Levon
  1 sibling, 1 reply; 22+ messages in thread
From: Keith Busch @ 2022-06-09 14:29 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Jinhao Fan, qemu-devel

On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> 
> Keith, is this a bug in the kernel? If the code here would expect the
> doorbell buffer to be updated for the admin queue as well, would we
> break?

The admin queue has to be created before db buffer can be set up, so we can't
rely on it. And this is a performance feature. Admin commands have never been
considered performant, so we decided not to consider it though the spec allows
it.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-09 14:29     ` Keith Busch
@ 2022-06-09 15:52       ` John Levon
  2022-06-09 17:27         ` Klaus Jensen
  0 siblings, 1 reply; 22+ messages in thread
From: John Levon @ 2022-06-09 15:52 UTC (permalink / raw)
  To: Keith Busch; +Cc: Klaus Jensen, Jinhao Fan, qemu-devel

On Thu, Jun 09, 2022 at 08:29:30AM -0600, Keith Busch wrote:

> On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> > 
> > Keith, is this a bug in the kernel? If the code here would expect the
> > doorbell buffer to be updated for the admin queue as well, would we
> > break?
> 
> The admin queue has to be created before db buffer can be set up, so we can't
> rely on it. And this is a performance feature. Admin commands have never been
> considered performant, so we decided not to consider it though the spec allows
> it.

It's not just unnecessary, but enabling shadow doorbells on admin queues will
actively break device implementations (such as SPDK), which have had to presume
that driver implementations don't use shadow doorbells for the admin queue.

regards
john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-09 15:52       ` John Levon
@ 2022-06-09 17:27         ` Klaus Jensen
  2022-06-09 17:50           ` John Levon
  2022-06-12 11:40           ` Jinhao Fan
  0 siblings, 2 replies; 22+ messages in thread
From: Klaus Jensen @ 2022-06-09 17:27 UTC (permalink / raw)
  To: John Levon; +Cc: Keith Busch, Jinhao Fan, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1434 bytes --]

On Jun  9 16:52, John Levon wrote:
> On Thu, Jun 09, 2022 at 08:29:30AM -0600, Keith Busch wrote:
> 
> > On Wed, Jun 08, 2022 at 10:55:30PM +0200, Klaus Jensen wrote:
> > > 
> > > Keith, is this a bug in the kernel? If the code here would expect the
> > > doorbell buffer to be updated for the admin queue as well, would we
> > > break?
> > 
> > The admin queue has to be created before db buffer can be set up, so we can't
> > rely on it. And this is a performance feature. Admin commands have never been
> > considered performant, so we decided not to consider it though the spec allows
> > it.

It's not really a question of whether or not the spec *allows* it. If
our device chooses to be compliant here, then it will read invalid
doorbell values if drivers doesnt update the admin doorbell buffer.

> 
> It's not just unnecessary, but enabling shadow doorbells on admin queues will
> actively break device implementations (such as SPDK), which have had to presume
> that driver implementations don't use shadow doorbells for the admin queue.
> 

I'm ok with following the concensus here, but we all agree that this is
a blatant spec violation that ended up manifesting itself down the
stack, right?

So... if QEMU wants to be compliant here, I guess we could ask the
kernel to introduce a quirk for *compliant* controllers. Now, THAT would
be a first! Not sure if I am being serious or not here ;)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-09 17:27         ` Klaus Jensen
@ 2022-06-09 17:50           ` John Levon
  2022-06-12 11:40           ` Jinhao Fan
  1 sibling, 0 replies; 22+ messages in thread
From: John Levon @ 2022-06-09 17:50 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Keith Busch, Jinhao Fan, qemu-devel

On Thu, Jun 09, 2022 at 07:27:57PM +0200, Klaus Jensen wrote:

> > It's not just unnecessary, but enabling shadow doorbells on admin queues will
> > actively break device implementations (such as SPDK), which have had to presume
> > that driver implementations don't use shadow doorbells for the admin queue.
> 
> I'm ok with following the concensus here, but we all agree that this is
> a blatant spec violation that ended up manifesting itself down the
> stack, right?

Yup. I've been meaning to try following up to get the spec updated to reflect
reality, but haven't made progress there yet.

> So... if QEMU wants to be compliant here, I guess we could ask the
> kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> be a first! Not sure if I am being serious or not here ;)

:)

regards
john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-09 17:27         ` Klaus Jensen
  2022-06-09 17:50           ` John Levon
@ 2022-06-12 11:40           ` Jinhao Fan
  2022-06-13 21:15             ` Keith Busch
  1 sibling, 1 reply; 22+ messages in thread
From: Jinhao Fan @ 2022-06-12 11:40 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: John Levon, Keith Busch, qemu-devel


> On Jun 10, 2022, at 1:27 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> I'm ok with following the concensus here, but we all agree that this is
> a blatant spec violation that ended up manifesting itself down the
> stack, right?
> 
> So... if QEMU wants to be compliant here, I guess we could ask the
> kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> be a first! Not sure if I am being serious or not here ;)

Hi all,

Is this our final decision?

Thanks,
Jinhao Fan



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-12 11:40           ` Jinhao Fan
@ 2022-06-13 21:15             ` Keith Busch
  2022-06-14  7:24               ` Jinhao Fan
  0 siblings, 1 reply; 22+ messages in thread
From: Keith Busch @ 2022-06-13 21:15 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: Klaus Jensen, John Levon, qemu-devel

On Sun, Jun 12, 2022 at 07:40:55PM +0800, Jinhao Fan wrote:
> 
> > On Jun 10, 2022, at 1:27 AM, Klaus Jensen <its@irrelevant.dk> wrote:
> > 
> > I'm ok with following the concensus here, but we all agree that this is
> > a blatant spec violation that ended up manifesting itself down the
> > stack, right?
> > 
> > So... if QEMU wants to be compliant here, I guess we could ask the
> > kernel to introduce a quirk for *compliant* controllers. Now, THAT would
> > be a first! Not sure if I am being serious or not here ;)
> 
> Hi all,
> 
> Is this our final decision?

What a mess...

The spec should have gone into more details on initializing the shadow and
event buffers if they really intended it to be run on a live queue.

Anyway, the following hack on top of your patch should allow the host to use
admin shadow queues, and also remain backward compatible for the "broken"
hosts, like Linux and SPDK.

---
diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index a0a9208c0f..03d84feecf 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4267,7 +4267,7 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
     }
     sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
 
-    if (sqid && n->dbbuf_dbs && n->dbbuf_eis) {
+    if (n->dbbuf_dbs && n->dbbuf_eis) {
         sq->db_addr = n->dbbuf_dbs + 2 * sqid * stride;
         sq->ei_addr = n->dbbuf_eis + 2 * sqid * stride;
     }
@@ -4632,7 +4632,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     cq->head = cq->tail = 0;
     QTAILQ_INIT(&cq->req_list);
     QTAILQ_INIT(&cq->sq_list);
-    if (cqid && n->dbbuf_dbs && n->dbbuf_eis) {
+    if (n->dbbuf_dbs && n->dbbuf_eis) {
         cq->db_addr = n->dbbuf_dbs + (2 * cqid + 1) * stride;
         cq->ei_addr = n->dbbuf_eis + (2 * cqid + 1) * stride;
     }
@@ -5805,7 +5805,7 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
     n->dbbuf_dbs = dbs_addr;
     n->dbbuf_eis = eis_addr;
 
-    for (i = 1; i < n->params.max_ioqpairs + 1; i++) {
+    for (i = 0; i < n->params.max_ioqpairs + 1; i++) {
         NvmeSQueue *sq = n->sq[i];
         NvmeCQueue *cq = n->cq[i];
 
@@ -5813,12 +5813,16 @@ static uint16_t nvme_dbbuf_config(NvmeCtrl *n, const NvmeRequest *req)
             /* Submission queue tail pointer location, 2 * QID * stride */
             sq->db_addr = dbs_addr + 2 * i * stride;
             sq->ei_addr = eis_addr + 2 * i * stride;
+            pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
+                    sizeof(sq->tail));
         }
 
         if (cq) {
             /* Completion queue head pointer location, (2 * QID + 1) * stride */
             cq->db_addr = dbs_addr + (2 * i + 1) * stride;
             cq->ei_addr = eis_addr + (2 * i + 1) * stride;
+            pci_dma_write(&n->parent_obj, cq->db_addr, &cq->head,
+                    sizeof(cq->head));
         }
     }
 
@@ -6479,8 +6483,10 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
         trace_pci_nvme_mmio_doorbell_cq(cq->cqid, new_head);
 
         start_sqs = nvme_cq_full(cq) ? 1 : 0;
-        if (!cq->db_addr) {
         cq->head = new_head;
+        if (cq->db_addr) {
+            pci_dma_write(&n->parent_obj, cq->db_addr, &cq->head,
+                    sizeof(cq->head));
         }
         if (start_sqs) {
             NvmeSQueue *sq;
@@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
 
         trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
 
-        if (!sq->db_addr) {
         sq->tail = new_tail;
+        if (sq->db_addr) {
+            /*
+             * The spec states "the host shall also update the controller's
+             * corresponding doorbell property to match the value of that entry
+             * in the Shadow Doorbell buffer."
+             *
+             * Since this context is currently a VM trap, we can safely enforce
+             * the requirement from the device side in case the host is
+             * misbehaving.
+             *
+             * Note, we shouldn't have to do this, but various drivers
+             * including ones that run on Linux, are not updating Admin Queues,
+             * so we can't trust reading it for an appropriate sq tail.
+             */
+            pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
+                    sizeof(sq->tail));
         }
+
         timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
     }
 }
--


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-13 21:15             ` Keith Busch
@ 2022-06-14  7:24               ` Jinhao Fan
  2022-06-14 15:41                 ` Keith Busch
  0 siblings, 1 reply; 22+ messages in thread
From: Jinhao Fan @ 2022-06-14  7:24 UTC (permalink / raw)
  To: Keith Busch; +Cc: Klaus Jensen, John Levon, qemu-devel



> On Jun 14, 2022, at 5:15 AM, Keith Busch <kbusch@kernel.org> wrote:
> 
> 
> @@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
> 
>         trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
> 
> -        if (!sq->db_addr) {
>         sq->tail = new_tail;
> +        if (sq->db_addr) {
> +            /*
> +             * The spec states "the host shall also update the controller's
> +             * corresponding doorbell property to match the value of that entry
> +             * in the Shadow Doorbell buffer."
> +             *
> +             * Since this context is currently a VM trap, we can safely enforce
> +             * the requirement from the device side in case the host is
> +             * misbehaving.
> +             *
> +             * Note, we shouldn't have to do this, but various drivers
> +             * including ones that run on Linux, are not updating Admin Queues,
> +             * so we can't trust reading it for an appropriate sq tail.
> +             */
> +            pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
> +                    sizeof(sq->tail));
>         }
> +
>         timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
>     }
> }
> --

Thanks Keith,

This is an interesting hack. I wonder how should I incorporate your changes in my patch. I guess I can modify the code in PATCH 1/2 and add a “Proposed-by” tag. Is this the correct way?

Regards,
Jinhao Fan



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-14  7:24               ` Jinhao Fan
@ 2022-06-14 15:41                 ` Keith Busch
  2022-06-15  3:58                   ` Jinhao Fan
  2022-06-15  8:48                   ` Klaus Jensen
  0 siblings, 2 replies; 22+ messages in thread
From: Keith Busch @ 2022-06-14 15:41 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: Klaus Jensen, John Levon, qemu-devel

On Tue, Jun 14, 2022 at 03:24:37PM +0800, Jinhao Fan wrote:
> > On Jun 14, 2022, at 5:15 AM, Keith Busch <kbusch@kernel.org> wrote:
> > @@ -6538,9 +6544,25 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
> > 
> >         trace_pci_nvme_mmio_doorbell_sq(sq->sqid, new_tail);
> > 
> > -        if (!sq->db_addr) {
> >         sq->tail = new_tail;
> > +        if (sq->db_addr) {
> > +            /*
> > +             * The spec states "the host shall also update the controller's
> > +             * corresponding doorbell property to match the value of that entry
> > +             * in the Shadow Doorbell buffer."
> > +             *
> > +             * Since this context is currently a VM trap, we can safely enforce
> > +             * the requirement from the device side in case the host is
> > +             * misbehaving.
> > +             *
> > +             * Note, we shouldn't have to do this, but various drivers
> > +             * including ones that run on Linux, are not updating Admin Queues,
> > +             * so we can't trust reading it for an appropriate sq tail.
> > +             */
> > +            pci_dma_write(&n->parent_obj, sq->db_addr, &sq->tail,
> > +                    sizeof(sq->tail));
> >         }
> > +
> >         timer_mod(sq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
> >     }
> > }
> > --
> 
> Thanks Keith,
> 
> This is an interesting hack. I wonder how should I incorporate your changes in my patch. I guess I can modify the code in PATCH 1/2 and add a “Proposed-by” tag. Is this the correct way?

It's a pretty nasty hack, and definitely not in compliance with the spec: the
db_addr is supposed to be read-only from the device side, though I do think
it's safe for this environment. Unless Klaus or anyone finds something I'm
missing, I feel this is an acceptable compromise to address this odd
discrepency.

I believe the recommended tag for something like this is "Suggested-by:", but
no need to credit me. Just fold it into your first patch and send a v2.

By the way, I noticed that the patch never updates the cq's ei_addr value. Is
that on purpose?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-14 15:41                 ` Keith Busch
@ 2022-06-15  3:58                   ` Jinhao Fan
  2022-06-15  9:38                     ` Klaus Jensen
  2022-06-15  8:48                   ` Klaus Jensen
  1 sibling, 1 reply; 22+ messages in thread
From: Jinhao Fan @ 2022-06-15  3:58 UTC (permalink / raw)
  To: Keith Busch; +Cc: Klaus Jensen, John Levon, qemu-devel


> On Jun 14, 2022, at 11:41 PM, Keith Busch <kbusch@kernel.org> wrote:
> 
> It's a pretty nasty hack, and definitely not in compliance with the spec: the
> db_addr is supposed to be read-only from the device side, though I do think
> it's safe for this environment. Unless Klaus or anyone finds something I'm
> missing, I feel this is an acceptable compromise to address this odd
> discrepency.

:) In my next patch I will check the performance numbers with this hack. Not
sure if updating db_addr value from the host will have any performance 
implications but I guess it should be OK.

> 
> I believe the recommended tag for something like this is "Suggested-by:", but
> no need to credit me. Just fold it into your first patch and send a v2.

Sure. Thanks!

> 
> By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> that on purpose?

Klaus also raised a similar question in a prior comment. I think we need to figure
this out before we move on to the v2 patch. I did this because the original Google
extension patch did not update cq’s ei_addr. This seems to make sense because
the purpose of cq’s ei_addr is for the guest to notify the host about cq head
changes when necessary. However, the host does not need this notification 
because we let the host proactively check for cq’s db_addr value when it wants
to post a new cqe. This is also the only point where the host uses the cq’s
db_addr. Therefore, it is OK to postpone the check for cq’s db_addr to this point,
instead of getting timely but not useful notifications by updating cq’s ei_addr.
This helps to reduce the number of MMIO’s on the cq’s doorbell register.

Klaus, Keith, do you think this design makes sense?


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-14 15:41                 ` Keith Busch
  2022-06-15  3:58                   ` Jinhao Fan
@ 2022-06-15  8:48                   ` Klaus Jensen
  2022-06-15  9:07                     ` John Levon
  1 sibling, 1 reply; 22+ messages in thread
From: Klaus Jensen @ 2022-06-15  8:48 UTC (permalink / raw)
  To: Keith Busch; +Cc: Jinhao Fan, John Levon, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1392 bytes --]

On Jun 14 08:41, Keith Busch wrote:
> It's a pretty nasty hack, and definitely not in compliance with the spec: the
> db_addr is supposed to be read-only from the device side, though I do think
> it's safe for this environment. Unless Klaus or anyone finds something I'm
> missing, I feel this is an acceptable compromise to address this odd
> discrepency.
> 

No, I love this hack! :D

I have tested your hack against a dbbuf enabled driver that enables
shadow doorbells on the admin queue by default. I can confirm that this
works as well as on "broken" (or, lets call them "reasonable") drivers.

> By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> that on purpose?

Yeah, I also mentioned this previously[1] and I still think we need to
update the event index. Otherwise (and my testing confirms this), we end
up in a situation where the driver skips the mmio, leaving a completion
queue entry "in use" on the device until some other completion comes
along.

I have folded these changes into a patch for testing[2]. Note, your
patch was missing equivalent changes in nvme_post_cqes(), so I added
that as well as updating of the event index.


  [1]: https://lore.kernel.org/qemu-devel/YqEMwsclktptJvQI@apples/
  [2]: http://git.infradead.org/qemu-nvme.git/commitdiff/60712930e441b684490a792b00ef6698cc85f116


Cheers,
Klaus

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15  8:48                   ` Klaus Jensen
@ 2022-06-15  9:07                     ` John Levon
  2022-06-15  9:33                       ` Klaus Jensen
  0 siblings, 1 reply; 22+ messages in thread
From: John Levon @ 2022-06-15  9:07 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Keith Busch, Jinhao Fan, qemu-devel

On Wed, Jun 15, 2022 at 10:48:26AM +0200, Klaus Jensen wrote:

> > By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> > that on purpose?
> 
> Yeah, I also mentioned this previously[1] and I still think we need to
> update the event index. Otherwise (and my testing confirms this), we end
> up in a situation where the driver skips the mmio, leaving a completion
> queue entry "in use" on the device until some other completion comes
> along.

Hmm, can you expand on this a little bit? We don't touch cq eventidx this in
SPDK either, on the basis that mmio exits are expensive, and we only ever need
to look at cq_head when we're checking for room when posting a completion - and
in that case, we can just look directly at shadow cq_head value.

Can you clarify the exact circumstance that needs an mmio write when the driver
updates cq_head?

BTW I'm surprised that this patch has just this:

+static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
+{
+    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
+                  sizeof(sq->tail));
+}

Isn't this racy against the driver? Compare
https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317

thanks
john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15  9:07                     ` John Levon
@ 2022-06-15  9:33                       ` Klaus Jensen
  2022-06-15 10:11                         ` John Levon
  0 siblings, 1 reply; 22+ messages in thread
From: Klaus Jensen @ 2022-06-15  9:33 UTC (permalink / raw)
  To: John Levon; +Cc: Keith Busch, Jinhao Fan, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1702 bytes --]

On Jun 15 10:07, John Levon wrote:
> On Wed, Jun 15, 2022 at 10:48:26AM +0200, Klaus Jensen wrote:
> 
> > > By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> > > that on purpose?
> > 
> > Yeah, I also mentioned this previously[1] and I still think we need to
> > update the event index. Otherwise (and my testing confirms this), we end
> > up in a situation where the driver skips the mmio, leaving a completion
> > queue entry "in use" on the device until some other completion comes
> > along.
> 
> Hmm, can you expand on this a little bit? We don't touch cq eventidx this in
> SPDK either, on the basis that mmio exits are expensive, and we only ever need
> to look at cq_head when we're checking for room when posting a completion - and
> in that case, we can just look directly at shadow cq_head value.
> 
> Can you clarify the exact circumstance that needs an mmio write when the driver
> updates cq_head?
> 

No, I see, you are correct that not updating the eventidx reduces MMIO
and that we check read the cq head anyway prior to posting completions.
I guess its a perfectly reasonable device-side optimization in this
case. We can safely drop that addition again I think.

> BTW I'm surprised that this patch has just this:
> 
> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> +{
> +    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
> +                  sizeof(sq->tail));
> +}
> 
> Isn't this racy against the driver? Compare
> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> 
> thanks
> john

QEMU has full memory barriers on dma read/write, so I believe this is
safe?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15  3:58                   ` Jinhao Fan
@ 2022-06-15  9:38                     ` Klaus Jensen
  2022-06-15 14:52                       ` Jinhao Fan
  0 siblings, 1 reply; 22+ messages in thread
From: Klaus Jensen @ 2022-06-15  9:38 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: Keith Busch, John Levon, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Jun 15 11:58, Jinhao Fan wrote:
> 
> > On Jun 14, 2022, at 11:41 PM, Keith Busch <kbusch@kernel.org> wrote:
> > 
> > It's a pretty nasty hack, and definitely not in compliance with the spec: the
> > db_addr is supposed to be read-only from the device side, though I do think
> > it's safe for this environment. Unless Klaus or anyone finds something I'm
> > missing, I feel this is an acceptable compromise to address this odd
> > discrepency.
> 
> :) In my next patch I will check the performance numbers with this hack. Not
> sure if updating db_addr value from the host will have any performance 
> implications but I guess it should be OK.
> 

I prefer we use the NVMe terminology to minimize misunderstandings, so
"host" means the driver and "device" means the qemu side of things

> > By the way, I noticed that the patch never updates the cq's ei_addr value. Is
> > that on purpose?
> 
> Klaus also raised a similar question in a prior comment. I think we need to figure
> this out before we move on to the v2 patch. I did this because the original Google
> extension patch did not update cq’s ei_addr. This seems to make sense because
> the purpose of cq’s ei_addr is for the guest to notify the host about cq head
> changes when necessary. However, the host does not need this notification 
> because we let the host proactively check for cq’s db_addr value when it wants
> to post a new cqe.
> This is also the only point where the host uses the cq’s
> db_addr. Therefore, it is OK to postpone the check for cq’s db_addr to this point,
> instead of getting timely but not useful notifications by updating cq’s ei_addr.
> This helps to reduce the number of MMIO’s on the cq’s doorbell register.
> 

True, it does reduce it, but it may leave CQEs "lingering" on the device
side (since the device has not been notified that the host has consumed
them).

> Klaus, Keith, do you think this design makes sense?

As I mentioned in my reply to John, I can see why this is a perfectly
reasonable optimization, we don't really care about the lingering CQEs
since we read the head anyway prior to posting completions. I jumped the
gun here in my eagerness to be "spec compliant" ;)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15  9:33                       ` Klaus Jensen
@ 2022-06-15 10:11                         ` John Levon
  2022-06-15 11:22                           ` Jinhao Fan
  0 siblings, 1 reply; 22+ messages in thread
From: John Levon @ 2022-06-15 10:11 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Keith Busch, Jinhao Fan, qemu-devel

On Wed, Jun 15, 2022 at 11:33:02AM +0200, Klaus Jensen wrote:

> > BTW I'm surprised that this patch has just this:
> > 
> > +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
> > +{
> > +    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
> > +                  sizeof(sq->tail));
> > +}
> > 
> > Isn't this racy against the driver? Compare
> > https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> > 
> > thanks
> > john
> 
> QEMU has full memory barriers on dma read/write, so I believe this is
> safe?

But don't you need to re-read the tail still, for example:


driver 			device

			eventidx is 3

write 4 to tail
			read tail of 4
write 5 to tail
read eventidx of 3
nvme_dbbuf_need_event (1)

			set eventidx to 4
			go to sleep

at (1), our tail update of 4->5 doesn't straddle the eventidx, so we don't send
any MMIO, and the device won't wake up. This is why the above code checks the
tail twice for any concurrent update.

regards
john




^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15 10:11                         ` John Levon
@ 2022-06-15 11:22                           ` Jinhao Fan
  2022-06-15 11:45                             ` John Levon
  0 siblings, 1 reply; 22+ messages in thread
From: Jinhao Fan @ 2022-06-15 11:22 UTC (permalink / raw)
  To: John Levon; +Cc: Klaus Jensen, Keith Busch, qemu-devel



> On Jun 15, 2022, at 6:11 PM, John Levon <levon@movementarian.org> wrote:
> 
> On Wed, Jun 15, 2022 at 11:33:02AM +0200, Klaus Jensen wrote:
> 
>>> BTW I'm surprised that this patch has just this:
>>> 
>>> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq)
>>> +{
>>> +    pci_dma_write(&sq->ctrl->parent_obj, sq->ei_addr, &sq->tail,
>>> +                  sizeof(sq->tail));
>>> +}
>>> 
>>> Isn't this racy against the driver? Compare
>>> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
>>> 
>>> thanks
>>> john
>> 
>> QEMU has full memory barriers on dma read/write, so I believe this is
>> safe?
> 
> But don't you need to re-read the tail still, for example:


Hi John,

I think we also have a check for concurrent update on the tail. After writing eventidx, we read the tail again. It is here:

@@ -5854,6 +5943,11 @@ static void nvme_process_sq(void *opaque)
             req->status = status;
             nvme_enqueue_req_completion(cq, req);
         }
+
+        if (n->dbbuf_enabled) {
+            nvme_update_sq_eventidx(sq);
+            nvme_update_sq_tail(sq);
+        }

> 
> 
> driver 			device
> 
> 			eventidx is 3
> 
> write 4 to tail
> 			read tail of 4
> write 5 to tail
> read eventidx of 3
> nvme_dbbuf_need_event (1)
> 
> 			set eventidx to 4

Therefore, at this point, we read the tail of 5.

> 			go to sleep
> 
> at (1), our tail update of 4->5 doesn't straddle the eventidx, so we don't send
> any MMIO, and the device won't wake up. This is why the above code checks the
> tail twice for any concurrent update.

Thanks,
Jinhao Fan

> 
> regards
> john



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15 11:22                           ` Jinhao Fan
@ 2022-06-15 11:45                             ` John Levon
  0 siblings, 0 replies; 22+ messages in thread
From: John Levon @ 2022-06-15 11:45 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: Klaus Jensen, Keith Busch, qemu-devel

On Wed, Jun 15, 2022 at 07:22:22PM +0800, Jinhao Fan wrote:

> >>> Isn't this racy against the driver? Compare
> >>> https://github.com/spdk/spdk/blob/master/lib/nvmf/vfio_user.c#L1317
> >> 
> >> QEMU has full memory barriers on dma read/write, so I believe this is
> >> safe?
> > 
> > But don't you need to re-read the tail still, for example:
> 
> I think we also have a check for concurrent update on the tail. After writing eventidx, we read the tail again. It is here:
> 
> @@ -5854,6 +5943,11 @@ static void nvme_process_sq(void *opaque)
>              req->status = status;
>              nvme_enqueue_req_completion(cq, req);
>          }
> +
> +        if (n->dbbuf_enabled) {
> +            nvme_update_sq_eventidx(sq);
> +            nvme_update_sq_tail(sq);
> +        }

Ah, and we go around the loop another time in this case.

> > driver 			device
> > 
> > 			eventidx is 3
> > 
> > write 4 to tail
> > 			read tail of 4
> > write 5 to tail
> > read eventidx of 3
> > nvme_dbbuf_need_event (1)
> > 
> > 			set eventidx to 4
> 
> Therefore, at this point, we read the tail of 5.

The driver could still update the tail after the nvme_update_sq_tail() above.
However, the driver ordering (read tail, then eventidx), does mean that it would
then do an mmio write, so yes, this looks safe, thank you.

regards
john


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH 1/2] hw/nvme: Implement shadow doorbell buffer support
  2022-06-15  9:38                     ` Klaus Jensen
@ 2022-06-15 14:52                       ` Jinhao Fan
  0 siblings, 0 replies; 22+ messages in thread
From: Jinhao Fan @ 2022-06-15 14:52 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: Keith Busch, John Levon, qemu-devel



> On Jun 15, 2022, at 5:38 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> I prefer we use the NVMe terminology to minimize misunderstandings, so
> "host" means the driver and "device" means the qemu side of things
> 

Thanks for helping me disambiguate this!

Now that we have resolved all issues in v1, I’ve submitted a v2 patch.



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2022-06-15 14:56 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-08  1:36 [PATCH 0/2] hw/nvme: Add shadow doorbell buffer support Jinhao Fan
2022-06-08  1:36 ` [PATCH 1/2] hw/nvme: Implement " Jinhao Fan
2022-06-08 20:55   ` Klaus Jensen
2022-06-09  1:49     ` Jinhao Fan
2022-06-09 14:29     ` Keith Busch
2022-06-09 15:52       ` John Levon
2022-06-09 17:27         ` Klaus Jensen
2022-06-09 17:50           ` John Levon
2022-06-12 11:40           ` Jinhao Fan
2022-06-13 21:15             ` Keith Busch
2022-06-14  7:24               ` Jinhao Fan
2022-06-14 15:41                 ` Keith Busch
2022-06-15  3:58                   ` Jinhao Fan
2022-06-15  9:38                     ` Klaus Jensen
2022-06-15 14:52                       ` Jinhao Fan
2022-06-15  8:48                   ` Klaus Jensen
2022-06-15  9:07                     ` John Levon
2022-06-15  9:33                       ` Klaus Jensen
2022-06-15 10:11                         ` John Levon
2022-06-15 11:22                           ` Jinhao Fan
2022-06-15 11:45                             ` John Levon
2022-06-08  1:36 ` [PATCH 2/2] hw/nvme: Add trace events for shadow doorbell buffer Jinhao Fan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.