All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/4] irqfd, iothread and polling support
@ 2022-08-27  9:12 Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 1/4] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
                   ` (4 more replies)
  0 siblings, 5 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-27  9:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan

This series of performance enhancements brings iothread and polling
capabilities to qemu-nvme. The first two patches implements support
for irqfd, which solves thread safety problems for interrupt emulation
outside the main loop thread. The third patch adds support for
emulating NVMe IO queues in a dedicated iothread, which avoid
interference from other devices running on the main loop thread.
The fourth patch implements SQ and CQ pollers, thus enabling polling
on SQ and CQ.

After all these optimizations, our performance becomes similar to
virtio-blk.

Comparison (KIOPS):

QD                    1   4  16  64
virtio-blk           59 185 260 256
nvme                 53 155 245 309
virtio-blk-polling   88 212 210 213
nvme-polling        123 165 189 191

Changes since v2:
 - Add polling support
 - Do not set up MSI-X masking handlers when irq-eventfd is off

Changes since v1:
 - Avoid duplicate initilization of cq timer

Jinhao Fan (4):
  hw/nvme: support irq(de)assertion with eventfd
  hw/nvme: use KVM irqfd when available
  hw/nvme: add iothread support
  hw/nvme: add polling support

 hw/nvme/ctrl.c       | 394 ++++++++++++++++++++++++++++++++++++++++---
 hw/nvme/ns.c         |  21 ++-
 hw/nvme/nvme.h       |  13 +-
 hw/nvme/trace-events |   3 +
 4 files changed, 401 insertions(+), 30 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v3 1/4] hw/nvme: support irq(de)assertion with eventfd
  2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
@ 2022-08-27  9:12 ` Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 2/4] hw/nvme: use KVM irqfd when available Jinhao Fan
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-27  9:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: its, kbusch, stefanha, Jinhao Fan, Klaus Jensen, open list:nvme

When the new option 'irq-eventfd' is turned on, the IO emulation code
signals an eventfd when it want to (de)assert an irq. The main loop
eventfd handler does the actual irq (de)assertion.  This paves the way
for iothread support since QEMU's interrupt emulation is not thread
safe.

Asserting and deasseting irq with eventfd has some performance
implications. For small queue depth it increases request latency but
for large queue depth it effectively coalesces irqs.

Comparision (KIOPS):

QD            1   4  16  64
QEMU         38 123 210 329
irq-eventfd  32 106 240 364

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c | 120 ++++++++++++++++++++++++++++++++++++++++++-------
 hw/nvme/nvme.h |   3 ++
 2 files changed, 106 insertions(+), 17 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..51792f3955 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -526,34 +526,57 @@ static void nvme_irq_check(NvmeCtrl *n)
     }
 }
 
+static void nvme_irq_do_assert(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    if (msix_enabled(&(n->parent_obj))) {
+        trace_pci_nvme_irq_msix(cq->vector);
+        msix_notify(&(n->parent_obj), cq->vector);
+    } else {
+        trace_pci_nvme_irq_pin();
+        assert(cq->vector < 32);
+        n->irq_status |= 1 << cq->vector;
+        nvme_irq_check(n);
+    }
+}
+
 static void nvme_irq_assert(NvmeCtrl *n, NvmeCQueue *cq)
 {
     if (cq->irq_enabled) {
-        if (msix_enabled(&(n->parent_obj))) {
-            trace_pci_nvme_irq_msix(cq->vector);
-            msix_notify(&(n->parent_obj), cq->vector);
+        if (cq->assert_notifier.initialized) {
+            event_notifier_set(&cq->assert_notifier);
         } else {
-            trace_pci_nvme_irq_pin();
-            assert(cq->vector < 32);
-            n->irq_status |= 1 << cq->vector;
-            nvme_irq_check(n);
+            nvme_irq_do_assert(n, cq);
         }
     } else {
         trace_pci_nvme_irq_masked();
     }
 }
 
+static void nvme_irq_do_deassert(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    if (msix_enabled(&(n->parent_obj))) {
+        return;
+    } else {
+        assert(cq->vector < 32);
+        if (!n->cq_pending) {
+            n->irq_status &= ~(1 << cq->vector);
+        }
+        nvme_irq_check(n);
+    }
+}
+
 static void nvme_irq_deassert(NvmeCtrl *n, NvmeCQueue *cq)
 {
     if (cq->irq_enabled) {
-        if (msix_enabled(&(n->parent_obj))) {
-            return;
+        if (cq->deassert_notifier.initialized) {
+            /*
+             * The deassert notifier will only be initilized when MSI-X is NOT
+             * in use. Therefore no need to worry about extra eventfd syscall
+             * for pin-based interrupts.
+             */
+            event_notifier_set(&cq->deassert_notifier);
         } else {
-            assert(cq->vector < 32);
-            if (!n->cq_pending) {
-                n->irq_status &= ~(1 << cq->vector);
-            }
-            nvme_irq_check(n);
+            nvme_irq_do_deassert(n, cq);
         }
     }
 }
@@ -1338,6 +1361,50 @@ static void nvme_update_cq_head(NvmeCQueue *cq)
     trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head);
 }
 
+static void nvme_assert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_do_assert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_deassert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_do_deassert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    int ret;
+
+    ret = event_notifier_init(&cq->assert_notifier, 0);
+    if (ret < 0) {
+        return;
+    }
+
+    event_notifier_set_handler(&cq->assert_notifier,
+                                nvme_assert_notifier_read);
+
+    if (!msix_enabled(&n->parent_obj)) {
+        ret = event_notifier_init(&cq->deassert_notifier, 0);
+        if (ret < 0) {
+            event_notifier_set_handler(&cq->assert_notifier, NULL);
+            event_notifier_cleanup(&cq->assert_notifier);
+
+            return;
+        }
+
+        event_notifier_set_handler(&cq->deassert_notifier,
+                                   nvme_deassert_notifier_read);
+    }
+
+    return;
+}
+
 static void nvme_post_cqes(void *opaque)
 {
     NvmeCQueue *cq = opaque;
@@ -1377,8 +1444,10 @@ static void nvme_post_cqes(void *opaque)
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
-        if (cq->irq_enabled && !pending) {
-            n->cq_pending++;
+        if (cq->irq_enabled) {
+            if (!pending) {
+                n->cq_pending++;
+            }
         }
 
         nvme_irq_assert(n, cq);
@@ -4705,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_set_handler(&cq->notifier, NULL);
         event_notifier_cleanup(&cq->notifier);
     }
+    if (cq->assert_notifier.initialized) {
+        event_notifier_set_handler(&cq->assert_notifier, NULL);
+        event_notifier_cleanup(&cq->assert_notifier);
+    }
+    if (cq->deassert_notifier.initialized) {
+        event_notifier_set_handler(&cq->deassert_notifier, NULL);
+        event_notifier_cleanup(&cq->deassert_notifier);
+    }
     if (msix_enabled(&n->parent_obj)) {
         msix_vector_unuse(&n->parent_obj, cq->vector);
     }
@@ -4734,7 +4811,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
         n->cq_pending--;
     }
 
-    nvme_irq_deassert(n, cq);
+    nvme_irq_do_deassert(n, cq);
     trace_pci_nvme_del_cq(qid);
     nvme_free_cq(cq, n);
     return NVME_SUCCESS;
@@ -4772,6 +4849,14 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     }
     n->cq[cqid] = cq;
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+
+    /*
+     * Only enable irq eventfd for IO queues since we always emulate admin
+     * queue in main loop thread.
+     */
+    if (cqid && n->params.irq_eventfd) {
+        nvme_init_irq_notifier(n, cq);
+    }
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
@@ -7671,6 +7756,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
     DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
     DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
+    DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false),
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c2..4850d3e965 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -398,6 +398,8 @@ typedef struct NvmeCQueue {
     uint64_t    ei_addr;
     QEMUTimer   *timer;
     EventNotifier notifier;
+    EventNotifier assert_notifier;
+    EventNotifier deassert_notifier;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
@@ -422,6 +424,7 @@ typedef struct NvmeParams {
     bool     auto_transition_zones;
     bool     legacy_cmb;
     bool     ioeventfd;
+    bool     irq_eventfd;
     uint8_t  sriov_max_vfs;
     uint16_t sriov_vq_flexible;
     uint16_t sriov_vi_flexible;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 2/4] hw/nvme: use KVM irqfd when available
  2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 1/4] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
@ 2022-08-27  9:12 ` Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 3/4] hw/nvme: add iothread support Jinhao Fan
                   ` (2 subsequent siblings)
  4 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-27  9:12 UTC (permalink / raw)
  To: qemu-devel
  Cc: its, kbusch, stefanha, Jinhao Fan, Klaus Jensen, open list:nvme

Use KVM's irqfd to send interrupts when possible. This approach is
thread safe. Moreover, it does not have the inter-thread communication
overhead of plain event notifiers since handler callback are called
in the same system call as irqfd write.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
Signed-off-by: Klaus Jensen <k.jensen@samsung.com>
---
 hw/nvme/ctrl.c       | 145 ++++++++++++++++++++++++++++++++++++++++++-
 hw/nvme/nvme.h       |   3 +
 hw/nvme/trace-events |   3 +
 3 files changed, 149 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 51792f3955..e11328967f 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -192,6 +192,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/kvm.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
@@ -1377,8 +1378,115 @@ static void nvme_deassert_notifier_read(EventNotifier *e)
     }
 }
 
+static int nvme_kvm_vector_use(NvmeCtrl *n, NvmeCQueue *cq, uint32_t vector)
+{
+    KVMRouteChange c = kvm_irqchip_begin_route_changes(kvm_state);
+    int ret;
+
+    ret = kvm_irqchip_add_msi_route(&c, vector, &n->parent_obj);
+    if (ret < 0) {
+        return ret;
+    }
+
+    kvm_irqchip_commit_route_changes(&c);
+
+    cq->virq = ret;
+
+    return 0;
+}
+
+static int nvme_kvm_vector_unmask(PCIDevice *pci_dev, unsigned vector,
+                                  MSIMessage msg)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+    int ret;
+
+    trace_pci_nvme_irq_unmask(vector, msg.address, msg.data);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (cq->vector == vector) {
+            if (cq->msg.data != msg.data || cq->msg.address != msg.address) {
+                ret = kvm_irqchip_update_msi_route(kvm_state, cq->virq, msg,
+                                                   pci_dev);
+                if (ret < 0) {
+                    return ret;
+                }
+
+                kvm_irqchip_commit_routes(kvm_state);
+
+                cq->msg = msg;
+            }
+
+            ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                                     &cq->assert_notifier,
+                                                     NULL, cq->virq);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
+static void nvme_kvm_vector_mask(PCIDevice *pci_dev, unsigned vector)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_mask(vector);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (cq->vector == vector) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier,
+                                                  cq->virq);
+        }
+    }
+}
+
+static void nvme_kvm_vector_poll(PCIDevice *pci_dev, unsigned int vector_start,
+                                 unsigned int vector_end)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_poll(vector_start, vector_end);
+
+    for (uint32_t i = 1; i <= n->params.max_ioqpairs; i++) {
+        NvmeCQueue *cq = n->cq[i];
+
+        if (!cq) {
+            continue;
+        }
+
+        if (!msix_is_masked(pci_dev, cq->vector)) {
+            continue;
+        }
+
+        if (cq->vector >= vector_start && cq->vector <= vector_end) {
+            if (event_notifier_test_and_clear(&cq->assert_notifier)) {
+                msix_set_pending(pci_dev, i);
+            }
+        }
+    }
+}
+
+
 static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     int ret;
 
     ret = event_notifier_init(&cq->assert_notifier, 0);
@@ -1386,12 +1494,27 @@ static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
         return;
     }
 
-    event_notifier_set_handler(&cq->assert_notifier,
-                                nvme_assert_notifier_read);
+    if (with_irqfd) {
+        ret = nvme_kvm_vector_use(n, cq, cq->vector);
+        if (ret < 0) {
+            event_notifier_cleanup(&cq->assert_notifier);
+
+            return;
+        }
+    } else {
+        event_notifier_set_handler(&cq->assert_notifier,
+                                   nvme_assert_notifier_read);
+    }
 
     if (!msix_enabled(&n->parent_obj)) {
         ret = event_notifier_init(&cq->deassert_notifier, 0);
         if (ret < 0) {
+            if (with_irqfd) {
+                kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                      &cq->assert_notifier,
+                                                      cq->virq);
+            }
+
             event_notifier_set_handler(&cq->assert_notifier, NULL);
             event_notifier_cleanup(&cq->assert_notifier);
 
@@ -4764,6 +4887,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 
 static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     uint16_t offset = (cq->cqid << 3) + (1 << 2);
 
     n->cq[cq->cqid] = NULL;
@@ -4775,6 +4900,12 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_cleanup(&cq->notifier);
     }
     if (cq->assert_notifier.initialized) {
+        if (with_irqfd) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier,
+                                                  cq->virq);
+            kvm_irqchip_release_virq(kvm_state, cq->virq);
+        }
         event_notifier_set_handler(&cq->assert_notifier, NULL);
         event_notifier_cleanup(&cq->assert_notifier);
     }
@@ -6528,6 +6659,9 @@ static int nvme_start_ctrl(NvmeCtrl *n)
     uint32_t page_size = 1 << page_bits;
     NvmeSecCtrlEntry *sctrl = nvme_sctrl(n);
 
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
+
     if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
         trace_pci_nvme_err_startfail_virt_state(le16_to_cpu(sctrl->nvi),
                                                 le16_to_cpu(sctrl->nvq),
@@ -6617,6 +6751,12 @@ static int nvme_start_ctrl(NvmeCtrl *n)
 
     nvme_select_iocs(n);
 
+    if (n->params.irq_eventfd && with_irqfd) {
+        return msix_set_vector_notifiers(PCI_DEVICE(n), nvme_kvm_vector_unmask,
+                                         nvme_kvm_vector_mask,
+                                         nvme_kvm_vector_poll);
+    }
+
     return 0;
 }
 
@@ -7734,6 +7874,7 @@ static void nvme_exit(PCIDevice *pci_dev)
         pcie_sriov_pf_exit(pci_dev);
     }
 
+    msix_unset_vector_notifiers(pci_dev);
     msix_uninit(pci_dev, &n->bar0, &n->bar0);
     memory_region_del_subregion(&n->bar0, &n->iomem);
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 4850d3e965..b0b986b024 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -20,6 +20,7 @@
 
 #include "qemu/uuid.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/msi.h"
 #include "hw/block/block.h"
 
 #include "block/nvme.h"
@@ -396,10 +397,12 @@ typedef struct NvmeCQueue {
     uint64_t    dma_addr;
     uint64_t    db_addr;
     uint64_t    ei_addr;
+    int         virq;
     QEMUTimer   *timer;
     EventNotifier notifier;
     EventNotifier assert_notifier;
     EventNotifier deassert_notifier;
+    MSIMessage  msg;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index fccb79f489..b11fcf4a65 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -2,6 +2,9 @@
 pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 pci_nvme_irq_pin(void) "pulsing IRQ pin"
 pci_nvme_irq_masked(void) "IRQ is masked"
+pci_nvme_irq_mask(uint32_t vector) "IRQ %u gets masked"
+pci_nvme_irq_unmask(uint32_t vector, uint64_t addr, uint32_t data) "IRQ %u gets unmasked, addr=0x%"PRIx64" data=0x%"PRIu32""
+pci_nvme_irq_poll(uint32_t vector_start, uint32_t vector_end) "IRQ poll, start=0x%"PRIu32" end=0x%"PRIu32""
 pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
 pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64" eis_addr=0x%"PRIx64""
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 3/4] hw/nvme: add iothread support
  2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 1/4] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
  2022-08-27  9:12 ` [PATCH v3 2/4] hw/nvme: use KVM irqfd when available Jinhao Fan
@ 2022-08-27  9:12 ` Jinhao Fan
  2022-10-20 11:13   ` Klaus Jensen
  2022-08-27  9:12 ` [PATCH v3 4/4] hw/nvme: add polling support Jinhao Fan
  2022-10-20 11:16 ` [PATCH v3 0/4] irqfd, iothread and " Klaus Jensen
  4 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-08-27  9:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

Add an option "iothread=x" to do emulation in a seperate iothread.
This improves the performance because QEMU's main loop is responsible
for a lot of other work while iothread is dedicated to NVMe emulation.
Moreover, emulating in iothread brings the potential of polling on
SQ/CQ doorbells, which I will bring up in a following patch.

Iothread can be enabled by:
-object iothread,id=nvme0 \
-device nvme,iothread=nvme0 \

Performance comparisons (KIOPS):

QD         1   4  16  64
QEMU      41 136 242 338
iothread  53 155 245 309

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 67 ++++++++++++++++++++++++++++++++++++++++++++------
 hw/nvme/ns.c   | 21 +++++++++++++---
 hw/nvme/nvme.h |  6 ++++-
 3 files changed, 82 insertions(+), 12 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index e11328967f..869565d77b 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -4458,7 +4458,13 @@ static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
         return ret;
     }
 
-    event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
+    if (cq->cqid) {
+        aio_set_event_notifier(n->ctx, &cq->notifier, true, nvme_cq_notifier,
+                               NULL, NULL);
+    } else {
+        event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
+    }
+
     memory_region_add_eventfd(&n->iomem,
                               0x1000 + offset, 4, false, 0, &cq->notifier);
 
@@ -4487,7 +4493,13 @@ static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
         return ret;
     }
 
-    event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
+    if (sq->sqid) {
+        aio_set_event_notifier(n->ctx, &sq->notifier, true, nvme_sq_notifier,
+                               NULL, NULL);
+    } else {
+        event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
+    }
+
     memory_region_add_eventfd(&n->iomem,
                               0x1000 + offset, 4, false, 0, &sq->notifier);
 
@@ -4503,7 +4515,12 @@ static void nvme_free_sq(NvmeSQueue *sq, NvmeCtrl *n)
     if (sq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &sq->notifier);
-        event_notifier_set_handler(&sq->notifier, NULL);
+        if (sq->sqid) {
+            aio_set_event_notifier(n->ctx, &sq->notifier, true, NULL, NULL,
+                                   NULL);
+        } else {
+            event_notifier_set_handler(&sq->notifier, NULL);
+        }
         event_notifier_cleanup(&sq->notifier);
     }
     g_free(sq->io_req);
@@ -4573,7 +4590,13 @@ static void nvme_init_sq(NvmeSQueue *sq, NvmeCtrl *n, uint64_t dma_addr,
         sq->io_req[i].sq = sq;
         QTAILQ_INSERT_TAIL(&(sq->req_list), &sq->io_req[i], entry);
     }
-    sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
+
+    if (sq->sqid) {
+        sq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
+                                  nvme_process_sq, sq);
+    } else {
+        sq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_process_sq, sq);
+    }
 
     if (n->dbbuf_enabled) {
         sq->db_addr = n->dbbuf_dbs + (sqid << 3);
@@ -4896,7 +4919,12 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
     if (cq->ioeventfd_enabled) {
         memory_region_del_eventfd(&n->iomem,
                                   0x1000 + offset, 4, false, 0, &cq->notifier);
-        event_notifier_set_handler(&cq->notifier, NULL);
+        if (cq->cqid) {
+            aio_set_event_notifier(n->ctx, &cq->notifier, true, NULL, NULL,
+                                   NULL);
+        } else {
+            event_notifier_set_handler(&cq->notifier, NULL);
+        }
         event_notifier_cleanup(&cq->notifier);
     }
     if (cq->assert_notifier.initialized) {
@@ -4979,7 +5007,13 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
         }
     }
     n->cq[cqid] = cq;
-    cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+
+    if (cq->cqid) {
+        cq->timer = aio_timer_new(n->ctx, QEMU_CLOCK_VIRTUAL, SCALE_NS,
+                                  nvme_post_cqes, cq);
+    } else {
+        cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+    }
 
     /*
      * Only enable irq eventfd for IO queues since we always emulate admin
@@ -7759,6 +7793,14 @@ static void nvme_init_ctrl(NvmeCtrl *n, PCIDevice *pci_dev)
     if (pci_is_vf(&n->parent_obj) && !sctrl->scs) {
         stl_le_p(&n->bar.csts, NVME_CSTS_FAILED);
     }
+
+    if (n->params.iothread) {
+        n->iothread = n->params.iothread;
+        object_ref(OBJECT(n->iothread));
+        n->ctx = iothread_get_aio_context(n->iothread);
+    } else {
+        n->ctx = qemu_get_aio_context();
+    }
 }
 
 static int nvme_init_subsys(NvmeCtrl *n, Error **errp)
@@ -7831,7 +7873,7 @@ static void nvme_realize(PCIDevice *pci_dev, Error **errp)
         ns = &n->namespace;
         ns->params.nsid = 1;
 
-        if (nvme_ns_setup(ns, errp)) {
+        if (nvme_ns_setup(ns, n->ctx, errp)) {
             return;
         }
 
@@ -7862,6 +7904,15 @@ static void nvme_exit(PCIDevice *pci_dev)
     g_free(n->sq);
     g_free(n->aer_reqs);
 
+    aio_context_acquire(n->ctx);
+    blk_set_aio_context(n->namespace.blkconf.blk, qemu_get_aio_context(), NULL);
+    aio_context_release(n->ctx);
+
+    if (n->iothread) {
+        object_unref(OBJECT(n->iothread));
+        n->iothread = NULL;
+    }
+
     if (n->params.cmb_size_mb) {
         g_free(n->cmb.buf);
     }
@@ -7885,6 +7936,8 @@ static Property nvme_props[] = {
                      HostMemoryBackend *),
     DEFINE_PROP_LINK("subsys", NvmeCtrl, subsys, TYPE_NVME_SUBSYS,
                      NvmeSubsystem *),
+    DEFINE_PROP_LINK("iothread", NvmeCtrl, params.iothread, TYPE_IOTHREAD,
+                     IOThread *),
     DEFINE_PROP_STRING("serial", NvmeCtrl, params.serial),
     DEFINE_PROP_UINT32("cmb_size_mb", NvmeCtrl, params.cmb_size_mb, 0),
     DEFINE_PROP_UINT32("num_queues", NvmeCtrl, params.num_queues, 0),
diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index 62a1f97be0..eb9141a67b 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -146,9 +146,11 @@ lbaf_found:
     return 0;
 }
 
-static int nvme_ns_init_blk(NvmeNamespace *ns, Error **errp)
+static int nvme_ns_init_blk(NvmeNamespace *ns, AioContext *ctx, Error **errp)
 {
     bool read_only;
+    AioContext *old_context;
+    int ret;
 
     if (!blkconf_blocksizes(&ns->blkconf, errp)) {
         return -1;
@@ -170,6 +172,17 @@ static int nvme_ns_init_blk(NvmeNamespace *ns, Error **errp)
         return -1;
     }
 
+    old_context = blk_get_aio_context(ns->blkconf.blk);
+    aio_context_acquire(old_context);
+    ret = blk_set_aio_context(ns->blkconf.blk, ctx, errp);
+    aio_context_release(old_context);
+
+    if (ret) {
+        error_setg(errp, "Set AioContext on BlockBackend failed");
+        return ret;
+    }
+
+
     return 0;
 }
 
@@ -482,13 +495,13 @@ static int nvme_ns_check_constraints(NvmeNamespace *ns, Error **errp)
     return 0;
 }
 
-int nvme_ns_setup(NvmeNamespace *ns, Error **errp)
+int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, Error **errp)
 {
     if (nvme_ns_check_constraints(ns, errp)) {
         return -1;
     }
 
-    if (nvme_ns_init_blk(ns, errp)) {
+    if (nvme_ns_init_blk(ns, ctx, errp)) {
         return -1;
     }
 
@@ -563,7 +576,7 @@ static void nvme_ns_realize(DeviceState *dev, Error **errp)
         }
     }
 
-    if (nvme_ns_setup(ns, errp)) {
+    if (nvme_ns_setup(ns, n->ctx, errp)) {
         return;
     }
 
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index b0b986b024..224b73e6c4 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -22,6 +22,7 @@
 #include "hw/pci/pci.h"
 #include "hw/pci/msi.h"
 #include "hw/block/block.h"
+#include "sysemu/iothread.h"
 
 #include "block/nvme.h"
 
@@ -276,7 +277,7 @@ static inline void nvme_aor_dec_active(NvmeNamespace *ns)
 }
 
 void nvme_ns_init_format(NvmeNamespace *ns);
-int nvme_ns_setup(NvmeNamespace *ns, Error **errp);
+int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, Error **errp);
 void nvme_ns_drain(NvmeNamespace *ns);
 void nvme_ns_shutdown(NvmeNamespace *ns);
 void nvme_ns_cleanup(NvmeNamespace *ns);
@@ -433,6 +434,7 @@ typedef struct NvmeParams {
     uint16_t sriov_vi_flexible;
     uint8_t  sriov_max_vq_per_vf;
     uint8_t  sriov_max_vi_per_vf;
+    IOThread *iothread;
 } NvmeParams;
 
 typedef struct NvmeCtrl {
@@ -464,6 +466,8 @@ typedef struct NvmeCtrl {
     uint64_t    dbbuf_dbs;
     uint64_t    dbbuf_eis;
     bool        dbbuf_enabled;
+    IOThread    *iothread;
+    AioContext  *ctx;
 
     struct {
         MemoryRegion mem;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v3 4/4] hw/nvme: add polling support
  2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
                   ` (2 preceding siblings ...)
  2022-08-27  9:12 ` [PATCH v3 3/4] hw/nvme: add iothread support Jinhao Fan
@ 2022-08-27  9:12 ` Jinhao Fan
  2022-10-20 11:10   ` Klaus Jensen
  2022-10-20 11:16 ` [PATCH v3 0/4] irqfd, iothread and " Klaus Jensen
  4 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-08-27  9:12 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

Add AioContext polling handlers for NVMe SQ and CQ. By employing polling,
the latency of NVMe IO emulation is greatly reduced. The SQ polling
handler checks for updates on the SQ tail shadow doorbell buffer. The CQ
polling handler is an empty function because we procatively polls the CQ
head shadow doorbell buffer when we want to post a cqe. Updates on the
SQ eventidx buffer is stopped during polling to avoid the host doing
unnecessary doorbell buffer writes.

Comparison (KIOPS):

QD        1   4  16  64
QEMU     53 155 245 309
polling 123 165 189 191

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++----
 hw/nvme/nvme.h |  1 +
 2 files changed, 70 insertions(+), 5 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 869565d77b..a7f8a4220e 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -298,6 +298,8 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
 
 static void nvme_process_sq(void *opaque);
 static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
+static void nvme_update_sq_eventidx(const NvmeSQueue *sq);
+static void nvme_update_sq_tail(NvmeSQueue *sq);
 
 static uint16_t nvme_sqid(NvmeRequest *req)
 {
@@ -4447,6 +4449,21 @@ static void nvme_cq_notifier(EventNotifier *e)
     nvme_post_cqes(cq);
 }
 
+static bool nvme_cq_notifier_aio_poll(void *opaque)
+{
+    /*
+     * We already "poll" the CQ tail shadow doorbell value in nvme_post_cqes(),
+     * so we do not need to check the value here. However, QEMU's AioContext
+     * polling requires us to provide io_poll and io_poll_ready handlers, so
+     * use dummy functions for CQ.
+     */
+    return false;
+}
+
+static void nvme_cq_notifier_aio_poll_ready(EventNotifier *n)
+{
+}
+
 static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
 {
     NvmeCtrl *n = cq->ctrl;
@@ -4459,8 +4476,10 @@ static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
     }
 
     if (cq->cqid) {
-        aio_set_event_notifier(n->ctx, &cq->notifier, true, nvme_cq_notifier,
-                               NULL, NULL);
+        aio_set_event_notifier(n->ctx, &cq->notifier, true,
+                               nvme_cq_notifier,
+                               nvme_cq_notifier_aio_poll,
+                               nvme_cq_notifier_aio_poll_ready);
     } else {
         event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
     }
@@ -4482,6 +4501,44 @@ static void nvme_sq_notifier(EventNotifier *e)
     nvme_process_sq(sq);
 }
 
+static void nvme_sq_notifier_aio_poll_begin(EventNotifier *n)
+{
+    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
+
+    nvme_update_sq_eventidx(sq);
+
+    /* Stop host doorbell writes by stop updating eventidx */
+    sq->suppress_db = true;
+}
+
+static bool nvme_sq_notifier_aio_poll(void *opaque)
+{
+    EventNotifier *n = opaque;
+    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
+    uint32_t old_tail = sq->tail;
+
+    nvme_update_sq_tail(sq);
+
+    return sq->tail != old_tail;
+}
+
+static void nvme_sq_notifier_aio_poll_ready(EventNotifier *n)
+{
+    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
+
+    nvme_process_sq(sq);
+}
+
+static void nvme_sq_notifier_aio_poll_end(EventNotifier *n)
+{
+    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
+
+    nvme_update_sq_eventidx(sq);
+
+    /* Resume host doorbell writes */
+    sq->suppress_db = false;
+}
+
 static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
 {
     NvmeCtrl *n = sq->ctrl;
@@ -4494,8 +4551,13 @@ static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
     }
 
     if (sq->sqid) {
-        aio_set_event_notifier(n->ctx, &sq->notifier, true, nvme_sq_notifier,
-                               NULL, NULL);
+        aio_set_event_notifier(n->ctx, &sq->notifier, true,
+                               nvme_sq_notifier,
+                               nvme_sq_notifier_aio_poll,
+                               nvme_sq_notifier_aio_poll_ready);
+        aio_set_event_notifier_poll(n->ctx, &sq->notifier,
+                                    nvme_sq_notifier_aio_poll_begin,
+                                    nvme_sq_notifier_aio_poll_end);
     } else {
         event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
     }
@@ -6530,7 +6592,9 @@ static void nvme_process_sq(void *opaque)
         }
 
         if (n->dbbuf_enabled) {
-            nvme_update_sq_eventidx(sq);
+            if (!sq->suppress_db) {
+                nvme_update_sq_eventidx(sq);
+            }
             nvme_update_sq_tail(sq);
         }
     }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 224b73e6c4..bd486a8e15 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -380,6 +380,7 @@ typedef struct NvmeSQueue {
     QEMUTimer   *timer;
     EventNotifier notifier;
     bool        ioeventfd_enabled;
+    bool        suppress_db;
     NvmeRequest *io_req;
     QTAILQ_HEAD(, NvmeRequest) req_list;
     QTAILQ_HEAD(, NvmeRequest) out_req_list;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-08-27  9:12 ` [PATCH v3 4/4] hw/nvme: add polling support Jinhao Fan
@ 2022-10-20 11:10   ` Klaus Jensen
  2022-11-03  2:18     ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-10-20 11:10 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 6185 bytes --]

On Aug 27 17:12, Jinhao Fan wrote:
> Add AioContext polling handlers for NVMe SQ and CQ. By employing polling,
> the latency of NVMe IO emulation is greatly reduced. The SQ polling
> handler checks for updates on the SQ tail shadow doorbell buffer. The CQ
> polling handler is an empty function because we procatively polls the CQ
> head shadow doorbell buffer when we want to post a cqe. Updates on the
> SQ eventidx buffer is stopped during polling to avoid the host doing
> unnecessary doorbell buffer writes.
> 
> Comparison (KIOPS):
> 
> QD        1   4  16  64
> QEMU     53 155 245 309
> polling 123 165 189 191
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 74 ++++++++++++++++++++++++++++++++++++++++++++++----
>  hw/nvme/nvme.h |  1 +
>  2 files changed, 70 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 869565d77b..a7f8a4220e 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -298,6 +298,8 @@ static const uint32_t nvme_cse_iocs_zoned[256] = {
>  
>  static void nvme_process_sq(void *opaque);
>  static void nvme_ctrl_reset(NvmeCtrl *n, NvmeResetType rst);
> +static void nvme_update_sq_eventidx(const NvmeSQueue *sq);
> +static void nvme_update_sq_tail(NvmeSQueue *sq);
>  
>  static uint16_t nvme_sqid(NvmeRequest *req)
>  {
> @@ -4447,6 +4449,21 @@ static void nvme_cq_notifier(EventNotifier *e)
>      nvme_post_cqes(cq);
>  }
>  
> +static bool nvme_cq_notifier_aio_poll(void *opaque)
> +{
> +    /*
> +     * We already "poll" the CQ tail shadow doorbell value in nvme_post_cqes(),
> +     * so we do not need to check the value here. However, QEMU's AioContext
> +     * polling requires us to provide io_poll and io_poll_ready handlers, so
> +     * use dummy functions for CQ.
> +     */
> +    return false;
> +}
> +
> +static void nvme_cq_notifier_aio_poll_ready(EventNotifier *n)
> +{
> +}
> +
>  static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
>  {
>      NvmeCtrl *n = cq->ctrl;
> @@ -4459,8 +4476,10 @@ static int nvme_init_cq_ioeventfd(NvmeCQueue *cq)
>      }
>  
>      if (cq->cqid) {
> -        aio_set_event_notifier(n->ctx, &cq->notifier, true, nvme_cq_notifier,
> -                               NULL, NULL);
> +        aio_set_event_notifier(n->ctx, &cq->notifier, true,
> +                               nvme_cq_notifier,
> +                               nvme_cq_notifier_aio_poll,
> +                               nvme_cq_notifier_aio_poll_ready);

There is no reason to set up these polling handlers (since they don't do
anything).

>      } else {
>          event_notifier_set_handler(&cq->notifier, nvme_cq_notifier);
>      }
> @@ -4482,6 +4501,44 @@ static void nvme_sq_notifier(EventNotifier *e)
>      nvme_process_sq(sq);
>  }
>  
> +static void nvme_sq_notifier_aio_poll_begin(EventNotifier *n)
> +{
> +    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +    nvme_update_sq_eventidx(sq);
> +
> +    /* Stop host doorbell writes by stop updating eventidx */
> +    sq->suppress_db = true;

This doesn't do what you expect it to. By not updaring the eventidx it
will fall behind the actual head, causing the host to think that the
device is not processing events (but it is!), resulting in doorbell
ringing.

> +}
> +
> +static bool nvme_sq_notifier_aio_poll(void *opaque)
> +{
> +    EventNotifier *n = opaque;
> +    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +    uint32_t old_tail = sq->tail;
> +
> +    nvme_update_sq_tail(sq);
> +
> +    return sq->tail != old_tail;
> +}
> +
> +static void nvme_sq_notifier_aio_poll_ready(EventNotifier *n)
> +{
> +    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +    nvme_process_sq(sq);
> +}
> +
> +static void nvme_sq_notifier_aio_poll_end(EventNotifier *n)
> +{
> +    NvmeSQueue *sq = container_of(n, NvmeSQueue, notifier);
> +
> +    nvme_update_sq_eventidx(sq);
> +
> +    /* Resume host doorbell writes */
> +    sq->suppress_db = false;
> +}
> +
>  static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
>  {
>      NvmeCtrl *n = sq->ctrl;
> @@ -4494,8 +4551,13 @@ static int nvme_init_sq_ioeventfd(NvmeSQueue *sq)
>      }
>  
>      if (sq->sqid) {
> -        aio_set_event_notifier(n->ctx, &sq->notifier, true, nvme_sq_notifier,
> -                               NULL, NULL);
> +        aio_set_event_notifier(n->ctx, &sq->notifier, true,
> +                               nvme_sq_notifier,
> +                               nvme_sq_notifier_aio_poll,
> +                               nvme_sq_notifier_aio_poll_ready);
> +        aio_set_event_notifier_poll(n->ctx, &sq->notifier,
> +                                    nvme_sq_notifier_aio_poll_begin,
> +                                    nvme_sq_notifier_aio_poll_end);

You can remove the call to aio_set_event_notifier_poll() since the
supress_db "trick" shouldnt be used anyway.

>      } else {
>          event_notifier_set_handler(&sq->notifier, nvme_sq_notifier);
>      }
> @@ -6530,7 +6592,9 @@ static void nvme_process_sq(void *opaque)
>          }
>  
>          if (n->dbbuf_enabled) {
> -            nvme_update_sq_eventidx(sq);
> +            if (!sq->suppress_db) {
> +                nvme_update_sq_eventidx(sq);
> +            }

Remove this change.

>              nvme_update_sq_tail(sq);
>          }
>      }
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 224b73e6c4..bd486a8e15 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -380,6 +380,7 @@ typedef struct NvmeSQueue {
>      QEMUTimer   *timer;
>      EventNotifier notifier;
>      bool        ioeventfd_enabled;
> +    bool        suppress_db;

Remove this.

>      NvmeRequest *io_req;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
>      QTAILQ_HEAD(, NvmeRequest) out_req_list;
> -- 
> 2.25.1
> 

I tested with the patch modified for the above and I am not seeing any
difference in performance. But without the supress_db, this seems more
"correct".

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/4] hw/nvme: add iothread support
  2022-08-27  9:12 ` [PATCH v3 3/4] hw/nvme: add iothread support Jinhao Fan
@ 2022-10-20 11:13   ` Klaus Jensen
  2022-11-03  1:51     ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-10-20 11:13 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 1892 bytes --]

On Aug 27 17:12, Jinhao Fan wrote:
> Add an option "iothread=x" to do emulation in a seperate iothread.
> This improves the performance because QEMU's main loop is responsible
> for a lot of other work while iothread is dedicated to NVMe emulation.
> Moreover, emulating in iothread brings the potential of polling on
> SQ/CQ doorbells, which I will bring up in a following patch.
> 
> Iothread can be enabled by:
> -object iothread,id=nvme0 \
> -device nvme,iothread=nvme0 \
> 
> Performance comparisons (KIOPS):
> 
> QD         1   4  16  64
> QEMU      41 136 242 338
> iothread  53 155 245 309
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 67 ++++++++++++++++++++++++++++++++++++++++++++------
>  hw/nvme/ns.c   | 21 +++++++++++++---
>  hw/nvme/nvme.h |  6 ++++-
>  3 files changed, 82 insertions(+), 12 deletions(-)
> 

In hw/nvme/ns.c you need to guard the blk_flush and blk_drain calls with
an aio_context_acquire and aio_context_release pair.

diff --git a/hw/nvme/ns.c b/hw/nvme/ns.c
index eb9141a67b5c..dcf889f6d5ce 100644
--- a/hw/nvme/ns.c
+++ b/hw/nvme/ns.c
@@ -520,12 +520,21 @@ int nvme_ns_setup(NvmeNamespace *ns, AioContext *ctx, Error **errp)

 void nvme_ns_drain(NvmeNamespace *ns)
 {
+    AioContext *ctx = blk_get_aio_context(ns->blkconf.blk);
+
+    aio_context_acquire(ctx);
     blk_drain(ns->blkconf.blk);
+    aio_context_release(ctx);
 }

 void nvme_ns_shutdown(NvmeNamespace *ns)
 {
+    AioContext *ctx = blk_get_aio_context(ns->blkconf.blk);
+
+    aio_context_acquire(ctx);
     blk_flush(ns->blkconf.blk);
+    aio_context_release(ctx);
+
     if (ns->params.zoned) {
         nvme_zoned_ns_shutdown(ns);
     }

Otherwise, it all looks fine. I'm still seeing the weird slowdown when
an iothread is enabled. I have yet to figure out why that is... But it
scales! :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 0/4] irqfd, iothread and polling support
  2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
                   ` (3 preceding siblings ...)
  2022-08-27  9:12 ` [PATCH v3 4/4] hw/nvme: add polling support Jinhao Fan
@ 2022-10-20 11:16 ` Klaus Jensen
  4 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2022-10-20 11:16 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha

[-- Attachment #1: Type: text/plain, Size: 1159 bytes --]

On Aug 27 17:12, Jinhao Fan wrote:
> This series of performance enhancements brings iothread and polling
> capabilities to qemu-nvme. The first two patches implements support
> for irqfd, which solves thread safety problems for interrupt emulation
> outside the main loop thread. The third patch adds support for
> emulating NVMe IO queues in a dedicated iothread, which avoid
> interference from other devices running on the main loop thread.
> The fourth patch implements SQ and CQ pollers, thus enabling polling
> on SQ and CQ.
> 
> After all these optimizations, our performance becomes similar to
> virtio-blk.
> 
> Comparison (KIOPS):
> 
> QD                    1   4  16  64
> virtio-blk           59 185 260 256
> nvme                 53 155 245 309
> virtio-blk-polling   88 212 210 213
> nvme-polling        123 165 189 191
> 

While reviewing this I noticed that the commit that introduced the
ioeventfd, we end up disabling batching of cqes (that is, we post and
interrupt for *each* cqe).

I'll post a patch to rectify that. With that, I am seeing progressive
performance improvements with the patches in this series.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/4] hw/nvme: add iothread support
  2022-10-20 11:13   ` Klaus Jensen
@ 2022-11-03  1:51     ` Jinhao Fan
  2022-11-03 12:11       ` Klaus Jensen
  0 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-11-03  1:51 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

at 7:13 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> Otherwise, it all looks fine. I'm still seeing the weird slowdown when
> an iothread is enabled. I have yet to figure out why that is... But it
> scales! :)

How much slowdown do you observe on your machine? 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-10-20 11:10   ` Klaus Jensen
@ 2022-11-03  2:18     ` Jinhao Fan
  2022-11-03 12:10       ` Klaus Jensen
  0 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-11-03  2:18 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

at 7:10 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> This doesn't do what you expect it to. By not updaring the eventidx it
> will fall behind the actual head, causing the host to think that the
> device is not processing events (but it is!), resulting in doorbell
> ringing.

I’m not sure I understand this correctly. 

In 7.13.1 in NVMe Spec 1.4c it says "If updating an entry in the Shadow
Doorbell buffer **changes** the value from being less than or equal to the
value of the corresponding EventIdx buffer entry to being greater than that
value, then the host shall also update the controller's corresponding
doorbell register to match the value of that entry in the Shadow Doorbell
buffer.”

So my understanding is that once the eventidx falls behind the actual head,
the host will only ring the doorbell once but *not* for future submissions.

Is this not what real hosts are doing?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-03  2:18     ` Jinhao Fan
@ 2022-11-03 12:10       ` Klaus Jensen
  2022-11-03 13:19         ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-11-03 12:10 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 1280 bytes --]

On Nov  3 10:18, Jinhao Fan wrote:
> at 7:10 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> > This doesn't do what you expect it to. By not updaring the eventidx it
> > will fall behind the actual head, causing the host to think that the
> > device is not processing events (but it is!), resulting in doorbell
> > ringing.
> 
> I’m not sure I understand this correctly. 
> 
> In 7.13.1 in NVMe Spec 1.4c it says "If updating an entry in the Shadow
> Doorbell buffer **changes** the value from being less than or equal to the
> value of the corresponding EventIdx buffer entry to being greater than that
> value, then the host shall also update the controller's corresponding
> doorbell register to match the value of that entry in the Shadow Doorbell
> buffer.”
> 
> So my understanding is that once the eventidx falls behind the actual head,
> the host will only ring the doorbell once but *not* for future submissions.
> 
> Is this not what real hosts are doing?

I agree that the spec is a little unclear on this point. In any case, in
Linux, when the driver has decided that the sq tail must be updated,
it will use this check:

  (new_idx - event_idx - 1) < (new_idx - old)

So it doesn't account for if or not eventidx was already behind.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/4] hw/nvme: add iothread support
  2022-11-03  1:51     ` Jinhao Fan
@ 2022-11-03 12:11       ` Klaus Jensen
  2022-11-03 13:10         ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-11-03 12:11 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 379 bytes --]

On Nov  3 09:51, Jinhao Fan wrote:
> at 7:13 PM, Klaus Jensen <its@irrelevant.dk> wrote:
> 
> > Otherwise, it all looks fine. I'm still seeing the weird slowdown when
> > an iothread is enabled. I have yet to figure out why that is... But it
> > scales! :)
> 
> How much slowdown do you observe on your machine? 
> 

I'll rerun some experiments and get back to you :)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 3/4] hw/nvme: add iothread support
  2022-11-03 12:11       ` Klaus Jensen
@ 2022-11-03 13:10         ` Jinhao Fan
  0 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-11-03 13:10 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

On 11/3/2022 8:11 PM, Klaus Jensen wrote:
> I'll rerun some experiments and get back to you 😄

Thanks Klaus. I'll also run some experments on my machine, to see if the 
reenabling cqe batching patch solves this problem.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-03 12:10       ` Klaus Jensen
@ 2022-11-03 13:19         ` Jinhao Fan
  2022-11-04  6:32           ` Klaus Jensen
  0 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-11-03 13:19 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> I agree that the spec is a little unclear on this point. In any case, in
> Linux, when the driver has decided that the sq tail must be updated,
> it will use this check:
> 
>    (new_idx - event_idx - 1) < (new_idx - old)

When eventidx is already behind, it's like:

  0
  1 <- event_idx
  2 <- old
  3 <- new_idx
  4
  .
  .
  .

In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) = 
3-2=1, so the host won't update sq tail. Where am I wrong in this example?

> 
> So it doesn't account for if or not eventidx was already behind.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-03 13:19         ` Jinhao Fan
@ 2022-11-04  6:32           ` Klaus Jensen
  2022-11-08 12:39             ` John Levon
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-11-04  6:32 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 718 bytes --]

On Nov  3 21:19, Jinhao Fan wrote:
> On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > I agree that the spec is a little unclear on this point. In any case, in
> > Linux, when the driver has decided that the sq tail must be updated,
> > it will use this check:
> > 
> >    (new_idx - event_idx - 1) < (new_idx - old)
> 
> When eventidx is already behind, it's like:
> 
>  0
>  1 <- event_idx
>  2 <- old
>  3 <- new_idx
>  4
>  .
>  .
>  .
> 
> In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> 

That becomes 1 >= 1, i.e. "true". So this will result in the driver
doing an mmio doorbell write.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-04  6:32           ` Klaus Jensen
@ 2022-11-08 12:39             ` John Levon
  2022-11-08 14:11               ` Klaus Jensen
  0 siblings, 1 reply; 18+ messages in thread
From: John Levon @ 2022-11-08 12:39 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: Jinhao Fan, qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:

> On Nov  3 21:19, Jinhao Fan wrote:
> > On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > > I agree that the spec is a little unclear on this point. In any case, in
> > > Linux, when the driver has decided that the sq tail must be updated,
> > > it will use this check:
> > > 
> > >    (new_idx - event_idx - 1) < (new_idx - old)
> > 
> > When eventidx is already behind, it's like:
> > 
> >  0
> >  1 <- event_idx
> >  2 <- old
> >  3 <- new_idx
> >  4
> >  .
> >  .
> >  .
> > 
> > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> > 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> 
> That becomes 1 >= 1, i.e. "true". So this will result in the driver
> doing an mmio doorbell write.

The code is:

static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)         
{                                                                                
        return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);            
}                                                                                

which per the above is "return 1 < 1;", or false. So the above case does *not*
do an mmio write. No?

regards
john


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-08 12:39             ` John Levon
@ 2022-11-08 14:11               ` Klaus Jensen
  2022-11-09  4:35                 ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-11-08 14:11 UTC (permalink / raw)
  To: John Levon
  Cc: Jinhao Fan, qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 1456 bytes --]

On Nov  8 12:39, John Levon wrote:
> On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:
> 
> > On Nov  3 21:19, Jinhao Fan wrote:
> > > On 11/3/2022 8:10 PM, Klaus Jensen wrote:
> > > > I agree that the spec is a little unclear on this point. In any case, in
> > > > Linux, when the driver has decided that the sq tail must be updated,
> > > > it will use this check:
> > > > 
> > > >    (new_idx - event_idx - 1) < (new_idx - old)
> > > 
> > > When eventidx is already behind, it's like:
> > > 
> > >  0
> > >  1 <- event_idx
> > >  2 <- old
> > >  3 <- new_idx
> > >  4
> > >  .
> > >  .
> > >  .
> > > 
> > > In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
> > > 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
> > 
> > That becomes 1 >= 1, i.e. "true". So this will result in the driver
> > doing an mmio doorbell write.
> 
> The code is:
> 
> static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)         
> {                                                                                
>         return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);            
> }                                                                                
> 
> which per the above is "return 1 < 1;", or false. So the above case does *not*
> do an mmio write. No?
> 

Whelp.

Looks like I'm in the wrong here, apologies!

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v3 4/4] hw/nvme: add polling support
  2022-11-08 14:11               ` Klaus Jensen
@ 2022-11-09  4:35                 ` Jinhao Fan
  0 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-11-09  4:35 UTC (permalink / raw)
  To: Klaus Jensen
  Cc: John Levon, qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

at 10:11 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> On Nov  8 12:39, John Levon wrote:
>> On Fri, Nov 04, 2022 at 07:32:12AM +0100, Klaus Jensen wrote:
>> 
>>> On Nov  3 21:19, Jinhao Fan wrote:
>>>> On 11/3/2022 8:10 PM, Klaus Jensen wrote:
>>>>> I agree that the spec is a little unclear on this point. In any case, in
>>>>> Linux, when the driver has decided that the sq tail must be updated,
>>>>> it will use this check:
>>>>> 
>>>>>   (new_idx - event_idx - 1) < (new_idx - old)
>>>> 
>>>> When eventidx is already behind, it's like:
>>>> 
>>>> 0
>>>> 1 <- event_idx
>>>> 2 <- old
>>>> 3 <- new_idx
>>>> 4
>>>> .
>>>> .
>>>> .
>>>> 
>>>> In this case, (new_idx - event_idx - 1) = 3-1-1 = 1 >= (new_idx - old) =
>>>> 3-2=1, so the host won't update sq tail. Where am I wrong in this example?
>>> 
>>> That becomes 1 >= 1, i.e. "true". So this will result in the driver
>>> doing an mmio doorbell write.
>> 
>> The code is:
>> 
>> static inline int nvme_dbbuf_need_event(u16 event_idx, u16 new_idx, u16 old)         
>> {                                                                                
>>        return (u16)(new_idx - event_idx - 1) < (u16)(new_idx - old);            
>> }                                                                                
>> 
>> which per the above is "return 1 < 1;", or false. So the above case does *not*
>> do an mmio write. No?
> 
> Whelp.
> 
> Looks like I'm in the wrong here, apologies!

So disabling eventidx updates during polling has the potential of reducing
doorbell writes. But as Klaus observed removing this function does not cause
performance difference. So I guess only one command is processed during each
polling iteration.


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-11-09  4:36 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-27  9:12 [PATCH v3 0/4] irqfd, iothread and polling support Jinhao Fan
2022-08-27  9:12 ` [PATCH v3 1/4] hw/nvme: support irq(de)assertion with eventfd Jinhao Fan
2022-08-27  9:12 ` [PATCH v3 2/4] hw/nvme: use KVM irqfd when available Jinhao Fan
2022-08-27  9:12 ` [PATCH v3 3/4] hw/nvme: add iothread support Jinhao Fan
2022-10-20 11:13   ` Klaus Jensen
2022-11-03  1:51     ` Jinhao Fan
2022-11-03 12:11       ` Klaus Jensen
2022-11-03 13:10         ` Jinhao Fan
2022-08-27  9:12 ` [PATCH v3 4/4] hw/nvme: add polling support Jinhao Fan
2022-10-20 11:10   ` Klaus Jensen
2022-11-03  2:18     ` Jinhao Fan
2022-11-03 12:10       ` Klaus Jensen
2022-11-03 13:19         ` Jinhao Fan
2022-11-04  6:32           ` Klaus Jensen
2022-11-08 12:39             ` John Levon
2022-11-08 14:11               ` Klaus Jensen
2022-11-09  4:35                 ` Jinhao Fan
2022-10-20 11:16 ` [PATCH v3 0/4] irqfd, iothread and " Klaus Jensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.