All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] hw/nvme: add irqfd support
@ 2022-08-11 15:37 Jinhao Fan
  2022-08-11 15:37 ` [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions Jinhao Fan
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-11 15:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan

This patch series changes qemu-nvme's interrupt emulation to use event
notifiers, which can ensure thread-safe interrupt delivery when iothread
is used. In the first two patches, I convert qemu-nvme's IO emulation
logic to send irq via eventfd, so that the actual assertion and
deassertion is always done in the main loop thread with BQL held. In the
third patch, support is added to send irq via KVM irqfd, bypassing
qemu's MSI-x emulation. In the last patch, I add MSI-x mask handlers
when irqfd is enabled so that qemu-nvme knows which vector is masked
even when qemu's MSI-x emulation is bypassed.

Jinhao Fan (4):
  hw/nvme: avoid unnecessary call to irq (de)assertion functions
  hw/nvme: add option to (de)assert irq with eventfd
  hw/nvme: use irqfd to send interrupts
  hw/nvme: add MSI-x mask handlers for irqfd

 hw/nvme/ctrl.c       | 251 +++++++++++++++++++++++++++++++++++++++----
 hw/nvme/nvme.h       |   7 ++
 hw/nvme/trace-events |   3 +
 3 files changed, 243 insertions(+), 18 deletions(-)

-- 
2.25.1



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
@ 2022-08-11 15:37 ` Jinhao Fan
  2022-08-16 15:24   ` Stefan Hajnoczi
  2022-08-11 15:37 ` [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd Jinhao Fan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-08-11 15:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

nvme_irq_assert() only does useful work when cq->irq_enabled is true.
nvme_irq_deassert() only works for pin-based interrupts. Avoid calls
into these functions if we are sure they will not do useful work.

This will be most useful when we use eventfd to send interrupts. We
can avoid the unnecessary overhead of signalling eventfd.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 40 ++++++++++++++++++++++------------------
 1 file changed, 22 insertions(+), 18 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 87aeba0564..bd3350d7e0 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1377,11 +1377,13 @@ static void nvme_post_cqes(void *opaque)
         QTAILQ_INSERT_TAIL(&sq->req_list, req, entry);
     }
     if (cq->tail != cq->head) {
-        if (cq->irq_enabled && !pending) {
-            n->cq_pending++;
-        }
+        if (cq->irq_enabled) {
+            if (!pending) {
+                n->cq_pending++;
+            }
 
-        nvme_irq_assert(n, cq);
+            nvme_irq_assert(n, cq);
+        }
     }
 }
 
@@ -4244,12 +4246,11 @@ static void nvme_cq_notifier(EventNotifier *e)
 
     nvme_update_cq_head(cq);
 
-    if (cq->tail == cq->head) {
-        if (cq->irq_enabled) {
-            n->cq_pending--;
+    if (cq->irq_enabled && cq->tail == cq->head) {
+        n->cq_pending--;
+        if (!msix_enabled(&n->parent_obj)) {
+            nvme_irq_deassert(n, cq);
         }
-
-        nvme_irq_deassert(n, cq);
     }
 
     nvme_post_cqes(cq);
@@ -4730,11 +4731,15 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
         return NVME_INVALID_QUEUE_DEL;
     }
 
-    if (cq->irq_enabled && cq->tail != cq->head) {
-        n->cq_pending--;
-    }
+    if (cq->irq_enabled) {
+        if (cq->tail != cq->head) {
+            n->cq_pending--;
+        }
 
-    nvme_irq_deassert(n, cq);
+        if (!msix_enabled(&n->parent_obj)) {
+            nvme_irq_deassert(n, cq);
+        }
+    }
     trace_pci_nvme_del_cq(qid);
     nvme_free_cq(cq, n);
     return NVME_SUCCESS;
@@ -6918,12 +6923,11 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
             timer_mod(cq->timer, qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + 500);
         }
 
-        if (cq->tail == cq->head) {
-            if (cq->irq_enabled) {
-                n->cq_pending--;
+        if (cq->irq_enabled && cq->tail == cq->head) {
+            n->cq_pending--;
+            if (!msix_enabled(&n->parent_obj)) {
+                nvme_irq_deassert(n, cq);
             }
-
-            nvme_irq_deassert(n, cq);
         }
     } else {
         /* Submission queue doorbell write */
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
  2022-08-11 15:37 ` [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions Jinhao Fan
@ 2022-08-11 15:37 ` Jinhao Fan
  2022-08-16 11:20   ` Klaus Jensen
  2022-08-23 10:58   ` Klaus Jensen
  2022-08-11 15:37 ` [PATCH 3/4] hw/nvme: use irqfd to send interrupts Jinhao Fan
                   ` (3 subsequent siblings)
  5 siblings, 2 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-11 15:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

When the new option 'irq-eventfd' is turned on, the IO emulation code
signals an eventfd when it want to (de)assert an irq. The main loop
eventfd handler does the actual irq (de)assertion.  This paves the way
for iothread support since QEMU's interrupt emulation is not thread
safe.

Asserting and deasseting irq with eventfd has some performance
implications. For small queue depth it increases request latency but
for large queue depth it effectively coalesces irqs.

Comparision (KIOPS):

QD            1   4  16  64
QEMU         38 123 210 329
irq-eventfd  32 106 240 364

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h |  4 +++
 2 files changed, 90 insertions(+), 3 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index bd3350d7e0..8a1c5ce3e1 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -1338,6 +1338,54 @@ static void nvme_update_cq_head(NvmeCQueue *cq)
     trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head);
 }
 
+static void nvme_assert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_assert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_deassert_notifier_read(EventNotifier *e)
+{
+    NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier);
+    if (event_notifier_test_and_clear(e)) {
+        nvme_irq_deassert(cq->ctrl, cq);
+    }
+}
+
+static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
+{
+    int ret;
+
+    ret = event_notifier_init(&cq->assert_notifier, 0);
+    if (ret < 0) {
+        goto fail_assert_handler;
+    }
+
+    event_notifier_set_handler(&cq->assert_notifier,
+                                nvme_assert_notifier_read);
+
+    if (!msix_enabled(&n->parent_obj)) {
+        ret = event_notifier_init(&cq->deassert_notifier, 0);
+        if (ret < 0) {
+            goto fail_deassert_handler;
+        }
+
+        event_notifier_set_handler(&cq->deassert_notifier,
+                                   nvme_deassert_notifier_read);
+    }
+
+    return;
+
+fail_deassert_handler:
+    event_notifier_set_handler(&cq->deassert_notifier, NULL);
+    event_notifier_cleanup(&cq->deassert_notifier);
+fail_assert_handler:
+    event_notifier_set_handler(&cq->assert_notifier, NULL);
+    event_notifier_cleanup(&cq->assert_notifier);
+}
+
 static void nvme_post_cqes(void *opaque)
 {
     NvmeCQueue *cq = opaque;
@@ -1382,7 +1430,23 @@ static void nvme_post_cqes(void *opaque)
                 n->cq_pending++;
             }
 
-            nvme_irq_assert(n, cq);
+            if (unlikely(cq->first_io_cqe)) {
+                /*
+                 * Initilize event notifier when first cqe is posted. For irqfd 
+                 * support we need to register the MSI message in KVM. We
+                 * can not do this registration at CQ creation time because
+                 * Linux's NVMe driver changes the MSI message after CQ creation.
+                 */
+                cq->first_io_cqe = false;
+
+                nvme_init_irq_notifier(n, cq);
+            }
+
+            if (cq->assert_notifier.initialized) {
+                event_notifier_set(&cq->assert_notifier);
+            } else {
+                nvme_irq_assert(n, cq);
+            }
         }
     }
 }
@@ -4249,7 +4313,11 @@ static void nvme_cq_notifier(EventNotifier *e)
     if (cq->irq_enabled && cq->tail == cq->head) {
         n->cq_pending--;
         if (!msix_enabled(&n->parent_obj)) {
-            nvme_irq_deassert(n, cq);
+            if (cq->deassert_notifier.initialized) {
+                event_notifier_set(&cq->deassert_notifier);
+            } else {
+                nvme_irq_deassert(n, cq);
+            }
         }
     }
 
@@ -4706,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_set_handler(&cq->notifier, NULL);
         event_notifier_cleanup(&cq->notifier);
     }
+    if (cq->assert_notifier.initialized) {
+        event_notifier_set_handler(&cq->assert_notifier, NULL);
+        event_notifier_cleanup(&cq->assert_notifier);
+    }
+    if (cq->deassert_notifier.initialized) {
+        event_notifier_set_handler(&cq->deassert_notifier, NULL);
+        event_notifier_cleanup(&cq->deassert_notifier);
+    }
     if (msix_enabled(&n->parent_obj)) {
         msix_vector_unuse(&n->parent_obj, cq->vector);
     }
@@ -4737,6 +4813,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
         }
 
         if (!msix_enabled(&n->parent_obj)) {
+            /* Do not use eventfd since this is always called in main loop */
             nvme_irq_deassert(n, cq);
         }
     }
@@ -4777,6 +4854,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
     }
     n->cq[cqid] = cq;
     cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
+    cq->first_io_cqe = cqid != 0;
 }
 
 static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
@@ -6926,7 +7004,11 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
         if (cq->irq_enabled && cq->tail == cq->head) {
             n->cq_pending--;
             if (!msix_enabled(&n->parent_obj)) {
-                nvme_irq_deassert(n, cq);
+                if (cq->deassert_notifier.initialized) {
+                    event_notifier_set(&cq->deassert_notifier);
+                } else {
+                    nvme_irq_deassert(n, cq);
+                }
             }
         }
     } else {
@@ -7675,6 +7757,7 @@ static Property nvme_props[] = {
     DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
     DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
     DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
+    DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false),
     DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
     DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
                      params.auto_transition_zones, true),
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 79f5c281c2..759d0ecd7c 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -398,6 +398,9 @@ typedef struct NvmeCQueue {
     uint64_t    ei_addr;
     QEMUTimer   *timer;
     EventNotifier notifier;
+    EventNotifier assert_notifier;
+    EventNotifier deassert_notifier;
+    bool        first_io_cqe;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
     QTAILQ_HEAD(, NvmeRequest) req_list;
@@ -422,6 +425,7 @@ typedef struct NvmeParams {
     bool     auto_transition_zones;
     bool     legacy_cmb;
     bool     ioeventfd;
+    bool     irq_eventfd;
     uint8_t  sriov_max_vfs;
     uint16_t sriov_vq_flexible;
     uint16_t sriov_vi_flexible;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] hw/nvme: use irqfd to send interrupts
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
  2022-08-11 15:37 ` [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions Jinhao Fan
  2022-08-11 15:37 ` [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd Jinhao Fan
@ 2022-08-11 15:37 ` Jinhao Fan
  2022-08-11 15:37 ` [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd Jinhao Fan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-11 15:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

Use KVM's irqfd to send interrupts when possible. This approach is
thread safe. Moreover, it does not have the inter-thread communication
overhead of plain event notifiers since handler callback are called
in the same system call as irqfd write.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++--
 hw/nvme/nvme.h |  1 +
 2 files changed, 49 insertions(+), 2 deletions(-)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 8a1c5ce3e1..63f988f2f9 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -192,6 +192,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/sysemu.h"
+#include "sysemu/kvm.h"
 #include "sysemu/block-backend.h"
 #include "sysemu/hostmem.h"
 #include "hw/pci/msix.h"
@@ -1354,8 +1355,26 @@ static void nvme_deassert_notifier_read(EventNotifier *e)
     }
 }
 
+static int nvme_kvm_msix_vector_use(NvmeCtrl *n,
+                                    NvmeCQueue *cq,
+                                    uint32_t vector)
+{
+    int ret;
+
+    KVMRouteChange c = kvm_irqchip_begin_route_changes(kvm_state);
+    ret = kvm_irqchip_add_msi_route(&c, vector, &n->parent_obj);
+    if (ret < 0) {
+        return ret;
+    }
+    kvm_irqchip_commit_route_changes(&c);
+    cq->virq = ret;
+    return 0;
+}
+
 static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     int ret;
 
     ret = event_notifier_init(&cq->assert_notifier, 0);
@@ -1363,8 +1382,21 @@ static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
         goto fail_assert_handler;
     }
 
-    event_notifier_set_handler(&cq->assert_notifier,
-                                nvme_assert_notifier_read);
+    if (with_irqfd) {
+        ret = nvme_kvm_msix_vector_use(n, cq, cq->vector);
+        if (ret < 0) {
+            goto fail_assert_handler;
+        }
+        ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                                 &cq->assert_notifier, NULL, 
+                                                 cq->virq);
+        if (ret < 0) {
+            goto fail_kvm;
+         }
+    } else {
+        event_notifier_set_handler(&cq->assert_notifier,
+                                   nvme_assert_notifier_read);
+    }
 
     if (!msix_enabled(&n->parent_obj)) {
         ret = event_notifier_init(&cq->deassert_notifier, 0);
@@ -1381,6 +1413,12 @@ static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
 fail_deassert_handler:
     event_notifier_set_handler(&cq->deassert_notifier, NULL);
     event_notifier_cleanup(&cq->deassert_notifier);
+    if (with_irqfd) {
+        kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &cq->assert_notifier,
+                                              cq->virq);
+fail_kvm:
+        kvm_irqchip_release_virq(kvm_state, cq->virq);
+    }
 fail_assert_handler:
     event_notifier_set_handler(&cq->assert_notifier, NULL);
     event_notifier_cleanup(&cq->assert_notifier);
@@ -4764,6 +4802,8 @@ static uint16_t nvme_get_log(NvmeCtrl *n, NvmeRequest *req)
 
 static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
 {
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     uint16_t offset = (cq->cqid << 3) + (1 << 2);
 
     n->cq[cq->cqid] = NULL;
@@ -4775,6 +4815,12 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
         event_notifier_cleanup(&cq->notifier);
     }
     if (cq->assert_notifier.initialized) {
+        if (with_irqfd) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier, 
+                                                  cq->virq);
+            kvm_irqchip_release_virq(kvm_state, cq->virq);
+        }
         event_notifier_set_handler(&cq->assert_notifier, NULL);
         event_notifier_cleanup(&cq->assert_notifier);
     }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 759d0ecd7c..85fd9cd0e2 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -396,6 +396,7 @@ typedef struct NvmeCQueue {
     uint64_t    dma_addr;
     uint64_t    db_addr;
     uint64_t    ei_addr;
+    int         virq;
     QEMUTimer   *timer;
     EventNotifier notifier;
     EventNotifier assert_notifier;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
                   ` (2 preceding siblings ...)
  2022-08-11 15:37 ` [PATCH 3/4] hw/nvme: use irqfd to send interrupts Jinhao Fan
@ 2022-08-11 15:37 ` Jinhao Fan
  2022-08-16 10:46   ` Klaus Jensen
  2022-08-23 11:04   ` Klaus Jensen
  2022-08-16  1:54 ` [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
  2022-08-24 20:15 ` Klaus Jensen
  5 siblings, 2 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-11 15:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, kbusch, stefanha, Jinhao Fan, open list:nvme

When irqfd is enabled, we bypass QEMU's irq emulation and let KVM to
directly assert the irq. However, KVM is not aware of the device's MSI-x
masking status. Add MSI-x mask bookkeeping in NVMe emulation and
detach the corresponding irqfd when the certain vector is masked.

Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
---
 hw/nvme/ctrl.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++
 hw/nvme/nvme.h       |  2 ++
 hw/nvme/trace-events |  3 ++
 3 files changed, 87 insertions(+)

diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
index 63f988f2f9..ac5460c7c8 100644
--- a/hw/nvme/ctrl.c
+++ b/hw/nvme/ctrl.c
@@ -7478,10 +7478,84 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
 
     return 0;
 }
+static int nvme_vector_unmask(PCIDevice *pci_dev, unsigned vector,
+                               MSIMessage msg)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+    int ret;
+
+    trace_pci_nvme_irq_unmask(vector, msg.address, msg.data);
+    
+    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {
+        NvmeCQueue *cq = n->cq[i];
+        /* 
+         * If this function is called, then irqfd must be available. Therefore,
+         * irqfd must be in use if cq->assert_notifier.initialized is true.
+         */
+        if (cq && cq->vector == vector && cq->assert_notifier.initialized) {
+            if (cq->msg.data != msg.data || cq->msg.address != msg.address) {
+                ret = kvm_irqchip_update_msi_route(kvm_state, cq->virq, msg,
+                                                   pci_dev);
+                if (ret < 0) {
+                    return ret;
+                }
+                kvm_irqchip_commit_routes(kvm_state);
+                cq->msg = msg;
+            }
+
+            ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                                     &cq->assert_notifier,
+                                                     NULL, cq->virq);
+            if (ret < 0) {
+                return ret;
+            }
+        }
+    }
+
+    return 0;
+}
+
+static void nvme_vector_mask(PCIDevice *pci_dev, unsigned vector)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_mask(vector);
+    
+    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {
+        NvmeCQueue *cq = n->cq[i];
+        if (cq && cq->vector == vector && cq->assert_notifier.initialized) {
+            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
+                                                  &cq->assert_notifier,
+                                                  cq->virq);
+        }
+    }
+}
+
+static void nvme_vector_poll(PCIDevice *pci_dev,
+                             unsigned int vector_start,
+                             unsigned int vector_end)
+{
+    NvmeCtrl *n = NVME(pci_dev);
+
+    trace_pci_nvme_irq_poll(vector_start, vector_end);
+
+    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {
+        NvmeCQueue *cq = n->cq[i];
+        if (cq && cq->vector >= vector_start && cq->vector <= vector_end 
+            && msix_is_masked(pci_dev, cq->vector) 
+            && cq->assert_notifier.initialized) {
+            if (event_notifier_test_and_clear(&cq->assert_notifier)) {
+                msix_set_pending(pci_dev, i);
+            }
+        }
+    }
+}
 
 static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
 {
     uint8_t *pci_conf = pci_dev->config;
+    bool with_irqfd = msix_enabled(&n->parent_obj) &&
+                      kvm_msi_via_irqfd_enabled();
     uint64_t bar_size;
     unsigned msix_table_offset, msix_pba_offset;
     int ret;
@@ -7534,6 +7608,13 @@ static int nvme_init_pci(NvmeCtrl *n, PCIDevice *pci_dev, Error **errp)
         }
     }
 
+    if (with_irqfd) {
+        msix_set_vector_notifiers(pci_dev,
+                                  nvme_vector_unmask,
+                                  nvme_vector_mask,
+                                  nvme_vector_poll);
+    }
+
     nvme_update_msixcap_ts(pci_dev, n->conf_msix_qsize);
 
     if (n->params.cmb_size_mb) {
@@ -7781,6 +7862,7 @@ static void nvme_exit(PCIDevice *pci_dev)
         pcie_sriov_pf_exit(pci_dev);
     }
 
+    msix_unset_vector_notifiers(pci_dev);
     msix_uninit(pci_dev, &n->bar0, &n->bar0);
     memory_region_del_subregion(&n->bar0, &n->iomem);
 }
diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
index 85fd9cd0e2..707a55ebfc 100644
--- a/hw/nvme/nvme.h
+++ b/hw/nvme/nvme.h
@@ -20,6 +20,7 @@
 
 #include "qemu/uuid.h"
 #include "hw/pci/pci.h"
+#include "hw/pci/msi.h"
 #include "hw/block/block.h"
 
 #include "block/nvme.h"
@@ -401,6 +402,7 @@ typedef struct NvmeCQueue {
     EventNotifier notifier;
     EventNotifier assert_notifier;
     EventNotifier deassert_notifier;
+    MSIMessage  msg;
     bool        first_io_cqe;
     bool        ioeventfd_enabled;
     QTAILQ_HEAD(, NvmeSQueue) sq_list;
diff --git a/hw/nvme/trace-events b/hw/nvme/trace-events
index fccb79f489..b11fcf4a65 100644
--- a/hw/nvme/trace-events
+++ b/hw/nvme/trace-events
@@ -2,6 +2,9 @@
 pci_nvme_irq_msix(uint32_t vector) "raising MSI-X IRQ vector %u"
 pci_nvme_irq_pin(void) "pulsing IRQ pin"
 pci_nvme_irq_masked(void) "IRQ is masked"
+pci_nvme_irq_mask(uint32_t vector) "IRQ %u gets masked"
+pci_nvme_irq_unmask(uint32_t vector, uint64_t addr, uint32_t data) "IRQ %u gets unmasked, addr=0x%"PRIx64" data=0x%"PRIu32""
+pci_nvme_irq_poll(uint32_t vector_start, uint32_t vector_end) "IRQ poll, start=0x%"PRIu32" end=0x%"PRIu32""
 pci_nvme_dma_read(uint64_t prp1, uint64_t prp2) "DMA read, prp1=0x%"PRIx64" prp2=0x%"PRIx64""
 pci_nvme_dbbuf_config(uint64_t dbs_addr, uint64_t eis_addr) "dbs_addr=0x%"PRIx64" eis_addr=0x%"PRIx64""
 pci_nvme_map_addr(uint64_t addr, uint64_t len) "addr 0x%"PRIx64" len %"PRIu64""
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/4] hw/nvme: add irqfd support
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
                   ` (3 preceding siblings ...)
  2022-08-11 15:37 ` [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd Jinhao Fan
@ 2022-08-16  1:54 ` Jinhao Fan
  2022-08-24 20:15 ` Klaus Jensen
  5 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-16  1:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: its, Keith Busch, stefanha

at 11:37 PM, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:

> This patch series changes qemu-nvme's interrupt emulation to use event
> notifiers, which can ensure thread-safe interrupt delivery when iothread
> is used. In the first two patches, I convert qemu-nvme's IO emulation
> logic to send irq via eventfd, so that the actual assertion and
> deassertion is always done in the main loop thread with BQL held. In the
> third patch, support is added to send irq via KVM irqfd, bypassing
> qemu's MSI-x emulation. In the last patch, I add MSI-x mask handlers
> when irqfd is enabled so that qemu-nvme knows which vector is masked
> even when qemu's MSI-x emulation is bypassed.
> 
> Jinhao Fan (4):
>  hw/nvme: avoid unnecessary call to irq (de)assertion functions
>  hw/nvme: add option to (de)assert irq with eventfd
>  hw/nvme: use irqfd to send interrupts
>  hw/nvme: add MSI-x mask handlers for irqfd
> 
> hw/nvme/ctrl.c       | 251 +++++++++++++++++++++++++++++++++++++++----
> hw/nvme/nvme.h       |   7 ++
> hw/nvme/trace-events |   3 +
> 3 files changed, 243 insertions(+), 18 deletions(-)
> 
> -- 
> 2.25.1

Ping~

Thanks!



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-11 15:37 ` [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd Jinhao Fan
@ 2022-08-16 10:46   ` Klaus Jensen
  2022-08-17  5:35     ` Jinhao Fan
  2022-08-23 14:43     ` Jinhao Fan
  2022-08-23 11:04   ` Klaus Jensen
  1 sibling, 2 replies; 18+ messages in thread
From: Klaus Jensen @ 2022-08-16 10:46 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 428 bytes --]

On Aug 11 23:37, Jinhao Fan wrote:
> When irqfd is enabled, we bypass QEMU's irq emulation and let KVM to
> directly assert the irq. However, KVM is not aware of the device's MSI-x
> masking status. Add MSI-x mask bookkeeping in NVMe emulation and
> detach the corresponding irqfd when the certain vector is masked.
> 

Did qtest work out for you for testing? If so, it would be nice to add a
simple test case as well.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd
  2022-08-11 15:37 ` [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd Jinhao Fan
@ 2022-08-16 11:20   ` Klaus Jensen
  2022-08-17  5:36     ` Jinhao Fan
  2022-08-23 10:58   ` Klaus Jensen
  1 sibling, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-08-16 11:20 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 1462 bytes --]

On Aug 11 23:37, Jinhao Fan wrote:
> When the new option 'irq-eventfd' is turned on, the IO emulation code
> signals an eventfd when it want to (de)assert an irq. The main loop
> eventfd handler does the actual irq (de)assertion.  This paves the way
> for iothread support since QEMU's interrupt emulation is not thread
> safe.
> 
> Asserting and deasseting irq with eventfd has some performance
> implications. For small queue depth it increases request latency but
> for large queue depth it effectively coalesces irqs.
> 
> Comparision (KIOPS):
> 
> QD            1   4  16  64
> QEMU         38 123 210 329
> irq-eventfd  32 106 240 364
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h |  4 +++
>  2 files changed, 90 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index bd3350d7e0..8a1c5ce3e1 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -7675,6 +7757,7 @@ static Property nvme_props[] = {
>      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
>      DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
>      DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
> +    DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false),

This option does not seem to change anything - the value is never used
;)

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions
  2022-08-11 15:37 ` [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions Jinhao Fan
@ 2022-08-16 15:24   ` Stefan Hajnoczi
  2022-08-17  5:42     ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Stefan Hajnoczi @ 2022-08-16 15:24 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, its, kbusch, open list:nvme

On Thu, 11 Aug 2022 at 11:38, Jinhao Fan <fanjinhao21s@ict.ac.cn> wrote:
>
> nvme_irq_assert() only does useful work when cq->irq_enabled is true.
> nvme_irq_deassert() only works for pin-based interrupts. Avoid calls
> into these functions if we are sure they will not do useful work.
>
> This will be most useful when we use eventfd to send interrupts. We
> can avoid the unnecessary overhead of signalling eventfd.
>
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 40 ++++++++++++++++++++++------------------
>  1 file changed, 22 insertions(+), 18 deletions(-)

There is code duplication and nvme_irq_assert/deassert() check
->irq_enabled and msix_enabled() again.

Can the logic be moved into assert()/deassert() so callers don't need
to duplicate the checks?

(I assume the optimization is that eventfd syscalls are avoided, not
that the function call is avoided.)

Stefan


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-16 10:46   ` Klaus Jensen
@ 2022-08-17  5:35     ` Jinhao Fan
  2022-08-23 14:43     ` Jinhao Fan
  1 sibling, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-17  5:35 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, Keith Busch, Stefan Hajnoczi, open list:nvme

at 6:46 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> 
> Did qtest work out for you for testing? If so, it would be nice to add a
> simple test case as well.

I found MSI-x masking harder to test than we imagined. My plan is to only
emulate IO queues in the IOthread and leave admin queue emulation in the
main loop since some admin commands require BQL. So I didn’t enable irqfd on
admin queues. Therefore we can onlyt test MSI-x masking on IO queues. This
makes qtest complicated since we need to handle IO queue creation.

But I’m not sure my understanding is correct. Is it true that the admin
queue does not need irqfd as long as it runs in the main loop thread?


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd
  2022-08-16 11:20   ` Klaus Jensen
@ 2022-08-17  5:36     ` Jinhao Fan
  0 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-17  5:36 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

at 7:20 PM, Klaus Jensen <its@irrelevant.dk> wrote:

> This option does not seem to change anything - the value is never used
> ;)

What a stupid mistake. I’ll fix this in the next version.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions
  2022-08-16 15:24   ` Stefan Hajnoczi
@ 2022-08-17  5:42     ` Jinhao Fan
  0 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-17  5:42 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, its, kbusch, open list:nvme

at 11:24 PM, Stefan Hajnoczi <stefanha@gmail.com> wrote:

> Can the logic be moved into assert()/deassert() so callers don't need
> to duplicate the checks?
> 
> (I assume the optimization is that eventfd syscalls are avoided, not
> that the function call is avoided.)

I guess I can move the eventfd syscall into assert()/deassert().



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd
  2022-08-11 15:37 ` [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd Jinhao Fan
  2022-08-16 11:20   ` Klaus Jensen
@ 2022-08-23 10:58   ` Klaus Jensen
  1 sibling, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2022-08-23 10:58 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 8333 bytes --]

On Aug 11 23:37, Jinhao Fan wrote:
> When the new option 'irq-eventfd' is turned on, the IO emulation code
> signals an eventfd when it want to (de)assert an irq. The main loop
> eventfd handler does the actual irq (de)assertion.  This paves the way
> for iothread support since QEMU's interrupt emulation is not thread
> safe.
> 
> Asserting and deasseting irq with eventfd has some performance
> implications. For small queue depth it increases request latency but
> for large queue depth it effectively coalesces irqs.
> 
> Comparision (KIOPS):
> 
> QD            1   4  16  64
> QEMU         38 123 210 329
> irq-eventfd  32 106 240 364
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++--
>  hw/nvme/nvme.h |  4 +++
>  2 files changed, 90 insertions(+), 3 deletions(-)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index bd3350d7e0..8a1c5ce3e1 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -1338,6 +1338,54 @@ static void nvme_update_cq_head(NvmeCQueue *cq)
>      trace_pci_nvme_shadow_doorbell_cq(cq->cqid, cq->head);
>  }
>  
> +static void nvme_assert_notifier_read(EventNotifier *e)
> +{
> +    NvmeCQueue *cq = container_of(e, NvmeCQueue, assert_notifier);
> +    if (event_notifier_test_and_clear(e)) {
> +        nvme_irq_assert(cq->ctrl, cq);
> +    }
> +}
> +
> +static void nvme_deassert_notifier_read(EventNotifier *e)
> +{
> +    NvmeCQueue *cq = container_of(e, NvmeCQueue, deassert_notifier);
> +    if (event_notifier_test_and_clear(e)) {
> +        nvme_irq_deassert(cq->ctrl, cq);
> +    }
> +}
> +
> +static void nvme_init_irq_notifier(NvmeCtrl *n, NvmeCQueue *cq)
> +{
> +    int ret;
> +
> +    ret = event_notifier_init(&cq->assert_notifier, 0);
> +    if (ret < 0) {
> +        goto fail_assert_handler;
> +    }
> +
> +    event_notifier_set_handler(&cq->assert_notifier,
> +                                nvme_assert_notifier_read);
> +
> +    if (!msix_enabled(&n->parent_obj)) {
> +        ret = event_notifier_init(&cq->deassert_notifier, 0);
> +        if (ret < 0) {
> +            goto fail_deassert_handler;
> +        }
> +
> +        event_notifier_set_handler(&cq->deassert_notifier,
> +                                   nvme_deassert_notifier_read);
> +    }
> +
> +    return;
> +
> +fail_deassert_handler:
> +    event_notifier_set_handler(&cq->deassert_notifier, NULL);
> +    event_notifier_cleanup(&cq->deassert_notifier);
> +fail_assert_handler:
> +    event_notifier_set_handler(&cq->assert_notifier, NULL);
> +    event_notifier_cleanup(&cq->assert_notifier);
> +}
> +
>  static void nvme_post_cqes(void *opaque)
>  {
>      NvmeCQueue *cq = opaque;
> @@ -1382,7 +1430,23 @@ static void nvme_post_cqes(void *opaque)
>                  n->cq_pending++;
>              }
>  
> -            nvme_irq_assert(n, cq);
> +            if (unlikely(cq->first_io_cqe)) {
> +                /*
> +                 * Initilize event notifier when first cqe is posted. For irqfd 
> +                 * support we need to register the MSI message in KVM. We
> +                 * can not do this registration at CQ creation time because
> +                 * Linux's NVMe driver changes the MSI message after CQ creation.
> +                 */
> +                cq->first_io_cqe = false;
> +
> +                nvme_init_irq_notifier(n, cq);
> +            }

It is really unfortunate that we have to do this. From what I can tell
in the kernel driver, even if it were to change to set up the irq prior
to creating the completion queue, we'd still have issues making this
work on earlier versions and there is no way to quirk our way out of
this.

We can't even move this upon creation of the submission queue since the
kernel also creates *that* prior to allocating the interrupt line.

In conclusion I don't see any way around this other than asking the NVMe
TWG to add some kind of bit indicating that the host sets up the
interrupt line prior to creating the cq.

Meh.

> +
> +            if (cq->assert_notifier.initialized) {
> +                event_notifier_set(&cq->assert_notifier);
> +            } else {
> +                nvme_irq_assert(n, cq);
> +            }

There is a lot of duplication below, checking if the notifier is
initialized and then choosing what to do. Can we move this to
nvme_irq_assert/deassert()?

>          }
>      }
>  }
> @@ -4249,7 +4313,11 @@ static void nvme_cq_notifier(EventNotifier *e)
>      if (cq->irq_enabled && cq->tail == cq->head) {
>          n->cq_pending--;
>          if (!msix_enabled(&n->parent_obj)) {
> -            nvme_irq_deassert(n, cq);
> +            if (cq->deassert_notifier.initialized) {
> +                event_notifier_set(&cq->deassert_notifier);
> +            } else {
> +                nvme_irq_deassert(n, cq);
> +            }
>          }
>      }
>  
> @@ -4706,6 +4774,14 @@ static void nvme_free_cq(NvmeCQueue *cq, NvmeCtrl *n)
>          event_notifier_set_handler(&cq->notifier, NULL);
>          event_notifier_cleanup(&cq->notifier);
>      }
> +    if (cq->assert_notifier.initialized) {
> +        event_notifier_set_handler(&cq->assert_notifier, NULL);
> +        event_notifier_cleanup(&cq->assert_notifier);
> +    }
> +    if (cq->deassert_notifier.initialized) {
> +        event_notifier_set_handler(&cq->deassert_notifier, NULL);
> +        event_notifier_cleanup(&cq->deassert_notifier);
> +    }
>      if (msix_enabled(&n->parent_obj)) {
>          msix_vector_unuse(&n->parent_obj, cq->vector);
>      }
> @@ -4737,6 +4813,7 @@ static uint16_t nvme_del_cq(NvmeCtrl *n, NvmeRequest *req)
>          }
>  
>          if (!msix_enabled(&n->parent_obj)) {
> +            /* Do not use eventfd since this is always called in main loop */
>              nvme_irq_deassert(n, cq);
>          }
>      }
> @@ -4777,6 +4854,7 @@ static void nvme_init_cq(NvmeCQueue *cq, NvmeCtrl *n, uint64_t dma_addr,
>      }
>      n->cq[cqid] = cq;
>      cq->timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, nvme_post_cqes, cq);
> +    cq->first_io_cqe = cqid != 0;
>  }
>  
>  static uint16_t nvme_create_cq(NvmeCtrl *n, NvmeRequest *req)
> @@ -6926,7 +7004,11 @@ static void nvme_process_db(NvmeCtrl *n, hwaddr addr, int val)
>          if (cq->irq_enabled && cq->tail == cq->head) {
>              n->cq_pending--;
>              if (!msix_enabled(&n->parent_obj)) {
> -                nvme_irq_deassert(n, cq);
> +                if (cq->deassert_notifier.initialized) {
> +                    event_notifier_set(&cq->deassert_notifier);
> +                } else {
> +                    nvme_irq_deassert(n, cq);
> +                }
>              }
>          }
>      } else {
> @@ -7675,6 +7757,7 @@ static Property nvme_props[] = {
>      DEFINE_PROP_BOOL("use-intel-id", NvmeCtrl, params.use_intel_id, false),
>      DEFINE_PROP_BOOL("legacy-cmb", NvmeCtrl, params.legacy_cmb, false),
>      DEFINE_PROP_BOOL("ioeventfd", NvmeCtrl, params.ioeventfd, false),
> +    DEFINE_PROP_BOOL("irq-eventfd", NvmeCtrl, params.irq_eventfd, false),
>      DEFINE_PROP_UINT8("zoned.zasl", NvmeCtrl, params.zasl, 0),
>      DEFINE_PROP_BOOL("zoned.auto_transition", NvmeCtrl,
>                       params.auto_transition_zones, true),
> diff --git a/hw/nvme/nvme.h b/hw/nvme/nvme.h
> index 79f5c281c2..759d0ecd7c 100644
> --- a/hw/nvme/nvme.h
> +++ b/hw/nvme/nvme.h
> @@ -398,6 +398,9 @@ typedef struct NvmeCQueue {
>      uint64_t    ei_addr;
>      QEMUTimer   *timer;
>      EventNotifier notifier;
> +    EventNotifier assert_notifier;
> +    EventNotifier deassert_notifier;
> +    bool        first_io_cqe;
>      bool        ioeventfd_enabled;
>      QTAILQ_HEAD(, NvmeSQueue) sq_list;
>      QTAILQ_HEAD(, NvmeRequest) req_list;
> @@ -422,6 +425,7 @@ typedef struct NvmeParams {
>      bool     auto_transition_zones;
>      bool     legacy_cmb;
>      bool     ioeventfd;
> +    bool     irq_eventfd;
>      uint8_t  sriov_max_vfs;
>      uint16_t sriov_vq_flexible;
>      uint16_t sriov_vi_flexible;
> -- 
> 2.25.1
> 

-- 
One of us - No more doubt, silence or taboo about mental illness.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-11 15:37 ` [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd Jinhao Fan
  2022-08-16 10:46   ` Klaus Jensen
@ 2022-08-23 11:04   ` Klaus Jensen
  1 sibling, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2022-08-23 11:04 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

On Aug 11 23:37, Jinhao Fan wrote:
> When irqfd is enabled, we bypass QEMU's irq emulation and let KVM to
> directly assert the irq. However, KVM is not aware of the device's MSI-x
> masking status. Add MSI-x mask bookkeeping in NVMe emulation and
> detach the corresponding irqfd when the certain vector is masked.
> 
> Signed-off-by: Jinhao Fan <fanjinhao21s@ict.ac.cn>
> ---
>  hw/nvme/ctrl.c       | 82 ++++++++++++++++++++++++++++++++++++++++++++
>  hw/nvme/nvme.h       |  2 ++
>  hw/nvme/trace-events |  3 ++
>  3 files changed, 87 insertions(+)
> 
> diff --git a/hw/nvme/ctrl.c b/hw/nvme/ctrl.c
> index 63f988f2f9..ac5460c7c8 100644
> --- a/hw/nvme/ctrl.c
> +++ b/hw/nvme/ctrl.c
> @@ -7478,10 +7478,84 @@ static int nvme_add_pm_capability(PCIDevice *pci_dev, uint8_t offset)
>  
>      return 0;
>  }
> +static int nvme_vector_unmask(PCIDevice *pci_dev, unsigned vector,
> +                               MSIMessage msg)
> +{
> +    NvmeCtrl *n = NVME(pci_dev);
> +    int ret;
> +
> +    trace_pci_nvme_irq_unmask(vector, msg.address, msg.data);
> +    
> +    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {

This loop is OK for now, but could be done better with a reverse
mapping.

This is just an observation. Don't spend time on doing it since we are
not aware of any host actually doing masking/unmasking.

> +        NvmeCQueue *cq = n->cq[i];
> +        /* 
> +         * If this function is called, then irqfd must be available. Therefore,
> +         * irqfd must be in use if cq->assert_notifier.initialized is true.
> +         */

The wording indicates that you want to assert() that assumption instead.

> +        if (cq && cq->vector == vector && cq->assert_notifier.initialized) {
> +            if (cq->msg.data != msg.data || cq->msg.address != msg.address) {
> +                ret = kvm_irqchip_update_msi_route(kvm_state, cq->virq, msg,
> +                                                   pci_dev);
> +                if (ret < 0) {
> +                    return ret;
> +                }
> +                kvm_irqchip_commit_routes(kvm_state);
> +                cq->msg = msg;
> +            }
> +
> +            ret = kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
> +                                                     &cq->assert_notifier,
> +                                                     NULL, cq->virq);
> +            if (ret < 0) {
> +                return ret;
> +            }
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void nvme_vector_mask(PCIDevice *pci_dev, unsigned vector)
> +{
> +    NvmeCtrl *n = NVME(pci_dev);
> +
> +    trace_pci_nvme_irq_mask(vector);
> +    
> +    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {
> +        NvmeCQueue *cq = n->cq[i];
> +        if (cq && cq->vector == vector && cq->assert_notifier.initialized) {
> +            kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state,
> +                                                  &cq->assert_notifier,
> +                                                  cq->virq);
> +        }
> +    }
> +}
> +
> +static void nvme_vector_poll(PCIDevice *pci_dev,
> +                             unsigned int vector_start,
> +                             unsigned int vector_end)
> +{
> +    NvmeCtrl *n = NVME(pci_dev);
> +
> +    trace_pci_nvme_irq_poll(vector_start, vector_end);
> +
> +    for (uint32_t i = 0; i < n->params.max_ioqpairs + 1; i++) {
> +        NvmeCQueue *cq = n->cq[i];
> +        if (cq && cq->vector >= vector_start && cq->vector <= vector_end 
> +            && msix_is_masked(pci_dev, cq->vector) 
> +            && cq->assert_notifier.initialized) {
> +            if (event_notifier_test_and_clear(&cq->assert_notifier)) {
> +                msix_set_pending(pci_dev, i);
> +            }
> +        }
> +    }
> +}

I'm a little fuzzy on this - what causes this function to be invoked?

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-16 10:46   ` Klaus Jensen
  2022-08-17  5:35     ` Jinhao Fan
@ 2022-08-23 14:43     ` Jinhao Fan
  2022-08-24 11:22       ` Klaus Jensen
  1 sibling, 1 reply; 18+ messages in thread
From: Jinhao Fan @ 2022-08-23 14:43 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

On 8/16/2022 6:46 PM, Klaus Jensen wrote:
> Did qtest work out for you for testing? If so, it would be nice to add a
> simple test case as well.

Since MSI-x masking handlers are only implemented for IO queues, if we 
want to use qtest we need to implement utilities for controller 
initialization and IO queue creation. After that we can actually test 
the MSI-x masking feature. Although we may reuse some code from virtio's 
tests, that is still a large amount of work.

Is it possible to get this patch merged without testing? If not, I guess 
I'll have to take the hard work to implement something like 
qtest/libqos/nvme.c



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-23 14:43     ` Jinhao Fan
@ 2022-08-24 11:22       ` Klaus Jensen
  2022-08-24 13:16         ` Jinhao Fan
  0 siblings, 1 reply; 18+ messages in thread
From: Klaus Jensen @ 2022-08-24 11:22 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

[-- Attachment #1: Type: text/plain, Size: 1143 bytes --]

On Aug 23 22:43, Jinhao Fan wrote:
> On 8/16/2022 6:46 PM, Klaus Jensen wrote:
> > Did qtest work out for you for testing? If so, it would be nice to add a
> > simple test case as well.
> 
> Since MSI-x masking handlers are only implemented for IO queues, if we want
> to use qtest we need to implement utilities for controller initialization
> and IO queue creation. After that we can actually test the MSI-x masking
> feature. Although we may reuse some code from virtio's tests, that is still
> a large amount of work.
> 
> Is it possible to get this patch merged without testing? If not, I guess
> I'll have to take the hard work to implement something like
> qtest/libqos/nvme.c
> 

I'm not too happy about code that is completely untestable (worse, right
now it is actually not even runnable).

What are the implications if we drop it? That is, if we go back to your
version that did not include this? If it doesnt impact the kvm irqchip
logic, then I'd rather that we rip it out and leave the device without
masking/unmasking support, keeping irqfd support as an experimental
feature until we can sort this out.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd
  2022-08-24 11:22       ` Klaus Jensen
@ 2022-08-24 13:16         ` Jinhao Fan
  0 siblings, 0 replies; 18+ messages in thread
From: Jinhao Fan @ 2022-08-24 13:16 UTC (permalink / raw)
  To: Klaus Jensen; +Cc: qemu-devel, kbusch, stefanha, open list:nvme

On 8/24/2022 7:22 PM, Klaus Jensen wrote:
> What are the implications if we drop it? That is, if we go back to your
> version that did not include this? If it doesnt impact the kvm irqchip
> logic, then I'd rather that we rip it out and leave the device without
> masking/unmasking support, keeping irqfd support as an experimental
> feature until we can sort this out.

As far as I can think of, the implication is that MSI-X 
masking/unmasking does not work when irqfd and (in the future) iothread 
is enabled. Considering that we do not find any driver making use of 
this feature, it seems OK to drop this support for now.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 0/4] hw/nvme: add irqfd support
  2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
                   ` (4 preceding siblings ...)
  2022-08-16  1:54 ` [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
@ 2022-08-24 20:15 ` Klaus Jensen
  5 siblings, 0 replies; 18+ messages in thread
From: Klaus Jensen @ 2022-08-24 20:15 UTC (permalink / raw)
  To: Jinhao Fan; +Cc: qemu-devel, kbusch, stefanha

[-- Attachment #1: Type: text/plain, Size: 1711 bytes --]

On Aug 11 23:37, Jinhao Fan wrote:
> This patch series changes qemu-nvme's interrupt emulation to use event
> notifiers, which can ensure thread-safe interrupt delivery when iothread
> is used. In the first two patches, I convert qemu-nvme's IO emulation
> logic to send irq via eventfd, so that the actual assertion and
> deassertion is always done in the main loop thread with BQL held. In the
> third patch, support is added to send irq via KVM irqfd, bypassing
> qemu's MSI-x emulation. In the last patch, I add MSI-x mask handlers
> when irqfd is enabled so that qemu-nvme knows which vector is masked
> even when qemu's MSI-x emulation is bypassed.
> 
> Jinhao Fan (4):
>   hw/nvme: avoid unnecessary call to irq (de)assertion functions
>   hw/nvme: add option to (de)assert irq with eventfd
>   hw/nvme: use irqfd to send interrupts
>   hw/nvme: add MSI-x mask handlers for irqfd
> 
>  hw/nvme/ctrl.c       | 251 +++++++++++++++++++++++++++++++++++++++----
>  hw/nvme/nvme.h       |   7 ++
>  hw/nvme/trace-events |   3 +
>  3 files changed, 243 insertions(+), 18 deletions(-)
> 
> -- 
> 2.25.1
> 

Hi Jinhao,

This series all looks pretty good to me. And, incidentally, it is also
super cool work :)

It can use a bit of clean up (the code duplication mentioned previously
by both Stefan and me) - but all the logic seems sound to me and my
testing is happy. Following up on my suggestion to drop the MSI-X
mask/unmasking, I gave the logic a thorough look it looks sound to me.
I'm gonna see what I can come up with for qtest. I suggest you just keep
going on iothread support.

Please post a v2 (all 4 patches) with suggested cleanups and we take it
from there.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2022-08-24 20:17 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-11 15:37 [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
2022-08-11 15:37 ` [PATCH 1/4] hw/nvme: avoid unnecessary call to irq (de)assertion functions Jinhao Fan
2022-08-16 15:24   ` Stefan Hajnoczi
2022-08-17  5:42     ` Jinhao Fan
2022-08-11 15:37 ` [PATCH 2/4] hw/nvme: add option to (de)assert irq with eventfd Jinhao Fan
2022-08-16 11:20   ` Klaus Jensen
2022-08-17  5:36     ` Jinhao Fan
2022-08-23 10:58   ` Klaus Jensen
2022-08-11 15:37 ` [PATCH 3/4] hw/nvme: use irqfd to send interrupts Jinhao Fan
2022-08-11 15:37 ` [PATCH 4/4] hw/nvme: add MSI-x mask handlers for irqfd Jinhao Fan
2022-08-16 10:46   ` Klaus Jensen
2022-08-17  5:35     ` Jinhao Fan
2022-08-23 14:43     ` Jinhao Fan
2022-08-24 11:22       ` Klaus Jensen
2022-08-24 13:16         ` Jinhao Fan
2022-08-23 11:04   ` Klaus Jensen
2022-08-16  1:54 ` [PATCH 0/4] hw/nvme: add irqfd support Jinhao Fan
2022-08-24 20:15 ` Klaus Jensen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.