[PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
@ 2020-02-26 22:50 Peter Xu
  2020-02-26 22:50 ` [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, peterx, Paolo Bonzini

VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
will directly fail with resamplefd attached so QEMU will automatically
fallback to the INTx slow path.  However on old kernels it's still
broken.

Only until recently I noticed that this could also break PXE boot for
assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
using INTx as well, which means we can't bypass that even if we
enables MSI for the guest kernel.

This series tries to first fix this issue function-wise, then speed up
for the INTx again with resamplefd (mostly following the ideas
proposed by Paolo one year ago [2]).  My TCP_RR test shows that:

  - Before this series: this is broken, no number to show

  - After patch 1 (enable slow path): get 63% perf comparing to full
    kernel irqchip

  - After whole series (enable fast path partly, irq injection will be
    the same as fast path, however userspace needs to intercept for
    EOI broadcast to resamplefd, though should still be faster than
    the MMIO trick for intx eoi): get 93% perf comparing to full
    kernel irqchip, which is a 46% performance boost

I think we can consider to apply patch 1 even sooner than the rest of
the series to unbreak intx+split first.

The whole test matrix for reference:

  |----------+---------+-------------------+--------------+--------------------|
  | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note               |
  |----------+---------+-------------------+--------------+--------------------|
  | msi      | on      |              9.39 |        17567 |                    |
  | nomsi    | on      |              9.29 |        14056 |                    |
  | msi      | split   |              9.36 |        17330 |                    |
  | nomsi    | split   |                 / |            / | currently broken   |
  | nomsi    | split   |              8.98 |         8977 | after patch 1      |
  | nomsi    | split   |              9.21 |        13142 | after whole series |
  |----------+---------+-------------------+--------------+--------------------|

Any review comment is welcomed.  Thanks,

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404
[2] https://patchwork.kernel.org/patch/10738541/#22609933

Peter Xu (5):
  vfio/pci: Disable INTx fast path if using split irqchip
  vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  KVM: Kick resamplefd for split kernel irqchip
  Revert "vfio/pci: Disable INTx fast path if using split irqchip"

 accel/kvm/kvm-all.c    | 83 +++++++++++++++++++++++++++++++++++++++---
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 11 +++++-
 hw/vfio/pci.c          | 37 ++++++++-----------
 include/sysemu/kvm.h   |  4 ++
 5 files changed, 106 insertions(+), 30 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
@ 2020-02-26 22:50 ` Peter Xu
  2020-02-27 16:53   ` Auger Eric
  2020-02-26 22:50 ` [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, peterx, Paolo Bonzini

It's currently broken.  Let's use the slow path to at least make it
functional.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..98e0e0c994 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -128,6 +128,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
+    if (kvm_irqchip_is_split()) {
+        /*
+         * VFIO INTx is currently not working with split kernel
+         * irqchip for level triggered interrupts.  Go the slow path
+         * as long as split is enabled so we can be at least
+         * functional (even with poor performance).
+         *
+         * TODO: Remove this after all things fixed up.
+         */
+        return;
+    }
+
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
  2020-02-26 22:50 ` [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
@ 2020-02-26 22:50 ` Peter Xu
  2020-02-27 11:04   ` Auger Eric
  2020-02-27 16:41   ` Cornelia Huck
  2020-02-26 22:54 ` [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, peterx, Paolo Bonzini

VFIO is currently the only one left that is not using the generic
function (kvm_irqchip_add_irqfd_notifier_gsi()) to register irqfds.
Let VFIO use the common framework too.

Follow up patches will introduce extra features for kvm irqfd, so that
VFIO can easily leverage that after the switch.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 98e0e0c994..09703362df 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -115,11 +115,7 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
 static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
 {
 #ifdef CONFIG_KVM
-    struct kvm_irqfd irqfd = {
-        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
-        .gsi = vdev->intx.route.irq,
-        .flags = KVM_IRQFD_FLAG_RESAMPLE,
-    };
+    int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
     Error *err = NULL;
 
     if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
@@ -141,7 +137,7 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     }
 
     /* Get to a known interrupt state */
-    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
@@ -152,17 +148,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         goto fail;
     }
 
-    /* KVM triggers it, VFIO listens for it */
-    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
-
-    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                           &vdev->intx.interrupt,
+                                           &vdev->intx.unmask,
+                                           vdev->intx.route.irq)) {
         error_setg_errno(errp, errno, "failed to setup resample irqfd");
         goto fail_irqfd;
     }
 
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_UNMASK,
-                               irqfd.resamplefd, &err)) {
+                               event_notifier_get_fd(&vdev->intx.unmask),
+                               &err)) {
         error_propagate(errp, err);
         goto fail_vfio;
     }
@@ -177,12 +174,12 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     return;
 
 fail_vfio:
-    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
-    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
+                                          vdev->intx.route.irq);
 fail_irqfd:
     event_notifier_cleanup(&vdev->intx.unmask);
 fail:
-    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+    qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 #endif
 }
@@ -190,12 +187,6 @@ fail:
 static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
 {
 #ifdef CONFIG_KVM
-    struct kvm_irqfd irqfd = {
-        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
-        .gsi = vdev->intx.route.irq,
-        .flags = KVM_IRQFD_FLAG_DEASSIGN,
-    };
-
     if (!vdev->intx.kvm_accel) {
         return;
     }
@@ -209,7 +200,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     pci_irq_deassert(&vdev->pdev);
 
     /* Tell KVM to stop listening for an INTx irqfd */
-    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+    if (kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
+                                              vdev->intx.route.irq)) {
         error_report("vfio: Error: Failed to disable INTx irqfd: %m");
     }
 
@@ -217,7 +209,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     event_notifier_cleanup(&vdev->intx.unmask);
 
     /* QEMU starts listening for interrupt events. */
-    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+    qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
+                        vfio_intx_interrupt, NULL, vdev);
 
     vdev->intx.kvm_accel = false;
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
  2020-02-26 22:50 ` [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
  2020-02-26 22:50 ` [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
@ 2020-02-26 22:54 ` Peter Xu
  2020-02-27 17:01   ` Auger Eric
  2020-02-26 22:55 ` [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:54 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, Peter Xu, Paolo Bonzini

So that kvm_irqchip_assign_irqfd() can have access to the
EventNotifiers, especially the resample event.  It is needed in follow
up patch to cache and kick resamplefds from QEMU.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 439a4efe52..d49b74512a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1628,9 +1628,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
     return kvm_update_routing_entry(s, &kroute);
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
+static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
+                                    EventNotifier *resample, int virq,
                                     bool assign)
 {
+    int fd = event_notifier_get_fd(event);
+    int rfd = resample ? event_notifier_get_fd(resample) : -1;
+
     struct kvm_irqfd irqfd = {
         .fd = fd,
         .gsi = virq,
@@ -1735,7 +1739,9 @@ int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
     return -ENOSYS;
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
+static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
+                                    EventNotifier *resample, int virq,
+                                    bool assign)
 {
     abort();
 }
@@ -1749,15 +1755,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
 int kvm_irqchip_add_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
                                        EventNotifier *rn, int virq)
 {
-    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
-           rn ? event_notifier_get_fd(rn) : -1, virq, true);
+    return kvm_irqchip_assign_irqfd(s, n, rn, virq, true);
 }
 
 int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
                                           int virq)
 {
-    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
-           false);
+    return kvm_irqchip_assign_irqfd(s, n, NULL, virq, false);
 }
 
 int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (2 preceding siblings ...)
  2020-02-26 22:54 ` [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
@ 2020-02-26 22:55 ` Peter Xu
  2020-02-27 17:00   ` [PATCH v1.1 " Peter Xu
  2020-02-28 10:34   ` [PATCH " Paolo Bonzini
  2020-02-26 22:55 ` [PATCH 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, Peter Xu, Paolo Bonzini

This is majorly only for X86 because that's the only one that supports
split irqchip for now.

When the irqchip is split, we face a dilemma that KVM irqfd will be
enabled, however the slow irqchip is still running in the userspace.
It means that the resamplefd in the kernel irqfds won't take any
effect and it can miss to ack INTx interrupts on EOIs.

One example is split irqchip with VFIO INTx, which will break if we
use the VFIO INTx fast path.

This patch can potentially supports the VFIO fast path again for INTx,
that the IRQ delivery will still use the fast path, while we don't
need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
callers of vfio_eoi() hook).  However the EOI of the INTx will still
need to be done from the userspace by caching all the resamplefds in
QEMU and kick properly for IOAPIC EOI broadcast.

https://patchwork.kernel.org/patch/10738541/#22609933

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 67 ++++++++++++++++++++++++++++++++++++++++++
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 11 +++++--
 include/sysemu/kvm.h   |  4 +++
 4 files changed, 81 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d49b74512a..c7a863552b 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
 static NotifierList kvm_irqchip_change_notifiers =
     NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
 
+struct KVMResampleFd {
+    int gsi;
+    EventNotifier *resample_event;
+    QLIST_ENTRY(KVMResampleFd) node;
+};
+typedef struct KVMResampleFd KVMResampleFd;
+
+/*
+ * Only used with split irqchip where we need to do the resample fd
+ * kick for the kernel from userspace.
+ */
+static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
+    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
+
 #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
 #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
 
+static inline void kvm_resample_fd_remove(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            QLIST_REMOVE(rfd, node);
+            break;
+        }
+    }
+}
+
+static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
+{
+    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
+
+    rfd->gsi = gsi;
+    rfd->resample_event = event;
+
+    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
+}
+
+void kvm_resample_fd_notify(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    if (!kvm_irqchip_is_split()) {
+        return;
+    }
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            event_notifier_set(rfd->resample_event);
+            trace_kvm_resample_fd_notify(gsi);
+            break;
+        }
+    }
+}
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -1644,6 +1697,20 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
     if (rfd != -1) {
         irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
         irqfd.resamplefd = rfd;
+
+        /*
+         * When the slow irqchip (e.g. IOAPIC) is in the userspace,
+         * resamplefd will not work because the EOI of the level
+         * triggered interrupt will be delivered to userspace
+         * instead.  The userspace needs to remember the resamplefd
+         * too and kick it when we receive EOI of this IRQ.
+         */
+        assert(assign);
+        kvm_resample_fd_insert(virq, resample);
+    }
+
+    if (!assign) {
+        kvm_resample_fd_remove(virq);
     }
 
     if (!kvm_irqfds_enabled()) {
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..a68eb66534 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_resample_fd_notify(int gsi) "gsi %d"
 
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 15747fe2c2..8c75465c62 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
         for (n = 0; n < IOAPIC_NUM_PINS; n++) {
             entry = s->ioredtbl[n];
 
-            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
-                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
+            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
+                continue;
+            }
+
+            /* Kick resamplefd if KVM is bypassed */
+            kvm_resample_fd_notify(n);
+
+            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
+                IOAPIC_TRIGGER_LEVEL) {
                 continue;
             }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 141342de98..b67552c047 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
 int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
 struct ppc_radix_page_info *kvm_get_radix_page_info(void);
 int kvm_get_max_memslots(void);
+
+/* Notify resamplefd for EOI of specific level triggered interrupts */
+void kvm_resample_fd_notify(int gsi);
+
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip"
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (3 preceding siblings ...)
  2020-02-26 22:55 ` [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
@ 2020-02-26 22:55 ` Peter Xu
  2020-02-27 15:32 ` [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Auger Eric
  2020-02-28 10:36 ` Paolo Bonzini
  6 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-26 22:55 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, Peter Xu, Paolo Bonzini

With the resamplefd list introduced, we can savely enable VFIO INTx
fast path again with split irqchip so it can still be faster than the
complete slow path.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 09703362df..1c0aa27386 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -124,18 +124,6 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
-    if (kvm_irqchip_is_split()) {
-        /*
-         * VFIO INTx is currently not working with split kernel
-         * irqchip for level triggered interrupts.  Go the slow path
-         * as long as split is enabled so we can be at least
-         * functional (even with poor performance).
-         *
-         * TODO: Remove this after all things fixed up.
-         */
-        return;
-    }
-
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  2020-02-26 22:50 ` [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
@ 2020-02-27 11:04   ` Auger Eric
  2020-02-27 16:41   ` Cornelia Huck
  1 sibling, 0 replies; 30+ messages in thread
From: Auger Eric @ 2020-02-27 11:04 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck

Hi Peter,

On 2/26/20 11:50 PM, Peter Xu wrote:
> VFIO is currently the only one left that is not using the generic
> function (kvm_irqchip_add_irqfd_notifier_gsi()) to register irqfds.
> Let VFIO use the common framework too.
> 
> Follow up patches will introduce extra features for kvm irqfd, so that
> VFIO can easily leverage that after the switch.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
Looks good to me

Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  hw/vfio/pci.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 98e0e0c994..09703362df 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -115,11 +115,7 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
>  static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>  {
>  #ifdef CONFIG_KVM
> -    struct kvm_irqfd irqfd = {
> -        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> -        .gsi = vdev->intx.route.irq,
> -        .flags = KVM_IRQFD_FLAG_RESAMPLE,
> -    };
> +    int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
>      Error *err = NULL;
>  
>      if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
> @@ -141,7 +137,7 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>      }
>  
>      /* Get to a known interrupt state */
> -    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
> +    qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
>      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>      vdev->intx.pending = false;
>      pci_irq_deassert(&vdev->pdev);
> @@ -152,17 +148,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>          goto fail;
>      }
>  
> -    /* KVM triggers it, VFIO listens for it */
> -    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
> -
> -    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> +    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
> +                                           &vdev->intx.interrupt,
> +                                           &vdev->intx.unmask,
> +                                           vdev->intx.route.irq)) {
>          error_setg_errno(errp, errno, "failed to setup resample irqfd");
>          goto fail_irqfd;
>      }
>  
>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>                                 VFIO_IRQ_SET_ACTION_UNMASK,
> -                               irqfd.resamplefd, &err)) {
> +                               event_notifier_get_fd(&vdev->intx.unmask),
> +                               &err)) {
>          error_propagate(errp, err);
>          goto fail_vfio;
>      }
> @@ -177,12 +174,12 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>      return;
>  
>  fail_vfio:
> -    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
> -    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
> +    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
> +                                          vdev->intx.route.irq);
>  fail_irqfd:
>      event_notifier_cleanup(&vdev->intx.unmask);
>  fail:
> -    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +    qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
>      vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>  #endif
>  }
> @@ -190,12 +187,6 @@ fail:
>  static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>  {
>  #ifdef CONFIG_KVM
> -    struct kvm_irqfd irqfd = {
> -        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> -        .gsi = vdev->intx.route.irq,
> -        .flags = KVM_IRQFD_FLAG_DEASSIGN,
> -    };
> -
>      if (!vdev->intx.kvm_accel) {
>          return;
>      }
> @@ -209,7 +200,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>      pci_irq_deassert(&vdev->pdev);
>  
>      /* Tell KVM to stop listening for an INTx irqfd */
> -    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> +    if (kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
> +                                              vdev->intx.route.irq)) {
>          error_report("vfio: Error: Failed to disable INTx irqfd: %m");
>      }
>  
> @@ -217,7 +209,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>      event_notifier_cleanup(&vdev->intx.unmask);
>  
>      /* QEMU starts listening for interrupt events. */
> -    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
> +                        vfio_intx_interrupt, NULL, vdev);
>  
>      vdev->intx.kvm_accel = false;
>  
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (4 preceding siblings ...)
  2020-02-26 22:55 ` [PATCH 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
@ 2020-02-27 15:32 ` Auger Eric
  2020-02-27 15:51   ` Peter Xu
  2020-02-28 10:36 ` Paolo Bonzini
  6 siblings, 1 reply; 30+ messages in thread
From: Auger Eric @ 2020-02-27 15:32 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck

Hi Peter,

On 2/26/20 11:50 PM, Peter Xu wrote:
> VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> will directly fail with resamplefd attached so QEMU will automatically
> fallback to the INTx slow path.  However on old kernels it's still
> broken.
I think you should quote the commit that changes the behavior:
654f1f13ea56  kvm: Check irqchip mode before assign irqfd?
so that one can understand what it fixed.

So on kernels that do not feature that commit, the ioctl succeeds and we
pretend the resamplefd was properly attached whereas in practice it was
never triggered.

Thanks

Eric

> 
> Only until recently I noticed that this could also break PXE boot for
> assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> using INTx as well, which means we can't bypass that even if we
> enables MSI for the guest kernel.
> 
> This series tries to first fix this issue function-wise, then speed up
> for the INTx again with resamplefd (mostly following the ideas
> proposed by Paolo one year ago [2]).  My TCP_RR test shows that:

> 
>   - Before this series: this is broken, no number to show
> 
>   - After patch 1 (enable slow path): get 63% perf comparing to full
>     kernel irqchip
> 
>   - After whole series (enable fast path partly, irq injection will be
>     the same as fast path, however userspace needs to intercept for
>     EOI broadcast to resamplefd, though should still be faster than
>     the MMIO trick for intx eoi): get 93% perf comparing to full
>     kernel irqchip, which is a 46% performance boost
> 
> I think we can consider to apply patch 1 even sooner than the rest of
> the series to unbreak intx+split first.
> 
> The whole test matrix for reference:
> 
>   |----------+---------+-------------------+--------------+--------------------|
>   | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note               |
>   |----------+---------+-------------------+--------------+--------------------|
>   | msi      | on      |              9.39 |        17567 |                    |
>   | nomsi    | on      |              9.29 |        14056 |                    |
>   | msi      | split   |              9.36 |        17330 |                    |
>   | nomsi    | split   |                 / |            / | currently broken   |
>   | nomsi    | split   |              8.98 |         8977 | after patch 1      |
>   | nomsi    | split   |              9.21 |        13142 | after whole series |
>   |----------+---------+-------------------+--------------+--------------------|
> 
> Any review comment is welcomed.  Thanks,
> 
> [1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404
> [2] https://patchwork.kernel.org/patch/10738541/#22609933
> 
> Peter Xu (5):
>   vfio/pci: Disable INTx fast path if using split irqchip
>   vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
>   KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
>   KVM: Kick resamplefd for split kernel irqchip
>   Revert "vfio/pci: Disable INTx fast path if using split irqchip"
> 
>  accel/kvm/kvm-all.c    | 83 +++++++++++++++++++++++++++++++++++++++---
>  accel/kvm/trace-events |  1 +
>  hw/intc/ioapic.c       | 11 +++++-
>  hw/vfio/pci.c          | 37 ++++++++-----------
>  include/sysemu/kvm.h   |  4 ++
>  5 files changed, 106 insertions(+), 30 deletions(-)
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-27 15:32 ` [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Auger Eric
@ 2020-02-27 15:51   ` Peter Xu
  2020-02-27 17:02     ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-27 15:51 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 04:32:24PM +0100, Auger Eric wrote:
> Hi Peter,

Hi, Eric,

> 
> On 2/26/20 11:50 PM, Peter Xu wrote:
> > VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> > will directly fail with resamplefd attached so QEMU will automatically
> > fallback to the INTx slow path.  However on old kernels it's still
> > broken.
> I think you should quote the commit that changes the behavior:
> 654f1f13ea56  kvm: Check irqchip mode before assign irqfd?
> so that one can understand what it fixed.

Right I should mention that.

Actually I mentioned it in an old cover letter but unluckily something
wrong happened with git-publish and merely my whole cover letter is
gone...  So what you saw is actually a simplified version and I think
I must have lost something like this... :(

Thanks for bringing that up.

> 
> So on kernels that do not feature that commit, the ioctl succeeds and we
> pretend the resamplefd was properly attached whereas in practice it was
> never triggered.

Right.  Another thing I just noticed is that we must _not_ attach the
resamplefd to the KVM_IRQFD ioctl when we registered this in the
userspace.  One thing is it's for sure not needed at all.  More
importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD
could fail (because it sees both split irqchip and resamplefd) and
QEMU will fallback again to the slow path even if the fast path can
work.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  2020-02-26 22:50 ` [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
  2020-02-27 11:04   ` Auger Eric
@ 2020-02-27 16:41   ` Cornelia Huck
  1 sibling, 0 replies; 30+ messages in thread
From: Cornelia Huck @ 2020-02-27 16:41 UTC (permalink / raw)
  To: Peter Xu; +Cc: Eric Auger, Alex Williamson, qemu-devel, Paolo Bonzini

On Wed, 26 Feb 2020 17:50:45 -0500
Peter Xu <peterx@redhat.com> wrote:

> VFIO is currently the only one left that is not using the generic
> function (kvm_irqchip_add_irqfd_notifier_gsi()) to register irqfds.
> Let VFIO use the common framework too.
> 
> Follow up patches will introduce extra features for kvm irqfd, so that
> VFIO can easily leverage that after the switch.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/pci.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)

Reviewed-by: Cornelia Huck <cohuck@redhat.com>



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip
  2020-02-26 22:50 ` [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
@ 2020-02-27 16:53   ` Auger Eric
  2020-02-27 17:10     ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Auger Eric @ 2020-02-27 16:53 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck

Hi Peter,

On 2/26/20 11:50 PM, Peter Xu wrote:
> It's currently broken.  Let's use the slow path to at least make it
> functional.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

> ---
>  hw/vfio/pci.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95129..98e0e0c994 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -128,6 +128,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>          return;
>      }
>  
> +    if (kvm_irqchip_is_split()) {
> +        /*
> +         * VFIO INTx is currently not working with split kernel
> +         * irqchip for level triggered interrupts.  Go the slow path
> +         * as long as split is enabled so we can be at least
> +         * functional (even with poor performance).
> +         *
> +         * TODO: Remove this after all things fixed up.
If this patch were to be applied sooner than the other patches as
suggested in the cover letter, We may emit a warning message saying that
slow path is selected due to split irqchip and this will induce perf
downgrade
> +         */
> +        return;
> +    }
> +
>      /* Get to a known interrupt state */
>      qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
>      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> 


Tested with a 5.2-rc1 kernel with reverted "654f1f13ea56  kvm: Check
irqchip mode before assign irqfd" and guest booting with nomsi.

Without this patch the assigned NIC does not work. This patch fixes the
issue.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Tested-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-26 22:55 ` [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
@ 2020-02-27 17:00   ` Peter Xu
  2020-02-27 17:18     ` Peter Xu
  2020-02-27 17:42     ` Auger Eric
  2020-02-28 10:34   ` [PATCH " Paolo Bonzini
  1 sibling, 2 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-27 17:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: Eric Auger, Alex Williamson, Cornelia Huck, Peter Xu, Paolo Bonzini

This is majorly only for X86 because that's the only one that supports
split irqchip for now.

When the irqchip is split, we face a dilemma that KVM irqfd will be
enabled, however the slow irqchip is still running in the userspace.
It means that the resamplefd in the kernel irqfds won't take any
effect and it can miss to ack INTx interrupts on EOIs.

One example is split irqchip with VFIO INTx, which will break if we
use the VFIO INTx fast path.

This patch can potentially supports the VFIO fast path again for INTx,
that the IRQ delivery will still use the fast path, while we don't
need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
callers of vfio_eoi() hook).  However the EOI of the INTx will still
need to be done from the userspace by caching all the resamplefds in
QEMU and kick properly for IOAPIC EOI broadcast.

When the userspace is responsible for the resamplefd kickup, don't
register it on the kvm_irqfd anymore, because on newer kernels (after
commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
irqchip and resamplefd.  This will make sure that the fast path will
work for all supported kernels.

https://patchwork.kernel.org/patch/10738541/#22609933

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v1.1 changelog:
- when resamplefd is going to be kicked from userspace, don't register
  it again in KVM_IRQFD.  Tested against upstream kernel.

 accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 11 +++++--
 include/sysemu/kvm.h   |  4 +++
 4 files changed, 86 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d49b74512a..b766b6e93c 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
 static NotifierList kvm_irqchip_change_notifiers =
     NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
 
+struct KVMResampleFd {
+    int gsi;
+    EventNotifier *resample_event;
+    QLIST_ENTRY(KVMResampleFd) node;
+};
+typedef struct KVMResampleFd KVMResampleFd;
+
+/*
+ * Only used with split irqchip where we need to do the resample fd
+ * kick for the kernel from userspace.
+ */
+static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
+    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
+
 #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
 #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
 
+static inline void kvm_resample_fd_remove(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            QLIST_REMOVE(rfd, node);
+            break;
+        }
+    }
+}
+
+static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
+{
+    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
+
+    rfd->gsi = gsi;
+    rfd->resample_event = event;
+
+    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
+}
+
+void kvm_resample_fd_notify(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    if (!kvm_irqchip_is_split()) {
+        return;
+    }
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            event_notifier_set(rfd->resample_event);
+            trace_kvm_resample_fd_notify(gsi);
+            break;
+        }
+    }
+}
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
     };
 
     if (rfd != -1) {
-        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
-        irqfd.resamplefd = rfd;
+        assert(assign);
+        if (kvm_irqchip_is_split()) {
+            /*
+             * When the slow irqchip (e.g. IOAPIC) is in the
+             * userspace, resamplefd will not work because the EOI of
+             * the interrupt will be delivered to userspace instead,
+             * the KVM resample fd kick is skipped.  The userspace
+             * needs to remember the resamplefd and kick it when we
+             * receive EOI of this IRQ.
+             */
+            kvm_resample_fd_insert(virq, resample);
+        } else {
+            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+            irqfd.resamplefd = rfd;
+        }
+    } else if (!assign) {
+        if (kvm_irqchip_is_split()) {
+            kvm_resample_fd_remove(virq);
+        }
     }
 
     if (!kvm_irqfds_enabled()) {
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..a68eb66534 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_resample_fd_notify(int gsi) "gsi %d"
 
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 15747fe2c2..8c75465c62 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
         for (n = 0; n < IOAPIC_NUM_PINS; n++) {
             entry = s->ioredtbl[n];
 
-            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
-                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
+            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
+                continue;
+            }
+
+            /* Kick resamplefd if KVM is bypassed */
+            kvm_resample_fd_notify(n);
+
+            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
+                IOAPIC_TRIGGER_LEVEL) {
                 continue;
             }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 141342de98..b67552c047 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
 int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
 struct ppc_radix_page_info *kvm_get_radix_page_info(void);
 int kvm_get_max_memslots(void);
+
+/* Notify resamplefd for EOI of specific level triggered interrupts */
+void kvm_resample_fd_notify(int gsi);
+
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  2020-02-26 22:54 ` [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
@ 2020-02-27 17:01   ` Auger Eric
  0 siblings, 0 replies; 30+ messages in thread
From: Auger Eric @ 2020-02-27 17:01 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck

Hi Peter,

On 2/26/20 11:54 PM, Peter Xu wrote:
> So that kvm_irqchip_assign_irqfd() can have access to the
> EventNotifiers, especially the resample event.  It is needed in follow
> up patch to cache and kick resamplefds from QEMU.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric
> ---
>  accel/kvm/kvm-all.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 439a4efe52..d49b74512a 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1628,9 +1628,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
>      return kvm_update_routing_entry(s, &kroute);
>  }
>  
> -static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
> +static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> +                                    EventNotifier *resample, int virq,
>                                      bool assign)
>  {
> +    int fd = event_notifier_get_fd(event);
> +    int rfd = resample ? event_notifier_get_fd(resample) : -1;
> +
>      struct kvm_irqfd irqfd = {
>          .fd = fd,
>          .gsi = virq,
> @@ -1735,7 +1739,9 @@ int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
>      return -ENOSYS;
>  }
>  
> -static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
> +static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> +                                    EventNotifier *resample, int virq,
> +                                    bool assign)
>  {
>      abort();
>  }
> @@ -1749,15 +1755,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
>  int kvm_irqchip_add_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
>                                         EventNotifier *rn, int virq)
>  {
> -    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
> -           rn ? event_notifier_get_fd(rn) : -1, virq, true);
> +    return kvm_irqchip_assign_irqfd(s, n, rn, virq, true);
>  }
>  
>  int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
>                                            int virq)
>  {
> -    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
> -           false);
> +    return kvm_irqchip_assign_irqfd(s, n, NULL, virq, false);
>  }
>  
>  int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-27 15:51   ` Peter Xu
@ 2020-02-27 17:02     ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-27 17:02 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 10:51:46AM -0500, Peter Xu wrote:
> Right.  Another thing I just noticed is that we must _not_ attach the
> resamplefd to the KVM_IRQFD ioctl when we registered this in the
> userspace.  One thing is it's for sure not needed at all.  More
> importantly, on new kernels (after commit 654f1f13ea56) the KVM_IRQFD
> could fail (because it sees both split irqchip and resamplefd) and
> QEMU will fallback again to the slow path even if the fast path can
> work.

I've just posted a v1.1 version of patch 4 only, which should fix this
problem on new kernels, so fast path will always be preferred.  The
only changed function is kvm_irqchip_assign_irqfd().

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip
  2020-02-27 16:53   ` Auger Eric
@ 2020-02-27 17:10     ` Peter Xu
  0 siblings, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-27 17:10 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 05:53:32PM +0100, Auger Eric wrote:
> Hi Peter,
> 
> On 2/26/20 11:50 PM, Peter Xu wrote:
> > It's currently broken.  Let's use the slow path to at least make it
> > functional.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> > ---
> >  hw/vfio/pci.c | 12 ++++++++++++
> >  1 file changed, 12 insertions(+)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index 5e75a95129..98e0e0c994 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -128,6 +128,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
> >          return;
> >      }
> >  
> > +    if (kvm_irqchip_is_split()) {
> > +        /*
> > +         * VFIO INTx is currently not working with split kernel
> > +         * irqchip for level triggered interrupts.  Go the slow path
> > +         * as long as split is enabled so we can be at least
> > +         * functional (even with poor performance).
> > +         *
> > +         * TODO: Remove this after all things fixed up.
> If this patch were to be applied sooner than the other patches as
> suggested in the cover letter, We may emit a warning message saying that
> slow path is selected due to split irqchip and this will induce perf
> downgrade

Yes we can.  Here I followed the rest of the cases where we'll
silently fallback to slow path if e.g. we used an even older kernel
that does not support resamplefd at all.  IMHO it's the same as that
(feature not supported yet, silent fallback, just in case it scares
the user a bit, which makes some sense).

> > +         */
> > +        return;
> > +    }
> > +
> >      /* Get to a known interrupt state */
> >      qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
> >      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
> > 
> 
> 
> Tested with a 5.2-rc1 kernel with reverted "654f1f13ea56  kvm: Check
> irqchip mode before assign irqfd" and guest booting with nomsi.
> 
> Without this patch the assigned NIC does not work. This patch fixes the
> issue.
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Tested-by: Eric Auger <eric.auger@redhat.com>

Thanks Eric!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 17:00   ` [PATCH v1.1 " Peter Xu
@ 2020-02-27 17:18     ` Peter Xu
  2020-02-27 17:42     ` Auger Eric
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-27 17:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: Eric Auger, Alex Williamson, Cornelia Huck, Paolo Bonzini

On Thu, Feb 27, 2020 at 12:00:48PM -0500, Peter Xu wrote:
> +static inline void kvm_resample_fd_remove(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            QLIST_REMOVE(rfd, node);

Oops, rfd is leaked...  Will fix that in v2.

> +            break;
> +        }
> +    }
> +}

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 17:00   ` [PATCH v1.1 " Peter Xu
  2020-02-27 17:18     ` Peter Xu
@ 2020-02-27 17:42     ` Auger Eric
  2020-02-27 18:00       ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Auger Eric @ 2020-02-27 17:42 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck

Hi Peter,

On 2/27/20 6:00 PM, Peter Xu wrote:
> This is majorly only for X86 because that's the only one that supports
> split irqchip for now.
> 
> When the irqchip is split, we face a dilemma that KVM irqfd will be
> enabled, however the slow irqchip is still running in the userspace.
> It means that the resamplefd in the kernel irqfds won't take any
> effect and it can miss to ack INTx interrupts on EOIs.
Won't it always fail to ack INTx? With the above sentence I understand
it can work sometimes?
> 
> One example is split irqchip with VFIO INTx, which will break if we
> use the VFIO INTx fast path.
> 
> This patch can potentially supports the VFIO fast path again for INTx,
> that the IRQ delivery will still use the fast path, while we don't
> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> callers of vfio_eoi() hook).  However the EOI of the INTx will still
> need to be done from the userspace by caching all the resamplefds in
> QEMU and kick properly for IOAPIC EOI broadcast.
If I understand correctly this is a one way fast path? Fast path is on
the trigger side only: VFIO -> KVM but not on the deactivation side,
trapped by the userspace IOAPIC where you directly notify the UNMASK
eventfd from userspace. Is that correct?
> 
> When the userspace is responsible for the resamplefd kickup, don't
> register it on the kvm_irqfd anymore, because on newer kernels (after
> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> irqchip and resamplefd.  This will make sure that the fast path will
> work for all supported kernels.
> 
> https://patchwork.kernel.org/patch/10738541/#22609933
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v1.1 changelog:
> - when resamplefd is going to be kicked from userspace, don't register
>   it again in KVM_IRQFD.  Tested against upstream kernel.
> 
>  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
>  accel/kvm/trace-events |  1 +
>  hw/intc/ioapic.c       | 11 +++++--
>  include/sysemu/kvm.h   |  4 +++
>  4 files changed, 86 insertions(+), 4 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index d49b74512a..b766b6e93c 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>  static NotifierList kvm_irqchip_change_notifiers =
>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
>  
> +struct KVMResampleFd {
> +    int gsi;
> +    EventNotifier *resample_event;
> +    QLIST_ENTRY(KVMResampleFd) node;
> +};
> +typedef struct KVMResampleFd KVMResampleFd;
> +
> +/*
> + * Only used with split irqchip where we need to do the resample fd
> + * kick for the kernel from userspace.
> + */
> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> +
>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>  
> +static inline void kvm_resample_fd_remove(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            QLIST_REMOVE(rfd, node);
> +            break;
> +        }
> +    }
> +}
> +
> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> +{
> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> +
> +    rfd->gsi = gsi;
> +    rfd->resample_event = event;
> +
> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> +}
> +
> +void kvm_resample_fd_notify(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    if (!kvm_irqchip_is_split()) {
> +        return;
> +    }
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            event_notifier_set(rfd->resample_event);
> +            trace_kvm_resample_fd_notify(gsi);
> +            break;
> +        }
> +    }
> +}
> +
>  int kvm_get_max_memslots(void)
>  {
>      KVMState *s = KVM_STATE(current_accel());
> @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>      };
>  
>      if (rfd != -1) {
> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> -        irqfd.resamplefd = rfd;
> +        assert(assign);
> +        if (kvm_irqchip_is_split()) {
> +            /*
> +             * When the slow irqchip (e.g. IOAPIC) is in the
> +             * userspace, resamplefd will not work because the EOI of
> +             * the interrupt will be delivered to userspace instead,
s/delivered to userspace/handled in userspace
> +             * the KVM resample fd kick is skipped.  The userspace
> +             * needs to remember the resamplefd and kick it when we
> +             * receive EOI of this IRQ.
Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
As such isn't it a bit weird to handle those normal UNMASK eventfds in
the KVM code?


> +             */
> +            kvm_resample_fd_insert(virq, resample);
> +        } else {
> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> +            irqfd.resamplefd = rfd;
> +        }
> +    } else if (!assign) {
> +        if (kvm_irqchip_is_split()) {
> +            kvm_resample_fd_remove(virq);
> +        }
>      }
>  
>      if (!kvm_irqfds_enabled()) {
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index 4fb6e59d19..a68eb66534 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> +kvm_resample_fd_notify(int gsi) "gsi %d"
>  
> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> index 15747fe2c2..8c75465c62 100644
> --- a/hw/intc/ioapic.c
> +++ b/hw/intc/ioapic.c
> @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
>              entry = s->ioredtbl[n];
>  
> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> +                continue;
> +            }
> +
> +            /* Kick resamplefd if KVM is bypassed */
> +            kvm_resample_fd_notify(n);
KVM is bypassed on the deactivation path but still we call
kvm_resample_fd_notify().
> +
> +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> +                IOAPIC_TRIGGER_LEVEL) {
>                  continue;
>              }
>  
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 141342de98..b67552c047 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
>  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
>  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
>  int kvm_get_max_memslots(void);
> +
> +/* Notify resamplefd for EOI of specific level triggered interrupts */
> +void kvm_resample_fd_notify(int gsi);
> +
>  #endif
> 
Thanks

Eric



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 17:42     ` Auger Eric
@ 2020-02-27 18:00       ` Peter Xu
  2020-02-27 18:22         ` Auger Eric
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-27 18:00 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 06:42:09PM +0100, Auger Eric wrote:
> Hi Peter,
> 
> On 2/27/20 6:00 PM, Peter Xu wrote:
> > This is majorly only for X86 because that's the only one that supports
> > split irqchip for now.
> > 
> > When the irqchip is split, we face a dilemma that KVM irqfd will be
> > enabled, however the slow irqchip is still running in the userspace.
> > It means that the resamplefd in the kernel irqfds won't take any
> > effect and it can miss to ack INTx interrupts on EOIs.
> Won't it always fail to ack INTx? With the above sentence I understand
> it can work sometimes?

I wanted to mean that it will fail.  How about s/can/will/?  Or even
better wordings that you'd suggest?

> > 
> > One example is split irqchip with VFIO INTx, which will break if we
> > use the VFIO INTx fast path.
> > 
> > This patch can potentially supports the VFIO fast path again for INTx,
> > that the IRQ delivery will still use the fast path, while we don't
> > need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> > callers of vfio_eoi() hook).  However the EOI of the INTx will still
> > need to be done from the userspace by caching all the resamplefds in
> > QEMU and kick properly for IOAPIC EOI broadcast.
> If I understand correctly this is a one way fast path? Fast path is on
> the trigger side only: VFIO -> KVM but not on the deactivation side,
> trapped by the userspace IOAPIC where you directly notify the UNMASK
> eventfd from userspace. Is that correct?

Right, the injection is still using the whole fast path.  However
AFAIU even for the EOI path it should still be faster than the pure
slow path of vfio INTx EIO.  From what I got from reading the code,
the slow path will conditionally unmap MMIO regions (with a timer to
delay the recovery) so all MMIOs will be slowed down.  For what this
patch is doing, it will need to exit to userspace for sure for each
EOI (after all IOAPIC is in userspace), however for the whole
lifecycle of the device, the MMIO regions should always be mapped so
no unwanted MMIO traps.

> > 
> > When the userspace is responsible for the resamplefd kickup, don't
> > register it on the kvm_irqfd anymore, because on newer kernels (after
> > commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> > irqchip and resamplefd.  This will make sure that the fast path will
> > work for all supported kernels.
> > 
> > https://patchwork.kernel.org/patch/10738541/#22609933
> > 
> > Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> > v1.1 changelog:
> > - when resamplefd is going to be kicked from userspace, don't register
> >   it again in KVM_IRQFD.  Tested against upstream kernel.
> > 
> >  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
> >  accel/kvm/trace-events |  1 +
> >  hw/intc/ioapic.c       | 11 +++++--
> >  include/sysemu/kvm.h   |  4 +++
> >  4 files changed, 86 insertions(+), 4 deletions(-)
> > 
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> > index d49b74512a..b766b6e93c 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
> >  static NotifierList kvm_irqchip_change_notifiers =
> >      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
> >  
> > +struct KVMResampleFd {
> > +    int gsi;
> > +    EventNotifier *resample_event;
> > +    QLIST_ENTRY(KVMResampleFd) node;
> > +};
> > +typedef struct KVMResampleFd KVMResampleFd;
> > +
> > +/*
> > + * Only used with split irqchip where we need to do the resample fd
> > + * kick for the kernel from userspace.
> > + */
> > +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> > +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> > +
> >  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
> >  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
> >  
> > +static inline void kvm_resample_fd_remove(int gsi)
> > +{
> > +    KVMResampleFd *rfd;
> > +
> > +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> > +        if (rfd->gsi == gsi) {
> > +            QLIST_REMOVE(rfd, node);
> > +            break;
> > +        }
> > +    }
> > +}
> > +
> > +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> > +{
> > +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> > +
> > +    rfd->gsi = gsi;
> > +    rfd->resample_event = event;
> > +
> > +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> > +}
> > +
> > +void kvm_resample_fd_notify(int gsi)
> > +{
> > +    KVMResampleFd *rfd;
> > +
> > +    if (!kvm_irqchip_is_split()) {
> > +        return;
> > +    }
> > +
> > +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> > +        if (rfd->gsi == gsi) {
> > +            event_notifier_set(rfd->resample_event);
> > +            trace_kvm_resample_fd_notify(gsi);
> > +            break;
> > +        }
> > +    }
> > +}
> > +
> >  int kvm_get_max_memslots(void)
> >  {
> >      KVMState *s = KVM_STATE(current_accel());
> > @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> >      };
> >  
> >      if (rfd != -1) {
> > -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> > -        irqfd.resamplefd = rfd;
> > +        assert(assign);
> > +        if (kvm_irqchip_is_split()) {
> > +            /*
> > +             * When the slow irqchip (e.g. IOAPIC) is in the
> > +             * userspace, resamplefd will not work because the EOI of
> > +             * the interrupt will be delivered to userspace instead,
> s/delivered to userspace/handled in userspace

It will be delivered to userspace by KVM_EXIT_IOAPIC_EOI, so
maybe... "delivered and handled"?

> > +             * the KVM resample fd kick is skipped.  The userspace
> > +             * needs to remember the resamplefd and kick it when we
> > +             * receive EOI of this IRQ.
> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
> As such isn't it a bit weird to handle those normal UNMASK eventfds in
> the KVM code?

I'm not sure I completely get the question, but this should be
something general to KVM resamplefd support.  In other words, this
should also fix other devices (besides VFIO) when they're using the
KVM resamplefd, because IMHO it's the resamplefd and split irqchip
which is really broken here.

With that in mind, I think KVM should not need to even know what's
behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
just needs to kick it when IOAPIC EOI comes for the specific IRQ.

> 
> 
> > +             */
> > +            kvm_resample_fd_insert(virq, resample);
> > +        } else {
> > +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> > +            irqfd.resamplefd = rfd;
> > +        }
> > +    } else if (!assign) {
> > +        if (kvm_irqchip_is_split()) {
> > +            kvm_resample_fd_remove(virq);
> > +        }
> >      }
> >  
> >      if (!kvm_irqfds_enabled()) {
> > diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> > index 4fb6e59d19..a68eb66534 100644
> > --- a/accel/kvm/trace-events
> > +++ b/accel/kvm/trace-events
> > @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
> >  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
> >  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
> >  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> > +kvm_resample_fd_notify(int gsi) "gsi %d"
> >  
> > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> > index 15747fe2c2..8c75465c62 100644
> > --- a/hw/intc/ioapic.c
> > +++ b/hw/intc/ioapic.c
> > @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
> >          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> >              entry = s->ioredtbl[n];
> >  
> > -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> > -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> > +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> > +                continue;
> > +            }
> > +
> > +            /* Kick resamplefd if KVM is bypassed */
> > +            kvm_resample_fd_notify(n);
> KVM is bypassed on the deactivation path but still we call
> kvm_resample_fd_notify().

Yes I wanted to say that the kernel won't be able to kick the
resamplefd.  How about:

  When IOAPIC is in the userspace (while APIC is still in the kernel),
  we need to kick the resamplefd to deactivate the IRQ for KVM.

Better suggestion on the wording is always welcomed.

Thanks,

> > +
> > +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> > +                IOAPIC_TRIGGER_LEVEL) {
> >                  continue;
> >              }
> >  
> > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > index 141342de98..b67552c047 100644
> > --- a/include/sysemu/kvm.h
> > +++ b/include/sysemu/kvm.h
> > @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
> >  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
> >  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
> >  int kvm_get_max_memslots(void);
> > +
> > +/* Notify resamplefd for EOI of specific level triggered interrupts */
> > +void kvm_resample_fd_notify(int gsi);
> > +
> >  #endif
> > 
> Thanks
> 
> Eric
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 18:00       ` Peter Xu
@ 2020-02-27 18:22         ` Auger Eric
  2020-02-27 19:19           ` Peter Xu
  0 siblings, 1 reply; 30+ messages in thread
From: Auger Eric @ 2020-02-27 18:22 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

Hi Peter,

On 2/27/20 7:00 PM, Peter Xu wrote:
> On Thu, Feb 27, 2020 at 06:42:09PM +0100, Auger Eric wrote:
>> Hi Peter,
>>
>> On 2/27/20 6:00 PM, Peter Xu wrote:
>>> This is majorly only for X86 because that's the only one that supports
>>> split irqchip for now.
>>>
>>> When the irqchip is split, we face a dilemma that KVM irqfd will be
>>> enabled, however the slow irqchip is still running in the userspace.
>>> It means that the resamplefd in the kernel irqfds won't take any
>>> effect and it can miss to ack INTx interrupts on EOIs.
>> Won't it always fail to ack INTx? With the above sentence I understand
>> it can work sometimes?
> 
> I wanted to mean that it will fail.  How about s/can/will/?  Or even
> better wordings that you'd suggest?
yes: s/can/will
> 
>>>
>>> One example is split irqchip with VFIO INTx, which will break if we
>>> use the VFIO INTx fast path.
>>>
>>> This patch can potentially supports the VFIO fast path again for INTx,
>>> that the IRQ delivery will still use the fast path, while we don't
>>> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
>>> callers of vfio_eoi() hook).  However the EOI of the INTx will still
>>> need to be done from the userspace by caching all the resamplefds in
>>> QEMU and kick properly for IOAPIC EOI broadcast.
>> If I understand correctly this is a one way fast path? Fast path is on
>> the trigger side only: VFIO -> KVM but not on the deactivation side,
>> trapped by the userspace IOAPIC where you directly notify the UNMASK
>> eventfd from userspace. Is that correct?
> 
> Right, the injection is still using the whole fast path.  However
> AFAIU even for the EOI path it should still be faster than the pure
> slow path of vfio INTx EIO.  From what I got from reading the code,
> the slow path will conditionally unmap MMIO regions (with a timer to
> delay the recovery) so all MMIOs will be slowed down.  For what this
> patch is doing, it will need to exit to userspace for sure for each
> EOI (after all IOAPIC is in userspace), however for the whole
> lifecycle of the device, the MMIO regions should always be mapped so
> no unwanted MMIO traps.
Yes the EOI is trapped on IOAPIC side and not at the BAR level. So it
should be more efficient and more precise.
> 
>>>
>>> When the userspace is responsible for the resamplefd kickup, don't
>>> register it on the kvm_irqfd anymore, because on newer kernels (after
>>> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
>>> irqchip and resamplefd.  This will make sure that the fast path will
>>> work for all supported kernels.
>>>
>>> https://patchwork.kernel.org/patch/10738541/#22609933
>>>
>>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>> ---
>>> v1.1 changelog:
>>> - when resamplefd is going to be kicked from userspace, don't register
>>>   it again in KVM_IRQFD.  Tested against upstream kernel.
>>>
>>>  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
>>>  accel/kvm/trace-events |  1 +
>>>  hw/intc/ioapic.c       | 11 +++++--
>>>  include/sysemu/kvm.h   |  4 +++
>>>  4 files changed, 86 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>> index d49b74512a..b766b6e93c 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>>>  static NotifierList kvm_irqchip_change_notifiers =
>>>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
>>>  
>>> +struct KVMResampleFd {
>>> +    int gsi;
>>> +    EventNotifier *resample_event;
>>> +    QLIST_ENTRY(KVMResampleFd) node;
>>> +};
>>> +typedef struct KVMResampleFd KVMResampleFd;
>>> +
>>> +/*
>>> + * Only used with split irqchip where we need to do the resample fd
>>> + * kick for the kernel from userspace.
>>> + */
>>> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
>>> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
>>> +
>>>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>>>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>>>  
>>> +static inline void kvm_resample_fd_remove(int gsi)
>>> +{
>>> +    KVMResampleFd *rfd;
>>> +
>>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
>>> +        if (rfd->gsi == gsi) {
>>> +            QLIST_REMOVE(rfd, node);
>>> +            break;
>>> +        }
>>> +    }
>>> +}
>>> +
>>> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
>>> +{
>>> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
>>> +
>>> +    rfd->gsi = gsi;
>>> +    rfd->resample_event = event;
>>> +
>>> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
>>> +}
>>> +
>>> +void kvm_resample_fd_notify(int gsi)
>>> +{
>>> +    KVMResampleFd *rfd;
>>> +
>>> +    if (!kvm_irqchip_is_split()) {
>>> +        return;
>>> +    }
>>> +
>>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
>>> +        if (rfd->gsi == gsi) {
>>> +            event_notifier_set(rfd->resample_event);
>>> +            trace_kvm_resample_fd_notify(gsi);
>>> +            break;
>>> +        }
>>> +    }
>>> +}
>>> +
>>>  int kvm_get_max_memslots(void)
>>>  {
>>>      KVMState *s = KVM_STATE(current_accel());
>>> @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>>>      };
>>>  
>>>      if (rfd != -1) {
>>> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
>>> -        irqfd.resamplefd = rfd;
>>> +        assert(assign);
>>> +        if (kvm_irqchip_is_split()) {
>>> +            /*
>>> +             * When the slow irqchip (e.g. IOAPIC) is in the
>>> +             * userspace, resamplefd will not work because the EOI of
>>> +             * the interrupt will be delivered to userspace instead,
>> s/delivered to userspace/handled in userspace
> 
> It will be delivered to userspace by KVM_EXIT_IOAPIC_EOI, so
> maybe... "delivered and handled"?
ah ok. TBH I don't really know how the split irqchip works and that may
explain below misunderstandings.
> 
>>> +             * the KVM resample fd kick is skipped.  The userspace
>>> +             * needs to remember the resamplefd and kick it when we
>>> +             * receive EOI of this IRQ.
>> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
>> As such isn't it a bit weird to handle those normal UNMASK eventfds in
>> the KVM code?
> 
> I'm not sure I completely get the question, but this should be
> something general to KVM resamplefd support.  In other words, this
> should also fix other devices (besides VFIO) when they're using the
> KVM resamplefd, because IMHO it's the resamplefd and split irqchip
> which is really broken here.
Here is my understanding (& memories): the KVM resamplefd is an eventfd
you register to KVM so that KVM triggers the resamplefd when KVM traps
the EOI. Here I understand this is the userspace IOAPIC that traps the
EOI and not the in-kernel virtual interrupt controller. So I would have
expected you just need to signal the VFIO UNMASK eventfd to re-enable
the physical IRQ (which was automasked). This is no more a KVM
resamplefd strictly speaking as KVM is not involved anymore in the
deactivation process.
> 
> With that in mind, I think KVM should not need to even know what's
> behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
> just needs to kick it when IOAPIC EOI comes for the specific IRQ
But above the userspace directly calls
event_notifier_set(rfd->resample_event);

This is not KVM anymore that "kicks it". Or maybe I miss something. So
my comment was, why is it handled in the QEMU KVM layer?
.
> 
>>
>>
>>> +             */
>>> +            kvm_resample_fd_insert(virq, resample);
>>> +        } else {
>>> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
>>> +            irqfd.resamplefd = rfd;
>>> +        }
>>> +    } else if (!assign) {
>>> +        if (kvm_irqchip_is_split()) {
>>> +            kvm_resample_fd_remove(virq);
>>> +        }
>>>      }
>>>  
>>>      if (!kvm_irqfds_enabled()) {
>>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
>>> index 4fb6e59d19..a68eb66534 100644
>>> --- a/accel/kvm/trace-events
>>> +++ b/accel/kvm/trace-events
>>> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>>>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>>>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>>>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
>>> +kvm_resample_fd_notify(int gsi) "gsi %d"
>>>  
>>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
>>> index 15747fe2c2..8c75465c62 100644
>>> --- a/hw/intc/ioapic.c
>>> +++ b/hw/intc/ioapic.c
>>> @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
>>>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
>>>              entry = s->ioredtbl[n];
>>>  
>>> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
>>> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
>>> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
>>> +                continue;
>>> +            }
>>> +
>>> +            /* Kick resamplefd if KVM is bypassed */
>>> +            kvm_resample_fd_notify(n);
>> KVM is bypassed on the deactivation path but still we call
>> kvm_resample_fd_notify().
> 
> Yes I wanted to say that the kernel won't be able to kick the
> resamplefd.  How about:
> 
>   When IOAPIC is in the userspace (while APIC is still in the kernel),
>   we need to kick the resamplefd to deactivate the IRQ for KVM.
This fd "just" aims at unmasking the IRQ at physical level (UNMASK VFIO
event)? Does it perform anything related to the virtual interrupt
controller?

Thanks

Eric
> 
> Better suggestion on the wording is always welcomed.
> 
> Thanks,
> 
>>> +
>>> +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
>>> +                IOAPIC_TRIGGER_LEVEL) {
>>>                  continue;
>>>              }
>>>  
>>> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
>>> index 141342de98..b67552c047 100644
>>> --- a/include/sysemu/kvm.h
>>> +++ b/include/sysemu/kvm.h
>>> @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
>>>  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
>>>  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
>>>  int kvm_get_max_memslots(void);
>>> +
>>> +/* Notify resamplefd for EOI of specific level triggered interrupts */
>>> +void kvm_resample_fd_notify(int gsi);
>>> +
>>>  #endif
>>>
>> Thanks
>>
>> Eric
>>
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 18:22         ` Auger Eric
@ 2020-02-27 19:19           ` Peter Xu
  2020-02-27 21:14             ` Auger Eric
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-27 19:19 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 07:22:08PM +0100, Auger Eric wrote:
> Hi Peter,
> 
> On 2/27/20 7:00 PM, Peter Xu wrote:
> > On Thu, Feb 27, 2020 at 06:42:09PM +0100, Auger Eric wrote:
> >> Hi Peter,
> >>
> >> On 2/27/20 6:00 PM, Peter Xu wrote:
> >>> This is majorly only for X86 because that's the only one that supports
> >>> split irqchip for now.
> >>>
> >>> When the irqchip is split, we face a dilemma that KVM irqfd will be
> >>> enabled, however the slow irqchip is still running in the userspace.
> >>> It means that the resamplefd in the kernel irqfds won't take any
> >>> effect and it can miss to ack INTx interrupts on EOIs.
> >> Won't it always fail to ack INTx? With the above sentence I understand
> >> it can work sometimes?
> > 
> > I wanted to mean that it will fail.  How about s/can/will/?  Or even
> > better wordings that you'd suggest?
> yes: s/can/will
> > 
> >>>
> >>> One example is split irqchip with VFIO INTx, which will break if we
> >>> use the VFIO INTx fast path.
> >>>
> >>> This patch can potentially supports the VFIO fast path again for INTx,
> >>> that the IRQ delivery will still use the fast path, while we don't
> >>> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> >>> callers of vfio_eoi() hook).  However the EOI of the INTx will still
> >>> need to be done from the userspace by caching all the resamplefds in
> >>> QEMU and kick properly for IOAPIC EOI broadcast.
> >> If I understand correctly this is a one way fast path? Fast path is on
> >> the trigger side only: VFIO -> KVM but not on the deactivation side,
> >> trapped by the userspace IOAPIC where you directly notify the UNMASK
> >> eventfd from userspace. Is that correct?
> > 
> > Right, the injection is still using the whole fast path.  However
> > AFAIU even for the EOI path it should still be faster than the pure
> > slow path of vfio INTx EIO.  From what I got from reading the code,
> > the slow path will conditionally unmap MMIO regions (with a timer to
> > delay the recovery) so all MMIOs will be slowed down.  For what this
> > patch is doing, it will need to exit to userspace for sure for each
> > EOI (after all IOAPIC is in userspace), however for the whole
> > lifecycle of the device, the MMIO regions should always be mapped so
> > no unwanted MMIO traps.
> Yes the EOI is trapped on IOAPIC side and not at the BAR level. So it
> should be more efficient and more precise.

Yes.

> > 
> >>>
> >>> When the userspace is responsible for the resamplefd kickup, don't
> >>> register it on the kvm_irqfd anymore, because on newer kernels (after
> >>> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> >>> irqchip and resamplefd.  This will make sure that the fast path will
> >>> work for all supported kernels.
> >>>
> >>> https://patchwork.kernel.org/patch/10738541/#22609933
> >>>
> >>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> >>> Signed-off-by: Peter Xu <peterx@redhat.com>
> >>> ---
> >>> v1.1 changelog:
> >>> - when resamplefd is going to be kicked from userspace, don't register
> >>>   it again in KVM_IRQFD.  Tested against upstream kernel.
> >>>
> >>>  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
> >>>  accel/kvm/trace-events |  1 +
> >>>  hw/intc/ioapic.c       | 11 +++++--
> >>>  include/sysemu/kvm.h   |  4 +++
> >>>  4 files changed, 86 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> >>> index d49b74512a..b766b6e93c 100644
> >>> --- a/accel/kvm/kvm-all.c
> >>> +++ b/accel/kvm/kvm-all.c
> >>> @@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
> >>>  static NotifierList kvm_irqchip_change_notifiers =
> >>>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
> >>>  
> >>> +struct KVMResampleFd {
> >>> +    int gsi;
> >>> +    EventNotifier *resample_event;
> >>> +    QLIST_ENTRY(KVMResampleFd) node;
> >>> +};
> >>> +typedef struct KVMResampleFd KVMResampleFd;
> >>> +
> >>> +/*
> >>> + * Only used with split irqchip where we need to do the resample fd
> >>> + * kick for the kernel from userspace.
> >>> + */
> >>> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> >>> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> >>> +
> >>>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
> >>>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
> >>>  
> >>> +static inline void kvm_resample_fd_remove(int gsi)
> >>> +{
> >>> +    KVMResampleFd *rfd;
> >>> +
> >>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> >>> +        if (rfd->gsi == gsi) {
> >>> +            QLIST_REMOVE(rfd, node);
> >>> +            break;
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> >>> +{
> >>> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> >>> +
> >>> +    rfd->gsi = gsi;
> >>> +    rfd->resample_event = event;
> >>> +
> >>> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> >>> +}
> >>> +
> >>> +void kvm_resample_fd_notify(int gsi)
> >>> +{
> >>> +    KVMResampleFd *rfd;
> >>> +
> >>> +    if (!kvm_irqchip_is_split()) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> >>> +        if (rfd->gsi == gsi) {
> >>> +            event_notifier_set(rfd->resample_event);
> >>> +            trace_kvm_resample_fd_notify(gsi);
> >>> +            break;
> >>> +        }
> >>> +    }
> >>> +}
> >>> +
> >>>  int kvm_get_max_memslots(void)
> >>>  {
> >>>      KVMState *s = KVM_STATE(current_accel());
> >>> @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> >>>      };
> >>>  
> >>>      if (rfd != -1) {
> >>> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> >>> -        irqfd.resamplefd = rfd;
> >>> +        assert(assign);
> >>> +        if (kvm_irqchip_is_split()) {
> >>> +            /*
> >>> +             * When the slow irqchip (e.g. IOAPIC) is in the
> >>> +             * userspace, resamplefd will not work because the EOI of
> >>> +             * the interrupt will be delivered to userspace instead,
> >> s/delivered to userspace/handled in userspace
> > 
> > It will be delivered to userspace by KVM_EXIT_IOAPIC_EOI, so
> > maybe... "delivered and handled"?
> ah ok. TBH I don't really know how the split irqchip works and that may
> explain below misunderstandings.
> > 
> >>> +             * the KVM resample fd kick is skipped.  The userspace
> >>> +             * needs to remember the resamplefd and kick it when we
> >>> +             * receive EOI of this IRQ.
> >> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
> >> As such isn't it a bit weird to handle those normal UNMASK eventfds in
> >> the KVM code?
> > 
> > I'm not sure I completely get the question, but this should be
> > something general to KVM resamplefd support.  In other words, this
> > should also fix other devices (besides VFIO) when they're using the
> > KVM resamplefd, because IMHO it's the resamplefd and split irqchip
> > which is really broken here.
> Here is my understanding (& memories): the KVM resamplefd is an eventfd
> you register to KVM so that KVM triggers the resamplefd when KVM traps
> the EOI. Here I understand this is the userspace IOAPIC that traps the
> EOI and not the in-kernel virtual interrupt controller. So I would have
> expected you just need to signal the VFIO UNMASK eventfd to re-enable
> the physical IRQ (which was automasked). This is no more a KVM
> resamplefd strictly speaking as KVM is not involved anymore in the
> deactivation process.

Yes KVM kernel side should not be involed when we're using split
irqchip in this case.  However it should still belongs to the work of
the userspace KVM module (kvm-all.c) so that it can still "mimic" the
resamplefd feature that KVM_IRQFD provides.

> > 
> > With that in mind, I think KVM should not need to even know what's
> > behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
> > just needs to kick it when IOAPIC EOI comes for the specific IRQ
> But above the userspace directly calls
> event_notifier_set(rfd->resample_event);
> 
> This is not KVM anymore that "kicks it". Or maybe I miss something. So
> my comment was, why is it handled in the QEMU KVM layer?

It's my fault to be unclear on using "KVM" above.  I should really say
it as kvm-all.c, say, the QEMU layer for the kernel KVM module.

Indeed this problem is complicated... let me try to summarize.

Firstly KVM split irqchip and resamplefd is not really going to work
in the kernel (I think we just overlooked that when introducing the
2nd feature, no matter which one comes first), because the resample
operation should be part of IOAPIC EOI, nevertheless when using split
irqchip IOAPIC is in userspace.

After we noticed this, Alex somewhere proposed to disable that in KVM,
which is actually the 1st kernel patch (654f1f13ea56).

We should (at the same time) propose patch 1 too in this series but I
guess everybody just forgot this afterwards (Paolo actually proposed
mostly the whole solution but I guess it got forgotten too)...

About the fast path speedup: the main logic should be to mimic the
same resamplefd feature as provided by KVM_IRQFD but this time only in
the userspace.  However now we're implementing the same logic only
within userspace kvm-all.c, and the kernel KVM should be totally not
aware of this.  Doing that benefits us in that the KVM interface in
QEMU does not need to be changed (majorly kvm_irqchip_assign_irqfd()).
What we need to do is just to wire up the userspace IOAPIC with these
resamplefds.  And the idea is actually the same too - someone (VFIO)
wants to have one fd (which is the resamplefd) kicked when EOI comes
when requesting for a KVM irqfd, no matter who's going to kick it
(kernel KVM or userspace).  That's all.

> .
> > 
> >>
> >>
> >>> +             */
> >>> +            kvm_resample_fd_insert(virq, resample);
> >>> +        } else {
> >>> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> >>> +            irqfd.resamplefd = rfd;
> >>> +        }
> >>> +    } else if (!assign) {
> >>> +        if (kvm_irqchip_is_split()) {
> >>> +            kvm_resample_fd_remove(virq);
> >>> +        }
> >>>      }
> >>>  
> >>>      if (!kvm_irqfds_enabled()) {
> >>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> >>> index 4fb6e59d19..a68eb66534 100644
> >>> --- a/accel/kvm/trace-events
> >>> +++ b/accel/kvm/trace-events
> >>> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
> >>>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
> >>>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
> >>>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> >>> +kvm_resample_fd_notify(int gsi) "gsi %d"
> >>>  
> >>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> >>> index 15747fe2c2..8c75465c62 100644
> >>> --- a/hw/intc/ioapic.c
> >>> +++ b/hw/intc/ioapic.c
> >>> @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
> >>>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> >>>              entry = s->ioredtbl[n];
> >>>  
> >>> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> >>> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> >>> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            /* Kick resamplefd if KVM is bypassed */
> >>> +            kvm_resample_fd_notify(n);
> >> KVM is bypassed on the deactivation path but still we call
> >> kvm_resample_fd_notify().
> > 
> > Yes I wanted to say that the kernel won't be able to kick the
> > resamplefd.  How about:
> > 
> >   When IOAPIC is in the userspace (while APIC is still in the kernel),
> >   we need to kick the resamplefd to deactivate the IRQ for KVM.
> This fd "just" aims at unmasking the IRQ at physical level (UNMASK VFIO
> event)? Does it perform anything related to the virtual interrupt
> controller?

It should not.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 19:19           ` Peter Xu
@ 2020-02-27 21:14             ` Auger Eric
  2020-02-27 21:52               ` Peter Xu
  2020-02-28 10:34               ` Paolo Bonzini
  0 siblings, 2 replies; 30+ messages in thread
From: Auger Eric @ 2020-02-27 21:14 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

Hi Peter,

On 2/27/20 8:19 PM, Peter Xu wrote:
> On Thu, Feb 27, 2020 at 07:22:08PM +0100, Auger Eric wrote:
>> Hi Peter,
>>
>> On 2/27/20 7:00 PM, Peter Xu wrote:
>>> On Thu, Feb 27, 2020 at 06:42:09PM +0100, Auger Eric wrote:
>>>> Hi Peter,
>>>>
>>>> On 2/27/20 6:00 PM, Peter Xu wrote:
>>>>> This is majorly only for X86 because that's the only one that supports
>>>>> split irqchip for now.
>>>>>
>>>>> When the irqchip is split, we face a dilemma that KVM irqfd will be
>>>>> enabled, however the slow irqchip is still running in the userspace.
>>>>> It means that the resamplefd in the kernel irqfds won't take any
>>>>> effect and it can miss to ack INTx interrupts on EOIs.
>>>> Won't it always fail to ack INTx? With the above sentence I understand
>>>> it can work sometimes?
>>>
>>> I wanted to mean that it will fail.  How about s/can/will/?  Or even
>>> better wordings that you'd suggest?
>> yes: s/can/will
>>>
>>>>>
>>>>> One example is split irqchip with VFIO INTx, which will break if we
>>>>> use the VFIO INTx fast path.
>>>>>
>>>>> This patch can potentially supports the VFIO fast path again for INTx,
>>>>> that the IRQ delivery will still use the fast path, while we don't
>>>>> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
>>>>> callers of vfio_eoi() hook).  However the EOI of the INTx will still
>>>>> need to be done from the userspace by caching all the resamplefds in
>>>>> QEMU and kick properly for IOAPIC EOI broadcast.
>>>> If I understand correctly this is a one way fast path? Fast path is on
>>>> the trigger side only: VFIO -> KVM but not on the deactivation side,
>>>> trapped by the userspace IOAPIC where you directly notify the UNMASK
>>>> eventfd from userspace. Is that correct?
>>>
>>> Right, the injection is still using the whole fast path.  However
>>> AFAIU even for the EOI path it should still be faster than the pure
>>> slow path of vfio INTx EIO.  From what I got from reading the code,
>>> the slow path will conditionally unmap MMIO regions (with a timer to
>>> delay the recovery) so all MMIOs will be slowed down.  For what this
>>> patch is doing, it will need to exit to userspace for sure for each
>>> EOI (after all IOAPIC is in userspace), however for the whole
>>> lifecycle of the device, the MMIO regions should always be mapped so
>>> no unwanted MMIO traps.
>> Yes the EOI is trapped on IOAPIC side and not at the BAR level. So it
>> should be more efficient and more precise.
> 
> Yes.
> 
>>>
>>>>>
>>>>> When the userspace is responsible for the resamplefd kickup, don't
>>>>> register it on the kvm_irqfd anymore, because on newer kernels (after
>>>>> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
>>>>> irqchip and resamplefd.  This will make sure that the fast path will
>>>>> work for all supported kernels.
>>>>>
>>>>> https://patchwork.kernel.org/patch/10738541/#22609933
>>>>>
>>>>> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
>>>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>>>>> ---
>>>>> v1.1 changelog:
>>>>> - when resamplefd is going to be kicked from userspace, don't register
>>>>>   it again in KVM_IRQFD.  Tested against upstream kernel.
>>>>>
>>>>>  accel/kvm/kvm-all.c    | 74 ++++++++++++++++++++++++++++++++++++++++--
>>>>>  accel/kvm/trace-events |  1 +
>>>>>  hw/intc/ioapic.c       | 11 +++++--
>>>>>  include/sysemu/kvm.h   |  4 +++
>>>>>  4 files changed, 86 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>>>>> index d49b74512a..b766b6e93c 100644
>>>>> --- a/accel/kvm/kvm-all.c
>>>>> +++ b/accel/kvm/kvm-all.c
>>>>> @@ -159,9 +159,62 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>>>>>  static NotifierList kvm_irqchip_change_notifiers =
>>>>>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
>>>>>  
>>>>> +struct KVMResampleFd {
>>>>> +    int gsi;
>>>>> +    EventNotifier *resample_event;
>>>>> +    QLIST_ENTRY(KVMResampleFd) node;
>>>>> +};
>>>>> +typedef struct KVMResampleFd KVMResampleFd;
>>>>> +
>>>>> +/*
>>>>> + * Only used with split irqchip where we need to do the resample fd
>>>>> + * kick for the kernel from userspace.
>>>>> + */
>>>>> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
>>>>> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
>>>>> +
>>>>>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>>>>>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>>>>>  
>>>>> +static inline void kvm_resample_fd_remove(int gsi)
>>>>> +{
>>>>> +    KVMResampleFd *rfd;
>>>>> +
>>>>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
>>>>> +        if (rfd->gsi == gsi) {
>>>>> +            QLIST_REMOVE(rfd, node);
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>> +
>>>>> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
>>>>> +{
>>>>> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
>>>>> +
>>>>> +    rfd->gsi = gsi;
>>>>> +    rfd->resample_event = event;
>>>>> +
>>>>> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
>>>>> +}
>>>>> +
>>>>> +void kvm_resample_fd_notify(int gsi)
>>>>> +{
>>>>> +    KVMResampleFd *rfd;
>>>>> +
>>>>> +    if (!kvm_irqchip_is_split()) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
>>>>> +        if (rfd->gsi == gsi) {
>>>>> +            event_notifier_set(rfd->resample_event);
>>>>> +            trace_kvm_resample_fd_notify(gsi);
>>>>> +            break;
>>>>> +        }
>>>>> +    }
>>>>> +}
>>>>> +
>>>>>  int kvm_get_max_memslots(void)
>>>>>  {
>>>>>      KVMState *s = KVM_STATE(current_accel());
>>>>> @@ -1642,8 +1695,25 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>>>>>      };
>>>>>  
>>>>>      if (rfd != -1) {
>>>>> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
>>>>> -        irqfd.resamplefd = rfd;
>>>>> +        assert(assign);
>>>>> +        if (kvm_irqchip_is_split()) {
>>>>> +            /*
>>>>> +             * When the slow irqchip (e.g. IOAPIC) is in the
>>>>> +             * userspace, resamplefd will not work because the EOI of
>>>>> +             * the interrupt will be delivered to userspace instead,
>>>> s/delivered to userspace/handled in userspace
>>>
>>> It will be delivered to userspace by KVM_EXIT_IOAPIC_EOI, so
>>> maybe... "delivered and handled"?
>> ah ok. TBH I don't really know how the split irqchip works and that may
>> explain below misunderstandings.
>>>
>>>>> +             * the KVM resample fd kick is skipped.  The userspace
>>>>> +             * needs to remember the resamplefd and kick it when we
>>>>> +             * receive EOI of this IRQ.
>>>> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
>>>> As such isn't it a bit weird to handle those normal UNMASK eventfds in
>>>> the KVM code?
>>>
>>> I'm not sure I completely get the question, but this should be
>>> something general to KVM resamplefd support.  In other words, this
>>> should also fix other devices (besides VFIO) when they're using the
>>> KVM resamplefd, because IMHO it's the resamplefd and split irqchip
>>> which is really broken here.
>> Here is my understanding (& memories): the KVM resamplefd is an eventfd
>> you register to KVM so that KVM triggers the resamplefd when KVM traps
>> the EOI. Here I understand this is the userspace IOAPIC that traps the
>> EOI and not the in-kernel virtual interrupt controller. So I would have
>> expected you just need to signal the VFIO UNMASK eventfd to re-enable
>> the physical IRQ (which was automasked). This is no more a KVM
>> resamplefd strictly speaking as KVM is not involved anymore in the
>> deactivation process.
> 
> Yes KVM kernel side should not be involed when we're using split
> irqchip in this case.  However it should still belongs to the work of
> the userspace KVM module (kvm-all.c) so that it can still "mimic" the
> resamplefd feature that KVM_IRQFD provides.
OK. So that what my actual question. Should this be handled by kvm-all.c?
> 
>>>
>>> With that in mind, I think KVM should not need to even know what's
>>> behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
>>> just needs to kick it when IOAPIC EOI comes for the specific IRQ
>> But above the userspace directly calls
>> event_notifier_set(rfd->resample_event);
>>
>> This is not KVM anymore that "kicks it". Or maybe I miss something. So
>> my comment was, why is it handled in the QEMU KVM layer?
> 
> It's my fault to be unclear on using "KVM" above.  I should really say
> it as kvm-all.c, say, the QEMU layer for the kernel KVM module.
> 
> Indeed this problem is complicated... let me try to summarize.
> 
> Firstly KVM split irqchip and resamplefd is not really going to work
> in the kernel (I think we just overlooked that when introducing the
> 2nd feature, no matter which one comes first), because the resample
> operation should be part of IOAPIC EOI, nevertheless when using split
> irqchip IOAPIC is in userspace.
> 
> After we noticed this, Alex somewhere proposed to disable that in KVM,
> which is actually the 1st kernel patch (654f1f13ea56).
> 
> We should (at the same time) propose patch 1 too in this series but I
> guess everybody just forgot this afterwards (Paolo actually proposed
> mostly the whole solution but I guess it got forgotten too)...
> 
> About the fast path speedup: the main logic should be to mimic the
> same resamplefd feature as provided by KVM_IRQFD but this time only in
> the userspace.  However now we're implementing the same logic only
> within userspace kvm-all.c, and the kernel KVM should be totally not
> aware of this.  Doing that benefits us in that the KVM interface in
> QEMU does not need to be changed (majorly kvm_irqchip_assign_irqfd()).
> What we need to do is just to wire up the userspace IOAPIC with these
> resamplefds.  And the idea is actually the same too - someone (VFIO)
> wants to have one fd (which is the resamplefd) kicked when EOI comes
> when requesting for a KVM irqfd, no matter who's going to kick it
> (kernel KVM or userspace).  That's all.

Yep I think it makes sense to accelerate the trigger path. And for the
EOI path if you have means to trap this on the userspace irqchip it
looks better than doing the map/unmap dance. So it looks a good iead to
me. Now shall it be in kvm-all.c or elsewhere, to me it is not the most
important, as long as we reach a consensus and the scheme gets
documented somewhere.

Thanks

Eric
> 
>> .
>>>
>>>>
>>>>
>>>>> +             */
>>>>> +            kvm_resample_fd_insert(virq, resample);
>>>>> +        } else {
>>>>> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
>>>>> +            irqfd.resamplefd = rfd;
>>>>> +        }
>>>>> +    } else if (!assign) {
>>>>> +        if (kvm_irqchip_is_split()) {
>>>>> +            kvm_resample_fd_remove(virq);
>>>>> +        }
>>>>>      }
>>>>>  
>>>>>      if (!kvm_irqfds_enabled()) {
>>>>> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
>>>>> index 4fb6e59d19..a68eb66534 100644
>>>>> --- a/accel/kvm/trace-events
>>>>> +++ b/accel/kvm/trace-events
>>>>> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>>>>>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>>>>>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>>>>>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
>>>>> +kvm_resample_fd_notify(int gsi) "gsi %d"
>>>>>  
>>>>> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
>>>>> index 15747fe2c2..8c75465c62 100644
>>>>> --- a/hw/intc/ioapic.c
>>>>> +++ b/hw/intc/ioapic.c
>>>>> @@ -236,8 +236,15 @@ void ioapic_eoi_broadcast(int vector)
>>>>>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
>>>>>              entry = s->ioredtbl[n];
>>>>>  
>>>>> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
>>>>> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
>>>>> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
>>>>> +                continue;
>>>>> +            }
>>>>> +
>>>>> +            /* Kick resamplefd if KVM is bypassed */
>>>>> +            kvm_resample_fd_notify(n);
>>>> KVM is bypassed on the deactivation path but still we call
>>>> kvm_resample_fd_notify().
>>>
>>> Yes I wanted to say that the kernel won't be able to kick the
>>> resamplefd.  How about:
>>>
>>>   When IOAPIC is in the userspace (while APIC is still in the kernel),
>>>   we need to kick the resamplefd to deactivate the IRQ for KVM.
>> This fd "just" aims at unmasking the IRQ at physical level (UNMASK VFIO
>> event)? Does it perform anything related to the virtual interrupt
>> controller?
> 
> It should not.
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 21:14             ` Auger Eric
@ 2020-02-27 21:52               ` Peter Xu
  2020-02-28 10:34               ` Paolo Bonzini
  1 sibling, 0 replies; 30+ messages in thread
From: Peter Xu @ 2020-02-27 21:52 UTC (permalink / raw)
  To: Auger Eric; +Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, qemu-devel

On Thu, Feb 27, 2020 at 10:14:47PM +0100, Auger Eric wrote:
> Hi Peter,

Hi, Eric,

[...]

> >>>>> +             * the KVM resample fd kick is skipped.  The userspace
> >>>>> +             * needs to remember the resamplefd and kick it when we
> >>>>> +             * receive EOI of this IRQ.
> >>>> Practically we now talk about a VFIO ACTION_UNMASK classical eventfd
> >>>> As such isn't it a bit weird to handle those normal UNMASK eventfds in
> >>>> the KVM code?
> >>>
> >>> I'm not sure I completely get the question, but this should be
> >>> something general to KVM resamplefd support.  In other words, this
> >>> should also fix other devices (besides VFIO) when they're using the
> >>> KVM resamplefd, because IMHO it's the resamplefd and split irqchip
> >>> which is really broken here.
> >> Here is my understanding (& memories): the KVM resamplefd is an eventfd
> >> you register to KVM so that KVM triggers the resamplefd when KVM traps
> >> the EOI. Here I understand this is the userspace IOAPIC that traps the
> >> EOI and not the in-kernel virtual interrupt controller. So I would have
> >> expected you just need to signal the VFIO UNMASK eventfd to re-enable
> >> the physical IRQ (which was automasked). This is no more a KVM
> >> resamplefd strictly speaking as KVM is not involved anymore in the
> >> deactivation process.
> > 
> > Yes KVM kernel side should not be involed when we're using split
> > irqchip in this case.  However it should still belongs to the work of
> > the userspace KVM module (kvm-all.c) so that it can still "mimic" the
> > resamplefd feature that KVM_IRQFD provides.
> OK. So that what my actual question. Should this be handled by kvm-all.c?

It should fix KVM split irqchip with resamplefd, so I think it's
natural to do this in kvm-all.c (I'm a bit puzzled on where else we
can put this... :).  Or did I misunderstood your question?

> > 
> >>>
> >>> With that in mind, I think KVM should not need to even know what's
> >>> behind the resamplefd (in VFIO's case, it's the UNMASK eventfd).  It
> >>> just needs to kick it when IOAPIC EOI comes for the specific IRQ
> >> But above the userspace directly calls
> >> event_notifier_set(rfd->resample_event);
> >>
> >> This is not KVM anymore that "kicks it". Or maybe I miss something. So
> >> my comment was, why is it handled in the QEMU KVM layer?
> > 
> > It's my fault to be unclear on using "KVM" above.  I should really say
> > it as kvm-all.c, say, the QEMU layer for the kernel KVM module.
> > 
> > Indeed this problem is complicated... let me try to summarize.
> > 
> > Firstly KVM split irqchip and resamplefd is not really going to work
> > in the kernel (I think we just overlooked that when introducing the
> > 2nd feature, no matter which one comes first), because the resample
> > operation should be part of IOAPIC EOI, nevertheless when using split
> > irqchip IOAPIC is in userspace.
> > 
> > After we noticed this, Alex somewhere proposed to disable that in KVM,
> > which is actually the 1st kernel patch (654f1f13ea56).
> > 
> > We should (at the same time) propose patch 1 too in this series but I
> > guess everybody just forgot this afterwards (Paolo actually proposed
> > mostly the whole solution but I guess it got forgotten too)...
> > 
> > About the fast path speedup: the main logic should be to mimic the
> > same resamplefd feature as provided by KVM_IRQFD but this time only in
> > the userspace.  However now we're implementing the same logic only
> > within userspace kvm-all.c, and the kernel KVM should be totally not
> > aware of this.  Doing that benefits us in that the KVM interface in
> > QEMU does not need to be changed (majorly kvm_irqchip_assign_irqfd()).
> > What we need to do is just to wire up the userspace IOAPIC with these
> > resamplefds.  And the idea is actually the same too - someone (VFIO)
> > wants to have one fd (which is the resamplefd) kicked when EOI comes
> > when requesting for a KVM irqfd, no matter who's going to kick it
> > (kernel KVM or userspace).  That's all.
> 
> Yep I think it makes sense to accelerate the trigger path. And for the
> EOI path if you have means to trap this on the userspace irqchip it
> looks better than doing the map/unmap dance. So it looks a good iead to
> me. Now shall it be in kvm-all.c or elsewhere, to me it is not the most
> important, as long as we reach a consensus and the scheme gets
> documented somewhere.

Sure.

For documentation: as mentioned above, I think the irqfd users will
always use the interface just like before, and the resamplefd should
work exactly like what KVM_IRQFD and kvm_irqchip_assign_irqfd() was
offering before this patch too.  IMO it'll just start to work even for
split irqchips which was silently broken without being noticed.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-27 21:14             ` Auger Eric
  2020-02-27 21:52               ` Peter Xu
@ 2020-02-28 10:34               ` Paolo Bonzini
  2020-02-28 10:36                 ` Auger Eric
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-28 10:34 UTC (permalink / raw)
  To: Auger Eric, Peter Xu; +Cc: Alex Williamson, Cornelia Huck, qemu-devel

On 27/02/20 22:14, Auger Eric wrote:
>> Yes KVM kernel side should not be involed when we're using split
>> irqchip in this case.  However it should still belongs to the work of
>> the userspace KVM module (kvm-all.c) so that it can still "mimic" the
>> resamplefd feature that KVM_IRQFD provides.
> OK. So that what my actual question. Should this be handled by kvm-all.c?

I think it should; kvm-all.c in this case is providing the API to enable
irqfds (including resamplefds).

You could have a generic file descriptor<->interrupt routing subsystem,
but for now that only exists for KVM so that's where Peter's code need
to go.

Paolo



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-26 22:55 ` [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
  2020-02-27 17:00   ` [PATCH v1.1 " Peter Xu
@ 2020-02-28 10:34   ` Paolo Bonzini
  2020-02-28 14:58     ` Peter Xu
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-28 10:34 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Eric Auger, Alex Williamson, Cornelia Huck

On 26/02/20 23:55, Peter Xu wrote:
> +
> +            /* Kick resamplefd if KVM is bypassed */
> +            kvm_resample_fd_notify(n);

This is only needed for level-triggered interrupts, so it can be placed
below the test.

Paolo

> +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> +                IOAPIC_TRIGGER_LEVEL) {
>                  continue;



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-28 10:34               ` Paolo Bonzini
@ 2020-02-28 10:36                 ` Auger Eric
  0 siblings, 0 replies; 30+ messages in thread
From: Auger Eric @ 2020-02-28 10:36 UTC (permalink / raw)
  To: Paolo Bonzini, Peter Xu; +Cc: Alex Williamson, Cornelia Huck, qemu-devel

Hi Paolo,

On 2/28/20 11:34 AM, Paolo Bonzini wrote:
> On 27/02/20 22:14, Auger Eric wrote:
>>> Yes KVM kernel side should not be involed when we're using split
>>> irqchip in this case.  However it should still belongs to the work of
>>> the userspace KVM module (kvm-all.c) so that it can still "mimic" the
>>> resamplefd feature that KVM_IRQFD provides.
>> OK. So that what my actual question. Should this be handled by kvm-all.c?
> 
> I think it should; kvm-all.c in this case is providing the API to enable
> irqfds (including resamplefds).
> 
> You could have a generic file descriptor<->interrupt routing subsystem,
> but for now that only exists for KVM so that's where Peter's code need
> to go.

OK

Thanks

Eric
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (5 preceding siblings ...)
  2020-02-27 15:32 ` [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Auger Eric
@ 2020-02-28 10:36 ` Paolo Bonzini
  2020-02-28 15:25   ` Peter Xu
  6 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-28 10:36 UTC (permalink / raw)
  To: Peter Xu, qemu-devel; +Cc: Eric Auger, Alex Williamson, Cornelia Huck

On 26/02/20 23:50, Peter Xu wrote:
> VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> will directly fail with resamplefd attached so QEMU will automatically
> fallback to the INTx slow path.  However on old kernels it's still
> broken.
> 
> Only until recently I noticed that this could also break PXE boot for
> assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> using INTx as well, which means we can't bypass that even if we
> enables MSI for the guest kernel.
> 
> This series tries to first fix this issue function-wise, then speed up
> for the INTx again with resamplefd (mostly following the ideas
> proposed by Paolo one year ago [2]).  My TCP_RR test shows that:
> 
>   - Before this series: this is broken, no number to show
> 
>   - After patch 1 (enable slow path): get 63% perf comparing to full
>     kernel irqchip

Oh, I thought something like patch 1 had already been applied.

One comment: because you're bypassing IOAPIC when raising the irq, the
IOAPIC's remote_irr for example will not be set.  Most OSes probably
don't care, but it's at least worth a comment.

Paolo

>   - After whole series (enable fast path partly, irq injection will be
>     the same as fast path, however userspace needs to intercept for
>     EOI broadcast to resamplefd, though should still be faster than
>     the MMIO trick for intx eoi): get 93% perf comparing to full
>     kernel irqchip, which is a 46% performance boost



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-28 10:34   ` [PATCH " Paolo Bonzini
@ 2020-02-28 14:58     ` Peter Xu
  2020-02-28 15:24       ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-28 14:58 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Eric Auger, Alex Williamson, Cornelia Huck, qemu-devel

On Fri, Feb 28, 2020 at 11:34:46AM +0100, Paolo Bonzini wrote:
> On 26/02/20 23:55, Peter Xu wrote:
> > +
> > +            /* Kick resamplefd if KVM is bypassed */
> > +            kvm_resample_fd_notify(n);
> 
> This is only needed for level-triggered interrupts, so it can be placed
> below the test.

Yes I was thinking it the same way, however...  I noticed that VFIO
will even mask edge INTx after getting IRQ (vfio_intx_handler).  And
if look into current KVM kernel implementation, it has done the same
logic to ack edge triggerred irqs (kvm_ioapic_update_eoi_one), we
should logically ack it here too even for edge-triggered?

Thanks,

> 
> Paolo
> 
> > +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> > +                IOAPIC_TRIGGER_LEVEL) {
> >                  continue;
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-02-28 14:58     ` Peter Xu
@ 2020-02-28 15:24       ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-28 15:24 UTC (permalink / raw)
  To: Peter Xu; +Cc: Eric Auger, Alex Williamson, Cornelia Huck, qemu-devel

On 28/02/20 15:58, Peter Xu wrote:
> Yes I was thinking it the same way, however...  I noticed that VFIO
> will even mask edge INTx after getting IRQ (vfio_intx_handler).

INTx is always level-triggered so that's okay.

> And if look into current KVM kernel implementation, it has done the
> same logic to ack edge triggerred irqs (kvm_ioapic_update_eoi_one),
> we should logically ack it here too even for edge-triggered?

Right, that's because it's using ack notifiers also for in-kernel PIT
and for RTC.  But yeah, it makes sense to have userspace IOAPIC
resamplefd match the functionality of the kernel ones.  Thanks!

Paolo



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-28 10:36 ` Paolo Bonzini
@ 2020-02-28 15:25   ` Peter Xu
  2020-02-28 15:32     ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: Peter Xu @ 2020-02-28 15:25 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Eric Auger, Alex Williamson, Cornelia Huck, qemu-devel

On Fri, Feb 28, 2020 at 11:36:55AM +0100, Paolo Bonzini wrote:
> On 26/02/20 23:50, Peter Xu wrote:
> > VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
> > will directly fail with resamplefd attached so QEMU will automatically
> > fallback to the INTx slow path.  However on old kernels it's still
> > broken.
> > 
> > Only until recently I noticed that this could also break PXE boot for
> > assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
> > using INTx as well, which means we can't bypass that even if we
> > enables MSI for the guest kernel.
> > 
> > This series tries to first fix this issue function-wise, then speed up
> > for the INTx again with resamplefd (mostly following the ideas
> > proposed by Paolo one year ago [2]).  My TCP_RR test shows that:
> > 
> >   - Before this series: this is broken, no number to show
> > 
> >   - After patch 1 (enable slow path): get 63% perf comparing to full
> >     kernel irqchip
> 
> Oh, I thought something like patch 1 had already been applied.
> 
> One comment: because you're bypassing IOAPIC when raising the irq, the
> IOAPIC's remote_irr for example will not be set.  Most OSes probably
> don't care, but it's at least worth a comment.

Ouch I should definitely do that...  How about something like this
(in ioapic_eoi_broadcast(), I even changed kvm_resample_fd_notify to
return a boolean to show whether some GSI is kicked so for this case
we don't need to proceed on checking irr and remote irr):

            /*
             * When IOAPIC is in the userspace while APIC is still in
             * the kernel (i.e., split irqchip), we have a trick to
             * kick the resamplefd logic for registered irqfds from
             * userspace to deactivate the IRQ.  When that happens, it
             * means the irq bypassed userspace IOAPIC (so the irr and
             * remote-irr of the table entry should be bypassed too
             * even if interrupt come), then we don't need to clear
             * the remote-IRR and check irr again because they'll
             * always be zeros.
             */
            if (kvm_resample_fd_notify(n)) {
                continue;
            }

I confess this is still tricky, and actually after some careful read I
noticed you've proposed a similar kernel fix for the problem too which
I overlooked (https://patchwork.kernel.org/patch/10738541/#22609933).
My current thought is that we keep this hackery in userspace only so
we keep split+resamplefd forbidden in the kernel and be clean there.

What's your opinion?

(I should have marked this series as RFC when post)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-02-28 15:25   ` Peter Xu
@ 2020-02-28 15:32     ` Paolo Bonzini
  0 siblings, 0 replies; 30+ messages in thread
From: Paolo Bonzini @ 2020-02-28 15:32 UTC (permalink / raw)
  To: Peter Xu; +Cc: Eric Auger, Alex Williamson, Cornelia Huck, qemu-devel

On 28/02/20 16:25, Peter Xu wrote:
> My current thought is that we keep this hackery in userspace only so
> we keep split+resamplefd forbidden in the kernel and be clean there.

It is better, yes.  The kernel solution makes some sense because split
irqchip _does_ have a concept of ioapic GSIs and routes.  But the kernel
patch would not fix the fact that remote-irr remains zero, which would
need a userspace change anyway.

Paolo



^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2020-02-28 15:33 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-02-26 22:50 [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
2020-02-26 22:50 ` [PATCH 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
2020-02-27 16:53   ` Auger Eric
2020-02-27 17:10     ` Peter Xu
2020-02-26 22:50 ` [PATCH 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
2020-02-27 11:04   ` Auger Eric
2020-02-27 16:41   ` Cornelia Huck
2020-02-26 22:54 ` [PATCH 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
2020-02-27 17:01   ` Auger Eric
2020-02-26 22:55 ` [PATCH 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
2020-02-27 17:00   ` [PATCH v1.1 " Peter Xu
2020-02-27 17:18     ` Peter Xu
2020-02-27 17:42     ` Auger Eric
2020-02-27 18:00       ` Peter Xu
2020-02-27 18:22         ` Auger Eric
2020-02-27 19:19           ` Peter Xu
2020-02-27 21:14             ` Auger Eric
2020-02-27 21:52               ` Peter Xu
2020-02-28 10:34               ` Paolo Bonzini
2020-02-28 10:36                 ` Auger Eric
2020-02-28 10:34   ` [PATCH " Paolo Bonzini
2020-02-28 14:58     ` Peter Xu
2020-02-28 15:24       ` Paolo Bonzini
2020-02-26 22:55 ` [PATCH 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
2020-02-27 15:32 ` [PATCH 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Auger Eric
2020-02-27 15:51   ` Peter Xu
2020-02-27 17:02     ` Peter Xu
2020-02-28 10:36 ` Paolo Bonzini
2020-02-28 15:25   ` Peter Xu
2020-02-28 15:32     ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.