All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
@ 2020-03-17 19:50 Peter Xu
  2020-03-17 19:50 ` [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
                   ` (5 more replies)
  0 siblings, 6 replies; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

I believe I missed 5.0... Anyway still good to post it before being
forgotten...

v3:
- collect r-bs for Eric
- unconditionally call kvm_resample_fd_notify(), change comment [Alex]
- remove the split irqchip check in kvm_resample_fd_notify(), then let
  it return nothing [Alex]
- test against shared irq to make sure it won't break

v2:
- pick tags
- don't register resamplefd with KVM kernel when the userspace
  resamplefd path is enabled (should enable fast path on new kernels)
- fix resamplefd mem leak
- fix commit message of patch 4 [Eric]
- let kvm_resample_fd_notify() return a boolean, skip ioapic check if
  returned true
- more comments here and there in the code to state the fact that
  userspace ioapic irr & remote-irr are bypassed [Paolo]

VFIO INTx is not working with split irqchip.  On new kernels KVM_IRQFD
will directly fail with resamplefd attached so QEMU will automatically
fallback to the INTx slow path.  However on old kernels it's still
broken.

Only until recently I noticed that this could also break PXE boot for
assigned NICs [1].  My wild guess is that the PXE ROM will be mostly
using INTx as well, which means we can't bypass that even if we
enables MSI for the guest kernel.

This series tries to first fix this issue function-wise, then speed up
for the INTx again with resamplefd (mostly following the ideas
proposed by Paolo one year ago [2]).  My TCP_RR test shows that:

  - Before this series: this is broken, no number to show

  - After patch 1 (enable slow path): get 63% perf comparing to full
    kernel irqchip

  - After whole series (enable fast path partly, irq injection will be
    the same as fast path, however userspace needs to intercept for
    EOI broadcast to resamplefd, though should still be faster than
    the MMIO trick for intx eoi): get 93% perf comparing to full
    kernel irqchip, which is a 46% performance boost

I think we can consider to apply patch 1 even sooner than the rest of
the series to unbreak intx+split first.

The whole test matrix for reference:

  |----------+---------+-------------------+--------------+--------------------|
  | IRQ type | irqchip | TCP_STREAM (Gbps) | TCP_RR (pps) | note               |
  |----------+---------+-------------------+--------------+--------------------|
  | msi      | on      |              9.39 |        17567 |                    |
  | nomsi    | on      |              9.29 |        14056 |                    |
  | msi      | split   |              9.36 |        17330 |                    |
  | nomsi    | split   |                 / |            / | currently broken   |
  | nomsi    | split   |              8.98 |         8977 | after patch 1      |
  | nomsi    | split   |              9.21 |        13142 | after whole series |
  |----------+---------+-------------------+--------------+--------------------|

Any review comment is welcomed.  Thanks,

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1786404
[2] https://patchwork.kernel.org/patch/10738541/#22609933

Peter Xu (5):
  vfio/pci: Disable INTx fast path if using split irqchip
  vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  KVM: Kick resamplefd for split kernel irqchip
  Revert "vfio/pci: Disable INTx fast path if using split irqchip"

 accel/kvm/kvm-all.c    | 95 ++++++++++++++++++++++++++++++++++++++----
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 25 ++++++++++-
 hw/vfio/pci.c          | 37 +++++++---------
 include/sysemu/kvm.h   |  4 ++
 5 files changed, 130 insertions(+), 32 deletions(-)

-- 
2.24.1



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
@ 2020-03-17 19:50 ` Peter Xu
  2020-03-17 22:57   ` Alex Williamson
  2020-03-17 19:50 ` [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

It's currently broken.  Let's use the slow path to at least make it
functional.

Tested-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 5e75a95129..98e0e0c994 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -128,6 +128,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
+    if (kvm_irqchip_is_split()) {
+        /*
+         * VFIO INTx is currently not working with split kernel
+         * irqchip for level triggered interrupts.  Go the slow path
+         * as long as split is enabled so we can be at least
+         * functional (even with poor performance).
+         *
+         * TODO: Remove this after all things fixed up.
+         */
+        return;
+    }
+
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
  2020-03-17 19:50 ` [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
@ 2020-03-17 19:50 ` Peter Xu
  2020-03-17 22:57   ` Alex Williamson
  2020-03-17 19:50 ` [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

VFIO is currently the only one left that is not using the generic
function (kvm_irqchip_add_irqfd_notifier_gsi()) to register irqfds.
Let VFIO use the common framework too.

Follow up patches will introduce extra features for kvm irqfd, so that
VFIO can easily leverage that after the switch.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Reviewed-by: Cornelia Huck <cohuck@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 98e0e0c994..09703362df 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -115,11 +115,7 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
 static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
 {
 #ifdef CONFIG_KVM
-    struct kvm_irqfd irqfd = {
-        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
-        .gsi = vdev->intx.route.irq,
-        .flags = KVM_IRQFD_FLAG_RESAMPLE,
-    };
+    int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
     Error *err = NULL;
 
     if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
@@ -141,7 +137,7 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     }
 
     /* Get to a known interrupt state */
-    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
+    qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
     vdev->intx.pending = false;
     pci_irq_deassert(&vdev->pdev);
@@ -152,17 +148,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         goto fail;
     }
 
-    /* KVM triggers it, VFIO listens for it */
-    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
-
-    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
+                                           &vdev->intx.interrupt,
+                                           &vdev->intx.unmask,
+                                           vdev->intx.route.irq)) {
         error_setg_errno(errp, errno, "failed to setup resample irqfd");
         goto fail_irqfd;
     }
 
     if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
                                VFIO_IRQ_SET_ACTION_UNMASK,
-                               irqfd.resamplefd, &err)) {
+                               event_notifier_get_fd(&vdev->intx.unmask),
+                               &err)) {
         error_propagate(errp, err);
         goto fail_vfio;
     }
@@ -177,12 +174,12 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
     return;
 
 fail_vfio:
-    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
-    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
+    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
+                                          vdev->intx.route.irq);
 fail_irqfd:
     event_notifier_cleanup(&vdev->intx.unmask);
 fail:
-    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+    qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
     vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
 #endif
 }
@@ -190,12 +187,6 @@ fail:
 static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
 {
 #ifdef CONFIG_KVM
-    struct kvm_irqfd irqfd = {
-        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
-        .gsi = vdev->intx.route.irq,
-        .flags = KVM_IRQFD_FLAG_DEASSIGN,
-    };
-
     if (!vdev->intx.kvm_accel) {
         return;
     }
@@ -209,7 +200,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     pci_irq_deassert(&vdev->pdev);
 
     /* Tell KVM to stop listening for an INTx irqfd */
-    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
+    if (kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
+                                              vdev->intx.route.irq)) {
         error_report("vfio: Error: Failed to disable INTx irqfd: %m");
     }
 
@@ -217,7 +209,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
     event_notifier_cleanup(&vdev->intx.unmask);
 
     /* QEMU starts listening for interrupt events. */
-    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
+    qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
+                        vfio_intx_interrupt, NULL, vdev);
 
     vdev->intx.kvm_accel = false;
 
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
  2020-03-17 19:50 ` [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
  2020-03-17 19:50 ` [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
@ 2020-03-17 19:50 ` Peter Xu
  2020-03-17 22:58   ` Alex Williamson
  2020-03-17 19:50 ` [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

So that kvm_irqchip_assign_irqfd() can have access to the
EventNotifiers, especially the resample event.  It is needed in follow
up patch to cache and kick resamplefds from QEMU.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 439a4efe52..d49b74512a 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1628,9 +1628,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
     return kvm_update_routing_entry(s, &kroute);
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
+static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
+                                    EventNotifier *resample, int virq,
                                     bool assign)
 {
+    int fd = event_notifier_get_fd(event);
+    int rfd = resample ? event_notifier_get_fd(resample) : -1;
+
     struct kvm_irqfd irqfd = {
         .fd = fd,
         .gsi = virq,
@@ -1735,7 +1739,9 @@ int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
     return -ENOSYS;
 }
 
-static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
+static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
+                                    EventNotifier *resample, int virq,
+                                    bool assign)
 {
     abort();
 }
@@ -1749,15 +1755,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
 int kvm_irqchip_add_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
                                        EventNotifier *rn, int virq)
 {
-    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
-           rn ? event_notifier_get_fd(rn) : -1, virq, true);
+    return kvm_irqchip_assign_irqfd(s, n, rn, virq, true);
 }
 
 int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
                                           int virq)
 {
-    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
-           false);
+    return kvm_irqchip_assign_irqfd(s, n, NULL, virq, false);
 }
 
 int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (2 preceding siblings ...)
  2020-03-17 19:50 ` [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
@ 2020-03-17 19:50 ` Peter Xu
  2020-03-17 21:06   ` Alex Williamson
  2020-03-17 22:49   ` [PATCH v3.1 " Peter Xu
  2020-03-17 19:50 ` [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
  2020-03-18  2:56 ` [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx no-reply
  5 siblings, 2 replies; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

This is majorly only for X86 because that's the only one that supports
split irqchip for now.

When the irqchip is split, we face a dilemma that KVM irqfd will be
enabled, however the slow irqchip is still running in the userspace.
It means that the resamplefd in the kernel irqfds won't take any
effect and it will miss to ack INTx interrupts on EOIs.

One example is split irqchip with VFIO INTx, which will break if we
use the VFIO INTx fast path.

This patch can potentially supports the VFIO fast path again for INTx,
that the IRQ delivery will still use the fast path, while we don't
need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
callers of vfio_eoi() hook).  However the EOI of the INTx will still
need to be done from the userspace by caching all the resamplefds in
QEMU and kick properly for IOAPIC EOI broadcast.

This is tricky because in this case the userspace ioapic irr &
remote-irr will be bypassed.  However such a change will greatly boost
performance for assigned devices using INTx irqs (TCP_RR boosts 46%
after this patch applied).

When the userspace is responsible for the resamplefd kickup, don't
register it on the kvm_irqfd anymore, because on newer kernels (after
commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
irqchip and resamplefd.  This will make sure that the fast path will
work for all supported kernels.

https://patchwork.kernel.org/patch/10738541/#22609933

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 accel/kvm/kvm-all.c    | 79 ++++++++++++++++++++++++++++++++++++++++--
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 25 +++++++++++--
 include/sysemu/kvm.h   |  4 +++
 4 files changed, 105 insertions(+), 4 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d49b74512a..9a85fd1b8f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -159,9 +159,59 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
 static NotifierList kvm_irqchip_change_notifiers =
     NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
 
+struct KVMResampleFd {
+    int gsi;
+    EventNotifier *resample_event;
+    QLIST_ENTRY(KVMResampleFd) node;
+};
+typedef struct KVMResampleFd KVMResampleFd;
+
+/*
+ * Only used with split irqchip where we need to do the resample fd
+ * kick for the kernel from userspace.
+ */
+static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
+    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
+
 #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
 #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
 
+static inline void kvm_resample_fd_remove(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            QLIST_REMOVE(rfd, node);
+            g_free(rfd);
+            break;
+        }
+    }
+}
+
+static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
+{
+    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
+
+    rfd->gsi = gsi;
+    rfd->resample_event = event;
+
+    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
+}
+
+void kvm_resample_fd_notify(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            event_notifier_set(rfd->resample_event);
+            trace_kvm_resample_fd_notify(gsi);
+            return;
+        }
+    }
+}
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -1642,8 +1692,33 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
     };
 
     if (rfd != -1) {
-        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
-        irqfd.resamplefd = rfd;
+        assert(assign);
+        if (kvm_irqchip_is_split()) {
+            /*
+             * When the slow irqchip (e.g. IOAPIC) is in the
+             * userspace, KVM kernel resamplefd will not work because
+             * the EOI of the interrupt will be delivered to userspace
+             * instead, so the KVM kernel resamplefd kick will be
+             * skipped.  The userspace here mimics what the kernel
+             * provides with resamplefd, remember the resamplefd and
+             * kick it when we receive EOI of this IRQ.
+             *
+             * This is hackery because IOAPIC is mostly bypassed
+             * (except EOI broadcasts) when irqfd is used.  However
+             * this can bring much performance back for split irqchip
+             * with INTx IRQs (for VFIO, this gives 93% perf of the
+             * full fast path, which is 46% perf boost comparing to
+             * the INTx slow path).
+             */
+            kvm_resample_fd_insert(virq, resample);
+        } else {
+            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+            irqfd.resamplefd = rfd;
+        }
+    } else if (!assign) {
+        if (kvm_irqchip_is_split()) {
+            kvm_resample_fd_remove(virq);
+        }
     }
 
     if (!kvm_irqfds_enabled()) {
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..a68eb66534 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_resample_fd_notify(int gsi) "gsi %d"
 
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 15747fe2c2..81a17cc2b8 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -236,8 +236,29 @@ void ioapic_eoi_broadcast(int vector)
         for (n = 0; n < IOAPIC_NUM_PINS; n++) {
             entry = s->ioredtbl[n];
 
-            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
-                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
+            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
+                continue;
+            }
+
+            /*
+             * When IOAPIC is in the userspace while APIC is still in
+             * the kernel (i.e., split irqchip), we have a trick to
+             * kick the resamplefd logic for registered irqfds from
+             * userspace to deactivate the IRQ.  When that happens, it
+             * means the irq bypassed userspace IOAPIC (so the irr and
+             * remote-irr of the table entry should be bypassed too
+             * even if interrupt come).  Still kick the resamplefds if
+             * they're bound to the IRQ, to make sure to EOI the
+             * interrupt for the hardware correctly.
+             *
+             * Note: We still need to go through the irr & remote-irr
+             * operations below because we don't know whether there're
+             * emulated devices that are using/sharing the same IRQ.
+             */
+            kvm_resample_fd_notify(n);
+
+            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
+                IOAPIC_TRIGGER_LEVEL) {
                 continue;
             }
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 141342de98..583a976f8a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
 int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
 struct ppc_radix_page_info *kvm_get_radix_page_info(void);
 int kvm_get_max_memslots(void);
+
+/* Notify resamplefd for EOI of specific interrupts. */
+void kvm_resample_fd_notify(int gsi);
+
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip"
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (3 preceding siblings ...)
  2020-03-17 19:50 ` [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
@ 2020-03-17 19:50 ` Peter Xu
  2020-03-17 22:58   ` Alex Williamson
  2020-03-18  2:56 ` [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx no-reply
  5 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 19:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, peterx, Eric Auger

With the resamplefd list introduced, we can savely enable VFIO INTx
fast path again with split irqchip so it can still be faster than the
complete slow path.

Reviewed-by: Eric Auger <eric.auger@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/pci.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 09703362df..1c0aa27386 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -124,18 +124,6 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
         return;
     }
 
-    if (kvm_irqchip_is_split()) {
-        /*
-         * VFIO INTx is currently not working with split kernel
-         * irqchip for level triggered interrupts.  Go the slow path
-         * as long as split is enabled so we can be at least
-         * functional (even with poor performance).
-         *
-         * TODO: Remove this after all things fixed up.
-         */
-        return;
-    }
-
     /* Get to a known interrupt state */
     qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
     vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 19:50 ` [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
@ 2020-03-17 21:06   ` Alex Williamson
  2020-03-17 21:41     ` Peter Xu
  2020-03-17 22:49   ` [PATCH v3.1 " Peter Xu
  1 sibling, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 21:06 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 15:50:41 -0400
Peter Xu <peterx@redhat.com> wrote:

> This is majorly only for X86 because that's the only one that supports
> split irqchip for now.
> 
> When the irqchip is split, we face a dilemma that KVM irqfd will be
> enabled, however the slow irqchip is still running in the userspace.
> It means that the resamplefd in the kernel irqfds won't take any
> effect and it will miss to ack INTx interrupts on EOIs.
> 
> One example is split irqchip with VFIO INTx, which will break if we
> use the VFIO INTx fast path.
> 
> This patch can potentially supports the VFIO fast path again for INTx,
> that the IRQ delivery will still use the fast path, while we don't
> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> callers of vfio_eoi() hook).  However the EOI of the INTx will still
> need to be done from the userspace by caching all the resamplefds in
> QEMU and kick properly for IOAPIC EOI broadcast.
> 
> This is tricky because in this case the userspace ioapic irr &
> remote-irr will be bypassed.  However such a change will greatly boost
> performance for assigned devices using INTx irqs (TCP_RR boosts 46%
> after this patch applied).
> 
> When the userspace is responsible for the resamplefd kickup, don't
> register it on the kvm_irqfd anymore, because on newer kernels (after
> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> irqchip and resamplefd.  This will make sure that the fast path will
> work for all supported kernels.
> 
> https://patchwork.kernel.org/patch/10738541/#22609933
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c    | 79 ++++++++++++++++++++++++++++++++++++++++--
>  accel/kvm/trace-events |  1 +
>  hw/intc/ioapic.c       | 25 +++++++++++--
>  include/sysemu/kvm.h   |  4 +++
>  4 files changed, 105 insertions(+), 4 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index d49b74512a..9a85fd1b8f 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -159,9 +159,59 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>  static NotifierList kvm_irqchip_change_notifiers =
>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
>  
> +struct KVMResampleFd {
> +    int gsi;
> +    EventNotifier *resample_event;
> +    QLIST_ENTRY(KVMResampleFd) node;
> +};
> +typedef struct KVMResampleFd KVMResampleFd;
> +
> +/*
> + * Only used with split irqchip where we need to do the resample fd
> + * kick for the kernel from userspace.
> + */
> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> +
>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>  
> +static inline void kvm_resample_fd_remove(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            QLIST_REMOVE(rfd, node);
> +            g_free(rfd);
> +            break;
> +        }
> +    }
> +}
> +
> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> +{
> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> +
> +    rfd->gsi = gsi;
> +    rfd->resample_event = event;
> +
> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> +}
> +
> +void kvm_resample_fd_notify(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            event_notifier_set(rfd->resample_event);
> +            trace_kvm_resample_fd_notify(gsi);
> +            return;
> +        }
> +    }
> +}
> +
>  int kvm_get_max_memslots(void)
>  {
>      KVMState *s = KVM_STATE(current_accel());
> @@ -1642,8 +1692,33 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>      };
>  
>      if (rfd != -1) {
> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> -        irqfd.resamplefd = rfd;
> +        assert(assign);
> +        if (kvm_irqchip_is_split()) {
> +            /*
> +             * When the slow irqchip (e.g. IOAPIC) is in the
> +             * userspace, KVM kernel resamplefd will not work because
> +             * the EOI of the interrupt will be delivered to userspace
> +             * instead, so the KVM kernel resamplefd kick will be
> +             * skipped.  The userspace here mimics what the kernel
> +             * provides with resamplefd, remember the resamplefd and
> +             * kick it when we receive EOI of this IRQ.
> +             *
> +             * This is hackery because IOAPIC is mostly bypassed
> +             * (except EOI broadcasts) when irqfd is used.  However
> +             * this can bring much performance back for split irqchip
> +             * with INTx IRQs (for VFIO, this gives 93% perf of the
> +             * full fast path, which is 46% perf boost comparing to
> +             * the INTx slow path).
> +             */
> +            kvm_resample_fd_insert(virq, resample);
> +        } else {
> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> +            irqfd.resamplefd = rfd;
> +        }
> +    } else if (!assign) {
> +        if (kvm_irqchip_is_split()) {
> +            kvm_resample_fd_remove(virq);
> +        }
>      }
>  
>      if (!kvm_irqfds_enabled()) {
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index 4fb6e59d19..a68eb66534 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> +kvm_resample_fd_notify(int gsi) "gsi %d"
>  
> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> index 15747fe2c2..81a17cc2b8 100644
> --- a/hw/intc/ioapic.c
> +++ b/hw/intc/ioapic.c
> @@ -236,8 +236,29 @@ void ioapic_eoi_broadcast(int vector)
>          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
>              entry = s->ioredtbl[n];
>  
> -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> +                continue;
> +            }
> +
> +            /*
> +             * When IOAPIC is in the userspace while APIC is still in
> +             * the kernel (i.e., split irqchip), we have a trick to
> +             * kick the resamplefd logic for registered irqfds from
> +             * userspace to deactivate the IRQ.  When that happens, it
> +             * means the irq bypassed userspace IOAPIC (so the irr and
> +             * remote-irr of the table entry should be bypassed too
> +             * even if interrupt come).  Still kick the resamplefds if
> +             * they're bound to the IRQ, to make sure to EOI the
> +             * interrupt for the hardware correctly.
> +             *
> +             * Note: We still need to go through the irr & remote-irr
> +             * operations below because we don't know whether there're
> +             * emulated devices that are using/sharing the same IRQ.
> +             */
> +            kvm_resample_fd_notify(n);
> +
> +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> +                IOAPIC_TRIGGER_LEVEL) {
>                  continue;
>              }
>  

What's the logic for sending resampler notifies before testing if the
ioapic entry is in level triggered mode?  vfio won't use this for
anything other than level triggered.  Inserting it between these checks
confused me and in my testing wasn't necessary.  Thanks,

Alex

> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 141342de98..583a976f8a 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
>  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
>  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
>  int kvm_get_max_memslots(void);
> +
> +/* Notify resamplefd for EOI of specific interrupts. */
> +void kvm_resample_fd_notify(int gsi);
> +
>  #endif



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 21:06   ` Alex Williamson
@ 2020-03-17 21:41     ` Peter Xu
  2020-03-17 22:12       ` Alex Williamson
  0 siblings, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 21:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, Mar 17, 2020 at 03:06:46PM -0600, Alex Williamson wrote:

[...]

> > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> > index 15747fe2c2..81a17cc2b8 100644
> > --- a/hw/intc/ioapic.c
> > +++ b/hw/intc/ioapic.c
> > @@ -236,8 +236,29 @@ void ioapic_eoi_broadcast(int vector)
> >          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> >              entry = s->ioredtbl[n];
> >  
> > -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> > -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> > +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> > +                continue;
> > +            }
> > +
> > +            /*
> > +             * When IOAPIC is in the userspace while APIC is still in
> > +             * the kernel (i.e., split irqchip), we have a trick to
> > +             * kick the resamplefd logic for registered irqfds from
> > +             * userspace to deactivate the IRQ.  When that happens, it
> > +             * means the irq bypassed userspace IOAPIC (so the irr and
> > +             * remote-irr of the table entry should be bypassed too
> > +             * even if interrupt come).  Still kick the resamplefds if
> > +             * they're bound to the IRQ, to make sure to EOI the
> > +             * interrupt for the hardware correctly.
> > +             *
> > +             * Note: We still need to go through the irr & remote-irr
> > +             * operations below because we don't know whether there're
> > +             * emulated devices that are using/sharing the same IRQ.
> > +             */
> > +            kvm_resample_fd_notify(n);
> > +
> > +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> > +                IOAPIC_TRIGGER_LEVEL) {
> >                  continue;
> >              }
> >  
> 
> What's the logic for sending resampler notifies before testing if the
> ioapic entry is in level triggered mode?  vfio won't use this for
> anything other than level triggered.  Inserting it between these checks
> confused me and in my testing wasn't necessary.  Thanks,

I put it there to match the kernel implementation, and IIUC Paolo
agreed with that too:

https://patchwork.kernel.org/patch/11407441/#23190969

Since we've discussed a few times here, I think I can talk a bit more
on how I understand this in case I was wrong...

Even if we have the fact that all the existing devices that use this
code should be using level-triggered IRQs, however... *If* there comes
an edge-triggered INTx device and we assign it using vfio-pci, vfio
should also mask the IRQ after it generates (according to
vfio_intx_handler), is that right?  Then we still need to kick the
resamplefd for that does-not-exist device too to make sure it'll work?

Thanks,

> 
> Alex
> 
> > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> > index 141342de98..583a976f8a 100644
> > --- a/include/sysemu/kvm.h
> > +++ b/include/sysemu/kvm.h
> > @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
> >  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
> >  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
> >  int kvm_get_max_memslots(void);
> > +
> > +/* Notify resamplefd for EOI of specific interrupts. */
> > +void kvm_resample_fd_notify(int gsi);
> > +
> >  #endif
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 21:41     ` Peter Xu
@ 2020-03-17 22:12       ` Alex Williamson
  2020-03-17 22:41         ` Peter Xu
  0 siblings, 1 reply; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:12 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 17:41:08 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Mar 17, 2020 at 03:06:46PM -0600, Alex Williamson wrote:
> 
> [...]
> 
> > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> > > index 15747fe2c2..81a17cc2b8 100644
> > > --- a/hw/intc/ioapic.c
> > > +++ b/hw/intc/ioapic.c
> > > @@ -236,8 +236,29 @@ void ioapic_eoi_broadcast(int vector)
> > >          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> > >              entry = s->ioredtbl[n];
> > >  
> > > -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> > > -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> > > +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> > > +                continue;
> > > +            }
> > > +
> > > +            /*
> > > +             * When IOAPIC is in the userspace while APIC is still in
> > > +             * the kernel (i.e., split irqchip), we have a trick to
> > > +             * kick the resamplefd logic for registered irqfds from
> > > +             * userspace to deactivate the IRQ.  When that happens, it
> > > +             * means the irq bypassed userspace IOAPIC (so the irr and
> > > +             * remote-irr of the table entry should be bypassed too
> > > +             * even if interrupt come).  Still kick the resamplefds if
> > > +             * they're bound to the IRQ, to make sure to EOI the
> > > +             * interrupt for the hardware correctly.
> > > +             *
> > > +             * Note: We still need to go through the irr & remote-irr
> > > +             * operations below because we don't know whether there're
> > > +             * emulated devices that are using/sharing the same IRQ.
> > > +             */
> > > +            kvm_resample_fd_notify(n);
> > > +
> > > +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> > > +                IOAPIC_TRIGGER_LEVEL) {
> > >                  continue;
> > >              }
> > >    
> > 
> > What's the logic for sending resampler notifies before testing if the
> > ioapic entry is in level triggered mode?  vfio won't use this for
> > anything other than level triggered.  Inserting it between these checks
> > confused me and in my testing wasn't necessary.  Thanks,  
> 
> I put it there to match the kernel implementation, and IIUC Paolo
> agreed with that too:
> 
> https://patchwork.kernel.org/patch/11407441/#23190969
> 
> Since we've discussed a few times here, I think I can talk a bit more
> on how I understand this in case I was wrong...
> 
> Even if we have the fact that all the existing devices that use this
> code should be using level-triggered IRQs, however... *If* there comes
> an edge-triggered INTx device and we assign it using vfio-pci, vfio
> should also mask the IRQ after it generates (according to
> vfio_intx_handler), is that right?  Then we still need to kick the
> resamplefd for that does-not-exist device too to make sure it'll work?

"edge-triggered INTx" is not a thing that exists.  The PCI spec defines
interrupt pins as:

  2.2.6. Interrupt Pins (Optional)

  Interrupts on PCI are optional and defined as "level sensitive,"
  asserted low (negative true), using open drain output drivers.

Masking of interrupts while they're in-service is not done for edge
triggered interrupts, we assume that being a discrete interrupt is a
sufficient rate limiter versus a level triggered interrupt, which is
continuous and can saturate the host.

If it exists before the level check only to match the kernel, maybe a
comment or todo item to check whether it's the optimal approach for
both cases should be in order.  I can't think of any reason why we'd
need it for the sake of edge triggered vfio interrupts in either place.
Thanks,

Alex



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 22:12       ` Alex Williamson
@ 2020-03-17 22:41         ` Peter Xu
  0 siblings, 0 replies; 17+ messages in thread
From: Peter Xu @ 2020-03-17 22:41 UTC (permalink / raw)
  To: Alex Williamson; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, Mar 17, 2020 at 04:12:00PM -0600, Alex Williamson wrote:
> On Tue, 17 Mar 2020 17:41:08 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Mar 17, 2020 at 03:06:46PM -0600, Alex Williamson wrote:
> > 
> > [...]
> > 
> > > > diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> > > > index 15747fe2c2..81a17cc2b8 100644
> > > > --- a/hw/intc/ioapic.c
> > > > +++ b/hw/intc/ioapic.c
> > > > @@ -236,8 +236,29 @@ void ioapic_eoi_broadcast(int vector)
> > > >          for (n = 0; n < IOAPIC_NUM_PINS; n++) {
> > > >              entry = s->ioredtbl[n];
> > > >  
> > > > -            if ((entry & IOAPIC_VECTOR_MASK) != vector ||
> > > > -                ((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) != IOAPIC_TRIGGER_LEVEL) {
> > > > +            if ((entry & IOAPIC_VECTOR_MASK) != vector) {
> > > > +                continue;
> > > > +            }
> > > > +
> > > > +            /*
> > > > +             * When IOAPIC is in the userspace while APIC is still in
> > > > +             * the kernel (i.e., split irqchip), we have a trick to
> > > > +             * kick the resamplefd logic for registered irqfds from
> > > > +             * userspace to deactivate the IRQ.  When that happens, it
> > > > +             * means the irq bypassed userspace IOAPIC (so the irr and
> > > > +             * remote-irr of the table entry should be bypassed too
> > > > +             * even if interrupt come).  Still kick the resamplefds if
> > > > +             * they're bound to the IRQ, to make sure to EOI the
> > > > +             * interrupt for the hardware correctly.
> > > > +             *
> > > > +             * Note: We still need to go through the irr & remote-irr
> > > > +             * operations below because we don't know whether there're
> > > > +             * emulated devices that are using/sharing the same IRQ.
> > > > +             */
> > > > +            kvm_resample_fd_notify(n);
> > > > +
> > > > +            if (((entry >> IOAPIC_LVT_TRIGGER_MODE_SHIFT) & 1) !=
> > > > +                IOAPIC_TRIGGER_LEVEL) {
> > > >                  continue;
> > > >              }
> > > >    
> > > 
> > > What's the logic for sending resampler notifies before testing if the
> > > ioapic entry is in level triggered mode?  vfio won't use this for
> > > anything other than level triggered.  Inserting it between these checks
> > > confused me and in my testing wasn't necessary.  Thanks,  
> > 
> > I put it there to match the kernel implementation, and IIUC Paolo
> > agreed with that too:
> > 
> > https://patchwork.kernel.org/patch/11407441/#23190969
> > 
> > Since we've discussed a few times here, I think I can talk a bit more
> > on how I understand this in case I was wrong...
> > 
> > Even if we have the fact that all the existing devices that use this
> > code should be using level-triggered IRQs, however... *If* there comes
> > an edge-triggered INTx device and we assign it using vfio-pci, vfio
> > should also mask the IRQ after it generates (according to
> > vfio_intx_handler), is that right?  Then we still need to kick the
> > resamplefd for that does-not-exist device too to make sure it'll work?
> 
> "edge-triggered INTx" is not a thing that exists.  The PCI spec defines
> interrupt pins as:
> 
>   2.2.6. Interrupt Pins (Optional)
> 
>   Interrupts on PCI are optional and defined as "level sensitive,"
>   asserted low (negative true), using open drain output drivers.

Ah OK!  I didn't notice it's a spec-wise answer...

> 
> Masking of interrupts while they're in-service is not done for edge
> triggered interrupts, we assume that being a discrete interrupt is a
> sufficient rate limiter versus a level triggered interrupt, which is
> continuous and can saturate the host.
> 
> If it exists before the level check only to match the kernel, maybe a
> comment or todo item to check whether it's the optimal approach for
> both cases should be in order.  I can't think of any reason why we'd
> need it for the sake of edge triggered vfio interrupts in either place.

I guess the KVM implementation of that is still required for the
kernel PIT implementation as Paolo mentioned.  Since this seems to be
confusing and the userspace does not have a real use case for that,
let me repost this patch only so the userspace resamplefd only reacts
to level triggered interrupts.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v3.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 19:50 ` [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
  2020-03-17 21:06   ` Alex Williamson
@ 2020-03-17 22:49   ` Peter Xu
  2020-03-17 22:58     ` Alex Williamson
  1 sibling, 1 reply; 17+ messages in thread
From: Peter Xu @ 2020-03-17 22:49 UTC (permalink / raw)
  To: qemu-devel
  Cc: Paolo Bonzini, Alex Williamson, Cornelia Huck, Peter Xu, Eric Auger

This is majorly only for X86 because that's the only one that supports
split irqchip for now.

When the irqchip is split, we face a dilemma that KVM irqfd will be
enabled, however the slow irqchip is still running in the userspace.
It means that the resamplefd in the kernel irqfds won't take any
effect and it will miss to ack INTx interrupts on EOIs.

One example is split irqchip with VFIO INTx, which will break if we
use the VFIO INTx fast path.

This patch can potentially supports the VFIO fast path again for INTx,
that the IRQ delivery will still use the fast path, while we don't
need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
callers of vfio_eoi() hook).  However the EOI of the INTx will still
need to be done from the userspace by caching all the resamplefds in
QEMU and kick properly for IOAPIC EOI broadcast.

This is tricky because in this case the userspace ioapic irr &
remote-irr will be bypassed.  However such a change will greatly boost
performance for assigned devices using INTx irqs (TCP_RR boosts 46%
after this patch applied).

When the userspace is responsible for the resamplefd kickup, don't
register it on the kvm_irqfd anymore, because on newer kernels (after
commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
irqchip and resamplefd.  This will make sure that the fast path will
work for all supported kernels.

https://patchwork.kernel.org/patch/10738541/#22609933

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
v3.1 changelog
- only kick resamplefd for level triggered irqs [Alex]
 accel/kvm/kvm-all.c    | 79 ++++++++++++++++++++++++++++++++++++++++--
 accel/kvm/trace-events |  1 +
 hw/intc/ioapic.c       | 17 +++++++++
 include/sysemu/kvm.h   |  4 +++
 4 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index d49b74512a..9a85fd1b8f 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -159,9 +159,59 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
 static NotifierList kvm_irqchip_change_notifiers =
     NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
 
+struct KVMResampleFd {
+    int gsi;
+    EventNotifier *resample_event;
+    QLIST_ENTRY(KVMResampleFd) node;
+};
+typedef struct KVMResampleFd KVMResampleFd;
+
+/*
+ * Only used with split irqchip where we need to do the resample fd
+ * kick for the kernel from userspace.
+ */
+static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
+    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
+
 #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
 #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
 
+static inline void kvm_resample_fd_remove(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            QLIST_REMOVE(rfd, node);
+            g_free(rfd);
+            break;
+        }
+    }
+}
+
+static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
+{
+    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
+
+    rfd->gsi = gsi;
+    rfd->resample_event = event;
+
+    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
+}
+
+void kvm_resample_fd_notify(int gsi)
+{
+    KVMResampleFd *rfd;
+
+    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
+        if (rfd->gsi == gsi) {
+            event_notifier_set(rfd->resample_event);
+            trace_kvm_resample_fd_notify(gsi);
+            return;
+        }
+    }
+}
+
 int kvm_get_max_memslots(void)
 {
     KVMState *s = KVM_STATE(current_accel());
@@ -1642,8 +1692,33 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
     };
 
     if (rfd != -1) {
-        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
-        irqfd.resamplefd = rfd;
+        assert(assign);
+        if (kvm_irqchip_is_split()) {
+            /*
+             * When the slow irqchip (e.g. IOAPIC) is in the
+             * userspace, KVM kernel resamplefd will not work because
+             * the EOI of the interrupt will be delivered to userspace
+             * instead, so the KVM kernel resamplefd kick will be
+             * skipped.  The userspace here mimics what the kernel
+             * provides with resamplefd, remember the resamplefd and
+             * kick it when we receive EOI of this IRQ.
+             *
+             * This is hackery because IOAPIC is mostly bypassed
+             * (except EOI broadcasts) when irqfd is used.  However
+             * this can bring much performance back for split irqchip
+             * with INTx IRQs (for VFIO, this gives 93% perf of the
+             * full fast path, which is 46% perf boost comparing to
+             * the INTx slow path).
+             */
+            kvm_resample_fd_insert(virq, resample);
+        } else {
+            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
+            irqfd.resamplefd = rfd;
+        }
+    } else if (!assign) {
+        if (kvm_irqchip_is_split()) {
+            kvm_resample_fd_remove(virq);
+        }
     }
 
     if (!kvm_irqfds_enabled()) {
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 4fb6e59d19..a68eb66534 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
 kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
+kvm_resample_fd_notify(int gsi) "gsi %d"
 
diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
index 15747fe2c2..2ae96e10be 100644
--- a/hw/intc/ioapic.c
+++ b/hw/intc/ioapic.c
@@ -241,6 +241,23 @@ void ioapic_eoi_broadcast(int vector)
                 continue;
             }
 
+            /*
+             * When IOAPIC is in the userspace while APIC is still in
+             * the kernel (i.e., split irqchip), we have a trick to
+             * kick the resamplefd logic for registered irqfds from
+             * userspace to deactivate the IRQ.  When that happens, it
+             * means the irq bypassed userspace IOAPIC (so the irr and
+             * remote-irr of the table entry should be bypassed too
+             * even if interrupt come).  Still kick the resamplefds if
+             * they're bound to the IRQ, to make sure to EOI the
+             * interrupt for the hardware correctly.
+             *
+             * Note: We still need to go through the irr & remote-irr
+             * operations below because we don't know whether there're
+             * emulated devices that are using/sharing the same IRQ.
+             */
+            kvm_resample_fd_notify(n);
+
             if (!(entry & IOAPIC_LVT_REMOTE_IRR)) {
                 continue;
             }
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 141342de98..583a976f8a 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
 int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
 struct ppc_radix_page_info *kvm_get_radix_page_info(void);
 int kvm_get_max_memslots(void);
+
+/* Notify resamplefd for EOI of specific interrupts. */
+void kvm_resample_fd_notify(int gsi);
+
 #endif
-- 
2.24.1



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip
  2020-03-17 19:50 ` [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
@ 2020-03-17 22:57   ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:57 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 15:50:38 -0400
Peter Xu <peterx@redhat.com> wrote:

> It's currently broken.  Let's use the slow path to at least make it
> functional.
> 
> Tested-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/pci.c | 12 ++++++++++++
>  1 file changed, 12 insertions(+)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>
 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 5e75a95129..98e0e0c994 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -128,6 +128,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>          return;
>      }
>  
> +    if (kvm_irqchip_is_split()) {
> +        /*
> +         * VFIO INTx is currently not working with split kernel
> +         * irqchip for level triggered interrupts.  Go the slow path
> +         * as long as split is enabled so we can be at least
> +         * functional (even with poor performance).
> +         *
> +         * TODO: Remove this after all things fixed up.
> +         */
> +        return;
> +    }
> +
>      /* Get to a known interrupt state */
>      qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
>      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds
  2020-03-17 19:50 ` [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
@ 2020-03-17 22:57   ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:57 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 15:50:39 -0400
Peter Xu <peterx@redhat.com> wrote:

> VFIO is currently the only one left that is not using the generic
> function (kvm_irqchip_add_irqfd_notifier_gsi()) to register irqfds.
> Let VFIO use the common framework too.
> 
> Follow up patches will introduce extra features for kvm irqfd, so that
> VFIO can easily leverage that after the switch.
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Reviewed-by: Cornelia Huck <cohuck@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/pci.c | 37 +++++++++++++++----------------------
>  1 file changed, 15 insertions(+), 22 deletions(-)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>

> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 98e0e0c994..09703362df 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -115,11 +115,7 @@ static void vfio_intx_eoi(VFIODevice *vbasedev)
>  static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>  {
>  #ifdef CONFIG_KVM
> -    struct kvm_irqfd irqfd = {
> -        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> -        .gsi = vdev->intx.route.irq,
> -        .flags = KVM_IRQFD_FLAG_RESAMPLE,
> -    };
> +    int irq_fd = event_notifier_get_fd(&vdev->intx.interrupt);
>      Error *err = NULL;
>  
>      if (vdev->no_kvm_intx || !kvm_irqfds_enabled() ||
> @@ -141,7 +137,7 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>      }
>  
>      /* Get to a known interrupt state */
> -    qemu_set_fd_handler(irqfd.fd, NULL, NULL, vdev);
> +    qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
>      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>      vdev->intx.pending = false;
>      pci_irq_deassert(&vdev->pdev);
> @@ -152,17 +148,18 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>          goto fail;
>      }
>  
> -    /* KVM triggers it, VFIO listens for it */
> -    irqfd.resamplefd = event_notifier_get_fd(&vdev->intx.unmask);
> -
> -    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> +    if (kvm_irqchip_add_irqfd_notifier_gsi(kvm_state,
> +                                           &vdev->intx.interrupt,
> +                                           &vdev->intx.unmask,
> +                                           vdev->intx.route.irq)) {
>          error_setg_errno(errp, errno, "failed to setup resample irqfd");
>          goto fail_irqfd;
>      }
>  
>      if (vfio_set_irq_signaling(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX, 0,
>                                 VFIO_IRQ_SET_ACTION_UNMASK,
> -                               irqfd.resamplefd, &err)) {
> +                               event_notifier_get_fd(&vdev->intx.unmask),
> +                               &err)) {
>          error_propagate(errp, err);
>          goto fail_vfio;
>      }
> @@ -177,12 +174,12 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>      return;
>  
>  fail_vfio:
> -    irqfd.flags = KVM_IRQFD_FLAG_DEASSIGN;
> -    kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd);
> +    kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
> +                                          vdev->intx.route.irq);
>  fail_irqfd:
>      event_notifier_cleanup(&vdev->intx.unmask);
>  fail:
> -    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +    qemu_set_fd_handler(irq_fd, vfio_intx_interrupt, NULL, vdev);
>      vfio_unmask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);
>  #endif
>  }
> @@ -190,12 +187,6 @@ fail:
>  static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>  {
>  #ifdef CONFIG_KVM
> -    struct kvm_irqfd irqfd = {
> -        .fd = event_notifier_get_fd(&vdev->intx.interrupt),
> -        .gsi = vdev->intx.route.irq,
> -        .flags = KVM_IRQFD_FLAG_DEASSIGN,
> -    };
> -
>      if (!vdev->intx.kvm_accel) {
>          return;
>      }
> @@ -209,7 +200,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>      pci_irq_deassert(&vdev->pdev);
>  
>      /* Tell KVM to stop listening for an INTx irqfd */
> -    if (kvm_vm_ioctl(kvm_state, KVM_IRQFD, &irqfd)) {
> +    if (kvm_irqchip_remove_irqfd_notifier_gsi(kvm_state, &vdev->intx.interrupt,
> +                                              vdev->intx.route.irq)) {
>          error_report("vfio: Error: Failed to disable INTx irqfd: %m");
>      }
>  
> @@ -217,7 +209,8 @@ static void vfio_intx_disable_kvm(VFIOPCIDevice *vdev)
>      event_notifier_cleanup(&vdev->intx.unmask);
>  
>      /* QEMU starts listening for interrupt events. */
> -    qemu_set_fd_handler(irqfd.fd, vfio_intx_interrupt, NULL, vdev);
> +    qemu_set_fd_handler(event_notifier_get_fd(&vdev->intx.interrupt),
> +                        vfio_intx_interrupt, NULL, vdev);
>  
>      vdev->intx.kvm_accel = false;
>  



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd
  2020-03-17 19:50 ` [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
@ 2020-03-17 22:58   ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 15:50:40 -0400
Peter Xu <peterx@redhat.com> wrote:

> So that kvm_irqchip_assign_irqfd() can have access to the
> EventNotifiers, especially the resample event.  It is needed in follow
> up patch to cache and kick resamplefds from QEMU.
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  accel/kvm/kvm-all.c | 16 ++++++++++------
>  1 file changed, 10 insertions(+), 6 deletions(-)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>

> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 439a4efe52..d49b74512a 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1628,9 +1628,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg,
>      return kvm_update_routing_entry(s, &kroute);
>  }
>  
> -static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int rfd, int virq,
> +static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> +                                    EventNotifier *resample, int virq,
>                                      bool assign)
>  {
> +    int fd = event_notifier_get_fd(event);
> +    int rfd = resample ? event_notifier_get_fd(resample) : -1;
> +
>      struct kvm_irqfd irqfd = {
>          .fd = fd,
>          .gsi = virq,
> @@ -1735,7 +1739,9 @@ int kvm_irqchip_add_hv_sint_route(KVMState *s, uint32_t vcpu, uint32_t sint)
>      return -ENOSYS;
>  }
>  
> -static int kvm_irqchip_assign_irqfd(KVMState *s, int fd, int virq, bool assign)
> +static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
> +                                    EventNotifier *resample, int virq,
> +                                    bool assign)
>  {
>      abort();
>  }
> @@ -1749,15 +1755,13 @@ int kvm_irqchip_update_msi_route(KVMState *s, int virq, MSIMessage msg)
>  int kvm_irqchip_add_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
>                                         EventNotifier *rn, int virq)
>  {
> -    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n),
> -           rn ? event_notifier_get_fd(rn) : -1, virq, true);
> +    return kvm_irqchip_assign_irqfd(s, n, rn, virq, true);
>  }
>  
>  int kvm_irqchip_remove_irqfd_notifier_gsi(KVMState *s, EventNotifier *n,
>                                            int virq)
>  {
> -    return kvm_irqchip_assign_irqfd(s, event_notifier_get_fd(n), -1, virq,
> -           false);
> +    return kvm_irqchip_assign_irqfd(s, n, NULL, virq, false);
>  }
>  
>  int kvm_irqchip_add_irqfd_notifier(KVMState *s, EventNotifier *n,



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3.1 4/5] KVM: Kick resamplefd for split kernel irqchip
  2020-03-17 22:49   ` [PATCH v3.1 " Peter Xu
@ 2020-03-17 22:58     ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 18:49:23 -0400
Peter Xu <peterx@redhat.com> wrote:

> This is majorly only for X86 because that's the only one that supports
> split irqchip for now.
> 
> When the irqchip is split, we face a dilemma that KVM irqfd will be
> enabled, however the slow irqchip is still running in the userspace.
> It means that the resamplefd in the kernel irqfds won't take any
> effect and it will miss to ack INTx interrupts on EOIs.
> 
> One example is split irqchip with VFIO INTx, which will break if we
> use the VFIO INTx fast path.
> 
> This patch can potentially supports the VFIO fast path again for INTx,
> that the IRQ delivery will still use the fast path, while we don't
> need to trap MMIOs in QEMU for the device to emulate the EIOs (see the
> callers of vfio_eoi() hook).  However the EOI of the INTx will still
> need to be done from the userspace by caching all the resamplefds in
> QEMU and kick properly for IOAPIC EOI broadcast.
> 
> This is tricky because in this case the userspace ioapic irr &
> remote-irr will be bypassed.  However such a change will greatly boost
> performance for assigned devices using INTx irqs (TCP_RR boosts 46%
> after this patch applied).
> 
> When the userspace is responsible for the resamplefd kickup, don't
> register it on the kvm_irqfd anymore, because on newer kernels (after
> commit 654f1f13ea56, 5.2+) the KVM_IRQFD will fail if with both split
> irqchip and resamplefd.  This will make sure that the fast path will
> work for all supported kernels.
> 
> https://patchwork.kernel.org/patch/10738541/#22609933
> 
> Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> v3.1 changelog
> - only kick resamplefd for level triggered irqs [Alex]
>  accel/kvm/kvm-all.c    | 79 ++++++++++++++++++++++++++++++++++++++++--
>  accel/kvm/trace-events |  1 +
>  hw/intc/ioapic.c       | 17 +++++++++
>  include/sysemu/kvm.h   |  4 +++
>  4 files changed, 99 insertions(+), 2 deletions(-)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>

> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index d49b74512a..9a85fd1b8f 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -159,9 +159,59 @@ static const KVMCapabilityInfo kvm_required_capabilites[] = {
>  static NotifierList kvm_irqchip_change_notifiers =
>      NOTIFIER_LIST_INITIALIZER(kvm_irqchip_change_notifiers);
>  
> +struct KVMResampleFd {
> +    int gsi;
> +    EventNotifier *resample_event;
> +    QLIST_ENTRY(KVMResampleFd) node;
> +};
> +typedef struct KVMResampleFd KVMResampleFd;
> +
> +/*
> + * Only used with split irqchip where we need to do the resample fd
> + * kick for the kernel from userspace.
> + */
> +static QLIST_HEAD(, KVMResampleFd) kvm_resample_fd_list =
> +    QLIST_HEAD_INITIALIZER(kvm_resample_fd_list);
> +
>  #define kvm_slots_lock(kml)      qemu_mutex_lock(&(kml)->slots_lock)
>  #define kvm_slots_unlock(kml)    qemu_mutex_unlock(&(kml)->slots_lock)
>  
> +static inline void kvm_resample_fd_remove(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            QLIST_REMOVE(rfd, node);
> +            g_free(rfd);
> +            break;
> +        }
> +    }
> +}
> +
> +static inline void kvm_resample_fd_insert(int gsi, EventNotifier *event)
> +{
> +    KVMResampleFd *rfd = g_new0(KVMResampleFd, 1);
> +
> +    rfd->gsi = gsi;
> +    rfd->resample_event = event;
> +
> +    QLIST_INSERT_HEAD(&kvm_resample_fd_list, rfd, node);
> +}
> +
> +void kvm_resample_fd_notify(int gsi)
> +{
> +    KVMResampleFd *rfd;
> +
> +    QLIST_FOREACH(rfd, &kvm_resample_fd_list, node) {
> +        if (rfd->gsi == gsi) {
> +            event_notifier_set(rfd->resample_event);
> +            trace_kvm_resample_fd_notify(gsi);
> +            return;
> +        }
> +    }
> +}
> +
>  int kvm_get_max_memslots(void)
>  {
>      KVMState *s = KVM_STATE(current_accel());
> @@ -1642,8 +1692,33 @@ static int kvm_irqchip_assign_irqfd(KVMState *s, EventNotifier *event,
>      };
>  
>      if (rfd != -1) {
> -        irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> -        irqfd.resamplefd = rfd;
> +        assert(assign);
> +        if (kvm_irqchip_is_split()) {
> +            /*
> +             * When the slow irqchip (e.g. IOAPIC) is in the
> +             * userspace, KVM kernel resamplefd will not work because
> +             * the EOI of the interrupt will be delivered to userspace
> +             * instead, so the KVM kernel resamplefd kick will be
> +             * skipped.  The userspace here mimics what the kernel
> +             * provides with resamplefd, remember the resamplefd and
> +             * kick it when we receive EOI of this IRQ.
> +             *
> +             * This is hackery because IOAPIC is mostly bypassed
> +             * (except EOI broadcasts) when irqfd is used.  However
> +             * this can bring much performance back for split irqchip
> +             * with INTx IRQs (for VFIO, this gives 93% perf of the
> +             * full fast path, which is 46% perf boost comparing to
> +             * the INTx slow path).
> +             */
> +            kvm_resample_fd_insert(virq, resample);
> +        } else {
> +            irqfd.flags |= KVM_IRQFD_FLAG_RESAMPLE;
> +            irqfd.resamplefd = rfd;
> +        }
> +    } else if (!assign) {
> +        if (kvm_irqchip_is_split()) {
> +            kvm_resample_fd_remove(virq);
> +        }
>      }
>  
>      if (!kvm_irqfds_enabled()) {
> diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
> index 4fb6e59d19..a68eb66534 100644
> --- a/accel/kvm/trace-events
> +++ b/accel/kvm/trace-events
> @@ -16,4 +16,5 @@ kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_
>  kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
>  kvm_set_user_memory(uint32_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
>  kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
> +kvm_resample_fd_notify(int gsi) "gsi %d"
>  
> diff --git a/hw/intc/ioapic.c b/hw/intc/ioapic.c
> index 15747fe2c2..2ae96e10be 100644
> --- a/hw/intc/ioapic.c
> +++ b/hw/intc/ioapic.c
> @@ -241,6 +241,23 @@ void ioapic_eoi_broadcast(int vector)
>                  continue;
>              }
>  
> +            /*
> +             * When IOAPIC is in the userspace while APIC is still in
> +             * the kernel (i.e., split irqchip), we have a trick to
> +             * kick the resamplefd logic for registered irqfds from
> +             * userspace to deactivate the IRQ.  When that happens, it
> +             * means the irq bypassed userspace IOAPIC (so the irr and
> +             * remote-irr of the table entry should be bypassed too
> +             * even if interrupt come).  Still kick the resamplefds if
> +             * they're bound to the IRQ, to make sure to EOI the
> +             * interrupt for the hardware correctly.
> +             *
> +             * Note: We still need to go through the irr & remote-irr
> +             * operations below because we don't know whether there're
> +             * emulated devices that are using/sharing the same IRQ.
> +             */
> +            kvm_resample_fd_notify(n);
> +
>              if (!(entry & IOAPIC_LVT_REMOTE_IRR)) {
>                  continue;
>              }
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 141342de98..583a976f8a 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -555,4 +555,8 @@ int kvm_set_one_reg(CPUState *cs, uint64_t id, void *source);
>  int kvm_get_one_reg(CPUState *cs, uint64_t id, void *target);
>  struct ppc_radix_page_info *kvm_get_radix_page_info(void);
>  int kvm_get_max_memslots(void);
> +
> +/* Notify resamplefd for EOI of specific interrupts. */
> +void kvm_resample_fd_notify(int gsi);
> +
>  #endif



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip"
  2020-03-17 19:50 ` [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
@ 2020-03-17 22:58   ` Alex Williamson
  0 siblings, 0 replies; 17+ messages in thread
From: Alex Williamson @ 2020-03-17 22:58 UTC (permalink / raw)
  To: Peter Xu; +Cc: Paolo Bonzini, Cornelia Huck, qemu-devel, Eric Auger

On Tue, 17 Mar 2020 15:50:42 -0400
Peter Xu <peterx@redhat.com> wrote:

> With the resamplefd list introduced, we can savely enable VFIO INTx
> fast path again with split irqchip so it can still be faster than the
> complete slow path.
> 
> Reviewed-by: Eric Auger <eric.auger@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/pci.c | 12 ------------
>  1 file changed, 12 deletions(-)

Reviewed-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Alex Williamson <alex.williamson@redhat.com>

> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index 09703362df..1c0aa27386 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -124,18 +124,6 @@ static void vfio_intx_enable_kvm(VFIOPCIDevice *vdev, Error **errp)
>          return;
>      }
>  
> -    if (kvm_irqchip_is_split()) {
> -        /*
> -         * VFIO INTx is currently not working with split kernel
> -         * irqchip for level triggered interrupts.  Go the slow path
> -         * as long as split is enabled so we can be at least
> -         * functional (even with poor performance).
> -         *
> -         * TODO: Remove this after all things fixed up.
> -         */
> -        return;
> -    }
> -
>      /* Get to a known interrupt state */
>      qemu_set_fd_handler(irq_fd, NULL, NULL, vdev);
>      vfio_mask_single_irqindex(&vdev->vbasedev, VFIO_PCI_INTX_IRQ_INDEX);



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx
  2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
                   ` (4 preceding siblings ...)
  2020-03-17 19:50 ` [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
@ 2020-03-18  2:56 ` no-reply
  5 siblings, 0 replies; 17+ messages in thread
From: no-reply @ 2020-03-18  2:56 UTC (permalink / raw)
  To: peterx; +Cc: cohuck, qemu-devel, peterx, eric.auger, alex.williamson, pbonzini

Patchew URL: https://patchew.org/QEMU/20200317195042.282977-1-peterx@redhat.com/



Hi,

This series failed the docker-mingw@fedora build test. Please find the testing commands and
their output below. If you have Docker installed, you can probably reproduce it
locally.

=== TEST SCRIPT BEGIN ===
#! /bin/bash
export ARCH=x86_64
make docker-image-fedora V=1 NETWORK=1
time make docker-test-mingw@fedora J=14 NETWORK=1
=== TEST SCRIPT END ===

  LINK    x86_64-softmmu/qemu-system-x86_64w.exe
hw/intc/ioapic.o: In function `ioapic_eoi_broadcast':
/tmp/qemu-test/src/hw/intc/ioapic.c:258: undefined reference to `kvm_resample_fd_notify'
collect2: error: ld returned 1 exit status
make[1]: *** [Makefile:207: qemu-system-x86_64w.exe] Error 1
make: *** [Makefile:527: x86_64-softmmu/all] Error 2
make: *** Waiting for unfinished jobs....
  LINK    aarch64-softmmu/qemu-system-aarch64w.exe
  GEN     aarch64-softmmu/qemu-system-aarch64.exe
---
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['sudo', '-n', 'docker', 'run', '--label', 'com.qemu.instance.uuid=58e111de5d2740baaed092c9da2bbf52', '-u', '1003', '--security-opt', 'seccomp=unconfined', '--rm', '-e', 'TARGET_LIST=', '-e', 'EXTRA_CONFIGURE_OPTS=', '-e', 'V=', '-e', 'J=14', '-e', 'DEBUG=', '-e', 'SHOW_ENV=', '-e', 'CCACHE_DIR=/var/tmp/ccache', '-v', '/home/patchew2/.cache/qemu-docker-ccache:/var/tmp/ccache:z', '-v', '/var/tmp/patchew-tester-tmp-xbwc233p/src/docker-src.2020-03-17-22.51.50.5035:/var/tmp/qemu:z,ro', 'qemu:fedora', '/var/tmp/qemu/run', 'test-mingw']' returned non-zero exit status 2.
filter=--filter=label=com.qemu.instance.uuid=58e111de5d2740baaed092c9da2bbf52
make[1]: *** [docker-run] Error 1
make[1]: Leaving directory `/var/tmp/patchew-tester-tmp-xbwc233p/src'
make: *** [docker-run-test-mingw@fedora] Error 2

real    4m18.007s
user    0m8.397s


The full log is available at
http://patchew.org/logs/20200317195042.282977-1-peterx@redhat.com/testing.docker-mingw@fedora/?type=message.
---
Email generated automatically by Patchew [https://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2020-03-18  2:56 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-03-17 19:50 [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx Peter Xu
2020-03-17 19:50 ` [PATCH v3 1/5] vfio/pci: Disable INTx fast path if using split irqchip Peter Xu
2020-03-17 22:57   ` Alex Williamson
2020-03-17 19:50 ` [PATCH v3 2/5] vfio/pci: Use kvm_irqchip_add_irqfd_notifier_gsi() for irqfds Peter Xu
2020-03-17 22:57   ` Alex Williamson
2020-03-17 19:50 ` [PATCH v3 3/5] KVM: Pass EventNotifier into kvm_irqchip_assign_irqfd Peter Xu
2020-03-17 22:58   ` Alex Williamson
2020-03-17 19:50 ` [PATCH v3 4/5] KVM: Kick resamplefd for split kernel irqchip Peter Xu
2020-03-17 21:06   ` Alex Williamson
2020-03-17 21:41     ` Peter Xu
2020-03-17 22:12       ` Alex Williamson
2020-03-17 22:41         ` Peter Xu
2020-03-17 22:49   ` [PATCH v3.1 " Peter Xu
2020-03-17 22:58     ` Alex Williamson
2020-03-17 19:50 ` [PATCH v3 5/5] Revert "vfio/pci: Disable INTx fast path if using split irqchip" Peter Xu
2020-03-17 22:58   ` Alex Williamson
2020-03-18  2:56 ` [PATCH v3 0/5] vfio/pci: Fix up breakage against split irqchip and INTx no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.