All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/5] vfio: ioeventfd support
@ 2018-02-07  0:26 ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

For the matching kernel patch, see:

https://lkml.org/lkml/2018/2/6/866

This series enables ioeventfd support and makes use of a proposed vfio
kernel ioeventfd interface for accelerating high frequency writes
through to the device.  In the specific case addressed, the writes are
to a range of MMIO space virtualized in QEMU for NVIDIA GeForce
support, but which also hosts a register which is used to allow the
MSI interrupt for the device to re-trigger.  Applications which
generate a very high interrupt rate on the GPU can see noticeable
overhead as a result of this trap through QEMU.  We added an option
for users to disable these quirks entirely for non-Geforce cards[1]
for optimal performance, but for GeForce users and users that can't
tweak their VM config, this gets us to within 95% of that performance
for an interrupt intensive micro-benchmark (from 83%).  I'd be
interested in more typical benchmark results to understand if there's
an improvement there as well.  Thanks,

Alex

[1] https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg06878.html

---

Alex Williamson (5):
      vfio/quirks: Add common quirk alloc helper
      vfio/quirks: Add generic support for ioveventfds
      vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
      vfio: Update linux header
      vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly


 hw/vfio/pci-quirks.c       |  184 +++++++++++++++++++++++++++++++++++++-------
 hw/vfio/pci.h              |   13 +++
 linux-headers/linux/vfio.h |   24 ++++++
 3 files changed, 192 insertions(+), 29 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 0/5] vfio: ioeventfd support
@ 2018-02-07  0:26 ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

For the matching kernel patch, see:

https://lkml.org/lkml/2018/2/6/866

This series enables ioeventfd support and makes use of a proposed vfio
kernel ioeventfd interface for accelerating high frequency writes
through to the device.  In the specific case addressed, the writes are
to a range of MMIO space virtualized in QEMU for NVIDIA GeForce
support, but which also hosts a register which is used to allow the
MSI interrupt for the device to re-trigger.  Applications which
generate a very high interrupt rate on the GPU can see noticeable
overhead as a result of this trap through QEMU.  We added an option
for users to disable these quirks entirely for non-Geforce cards[1]
for optimal performance, but for GeForce users and users that can't
tweak their VM config, this gets us to within 95% of that performance
for an interrupt intensive micro-benchmark (from 83%).  I'd be
interested in more typical benchmark results to understand if there's
an improvement there as well.  Thanks,

Alex

[1] https://lists.gnu.org/archive/html/qemu-devel/2018-01/msg06878.html

---

Alex Williamson (5):
      vfio/quirks: Add common quirk alloc helper
      vfio/quirks: Add generic support for ioveventfds
      vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
      vfio: Update linux header
      vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly


 hw/vfio/pci-quirks.c       |  184 +++++++++++++++++++++++++++++++++++++-------
 hw/vfio/pci.h              |   13 +++
 linux-headers/linux/vfio.h |   24 ++++++
 3 files changed, 192 insertions(+), 29 deletions(-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
  2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
@ 2018-02-07  0:26   ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

This will later be used to include list initialization

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
 1 file changed, 21 insertions(+), 27 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e5779a7ad35b..10af23217292 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
+static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
+{
+    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
+    quirk->mem = g_new0(MemoryRegion, nr_mem);
+    quirk->nr_mem = nr_mem;
+
+    return quirk;
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -288,9 +297,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    quirk = vfio_quirk_alloc(1);
 
     memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_ati_3c3_quirk, vdev,
                           "vfio-ati-3c3-quirk", 1);
@@ -323,9 +330,7 @@ static void vfio_probe_ati_bar4_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     window = quirk->data = g_malloc0(sizeof(*window) +
                                      sizeof(VFIOConfigWindowMatch));
     window->vdev = vdev;
@@ -371,10 +376,9 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(1);
     mirror = quirk->data = g_malloc0(sizeof(*mirror));
-    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x4000;
     mirror->bar = nr;
@@ -548,10 +552,8 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(2);
     quirk->data = data = g_malloc0(sizeof(*data));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
     data->vdev = vdev;
 
     memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_nvidia_3d4_quirk,
@@ -667,9 +669,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 4);
-    quirk->nr_mem = 4;
+    quirk = vfio_quirk_alloc(4);
     bar5 = quirk->data = g_malloc0(sizeof(*bar5) +
                                    (sizeof(VFIOConfigWindowMatch) * 2));
     window = &bar5->window;
@@ -762,10 +762,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(1);
     mirror = quirk->data = g_malloc0(sizeof(*mirror));
-    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x88000;
     mirror->bar = nr;
@@ -781,10 +780,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
 
     /* The 0x1800 offset mirror only seems to get used by legacy VGA */
     if (vdev->vga) {
-        quirk = g_malloc0(sizeof(*quirk));
+        quirk = vfio_quirk_alloc(1);
         mirror = quirk->data = g_malloc0(sizeof(*mirror));
-        mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-        quirk->nr_mem = 1;
+        mirror->mem = quirk->mem;
         mirror->vdev = vdev;
         mirror->offset = 0x1800;
         mirror->bar = nr;
@@ -945,9 +943,7 @@ static void vfio_probe_rtl8168_bar2_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     quirk->data = rtl = g_malloc0(sizeof(*rtl));
     rtl->vdev = vdev;
 
@@ -1507,9 +1503,7 @@ static void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
     }
 
     /* Setup our quirk to munge GTT addresses to the VM allocated buffer */
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     igd = quirk->data = g_malloc0(sizeof(*igd));
     igd->vdev = vdev;
     igd->index = ~0;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
@ 2018-02-07  0:26   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

This will later be used to include list initialization

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
 1 file changed, 21 insertions(+), 27 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e5779a7ad35b..10af23217292 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
     .endianness = DEVICE_LITTLE_ENDIAN,
 };
 
+static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
+{
+    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
+    quirk->mem = g_new0(MemoryRegion, nr_mem);
+    quirk->nr_mem = nr_mem;
+
+    return quirk;
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -288,9 +297,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    quirk = vfio_quirk_alloc(1);
 
     memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_ati_3c3_quirk, vdev,
                           "vfio-ati-3c3-quirk", 1);
@@ -323,9 +330,7 @@ static void vfio_probe_ati_bar4_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     window = quirk->data = g_malloc0(sizeof(*window) +
                                      sizeof(VFIOConfigWindowMatch));
     window->vdev = vdev;
@@ -371,10 +376,9 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(1);
     mirror = quirk->data = g_malloc0(sizeof(*mirror));
-    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x4000;
     mirror->bar = nr;
@@ -548,10 +552,8 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(2);
     quirk->data = data = g_malloc0(sizeof(*data));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
     data->vdev = vdev;
 
     memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_nvidia_3d4_quirk,
@@ -667,9 +669,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 4);
-    quirk->nr_mem = 4;
+    quirk = vfio_quirk_alloc(4);
     bar5 = quirk->data = g_malloc0(sizeof(*bar5) +
                                    (sizeof(VFIOConfigWindowMatch) * 2));
     window = &bar5->window;
@@ -762,10 +762,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
+    quirk = vfio_quirk_alloc(1);
     mirror = quirk->data = g_malloc0(sizeof(*mirror));
-    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-    quirk->nr_mem = 1;
+    mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x88000;
     mirror->bar = nr;
@@ -781,10 +780,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
 
     /* The 0x1800 offset mirror only seems to get used by legacy VGA */
     if (vdev->vga) {
-        quirk = g_malloc0(sizeof(*quirk));
+        quirk = vfio_quirk_alloc(1);
         mirror = quirk->data = g_malloc0(sizeof(*mirror));
-        mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
-        quirk->nr_mem = 1;
+        mirror->mem = quirk->mem;
         mirror->vdev = vdev;
         mirror->offset = 0x1800;
         mirror->bar = nr;
@@ -945,9 +943,7 @@ static void vfio_probe_rtl8168_bar2_quirk(VFIOPCIDevice *vdev, int nr)
         return;
     }
 
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     quirk->data = rtl = g_malloc0(sizeof(*rtl));
     rtl->vdev = vdev;
 
@@ -1507,9 +1503,7 @@ static void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
     }
 
     /* Setup our quirk to munge GTT addresses to the VM allocated buffer */
-    quirk = g_malloc0(sizeof(*quirk));
-    quirk->mem = g_new0(MemoryRegion, 2);
-    quirk->nr_mem = 2;
+    quirk = vfio_quirk_alloc(2);
     igd = quirk->data = g_malloc0(sizeof(*igd));
     igd->vdev = vdev;
     igd->index = ~0;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
  2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
@ 2018-02-07  0:26   ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

We might wish to handle some quirks via ioeventfds, add a list of
ioeventfds to the quirk.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   17 +++++++++++++++++
 hw/vfio/pci.h        |   11 +++++++++++
 2 files changed, 28 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 10af23217292..e4cf4ea2dd9c 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
@@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
 static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
 {
     VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
+    QLIST_INIT(&quirk->ioeventfds);
     quirk->mem = g_new0(MemoryRegion, nr_mem);
     quirk->nr_mem = nr_mem;
 
     return quirk;
 }
 
+static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
+{
+    QLIST_REMOVE(ioeventfd, next);
+    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
+                              ioeventfd->match_data, ioeventfd->data,
+                              &ioeventfd->e);
+    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
+    event_notifier_cleanup(&ioeventfd->e);
+    g_free(ioeventfd);
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
     int i;
 
     QLIST_FOREACH(quirk, &bar->quirks, next) {
+        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
+            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
+        }
+
         for (i = 0; i < quirk->nr_mem; i++) {
             memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
         }
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index f4aa13e021fa..146065c2f715 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -24,9 +24,20 @@
 
 struct VFIOPCIDevice;
 
+typedef struct VFIOIOEventFD {
+    QLIST_ENTRY(VFIOIOEventFD) next;
+    MemoryRegion *mr;
+    hwaddr addr;
+    unsigned size;
+    bool match_data;
+    uint64_t data;
+    EventNotifier e;
+} VFIOIOEventFD;
+
 typedef struct VFIOQuirk {
     QLIST_ENTRY(VFIOQuirk) next;
     void *data;
+    QLIST_HEAD(, VFIOIOEventFD) ioeventfds;
     int nr_mem;
     MemoryRegion *mem;
 } VFIOQuirk;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
@ 2018-02-07  0:26   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

We might wish to handle some quirks via ioeventfds, add a list of
ioeventfds to the quirk.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   17 +++++++++++++++++
 hw/vfio/pci.h        |   11 +++++++++++
 2 files changed, 28 insertions(+)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 10af23217292..e4cf4ea2dd9c 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -12,6 +12,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/main-loop.h"
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
@@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
 static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
 {
     VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
+    QLIST_INIT(&quirk->ioeventfds);
     quirk->mem = g_new0(MemoryRegion, nr_mem);
     quirk->nr_mem = nr_mem;
 
     return quirk;
 }
 
+static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
+{
+    QLIST_REMOVE(ioeventfd, next);
+    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
+                              ioeventfd->match_data, ioeventfd->data,
+                              &ioeventfd->e);
+    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
+    event_notifier_cleanup(&ioeventfd->e);
+    g_free(ioeventfd);
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
     int i;
 
     QLIST_FOREACH(quirk, &bar->quirks, next) {
+        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
+            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
+        }
+
         for (i = 0; i < quirk->nr_mem; i++) {
             memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
         }
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index f4aa13e021fa..146065c2f715 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -24,9 +24,20 @@
 
 struct VFIOPCIDevice;
 
+typedef struct VFIOIOEventFD {
+    QLIST_ENTRY(VFIOIOEventFD) next;
+    MemoryRegion *mr;
+    hwaddr addr;
+    unsigned size;
+    bool match_data;
+    uint64_t data;
+    EventNotifier e;
+} VFIOIOEventFD;
+
 typedef struct VFIOQuirk {
     QLIST_ENTRY(VFIOQuirk) next;
     void *data;
+    QLIST_HEAD(, VFIOIOEventFD) ioeventfds;
     int nr_mem;
     MemoryRegion *mem;
 } VFIOQuirk;

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
  2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
@ 2018-02-07  0:26   ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

Record data writes that come through the NVIDIA BAR0 quirk, if we get
enough in a row that we're only passing through, automatically enable
an ioeventfd for it.  The primary target for this is the MSI-ACK
that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
into BAR0 MMIO space.  For an interrupt latency sensitive micro-
benchmark, this takes us from 83% of performance versus disabling the
quirk entirely (which GeForce cannot do), to to almost 90%.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h        |    2 +
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e4cf4ea2dd9c..e739efe601b1 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
     uint32_t offset;
     uint8_t bar;
     MemoryRegion *mem;
+    uint8_t data[];
 } VFIOConfigMirrorQuirk;
 
 static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
@@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
     g_free(ioeventfd);
 }
 
+static void vfio_ioeventfd_handler(void *opaque)
+{
+    VFIOIOEventFD *ioeventfd = opaque;
+
+    if (event_notifier_test_and_clear(&ioeventfd->e)) {
+        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
+                          ioeventfd->data, ioeventfd->size);
+    }
+}
+
+static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
+                                          MemoryRegion *mr, hwaddr addr,
+                                          unsigned size, uint64_t data,
+                                          VFIORegion *region,
+                                          hwaddr region_addr)
+{
+    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
+
+    if (event_notifier_init(&ioeventfd->e, 0)) {
+        g_free(ioeventfd);
+        return NULL;
+    }
+
+    ioeventfd->mr = mr;
+    ioeventfd->addr = addr;
+    ioeventfd->size = size;
+    ioeventfd->match_data = true;
+    ioeventfd->data = data;
+    ioeventfd->region = region;
+    ioeventfd->region_addr = region_addr;
+
+    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
+                        vfio_ioeventfd_handler, NULL, ioeventfd);
+    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
+                              ioeventfd->size, ioeventfd->match_data,
+                              ioeventfd->data, &ioeventfd->e);
+
+    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
+                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
+                vdev->vbasedev.name, region->nr, region_addr, data, size);
+
+    return ioeventfd;
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
     trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
 }
 
+typedef struct LastDataSet {
+    hwaddr addr;
+    uint64_t data;
+    unsigned size;
+    int count;
+} LastDataSet;
+
 /*
  * Finally, BAR0 itself.  We want to redirect any accesses to either
  * 0x1800 or 0x88000 through the PCI config space access functions.
@@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
     VFIOConfigMirrorQuirk *mirror = opaque;
     VFIOPCIDevice *vdev = mirror->vdev;
     PCIDevice *pdev = &vdev->pdev;
+    LastDataSet *last = (LastDataSet *)&mirror->data;
 
     vfio_generic_quirk_mirror_write(opaque, addr, data, size);
 
@@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
                           addr + mirror->offset, data, size);
         trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
     }
+
+    /*
+     * Automatically add an ioeventfd to handle any repeated write with the
+     * same data and size above the standard PCI config space header.  This is
+     * primarily expected to accelerate the MSI-ACK behavior, such as noted
+     * above.  Current hardware/drivers should trigger an ioeventfd at config
+     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
+     */
+    if (addr > PCI_STD_HEADER_SIZEOF) {
+        if (addr != last->addr || data != last->data || size != last->size) {
+            last->addr = addr;
+            last->data = data;
+            last->size = size;
+            last->count = 1;
+        } else if (++last->count > 10) {
+            VFIOIOEventFD *ioeventfd;
+
+            ioeventfd = vfio_ioeventfd_init(vdev, mirror->mem, addr, size, data,
+                                            &vdev->bars[mirror->bar].region,
+                                            mirror->offset + addr);
+            if (ioeventfd) {
+                VFIOQuirk *quirk;
+
+                QLIST_FOREACH(quirk, &vdev->bars[mirror->bar].quirks, next) {
+                    if (quirk->data == mirror) {
+                        QLIST_INSERT_HEAD(&quirk->ioeventfds, ioeventfd, next);
+                        break;
+                    }
+                }
+            }
+        }
+    }
 }
 
 static const MemoryRegionOps vfio_nvidia_mirror_quirk = {
@@ -776,7 +861,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
     }
 
     quirk = vfio_quirk_alloc(1);
-    mirror = quirk->data = g_malloc0(sizeof(*mirror));
+    mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
     mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x88000;
@@ -794,7 +879,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
     /* The 0x1800 offset mirror only seems to get used by legacy VGA */
     if (vdev->vga) {
         quirk = vfio_quirk_alloc(1);
-        mirror = quirk->data = g_malloc0(sizeof(*mirror));
+        mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
         mirror->mem = quirk->mem;
         mirror->vdev = vdev;
         mirror->offset = 0x1800;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 146065c2f715..ec53b9935725 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -32,6 +32,8 @@ typedef struct VFIOIOEventFD {
     bool match_data;
     uint64_t data;
     EventNotifier e;
+    VFIORegion *region;
+    hwaddr region_addr;
 } VFIOIOEventFD;
 
 typedef struct VFIOQuirk {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
@ 2018-02-07  0:26   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

Record data writes that come through the NVIDIA BAR0 quirk, if we get
enough in a row that we're only passing through, automatically enable
an ioeventfd for it.  The primary target for this is the MSI-ACK
that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
into BAR0 MMIO space.  For an interrupt latency sensitive micro-
benchmark, this takes us from 83% of performance versus disabling the
quirk entirely (which GeForce cannot do), to to almost 90%.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
 hw/vfio/pci.h        |    2 +
 2 files changed, 89 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e4cf4ea2dd9c..e739efe601b1 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
     uint32_t offset;
     uint8_t bar;
     MemoryRegion *mem;
+    uint8_t data[];
 } VFIOConfigMirrorQuirk;
 
 static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
@@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
     g_free(ioeventfd);
 }
 
+static void vfio_ioeventfd_handler(void *opaque)
+{
+    VFIOIOEventFD *ioeventfd = opaque;
+
+    if (event_notifier_test_and_clear(&ioeventfd->e)) {
+        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
+                          ioeventfd->data, ioeventfd->size);
+    }
+}
+
+static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
+                                          MemoryRegion *mr, hwaddr addr,
+                                          unsigned size, uint64_t data,
+                                          VFIORegion *region,
+                                          hwaddr region_addr)
+{
+    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
+
+    if (event_notifier_init(&ioeventfd->e, 0)) {
+        g_free(ioeventfd);
+        return NULL;
+    }
+
+    ioeventfd->mr = mr;
+    ioeventfd->addr = addr;
+    ioeventfd->size = size;
+    ioeventfd->match_data = true;
+    ioeventfd->data = data;
+    ioeventfd->region = region;
+    ioeventfd->region_addr = region_addr;
+
+    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
+                        vfio_ioeventfd_handler, NULL, ioeventfd);
+    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
+                              ioeventfd->size, ioeventfd->match_data,
+                              ioeventfd->data, &ioeventfd->e);
+
+    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
+                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
+                vdev->vbasedev.name, region->nr, region_addr, data, size);
+
+    return ioeventfd;
+}
+
 static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
 {
     VFIOQuirk *quirk;
@@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
     trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
 }
 
+typedef struct LastDataSet {
+    hwaddr addr;
+    uint64_t data;
+    unsigned size;
+    int count;
+} LastDataSet;
+
 /*
  * Finally, BAR0 itself.  We want to redirect any accesses to either
  * 0x1800 or 0x88000 through the PCI config space access functions.
@@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
     VFIOConfigMirrorQuirk *mirror = opaque;
     VFIOPCIDevice *vdev = mirror->vdev;
     PCIDevice *pdev = &vdev->pdev;
+    LastDataSet *last = (LastDataSet *)&mirror->data;
 
     vfio_generic_quirk_mirror_write(opaque, addr, data, size);
 
@@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
                           addr + mirror->offset, data, size);
         trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
     }
+
+    /*
+     * Automatically add an ioeventfd to handle any repeated write with the
+     * same data and size above the standard PCI config space header.  This is
+     * primarily expected to accelerate the MSI-ACK behavior, such as noted
+     * above.  Current hardware/drivers should trigger an ioeventfd at config
+     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
+     */
+    if (addr > PCI_STD_HEADER_SIZEOF) {
+        if (addr != last->addr || data != last->data || size != last->size) {
+            last->addr = addr;
+            last->data = data;
+            last->size = size;
+            last->count = 1;
+        } else if (++last->count > 10) {
+            VFIOIOEventFD *ioeventfd;
+
+            ioeventfd = vfio_ioeventfd_init(vdev, mirror->mem, addr, size, data,
+                                            &vdev->bars[mirror->bar].region,
+                                            mirror->offset + addr);
+            if (ioeventfd) {
+                VFIOQuirk *quirk;
+
+                QLIST_FOREACH(quirk, &vdev->bars[mirror->bar].quirks, next) {
+                    if (quirk->data == mirror) {
+                        QLIST_INSERT_HEAD(&quirk->ioeventfds, ioeventfd, next);
+                        break;
+                    }
+                }
+            }
+        }
+    }
 }
 
 static const MemoryRegionOps vfio_nvidia_mirror_quirk = {
@@ -776,7 +861,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
     }
 
     quirk = vfio_quirk_alloc(1);
-    mirror = quirk->data = g_malloc0(sizeof(*mirror));
+    mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
     mirror->mem = quirk->mem;
     mirror->vdev = vdev;
     mirror->offset = 0x88000;
@@ -794,7 +879,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
     /* The 0x1800 offset mirror only seems to get used by legacy VGA */
     if (vdev->vga) {
         quirk = vfio_quirk_alloc(1);
-        mirror = quirk->data = g_malloc0(sizeof(*mirror));
+        mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
         mirror->mem = quirk->mem;
         mirror->vdev = vdev;
         mirror->offset = 0x1800;
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 146065c2f715..ec53b9935725 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -32,6 +32,8 @@ typedef struct VFIOIOEventFD {
     bool match_data;
     uint64_t data;
     EventNotifier e;
+    VFIORegion *region;
+    hwaddr region_addr;
 } VFIOIOEventFD;
 
 typedef struct VFIOQuirk {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 4/5] vfio: Update linux header
  2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
@ 2018-02-07  0:26   ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

Update with proposed ioeventfd API.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 linux-headers/linux/vfio.h |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 4312e961ffd3..0921994daa6d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -503,6 +503,30 @@ struct vfio_pci_hot_reset {
 
 #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
 
+/**
+ * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14,
+ *                              struct vfio_device_ioeventfd)
+ *
+ * Perform a write to the device at the specified device fd offset, with
+ * the specified data and width when the provided eventfd is triggered.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_ioeventfd {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_IOEVENTFD_8		(1 << 0) /* 1-byte write */
+#define VFIO_DEVICE_IOEVENTFD_16	(1 << 1) /* 2-byte write */
+#define VFIO_DEVICE_IOEVENTFD_32	(1 << 2) /* 4-byte write */
+#define VFIO_DEVICE_IOEVENTFD_64	(1 << 3) /* 8-byte write */
+#define VFIO_DEVICE_IOEVENTFD_SIZE_MASK	(0xf)
+	__u64	offset;			/* device fd offset of write */
+	__u64	data;			/* data to be written */
+	__s32	fd;			/* -1 for de-assignment */
+};
+
+#define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 14)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 4/5] vfio: Update linux header
@ 2018-02-07  0:26   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

Update with proposed ioeventfd API.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 linux-headers/linux/vfio.h |   24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/linux-headers/linux/vfio.h b/linux-headers/linux/vfio.h
index 4312e961ffd3..0921994daa6d 100644
--- a/linux-headers/linux/vfio.h
+++ b/linux-headers/linux/vfio.h
@@ -503,6 +503,30 @@ struct vfio_pci_hot_reset {
 
 #define VFIO_DEVICE_PCI_HOT_RESET	_IO(VFIO_TYPE, VFIO_BASE + 13)
 
+/**
+ * VFIO_DEVICE_IOEVENTFD - _IOW(VFIO_TYPE, VFIO_BASE + 14,
+ *                              struct vfio_device_ioeventfd)
+ *
+ * Perform a write to the device at the specified device fd offset, with
+ * the specified data and width when the provided eventfd is triggered.
+ *
+ * Return: 0 on success, -errno on failure.
+ */
+struct vfio_device_ioeventfd {
+	__u32	argsz;
+	__u32	flags;
+#define VFIO_DEVICE_IOEVENTFD_8		(1 << 0) /* 1-byte write */
+#define VFIO_DEVICE_IOEVENTFD_16	(1 << 1) /* 2-byte write */
+#define VFIO_DEVICE_IOEVENTFD_32	(1 << 2) /* 4-byte write */
+#define VFIO_DEVICE_IOEVENTFD_64	(1 << 3) /* 8-byte write */
+#define VFIO_DEVICE_IOEVENTFD_SIZE_MASK	(0xf)
+	__u64	offset;			/* device fd offset of write */
+	__u64	data;			/* data to be written */
+	__s32	fd;			/* -1 for de-assignment */
+};
+
+#define VFIO_DEVICE_IOEVENTFD		_IO(VFIO_TYPE, VFIO_BASE + 14)
+
 /* -------- API for Type1 VFIO IOMMU -------- */
 
 /**

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
@ 2018-02-07  0:26   ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered.  This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events.  On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e739efe601b1..35a4d5197e2d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -16,6 +16,7 @@
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include <sys/ioctl.h>
 #include "hw/nvram/fw_cfg.h"
 #include "pci.h"
 #include "trace.h"
@@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
     return quirk;
 }
 
-static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
+static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
 {
+    struct vfio_device_ioeventfd vfio_ioeventfd;
+
     QLIST_REMOVE(ioeventfd, next);
+
     memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
                               ioeventfd->match_data, ioeventfd->data,
                               &ioeventfd->e);
+
     qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
+
+    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
+    vfio_ioeventfd.flags = ioeventfd->size;
+    vfio_ioeventfd.data = ioeventfd->data;
+    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
+                            ioeventfd->region_addr;
+    vfio_ioeventfd.fd = -1;
+
+    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
+
     event_notifier_cleanup(&ioeventfd->e);
     g_free(ioeventfd);
 }
@@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
                                           hwaddr region_addr)
 {
     VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
+    struct vfio_device_ioeventfd vfio_ioeventfd;
+    char vfio_enabled = '+';
 
     if (event_notifier_init(&ioeventfd->e, 0)) {
         g_free(ioeventfd);
@@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
     ioeventfd->region = region;
     ioeventfd->region_addr = region_addr;
 
-    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
-                        vfio_ioeventfd_handler, NULL, ioeventfd);
+    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
+    vfio_ioeventfd.flags = ioeventfd->size;
+    vfio_ioeventfd.data = ioeventfd->data;
+    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
+                            ioeventfd->region_addr;
+    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
+
+    if (ioctl(vdev->vbasedev.fd,
+              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
+        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
+                            vfio_ioeventfd_handler, NULL, ioeventfd);
+        vfio_enabled = '-';
+    }
+
     memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
                               ioeventfd->size, ioeventfd->match_data,
                               ioeventfd->data, &ioeventfd->e);
 
     info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
-                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
-                vdev->vbasedev.name, region->nr, region_addr, data, size);
+                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
+                vdev->vbasedev.name, region->nr, region_addr, data, size,
+                vfio_enabled);
 
     return ioeventfd;
 }
@@ -1767,7 +1797,7 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
 
     QLIST_FOREACH(quirk, &bar->quirks, next) {
         while (!QLIST_EMPTY(&quirk->ioeventfds)) {
-            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
+            vfio_ioeventfd_exit(vdev, QLIST_FIRST(&quirk->ioeventfds));
         }
 
         for (i = 0; i < quirk->nr_mem; i++) {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
@ 2018-02-07  0:26   ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-07  0:26 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, kvm

With vfio ioeventfd support, we can program vfio-pci to perform a
specified BAR write when an eventfd is triggered.  This allows the
KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
userspace handling for these events.  On the same micro-benchmark
where the ioeventfd got us to almost 90% of performance versus
disabling the GeForce quirks, this gets us to within 95%.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index e739efe601b1..35a4d5197e2d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -16,6 +16,7 @@
 #include "qemu/range.h"
 #include "qapi/error.h"
 #include "qapi/visitor.h"
+#include <sys/ioctl.h>
 #include "hw/nvram/fw_cfg.h"
 #include "pci.h"
 #include "trace.h"
@@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
     return quirk;
 }
 
-static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
+static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
 {
+    struct vfio_device_ioeventfd vfio_ioeventfd;
+
     QLIST_REMOVE(ioeventfd, next);
+
     memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
                               ioeventfd->match_data, ioeventfd->data,
                               &ioeventfd->e);
+
     qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
+
+    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
+    vfio_ioeventfd.flags = ioeventfd->size;
+    vfio_ioeventfd.data = ioeventfd->data;
+    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
+                            ioeventfd->region_addr;
+    vfio_ioeventfd.fd = -1;
+
+    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
+
     event_notifier_cleanup(&ioeventfd->e);
     g_free(ioeventfd);
 }
@@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
                                           hwaddr region_addr)
 {
     VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
+    struct vfio_device_ioeventfd vfio_ioeventfd;
+    char vfio_enabled = '+';
 
     if (event_notifier_init(&ioeventfd->e, 0)) {
         g_free(ioeventfd);
@@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
     ioeventfd->region = region;
     ioeventfd->region_addr = region_addr;
 
-    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
-                        vfio_ioeventfd_handler, NULL, ioeventfd);
+    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
+    vfio_ioeventfd.flags = ioeventfd->size;
+    vfio_ioeventfd.data = ioeventfd->data;
+    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
+                            ioeventfd->region_addr;
+    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
+
+    if (ioctl(vdev->vbasedev.fd,
+              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
+        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
+                            vfio_ioeventfd_handler, NULL, ioeventfd);
+        vfio_enabled = '-';
+    }
+
     memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
                               ioeventfd->size, ioeventfd->match_data,
                               ioeventfd->data, &ioeventfd->e);
 
     info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
-                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
-                vdev->vbasedev.name, region->nr, region_addr, data, size);
+                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
+                vdev->vbasedev.name, region->nr, region_addr, data, size,
+                vfio_enabled);
 
     return ioeventfd;
 }
@@ -1767,7 +1797,7 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
 
     QLIST_FOREACH(quirk, &bar->quirks, next) {
         while (!QLIST_EMPTY(&quirk->ioeventfds)) {
-            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
+            vfio_ioeventfd_exit(vdev, QLIST_FIRST(&quirk->ioeventfds));
         }
 
         for (i = 0; i < quirk->nr_mem; i++) {

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
  2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
@ 2018-02-08 11:10     ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:10 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,

On 07/02/18 01:26, Alex Williamson wrote:
> Record data writes that come through the NVIDIA BAR0 quirk, if we get
> enough in a row that we're only passing through, automatically enable
> an ioeventfd for it.  The primary target for this is the MSI-ACK
> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
> benchmark, this takes us from 83% of performance versus disabling the
> quirk entirely (which GeForce cannot do), to to almost 90%.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h        |    2 +
>  2 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e4cf4ea2dd9c..e739efe601b1 100644lg

> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>      uint32_t offset;
>      uint8_t bar;
>      MemoryRegion *mem;
> +    uint8_t data[];
Do you foresee other usages of data besides the LastDataSet?
>  } VFIOConfigMirrorQuirk;
>  
>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>      g_free(ioeventfd);
>  }
>  
add a comment? user handler in case kvm ioeventfd setup failed?
> +static void vfio_ioeventfd_handler(void *opaque)
> +{
> +    VFIOIOEventFD *ioeventfd = opaque;
> +
> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
> +                          ioeventfd->data, ioeventfd->size);
> +    }
> +}
> +
> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> +                                          MemoryRegion *mr, hwaddr addr,
> +                                          unsigned size, uint64_t data,
> +                                          VFIORegion *region,
> +                                          hwaddr region_addr)
> +{
> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> +
> +    if (event_notifier_init(&ioeventfd->e, 0)) {
> +        g_free(ioeventfd);
> +        return NULL;
> +    }
> +
> +    ioeventfd->mr = mr;
> +    ioeventfd->addr = addr;
> +    ioeventfd->size = size;
> +    ioeventfd->match_data = true;
> +    ioeventfd->data = data;
> +    ioeventfd->region = region;
> +    ioeventfd->region_addr = region_addr;
I found difficult to follow the different addr semantic.
I understand region_add is the offset % bar and addr is the offset %
mirror region. Maybe more explicit names would help (region = bar_region
and region_addr = bar_offset)
> +
> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> +                              ioeventfd->size, ioeventfd->match_data,
> +                              ioeventfd->data, &ioeventfd->e);
> +
> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
> +
> +    return ioeventfd;
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>  }
>  
> +typedef struct LastDataSet {
> +    hwaddr addr;
> +    uint64_t data;
> +    unsigned size;
> +    int count;
> +} LastDataSet;
> +
>  /*
>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>   * 0x1800 or 0x88000 through the PCI config space access functions.
> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>      VFIOConfigMirrorQuirk *mirror = opaque;
>      VFIOPCIDevice *vdev = mirror->vdev;
>      PCIDevice *pdev = &vdev->pdev;
> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>  
>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>  
> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>                            addr + mirror->offset, data, size);
>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>      }
> +
> +    /*
> +     * Automatically add an ioeventfd to handle any repeated write with the
> +     * same data and size above the standard PCI config space header.  This is
> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
> +     */
> +    if (addr > PCI_STD_HEADER_SIZEOF) {
> +        if (addr != last->addr || data != last->data || size != last->size) {
> +            last->addr = addr;
> +            last->data = data;
> +            last->size = size;
> +            last->count = 1;
> +        } else if (++last->count > 10) {
So here is the naive question about the "10" choice and also the fact
this needs to be consecutive accesses. I guess you observed this works
but at first sight this is not obvious to me.

Does anyone check potential overlaps between ioeventfd's [addr, addr +
size -1]?
> +            VFIOIOEventFD *ioeventfd;
> +
> +            ioeventfd = vfio_ioeventfd_init(vdev, mirror->mem, addr, size, data,
> +                                            &vdev->bars[mirror->bar].region,
> +                                            mirror->offset + addr);
> +            if (ioeventfd) {
> +                VFIOQuirk *quirk;
> +
> +                QLIST_FOREACH(quirk, &vdev->bars[mirror->bar].quirks, next) {
> +                    if (quirk->data == mirror) {
> +                        QLIST_INSERT_HEAD(&quirk->ioeventfds, ioeventfd, next);
> +                        break;
> +                    }
> +                }
> +            }
> +        }
> +    }

Thanks

Eric
>  }
>  
>  static const MemoryRegionOps vfio_nvidia_mirror_quirk = {
> @@ -776,7 +861,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>      }
>  
>      quirk = vfio_quirk_alloc(1);
> -    mirror = quirk->data = g_malloc0(sizeof(*mirror));
> +    mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>      mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x88000;
> @@ -794,7 +879,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>      /* The 0x1800 offset mirror only seems to get used by legacy VGA */
>      if (vdev->vga) {
>          quirk = vfio_quirk_alloc(1);
> -        mirror = quirk->data = g_malloc0(sizeof(*mirror));
> +        mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>          mirror->mem = quirk->mem;
>          mirror->vdev = vdev;
>          mirror->offset = 0x1800;
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 146065c2f715..ec53b9935725 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -32,6 +32,8 @@ typedef struct VFIOIOEventFD {
>      bool match_data;
>      uint64_t data;
>      EventNotifier e;
> +    VFIORegion *region;
> +    hwaddr region_addr;
>  } VFIOIOEventFD;
>  
>  typedef struct VFIOQuirk {
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
@ 2018-02-08 11:10     ` Auger Eric
  0 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:10 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,

On 07/02/18 01:26, Alex Williamson wrote:
> Record data writes that come through the NVIDIA BAR0 quirk, if we get
> enough in a row that we're only passing through, automatically enable
> an ioeventfd for it.  The primary target for this is the MSI-ACK
> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
> benchmark, this takes us from 83% of performance versus disabling the
> quirk entirely (which GeForce cannot do), to to almost 90%.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  hw/vfio/pci.h        |    2 +
>  2 files changed, 89 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e4cf4ea2dd9c..e739efe601b1 100644lg

> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>      uint32_t offset;
>      uint8_t bar;
>      MemoryRegion *mem;
> +    uint8_t data[];
Do you foresee other usages of data besides the LastDataSet?
>  } VFIOConfigMirrorQuirk;
>  
>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>      g_free(ioeventfd);
>  }
>  
add a comment? user handler in case kvm ioeventfd setup failed?
> +static void vfio_ioeventfd_handler(void *opaque)
> +{
> +    VFIOIOEventFD *ioeventfd = opaque;
> +
> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
> +                          ioeventfd->data, ioeventfd->size);
> +    }
> +}
> +
> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> +                                          MemoryRegion *mr, hwaddr addr,
> +                                          unsigned size, uint64_t data,
> +                                          VFIORegion *region,
> +                                          hwaddr region_addr)
> +{
> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> +
> +    if (event_notifier_init(&ioeventfd->e, 0)) {
> +        g_free(ioeventfd);
> +        return NULL;
> +    }
> +
> +    ioeventfd->mr = mr;
> +    ioeventfd->addr = addr;
> +    ioeventfd->size = size;
> +    ioeventfd->match_data = true;
> +    ioeventfd->data = data;
> +    ioeventfd->region = region;
> +    ioeventfd->region_addr = region_addr;
I found difficult to follow the different addr semantic.
I understand region_add is the offset % bar and addr is the offset %
mirror region. Maybe more explicit names would help (region = bar_region
and region_addr = bar_offset)
> +
> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> +                              ioeventfd->size, ioeventfd->match_data,
> +                              ioeventfd->data, &ioeventfd->e);
> +
> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
> +
> +    return ioeventfd;
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>  }
>  
> +typedef struct LastDataSet {
> +    hwaddr addr;
> +    uint64_t data;
> +    unsigned size;
> +    int count;
> +} LastDataSet;
> +
>  /*
>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>   * 0x1800 or 0x88000 through the PCI config space access functions.
> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>      VFIOConfigMirrorQuirk *mirror = opaque;
>      VFIOPCIDevice *vdev = mirror->vdev;
>      PCIDevice *pdev = &vdev->pdev;
> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>  
>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>  
> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>                            addr + mirror->offset, data, size);
>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>      }
> +
> +    /*
> +     * Automatically add an ioeventfd to handle any repeated write with the
> +     * same data and size above the standard PCI config space header.  This is
> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
> +     */
> +    if (addr > PCI_STD_HEADER_SIZEOF) {
> +        if (addr != last->addr || data != last->data || size != last->size) {
> +            last->addr = addr;
> +            last->data = data;
> +            last->size = size;
> +            last->count = 1;
> +        } else if (++last->count > 10) {
So here is the naive question about the "10" choice and also the fact
this needs to be consecutive accesses. I guess you observed this works
but at first sight this is not obvious to me.

Does anyone check potential overlaps between ioeventfd's [addr, addr +
size -1]?
> +            VFIOIOEventFD *ioeventfd;
> +
> +            ioeventfd = vfio_ioeventfd_init(vdev, mirror->mem, addr, size, data,
> +                                            &vdev->bars[mirror->bar].region,
> +                                            mirror->offset + addr);
> +            if (ioeventfd) {
> +                VFIOQuirk *quirk;
> +
> +                QLIST_FOREACH(quirk, &vdev->bars[mirror->bar].quirks, next) {
> +                    if (quirk->data == mirror) {
> +                        QLIST_INSERT_HEAD(&quirk->ioeventfds, ioeventfd, next);
> +                        break;
> +                    }
> +                }
> +            }
> +        }
> +    }

Thanks

Eric
>  }
>  
>  static const MemoryRegionOps vfio_nvidia_mirror_quirk = {
> @@ -776,7 +861,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>      }
>  
>      quirk = vfio_quirk_alloc(1);
> -    mirror = quirk->data = g_malloc0(sizeof(*mirror));
> +    mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>      mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x88000;
> @@ -794,7 +879,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>      /* The 0x1800 offset mirror only seems to get used by legacy VGA */
>      if (vdev->vga) {
>          quirk = vfio_quirk_alloc(1);
> -        mirror = quirk->data = g_malloc0(sizeof(*mirror));
> +        mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>          mirror->mem = quirk->mem;
>          mirror->vdev = vdev;
>          mirror->offset = 0x1800;
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 146065c2f715..ec53b9935725 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -32,6 +32,8 @@ typedef struct VFIOIOEventFD {
>      bool match_data;
>      uint64_t data;
>      EventNotifier e;
> +    VFIORegion *region;
> +    hwaddr region_addr;
>  } VFIOIOEventFD;
>  
>  typedef struct VFIOQuirk {
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
  2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
@ 2018-02-08 11:10     ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:10 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,
On 07/02/18 01:26, Alex Williamson wrote:
> This will later be used to include list initialization
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
>  1 file changed, 21 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e5779a7ad35b..10af23217292 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
>      .endianness = DEVICE_LITTLE_ENDIAN,
>  };
>  
> +static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> +{
> +    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
nit: Peter advised the usage of g_new0 as well for that kind of alloc.
> +    quirk->mem = g_new0(MemoryRegion, nr_mem);
> +    quirk->nr_mem = nr_mem;
> +
> +    return quirk;
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -288,9 +297,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    quirk = vfio_quirk_alloc(1);
>  
>      memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_ati_3c3_quirk, vdev,
>                            "vfio-ati-3c3-quirk", 1);
> @@ -323,9 +330,7 @@ static void vfio_probe_ati_bar4_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      window = quirk->data = g_malloc0(sizeof(*window) +
>                                       sizeof(VFIOConfigWindowMatch));
>      window->vdev = vdev;
> @@ -371,10 +376,9 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(1);
>      mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x4000;
>      mirror->bar = nr;
> @@ -548,10 +552,8 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(2);
>      quirk->data = data = g_malloc0(sizeof(*data));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
>      data->vdev = vdev;
>  
>      memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_nvidia_3d4_quirk,
> @@ -667,9 +669,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 4);
> -    quirk->nr_mem = 4;
> +    quirk = vfio_quirk_alloc(4);
>      bar5 = quirk->data = g_malloc0(sizeof(*bar5) +
>                                     (sizeof(VFIOConfigWindowMatch) * 2));
>      window = &bar5->window;
> @@ -762,10 +762,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(1);
>      mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x88000;
>      mirror->bar = nr;
> @@ -781,10 +780,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>  
>      /* The 0x1800 offset mirror only seems to get used by legacy VGA */
>      if (vdev->vga) {
> -        quirk = g_malloc0(sizeof(*quirk));
> +        quirk = vfio_quirk_alloc(1);
>          mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -        mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -        quirk->nr_mem = 1;
> +        mirror->mem = quirk->mem;
>          mirror->vdev = vdev;
>          mirror->offset = 0x1800;
>          mirror->bar = nr;
> @@ -945,9 +943,7 @@ static void vfio_probe_rtl8168_bar2_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      quirk->data = rtl = g_malloc0(sizeof(*rtl));
>      rtl->vdev = vdev;
>  
> @@ -1507,9 +1503,7 @@ static void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
>      }
>  
>      /* Setup our quirk to munge GTT addresses to the VM allocated buffer */
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      igd = quirk->data = g_malloc0(sizeof(*igd));
>      igd->vdev = vdev;
>      igd->index = ~0;
> 
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
@ 2018-02-08 11:10     ` Auger Eric
  0 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:10 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,
On 07/02/18 01:26, Alex Williamson wrote:
> This will later be used to include list initialization
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
>  1 file changed, 21 insertions(+), 27 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e5779a7ad35b..10af23217292 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
>      .endianness = DEVICE_LITTLE_ENDIAN,
>  };
>  
> +static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> +{
> +    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
nit: Peter advised the usage of g_new0 as well for that kind of alloc.
> +    quirk->mem = g_new0(MemoryRegion, nr_mem);
> +    quirk->nr_mem = nr_mem;
> +
> +    return quirk;
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -288,9 +297,7 @@ static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    quirk = vfio_quirk_alloc(1);
>  
>      memory_region_init_io(quirk->mem, OBJECT(vdev), &vfio_ati_3c3_quirk, vdev,
>                            "vfio-ati-3c3-quirk", 1);
> @@ -323,9 +330,7 @@ static void vfio_probe_ati_bar4_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      window = quirk->data = g_malloc0(sizeof(*window) +
>                                       sizeof(VFIOConfigWindowMatch));
>      window->vdev = vdev;
> @@ -371,10 +376,9 @@ static void vfio_probe_ati_bar2_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(1);
>      mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x4000;
>      mirror->bar = nr;
> @@ -548,10 +552,8 @@ static void vfio_vga_probe_nvidia_3d0_quirk(VFIOPCIDevice *vdev)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(2);
>      quirk->data = data = g_malloc0(sizeof(*data));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
>      data->vdev = vdev;
>  
>      memory_region_init_io(&quirk->mem[0], OBJECT(vdev), &vfio_nvidia_3d4_quirk,
> @@ -667,9 +669,7 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 4);
> -    quirk->nr_mem = 4;
> +    quirk = vfio_quirk_alloc(4);
>      bar5 = quirk->data = g_malloc0(sizeof(*bar5) +
>                                     (sizeof(VFIOConfigWindowMatch) * 2));
>      window = &bar5->window;
> @@ -762,10 +762,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> +    quirk = vfio_quirk_alloc(1);
>      mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -    mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -    quirk->nr_mem = 1;
> +    mirror->mem = quirk->mem;
>      mirror->vdev = vdev;
>      mirror->offset = 0x88000;
>      mirror->bar = nr;
> @@ -781,10 +780,9 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>  
>      /* The 0x1800 offset mirror only seems to get used by legacy VGA */
>      if (vdev->vga) {
> -        quirk = g_malloc0(sizeof(*quirk));
> +        quirk = vfio_quirk_alloc(1);
>          mirror = quirk->data = g_malloc0(sizeof(*mirror));
> -        mirror->mem = quirk->mem = g_new0(MemoryRegion, 1);
> -        quirk->nr_mem = 1;
> +        mirror->mem = quirk->mem;
>          mirror->vdev = vdev;
>          mirror->offset = 0x1800;
>          mirror->bar = nr;
> @@ -945,9 +943,7 @@ static void vfio_probe_rtl8168_bar2_quirk(VFIOPCIDevice *vdev, int nr)
>          return;
>      }
>  
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      quirk->data = rtl = g_malloc0(sizeof(*rtl));
>      rtl->vdev = vdev;
>  
> @@ -1507,9 +1503,7 @@ static void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
>      }
>  
>      /* Setup our quirk to munge GTT addresses to the VM allocated buffer */
> -    quirk = g_malloc0(sizeof(*quirk));
> -    quirk->mem = g_new0(MemoryRegion, 2);
> -    quirk->nr_mem = 2;
> +    quirk = vfio_quirk_alloc(2);
>      igd = quirk->data = g_malloc0(sizeof(*igd));
>      igd->vdev = vdev;
>      igd->index = ~0;
> 
Reviewed-by: Eric Auger <eric.auger@redhat.com>

Thanks

Eric

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
  2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
@ 2018-02-08 11:11     ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:11 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,

On 07/02/18 01:26, Alex Williamson wrote:
> We might wish to handle some quirks via ioeventfds, add a list of
> ioeventfds to the quirk.
The commit title is a bit misleading as we only add the data type and
deletion function.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   17 +++++++++++++++++
>  hw/vfio/pci.h        |   11 +++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 10af23217292..e4cf4ea2dd9c 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -12,6 +12,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> @@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
>  static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>  {
>      VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
> +    QLIST_INIT(&quirk->ioeventfds);
>      quirk->mem = g_new0(MemoryRegion, nr_mem);
>      quirk->nr_mem = nr_mem;
>  
>      return quirk;
>  }
>  
> +static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> +{
> +    QLIST_REMOVE(ioeventfd, next);
> +    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> +                              ioeventfd->match_data, ioeventfd->data,
> +                              &ioeventfd->e);
> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> +    event_notifier_cleanup(&ioeventfd->e);
> +    g_free(ioeventfd);
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
>      int i;
>  
>      QLIST_FOREACH(quirk, &bar->quirks, next) {
> +        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> +            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> +        }
> +
>          for (i = 0; i < quirk->nr_mem; i++) {
>              memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
>          }
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index f4aa13e021fa..146065c2f715 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -24,9 +24,20 @@
>  
>  struct VFIOPCIDevice;
>  
> +typedef struct VFIOIOEventFD {
> +    QLIST_ENTRY(VFIOIOEventFD) next;
> +    MemoryRegion *mr;
> +    hwaddr addr;
> +    unsigned size;
> +    bool match_data;
Shouldn't you add the match_data field also in the kernel uapi?

Thanks

Eric
> +    uint64_t data;
> +    EventNotifier e;
> +} VFIOIOEventFD;.
> +
>  typedef struct VFIOQuirk {
>      QLIST_ENTRY(VFIOQuirk) next;
>      void *data;
> +    QLIST_HEAD(, VFIOIOEventFD) ioeventfds;
>      int nr_mem;
>      MemoryRegion *mem;
>  } VFIOQuirk;
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
@ 2018-02-08 11:11     ` Auger Eric
  0 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:11 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,

On 07/02/18 01:26, Alex Williamson wrote:
> We might wish to handle some quirks via ioeventfds, add a list of
> ioeventfds to the quirk.
The commit title is a bit misleading as we only add the data type and
deletion function.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   17 +++++++++++++++++
>  hw/vfio/pci.h        |   11 +++++++++++
>  2 files changed, 28 insertions(+)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index 10af23217292..e4cf4ea2dd9c 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -12,6 +12,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
> +#include "qemu/main-loop.h"
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> @@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
>  static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>  {
>      VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
> +    QLIST_INIT(&quirk->ioeventfds);
>      quirk->mem = g_new0(MemoryRegion, nr_mem);
>      quirk->nr_mem = nr_mem;
>  
>      return quirk;
>  }
>  
> +static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> +{
> +    QLIST_REMOVE(ioeventfd, next);
> +    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> +                              ioeventfd->match_data, ioeventfd->data,
> +                              &ioeventfd->e);
> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> +    event_notifier_cleanup(&ioeventfd->e);
> +    g_free(ioeventfd);
> +}
> +
>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>  {
>      VFIOQuirk *quirk;
> @@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
>      int i;
>  
>      QLIST_FOREACH(quirk, &bar->quirks, next) {
> +        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> +            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> +        }
> +
>          for (i = 0; i < quirk->nr_mem; i++) {
>              memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
>          }
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index f4aa13e021fa..146065c2f715 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -24,9 +24,20 @@
>  
>  struct VFIOPCIDevice;
>  
> +typedef struct VFIOIOEventFD {
> +    QLIST_ENTRY(VFIOIOEventFD) next;
> +    MemoryRegion *mr;
> +    hwaddr addr;
> +    unsigned size;
> +    bool match_data;
Shouldn't you add the match_data field also in the kernel uapi?

Thanks

Eric
> +    uint64_t data;
> +    EventNotifier e;
> +} VFIOIOEventFD;.
> +
>  typedef struct VFIOQuirk {
>      QLIST_ENTRY(VFIOQuirk) next;
>      void *data;
> +    QLIST_HEAD(, VFIOIOEventFD) ioeventfds;
>      int nr_mem;
>      MemoryRegion *mem;
>  } VFIOQuirk;
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
  2018-02-08 11:10     ` [Qemu-devel] " Auger Eric
  (?)
@ 2018-02-08 11:33     ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:33 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,

On 08/02/18 12:10, Auger Eric wrote:
> Hi Alex,
> 
> On 07/02/18 01:26, Alex Williamson wrote:
>> Record data writes that come through the NVIDIA BAR0 quirk, if we get
>> enough in a row that we're only passing through, automatically enable
>> an ioeventfd for it.  The primary target for this is the MSI-ACK
>> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
>> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
>> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
>> benchmark, this takes us from 83% of performance versus disabling the
>> quirk entirely (which GeForce cannot do), to to almost 90%.
>>
>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>> ---
>>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>  hw/vfio/pci.h        |    2 +
>>  2 files changed, 89 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>> index e4cf4ea2dd9c..e739efe601b1 100644lg
> 
>> --- a/hw/vfio/pci-quirks.c
>> +++ b/hw/vfio/pci-quirks.c
>> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>>      uint32_t offset;
>>      uint8_t bar;
>>      MemoryRegion *mem;
>> +    uint8_t data[];
> Do you foresee other usages of data besides the LastDataSet?
>>  } VFIOConfigMirrorQuirk;
>>  
>>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
>> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>>      g_free(ioeventfd);
>>  }
>>  
> add a comment? user handler in case kvm ioeventfd setup failed?
Forget that. I got confused. At this point you set an ioeventfd which
must be handled on user space. In last patch you plug the kernel vfio
handler through the new iotcl and only in case this fails you use the
userspace handler. Hope I got it right.

Eric


>> +static void vfio_ioeventfd_handler(void *opaque)
>> +{
>> +    VFIOIOEventFD *ioeventfd = opaque;
>> +
>> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
>> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
>> +                          ioeventfd->data, ioeventfd->size);
>> +    }
>> +}
>> +
>> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>> +                                          MemoryRegion *mr, hwaddr addr,
>> +                                          unsigned size, uint64_t data,
>> +                                          VFIORegion *region,
>> +                                          hwaddr region_addr)
>> +{
>> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
>> +
>> +    if (event_notifier_init(&ioeventfd->e, 0)) {
>> +        g_free(ioeventfd);
>> +        return NULL;
>> +    }
>> +
>> +    ioeventfd->mr = mr;
>> +    ioeventfd->addr = addr;
>> +    ioeventfd->size = size;
>> +    ioeventfd->match_data = true;
>> +    ioeventfd->data = data;
>> +    ioeventfd->region = region;
>> +    ioeventfd->region_addr = region_addr;
> I found difficult to follow the different addr semantic.
> I understand region_add is the offset % bar and addr is the offset %
> mirror region. Maybe more explicit names would help (region = bar_region
> and region_addr = bar_offset)
>> +
>> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
>> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
>> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>> +                              ioeventfd->size, ioeventfd->match_data,
>> +                              ioeventfd->data, &ioeventfd->e);
>> +
>> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
>> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
>> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
>> +
>> +    return ioeventfd;
>> +}
>> +
>>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>>  {
>>      VFIOQuirk *quirk;
>> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>>  }
>>  
>> +typedef struct LastDataSet {
>> +    hwaddr addr;
>> +    uint64_t data;
>> +    unsigned size;
>> +    int count;
>> +} LastDataSet;
>> +
>>  /*
>>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>>   * 0x1800 or 0x88000 through the PCI config space access functions.
>> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>      VFIOConfigMirrorQuirk *mirror = opaque;
>>      VFIOPCIDevice *vdev = mirror->vdev;
>>      PCIDevice *pdev = &vdev->pdev;
>> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>>  
>>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>>  
>> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>                            addr + mirror->offset, data, size);
>>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>>      }
>> +
>> +    /*
>> +     * Automatically add an ioeventfd to handle any repeated write with the
>> +     * same data and size above the standard PCI config space header.  This is
>> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
>> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
>> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
>> +     */
>> +    if (addr > PCI_STD_HEADER_SIZEOF) {
>> +        if (addr != last->addr || data != last->data || size != last->size) {
>> +            last->addr = addr;
>> +            last->data = data;
>> +            last->size = size;
>> +            last->count = 1;
>> +        } else if (++last->count > 10) {
> So here is the naive question about the "10" choice and also the fact
> this needs to be consecutive accesses. I guess you observed this works
> but at first sight this is not obvious to me.
> 
> Does anyone check potential overlaps between ioeventfd's [addr, addr +
> size -1]?
>> +            VFIOIOEventFD *ioeventfd;
>> +
>> +            ioeventfd = vfio_ioeventfd_init(vdev, mirror->mem, addr, size, data,
>> +                                            &vdev->bars[mirror->bar].region,
>> +                                            mirror->offset + addr);
>> +            if (ioeventfd) {
>> +                VFIOQuirk *quirk;
>> +
>> +                QLIST_FOREACH(quirk, &vdev->bars[mirror->bar].quirks, next) {
>> +                    if (quirk->data == mirror) {
>> +                        QLIST_INSERT_HEAD(&quirk->ioeventfds, ioeventfd, next);
>> +                        break;
>> +                    }
>> +                }
>> +            }
>> +        }
>> +    }
> 
> Thanks
> 
> Eric
>>  }
>>  
>>  static const MemoryRegionOps vfio_nvidia_mirror_quirk = {
>> @@ -776,7 +861,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>>      }
>>  
>>      quirk = vfio_quirk_alloc(1);
>> -    mirror = quirk->data = g_malloc0(sizeof(*mirror));
>> +    mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>>      mirror->mem = quirk->mem;
>>      mirror->vdev = vdev;
>>      mirror->offset = 0x88000;
>> @@ -794,7 +879,7 @@ static void vfio_probe_nvidia_bar0_quirk(VFIOPCIDevice *vdev, int nr)
>>      /* The 0x1800 offset mirror only seems to get used by legacy VGA */
>>      if (vdev->vga) {
>>          quirk = vfio_quirk_alloc(1);
>> -        mirror = quirk->data = g_malloc0(sizeof(*mirror));
>> +        mirror = quirk->data = g_malloc0(sizeof(*mirror) + sizeof(LastDataSet));
>>          mirror->mem = quirk->mem;
>>          mirror->vdev = vdev;
>>          mirror->offset = 0x1800;
>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>> index 146065c2f715..ec53b9935725 100644
>> --- a/hw/vfio/pci.h
>> +++ b/hw/vfio/pci.h
>> @@ -32,6 +32,8 @@ typedef struct VFIOIOEventFD {
>>      bool match_data;
>>      uint64_t data;
>>      EventNotifier e;
>> +    VFIORegion *region;
>> +    hwaddr region_addr;
>>  } VFIOIOEventFD;
>>  
>>  typedef struct VFIOQuirk {
>>
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
@ 2018-02-08 11:42     ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:42 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,
On 07/02/18 01:26, Alex Williamson wrote:
> With vfio ioeventfd support, we can program vfio-pci to perform a
> specified BAR write when an eventfd is triggered.  This allows the
> KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> userspace handling for these events.  On the same micro-benchmark
> where the ioeventfd got us to almost 90% of performance versus
> disabling the GeForce quirks, this gets us to within 95%.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
>  1 file changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e739efe601b1..35a4d5197e2d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -16,6 +16,7 @@
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include <sys/ioctl.h>
>  #include "hw/nvram/fw_cfg.h"
>  #include "pci.h"
>  #include "trace.h"
> @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>      return quirk;
>  }
>  
> -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
>  {
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +
>      QLIST_REMOVE(ioeventfd, next);
> +
>      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
>                                ioeventfd->match_data, ioeventfd->data,
>                                &ioeventfd->e);
> +
>      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> +
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = -1;
> +
> +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> +
>      event_notifier_cleanup(&ioeventfd->e);
>      g_free(ioeventfd);
>  }
> @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>                                            hwaddr region_addr)
>  {
>      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +    char vfio_enabled = '+';
>  
>      if (event_notifier_init(&ioeventfd->e, 0)) {
>          g_free(ioeventfd);
> @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>      ioeventfd->region = region;
>      ioeventfd->region_addr = region_addr;
>  
> -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> +
> +    if (ioctl(vdev->vbasedev.fd,
> +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> +        vfio_enabled = '-';
> +    }
> +
>      memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>                                ioeventfd->size, ioeventfd->match_data,
>                                ioeventfd->data, &ioeventfd->e);
>  
>      info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> -                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> -                vdev->vbasedev.name, region->nr, region_addr, data, size);
> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
> +                vdev->vbasedev.name, region->nr, region_addr, data, size,
> +                vfio_enabled);
Not sure if this message is really helpful for the end-user to
understand what happens. Maybe adding a trace event when everything
happens as it should and an error_report if we failed setting up the
vfio kernel handler, explaining the sub-optimal performance that can result.

Thanks

Eric
>  
>      return ioeventfd;
>  }
> @@ -1767,7 +1797,7 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
>  
>      QLIST_FOREACH(quirk, &bar->quirks, next) {
>          while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> -            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> +            vfio_ioeventfd_exit(vdev, QLIST_FIRST(&quirk->ioeventfds));
>          }
>  
>          for (i = 0; i < quirk->nr_mem; i++) {
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
@ 2018-02-08 11:42     ` Auger Eric
  0 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 11:42 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: kvm

Hi Alex,
On 07/02/18 01:26, Alex Williamson wrote:
> With vfio ioeventfd support, we can program vfio-pci to perform a
> specified BAR write when an eventfd is triggered.  This allows the
> KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> userspace handling for these events.  On the same micro-benchmark
> where the ioeventfd got us to almost 90% of performance versus
> disabling the GeForce quirks, this gets us to within 95%.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
>  1 file changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e739efe601b1..35a4d5197e2d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -16,6 +16,7 @@
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include <sys/ioctl.h>
>  #include "hw/nvram/fw_cfg.h"
>  #include "pci.h"
>  #include "trace.h"
> @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>      return quirk;
>  }
>  
> -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
>  {
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +
>      QLIST_REMOVE(ioeventfd, next);
> +
>      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
>                                ioeventfd->match_data, ioeventfd->data,
>                                &ioeventfd->e);
> +
>      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> +
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = -1;
> +
> +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> +
>      event_notifier_cleanup(&ioeventfd->e);
>      g_free(ioeventfd);
>  }
> @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>                                            hwaddr region_addr)
>  {
>      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +    char vfio_enabled = '+';
>  
>      if (event_notifier_init(&ioeventfd->e, 0)) {
>          g_free(ioeventfd);
> @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>      ioeventfd->region = region;
>      ioeventfd->region_addr = region_addr;
>  
> -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> +
> +    if (ioctl(vdev->vbasedev.fd,
> +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> +        vfio_enabled = '-';
> +    }
> +
>      memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>                                ioeventfd->size, ioeventfd->match_data,
>                                ioeventfd->data, &ioeventfd->e);
>  
>      info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> -                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> -                vdev->vbasedev.name, region->nr, region_addr, data, size);
> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
> +                vdev->vbasedev.name, region->nr, region_addr, data, size,
> +                vfio_enabled);
Not sure if this message is really helpful for the end-user to
understand what happens. Maybe adding a trace event when everything
happens as it should and an error_report if we failed setting up the
vfio kernel handler, explaining the sub-optimal performance that can result.

Thanks

Eric
>  
>      return ioeventfd;
>  }
> @@ -1767,7 +1797,7 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
>  
>      QLIST_FOREACH(quirk, &bar->quirks, next) {
>          while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> -            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> +            vfio_ioeventfd_exit(vdev, QLIST_FIRST(&quirk->ioeventfds));
>          }
>  
>          for (i = 0; i < quirk->nr_mem; i++) {
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
  2018-02-08 11:10     ` [Qemu-devel] " Auger Eric
@ 2018-02-08 18:24       ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:24 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:10:02 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 07/02/18 01:26, Alex Williamson wrote:
> > Record data writes that come through the NVIDIA BAR0 quirk, if we get
> > enough in a row that we're only passing through, automatically enable
> > an ioeventfd for it.  The primary target for this is the MSI-ACK
> > that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
> > 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
> > into BAR0 MMIO space.  For an interrupt latency sensitive micro-
> > benchmark, this takes us from 83% of performance versus disabling the
> > quirk entirely (which GeForce cannot do), to to almost 90%.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  hw/vfio/pci.h        |    2 +
> >  2 files changed, 89 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e4cf4ea2dd9c..e739efe601b1 100644lg  
> 
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
> >      uint32_t offset;
> >      uint8_t bar;
> >      MemoryRegion *mem;
> > +    uint8_t data[];  
> Do you foresee other usages of data besides the LastDataSet?

Sure, LastDataSet is just a very crude tracking mechanism, I could
imagine some kind of device specific LRU array.  Any quirk wanting to
add additional analytics on top of this generic quirk could make use of
it.  I didn't want to pollute the generic quirk with LastDataSet since
there are other users, but I also didn't want to complicate the shared
free'ing path we have by simply adding a pointer.  Thus embedding quirk
specific data is what I went with here.

> >  } VFIOConfigMirrorQuirk;
> >  
> >  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
> > @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> >      g_free(ioeventfd);
> >  }
> >    
> add a comment? user handler in case kvm ioeventfd setup failed?

As you correctly updated later, we're only setting up the kvm-user
ioeventfd handler in this patch.  This is the handler when that
succeeds.  Once we add vfio ioeventfd support, this handler will
hopefully never get used, but we still need to support kernels that
won't have vfio ioeventfd support and this alone does still provide a
performance improvement.

> > +static void vfio_ioeventfd_handler(void *opaque)
> > +{
> > +    VFIOIOEventFD *ioeventfd = opaque;
> > +
> > +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
> > +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
> > +                          ioeventfd->data, ioeventfd->size);
> > +    }
> > +}
> > +
> > +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> > +                                          MemoryRegion *mr, hwaddr addr,
> > +                                          unsigned size, uint64_t data,
> > +                                          VFIORegion *region,
> > +                                          hwaddr region_addr)
> > +{
> > +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > +
> > +    if (event_notifier_init(&ioeventfd->e, 0)) {
> > +        g_free(ioeventfd);
> > +        return NULL;
> > +    }
> > +
> > +    ioeventfd->mr = mr;
> > +    ioeventfd->addr = addr;
> > +    ioeventfd->size = size;
> > +    ioeventfd->match_data = true;
> > +    ioeventfd->data = data;
> > +    ioeventfd->region = region;
> > +    ioeventfd->region_addr = region_addr;  
> I found difficult to follow the different addr semantic.
> I understand region_add is the offset % bar and addr is the offset %
> mirror region. Maybe more explicit names would help (region = bar_region
> and region_addr = bar_offset)

I was trying to avoid PCI specific fields for VFIOIOEventFD, but
basically we're dealing with two different sets of base and offset.
The first is the MemoryRegion of the quirk and the offset relative to
that MemoryRegion.  Those are used by memory_region_add_eventfd() for
establishing the kvm ioeventfd.  The second is the region and
region_addr, which identify the vfio region and offset relative to the
region so that we can actually handle it with vfio_region_write().

I agree though that these are confusing and maybe comments are a
solution as I can't think of better names:

    /*
     * MemoryRegion and relative offset, plus additional ioeventfd setup
     * parameters for configuring and later tearing down KVM ioeventfd.
     */
    ioeventfd->mr = mr;
    ioeventfd->addr = addr;
    ioeventfd->size = size;
    ioeventfd->match_data = true;
    ioeventfd->data = data;
    /*
     * VFIORegion and relative offset for implementing the userspace
     * handler.  data & size fields shared for both uses.
     */
    ioeventfd->region = region;
    ioeventfd->region_addr = region_addr;


> > +
> > +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > +                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> > +                              ioeventfd->size, ioeventfd->match_data,
> > +                              ioeventfd->data, &ioeventfd->e);
> > +
> > +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> > +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> > +                vdev->vbasedev.name, region->nr, region_addr, data, size);
> > +
> > +    return ioeventfd;
> > +}
> > +
> >  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
> >  {
> >      VFIOQuirk *quirk;
> > @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
> >      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
> >  }
> >  
> > +typedef struct LastDataSet {
> > +    hwaddr addr;
> > +    uint64_t data;
> > +    unsigned size;
> > +    int count;
> > +} LastDataSet;
> > +
> >  /*
> >   * Finally, BAR0 itself.  We want to redirect any accesses to either
> >   * 0x1800 or 0x88000 through the PCI config space access functions.
> > @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
> >      VFIOConfigMirrorQuirk *mirror = opaque;
> >      VFIOPCIDevice *vdev = mirror->vdev;
> >      PCIDevice *pdev = &vdev->pdev;
> > +    LastDataSet *last = (LastDataSet *)&mirror->data;
> >  
> >      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
> >  
> > @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
> >                            addr + mirror->offset, data, size);
> >          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
> >      }
> > +
> > +    /*
> > +     * Automatically add an ioeventfd to handle any repeated write with the
> > +     * same data and size above the standard PCI config space header.  This is
> > +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
> > +     * above.  Current hardware/drivers should trigger an ioeventfd at config
> > +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
> > +     */
> > +    if (addr > PCI_STD_HEADER_SIZEOF) {
> > +        if (addr != last->addr || data != last->data || size != last->size) {
> > +            last->addr = addr;
> > +            last->data = data;
> > +            last->size = size;
> > +            last->count = 1;
> > +        } else if (++last->count > 10) {  
> So here is the naive question about the "10" choice and also the fact
> this needs to be consecutive accesses. I guess you observed this works
> but at first sight this is not obvious to me.

Initially when I started working on this I used the previously known
MSI-ACK behavior, which was a write to the MSI capability ID register,
but I quickly figured out it didn't do anything because my current guest
setup uses the 0x704 offset noted in the comment above.  So I continued
developing with 0x704 hard coded.  But I have no idea if everything
uses 0x704 now or some drivers might still use the capability ID, or
what NVIDIA might switch to next.  So I wanted to make this dynamic and
I figured that shouldn't anything that gets repeated calls through here
with the same data that we're not modifying on the way to hardware
benefit from this acceleration.  Then I started thinking about what
frequency of write should trigger this switch-over, but the time
component of that is hard to deal with, so I found that I could just
drop it.  The MSI-ACK behavior means that we see a pretty continuous
stream of writes to the same offset, same data, so that triggers right
away.  Interestingly, after the MSI-ACK triggers the ioeventfd, we even
add another for a relatively low frequency write.  I wouldn't have
bothered with this one otherwise, but I don't see the harm in leaving
it since we're setup for a datamatch we'll only trigger the ioeventfd
if it remains the same.  So basically to answer your question, it's 10
because it seemed like a reasonable watermark and it works.  If we had
to, we could use some sort of LRU array, but this is simple.

> Does anyone check potential overlaps between ioeventfd's [addr, addr +
> size -1]?

Do we need to?  KVM will only trigger the ioeventfd if we match the
address, data, and size, so AIUI we could have multiple ioeventfds
covering the same area with different data or size and there's nothing
wrong with that.  I wondered whether we need some sort of pruning
algorithm, but once we've enabled kvm-vfio acceleration, QEMU no longer
has any idea which are being used.  Maybe we could arbitrarily decide
some fixed number limit for a device, but currently I think the limit
is based on the user's open file limit.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
@ 2018-02-08 18:24       ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:24 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:10:02 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 07/02/18 01:26, Alex Williamson wrote:
> > Record data writes that come through the NVIDIA BAR0 quirk, if we get
> > enough in a row that we're only passing through, automatically enable
> > an ioeventfd for it.  The primary target for this is the MSI-ACK
> > that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
> > 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
> > into BAR0 MMIO space.  For an interrupt latency sensitive micro-
> > benchmark, this takes us from 83% of performance versus disabling the
> > quirk entirely (which GeForce cannot do), to to almost 90%.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >  hw/vfio/pci.h        |    2 +
> >  2 files changed, 89 insertions(+), 2 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e4cf4ea2dd9c..e739efe601b1 100644lg  
> 
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
> >      uint32_t offset;
> >      uint8_t bar;
> >      MemoryRegion *mem;
> > +    uint8_t data[];  
> Do you foresee other usages of data besides the LastDataSet?

Sure, LastDataSet is just a very crude tracking mechanism, I could
imagine some kind of device specific LRU array.  Any quirk wanting to
add additional analytics on top of this generic quirk could make use of
it.  I didn't want to pollute the generic quirk with LastDataSet since
there are other users, but I also didn't want to complicate the shared
free'ing path we have by simply adding a pointer.  Thus embedding quirk
specific data is what I went with here.

> >  } VFIOConfigMirrorQuirk;
> >  
> >  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
> > @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> >      g_free(ioeventfd);
> >  }
> >    
> add a comment? user handler in case kvm ioeventfd setup failed?

As you correctly updated later, we're only setting up the kvm-user
ioeventfd handler in this patch.  This is the handler when that
succeeds.  Once we add vfio ioeventfd support, this handler will
hopefully never get used, but we still need to support kernels that
won't have vfio ioeventfd support and this alone does still provide a
performance improvement.

> > +static void vfio_ioeventfd_handler(void *opaque)
> > +{
> > +    VFIOIOEventFD *ioeventfd = opaque;
> > +
> > +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
> > +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
> > +                          ioeventfd->data, ioeventfd->size);
> > +    }
> > +}
> > +
> > +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> > +                                          MemoryRegion *mr, hwaddr addr,
> > +                                          unsigned size, uint64_t data,
> > +                                          VFIORegion *region,
> > +                                          hwaddr region_addr)
> > +{
> > +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > +
> > +    if (event_notifier_init(&ioeventfd->e, 0)) {
> > +        g_free(ioeventfd);
> > +        return NULL;
> > +    }
> > +
> > +    ioeventfd->mr = mr;
> > +    ioeventfd->addr = addr;
> > +    ioeventfd->size = size;
> > +    ioeventfd->match_data = true;
> > +    ioeventfd->data = data;
> > +    ioeventfd->region = region;
> > +    ioeventfd->region_addr = region_addr;  
> I found difficult to follow the different addr semantic.
> I understand region_add is the offset % bar and addr is the offset %
> mirror region. Maybe more explicit names would help (region = bar_region
> and region_addr = bar_offset)

I was trying to avoid PCI specific fields for VFIOIOEventFD, but
basically we're dealing with two different sets of base and offset.
The first is the MemoryRegion of the quirk and the offset relative to
that MemoryRegion.  Those are used by memory_region_add_eventfd() for
establishing the kvm ioeventfd.  The second is the region and
region_addr, which identify the vfio region and offset relative to the
region so that we can actually handle it with vfio_region_write().

I agree though that these are confusing and maybe comments are a
solution as I can't think of better names:

    /*
     * MemoryRegion and relative offset, plus additional ioeventfd setup
     * parameters for configuring and later tearing down KVM ioeventfd.
     */
    ioeventfd->mr = mr;
    ioeventfd->addr = addr;
    ioeventfd->size = size;
    ioeventfd->match_data = true;
    ioeventfd->data = data;
    /*
     * VFIORegion and relative offset for implementing the userspace
     * handler.  data & size fields shared for both uses.
     */
    ioeventfd->region = region;
    ioeventfd->region_addr = region_addr;


> > +
> > +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > +                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> > +                              ioeventfd->size, ioeventfd->match_data,
> > +                              ioeventfd->data, &ioeventfd->e);
> > +
> > +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> > +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> > +                vdev->vbasedev.name, region->nr, region_addr, data, size);
> > +
> > +    return ioeventfd;
> > +}
> > +
> >  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
> >  {
> >      VFIOQuirk *quirk;
> > @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
> >      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
> >  }
> >  
> > +typedef struct LastDataSet {
> > +    hwaddr addr;
> > +    uint64_t data;
> > +    unsigned size;
> > +    int count;
> > +} LastDataSet;
> > +
> >  /*
> >   * Finally, BAR0 itself.  We want to redirect any accesses to either
> >   * 0x1800 or 0x88000 through the PCI config space access functions.
> > @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
> >      VFIOConfigMirrorQuirk *mirror = opaque;
> >      VFIOPCIDevice *vdev = mirror->vdev;
> >      PCIDevice *pdev = &vdev->pdev;
> > +    LastDataSet *last = (LastDataSet *)&mirror->data;
> >  
> >      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
> >  
> > @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
> >                            addr + mirror->offset, data, size);
> >          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
> >      }
> > +
> > +    /*
> > +     * Automatically add an ioeventfd to handle any repeated write with the
> > +     * same data and size above the standard PCI config space header.  This is
> > +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
> > +     * above.  Current hardware/drivers should trigger an ioeventfd at config
> > +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
> > +     */
> > +    if (addr > PCI_STD_HEADER_SIZEOF) {
> > +        if (addr != last->addr || data != last->data || size != last->size) {
> > +            last->addr = addr;
> > +            last->data = data;
> > +            last->size = size;
> > +            last->count = 1;
> > +        } else if (++last->count > 10) {  
> So here is the naive question about the "10" choice and also the fact
> this needs to be consecutive accesses. I guess you observed this works
> but at first sight this is not obvious to me.

Initially when I started working on this I used the previously known
MSI-ACK behavior, which was a write to the MSI capability ID register,
but I quickly figured out it didn't do anything because my current guest
setup uses the 0x704 offset noted in the comment above.  So I continued
developing with 0x704 hard coded.  But I have no idea if everything
uses 0x704 now or some drivers might still use the capability ID, or
what NVIDIA might switch to next.  So I wanted to make this dynamic and
I figured that shouldn't anything that gets repeated calls through here
with the same data that we're not modifying on the way to hardware
benefit from this acceleration.  Then I started thinking about what
frequency of write should trigger this switch-over, but the time
component of that is hard to deal with, so I found that I could just
drop it.  The MSI-ACK behavior means that we see a pretty continuous
stream of writes to the same offset, same data, so that triggers right
away.  Interestingly, after the MSI-ACK triggers the ioeventfd, we even
add another for a relatively low frequency write.  I wouldn't have
bothered with this one otherwise, but I don't see the harm in leaving
it since we're setup for a datamatch we'll only trigger the ioeventfd
if it remains the same.  So basically to answer your question, it's 10
because it seemed like a reasonable watermark and it works.  If we had
to, we could use some sort of LRU array, but this is simple.

> Does anyone check potential overlaps between ioeventfd's [addr, addr +
> size -1]?

Do we need to?  KVM will only trigger the ioeventfd if we match the
address, data, and size, so AIUI we could have multiple ioeventfds
covering the same area with different data or size and there's nothing
wrong with that.  I wondered whether we need some sort of pruning
algorithm, but once we've enabled kvm-vfio acceleration, QEMU no longer
has any idea which are being used.  Maybe we could arbitrarily decide
some fixed number limit for a device, but currently I think the limit
is based on the user's open file limit.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
  2018-02-08 11:10     ` [Qemu-devel] " Auger Eric
@ 2018-02-08 18:28       ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:28 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:10:35 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> On 07/02/18 01:26, Alex Williamson wrote:
> > This will later be used to include list initialization
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
> >  1 file changed, 21 insertions(+), 27 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e5779a7ad35b..10af23217292 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
> >      .endianness = DEVICE_LITTLE_ENDIAN,
> >  };
> >  
> > +static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> > +{
> > +    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));  
> nit: Peter advised the usage of g_new0 as well for that kind of alloc.

Not my favorite interface, but good point.  Updated and R-b added.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper
@ 2018-02-08 18:28       ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:28 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:10:35 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> On 07/02/18 01:26, Alex Williamson wrote:
> > This will later be used to include list initialization
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   48 +++++++++++++++++++++---------------------------
> >  1 file changed, 21 insertions(+), 27 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e5779a7ad35b..10af23217292 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -275,6 +275,15 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
> >      .endianness = DEVICE_LITTLE_ENDIAN,
> >  };
> >  
> > +static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> > +{
> > +    VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));  
> nit: Peter advised the usage of g_new0 as well for that kind of alloc.

Not my favorite interface, but good point.  Updated and R-b added.
Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
  2018-02-08 11:11     ` [Qemu-devel] " Auger Eric
@ 2018-02-08 18:33       ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:33 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:11:57 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 07/02/18 01:26, Alex Williamson wrote:
> > We might wish to handle some quirks via ioeventfds, add a list of
> > ioeventfds to the quirk.  
> The commit title is a bit misleading as we only add the data type and
> deletion function.

Unfortunately that's as much as I can add here or else QEMU will fail
to build because of unused functions :-\  So I pulled as much
infrastructure as I could here but had to leave some meat for when it
gets called.  Suggestions welcome for title/commit updates.

> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   17 +++++++++++++++++
> >  hw/vfio/pci.h        |   11 +++++++++++
> >  2 files changed, 28 insertions(+)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index 10af23217292..e4cf4ea2dd9c 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -12,6 +12,7 @@
> >  
> >  #include "qemu/osdep.h"
> >  #include "qemu/error-report.h"
> > +#include "qemu/main-loop.h"
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> > @@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
> >  static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> >  {
> >      VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
> > +    QLIST_INIT(&quirk->ioeventfds);
> >      quirk->mem = g_new0(MemoryRegion, nr_mem);
> >      quirk->nr_mem = nr_mem;
> >  
> >      return quirk;
> >  }
> >  
> > +static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > +{
> > +    QLIST_REMOVE(ioeventfd, next);
> > +    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> > +                              ioeventfd->match_data, ioeventfd->data,
> > +                              &ioeventfd->e);
> > +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > +    event_notifier_cleanup(&ioeventfd->e);
> > +    g_free(ioeventfd);
> > +}
> > +
> >  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
> >  {
> >      VFIOQuirk *quirk;
> > @@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
> >      int i;
> >  
> >      QLIST_FOREACH(quirk, &bar->quirks, next) {
> > +        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> > +            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> > +        }
> > +
> >          for (i = 0; i < quirk->nr_mem; i++) {
> >              memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
> >          }
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index f4aa13e021fa..146065c2f715 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -24,9 +24,20 @@
> >  
> >  struct VFIOPCIDevice;
> >  
> > +typedef struct VFIOIOEventFD {
> > +    QLIST_ENTRY(VFIOIOEventFD) next;
> > +    MemoryRegion *mr;
> > +    hwaddr addr;
> > +    unsigned size;
> > +    bool match_data;  
> Shouldn't you add the match_data field also in the kernel uapi?

In the vfio uapi?  vfio isn't matching anything and has no ability to
match anything, it's just triggering a pre-programmed write via
eventfd.  Maybe I'm not understanding the comment.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
@ 2018-02-08 18:33       ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:33 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:11:57 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> 
> On 07/02/18 01:26, Alex Williamson wrote:
> > We might wish to handle some quirks via ioeventfds, add a list of
> > ioeventfds to the quirk.  
> The commit title is a bit misleading as we only add the data type and
> deletion function.

Unfortunately that's as much as I can add here or else QEMU will fail
to build because of unused functions :-\  So I pulled as much
infrastructure as I could here but had to leave some meat for when it
gets called.  Suggestions welcome for title/commit updates.

> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   17 +++++++++++++++++
> >  hw/vfio/pci.h        |   11 +++++++++++
> >  2 files changed, 28 insertions(+)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index 10af23217292..e4cf4ea2dd9c 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -12,6 +12,7 @@
> >  
> >  #include "qemu/osdep.h"
> >  #include "qemu/error-report.h"
> > +#include "qemu/main-loop.h"
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> > @@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
> >  static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> >  {
> >      VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
> > +    QLIST_INIT(&quirk->ioeventfds);
> >      quirk->mem = g_new0(MemoryRegion, nr_mem);
> >      quirk->nr_mem = nr_mem;
> >  
> >      return quirk;
> >  }
> >  
> > +static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > +{
> > +    QLIST_REMOVE(ioeventfd, next);
> > +    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> > +                              ioeventfd->match_data, ioeventfd->data,
> > +                              &ioeventfd->e);
> > +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > +    event_notifier_cleanup(&ioeventfd->e);
> > +    g_free(ioeventfd);
> > +}
> > +
> >  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
> >  {
> >      VFIOQuirk *quirk;
> > @@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
> >      int i;
> >  
> >      QLIST_FOREACH(quirk, &bar->quirks, next) {
> > +        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
> > +            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
> > +        }
> > +
> >          for (i = 0; i < quirk->nr_mem; i++) {
> >              memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
> >          }
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index f4aa13e021fa..146065c2f715 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -24,9 +24,20 @@
> >  
> >  struct VFIOPCIDevice;
> >  
> > +typedef struct VFIOIOEventFD {
> > +    QLIST_ENTRY(VFIOIOEventFD) next;
> > +    MemoryRegion *mr;
> > +    hwaddr addr;
> > +    unsigned size;
> > +    bool match_data;  
> Shouldn't you add the match_data field also in the kernel uapi?

In the vfio uapi?  vfio isn't matching anything and has no ability to
match anything, it's just triggering a pre-programmed write via
eventfd.  Maybe I'm not understanding the comment.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-08 11:42     ` [Qemu-devel] " Auger Eric
@ 2018-02-08 18:41       ` Alex Williamson
  -1 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:41 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:42:15 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> On 07/02/18 01:26, Alex Williamson wrote:
> > With vfio ioeventfd support, we can program vfio-pci to perform a
> > specified BAR write when an eventfd is triggered.  This allows the
> > KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> > userspace handling for these events.  On the same micro-benchmark
> > where the ioeventfd got us to almost 90% of performance versus
> > disabling the GeForce quirks, this gets us to within 95%.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 36 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e739efe601b1..35a4d5197e2d 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -16,6 +16,7 @@
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> > +#include <sys/ioctl.h>
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "pci.h"
> >  #include "trace.h"
> > @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> >      return quirk;
> >  }
> >  
> > -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
> >  {
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +
> >      QLIST_REMOVE(ioeventfd, next);
> > +
> >      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> >                                ioeventfd->match_data, ioeventfd->data,
> >                                &ioeventfd->e);
> > +
> >      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > +
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = -1;
> > +
> > +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> > +
> >      event_notifier_cleanup(&ioeventfd->e);
> >      g_free(ioeventfd);
> >  }
> > @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >                                            hwaddr region_addr)
> >  {
> >      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +    char vfio_enabled = '+';
> >  
> >      if (event_notifier_init(&ioeventfd->e, 0)) {
> >          g_free(ioeventfd);
> > @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >      ioeventfd->region = region;
> >      ioeventfd->region_addr = region_addr;
> >  
> > -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> > +
> > +    if (ioctl(vdev->vbasedev.fd,
> > +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> > +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> > +        vfio_enabled = '-';
> > +    }
> > +
> >      memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> >                                ioeventfd->size, ioeventfd->match_data,
> >                                ioeventfd->data, &ioeventfd->e);
> >  
> >      info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> > -                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> > -                vdev->vbasedev.name, region->nr, region_addr, data, size);
> > +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
> > +                vdev->vbasedev.name, region->nr, region_addr, data, size,
> > +                vfio_enabled);  
> Not sure if this message is really helpful for the end-user to
> understand what happens. Maybe adding a trace event when everything
> happens as it should and an error_report if we failed setting up the
> vfio kernel handler, explaining the sub-optimal performance that can result.

For right now, I think it is useful.  Maybe when we get a few kernels
beyond when the vfio support is introduced and we know how different
devices are behaving and what ioeventfds get added, it might make sense
to switch to a trace interface.  I don't think we can legitimately
trigger an error_report for a feature which is just an accelerator and
isn't even in upstream kernels yet (though arguably it would be
upstream by the time this gets into QEMU).  For now it let's me ask
users to try it and they don't need to learn how to use tracing to send
reports that can be easily verified that both the QEMU and kernel bits
are in place and functional.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
@ 2018-02-08 18:41       ` Alex Williamson
  0 siblings, 0 replies; 35+ messages in thread
From: Alex Williamson @ 2018-02-08 18:41 UTC (permalink / raw)
  To: Auger Eric; +Cc: qemu-devel, kvm

On Thu, 8 Feb 2018 12:42:15 +0100
Auger Eric <eric.auger@redhat.com> wrote:

> Hi Alex,
> On 07/02/18 01:26, Alex Williamson wrote:
> > With vfio ioeventfd support, we can program vfio-pci to perform a
> > specified BAR write when an eventfd is triggered.  This allows the
> > KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> > userspace handling for these events.  On the same micro-benchmark
> > where the ioeventfd got us to almost 90% of performance versus
> > disabling the GeForce quirks, this gets us to within 95%.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 36 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e739efe601b1..35a4d5197e2d 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -16,6 +16,7 @@
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> > +#include <sys/ioctl.h>
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "pci.h"
> >  #include "trace.h"
> > @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> >      return quirk;
> >  }
> >  
> > -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
> >  {
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +
> >      QLIST_REMOVE(ioeventfd, next);
> > +
> >      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> >                                ioeventfd->match_data, ioeventfd->data,
> >                                &ioeventfd->e);
> > +
> >      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > +
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = -1;
> > +
> > +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> > +
> >      event_notifier_cleanup(&ioeventfd->e);
> >      g_free(ioeventfd);
> >  }
> > @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >                                            hwaddr region_addr)
> >  {
> >      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +    char vfio_enabled = '+';
> >  
> >      if (event_notifier_init(&ioeventfd->e, 0)) {
> >          g_free(ioeventfd);
> > @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >      ioeventfd->region = region;
> >      ioeventfd->region_addr = region_addr;
> >  
> > -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> > +
> > +    if (ioctl(vdev->vbasedev.fd,
> > +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> > +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> > +        vfio_enabled = '-';
> > +    }
> > +
> >      memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
> >                                ioeventfd->size, ioeventfd->match_data,
> >                                ioeventfd->data, &ioeventfd->e);
> >  
> >      info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
> > -                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
> > -                vdev->vbasedev.name, region->nr, region_addr, data, size);
> > +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u, vfio%c",
> > +                vdev->vbasedev.name, region->nr, region_addr, data, size,
> > +                vfio_enabled);  
> Not sure if this message is really helpful for the end-user to
> understand what happens. Maybe adding a trace event when everything
> happens as it should and an error_report if we failed setting up the
> vfio kernel handler, explaining the sub-optimal performance that can result.

For right now, I think it is useful.  Maybe when we get a few kernels
beyond when the vfio support is introduced and we know how different
devices are behaving and what ioeventfds get added, it might make sense
to switch to a trace interface.  I don't think we can legitimately
trigger an error_report for a feature which is just an accelerator and
isn't even in upstream kernels yet (though arguably it would be
upstream by the time this gets into QEMU).  For now it let's me ask
users to try it and they don't need to learn how to use tracing to send
reports that can be easily verified that both the QEMU and kernel bits
are in place and functional.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds
  2018-02-08 18:33       ` [Qemu-devel] " Alex Williamson
  (?)
@ 2018-02-08 20:37       ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 20:37 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

Hi Alex,

On 08/02/18 19:33, Alex Williamson wrote:
> On Thu, 8 Feb 2018 12:11:57 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 07/02/18 01:26, Alex Williamson wrote:
>>> We might wish to handle some quirks via ioeventfds, add a list of
>>> ioeventfds to the quirk.  
>> The commit title is a bit misleading as we only add the data type and
>> deletion function.
> 
> Unfortunately that's as much as I can add here or else QEMU will fail
> to build because of unused functions :-\  So I pulled as much
> infrastructure as I could here but had to leave some meat for when it
> gets called.  Suggestions welcome for title/commit updates.

Understood. Something like "vfio/quirks: Add a VFIOIOEventFD list in
quirks"?
> 
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci-quirks.c |   17 +++++++++++++++++
>>>  hw/vfio/pci.h        |   11 +++++++++++
>>>  2 files changed, 28 insertions(+)
>>>
>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>> index 10af23217292..e4cf4ea2dd9c 100644
>>> --- a/hw/vfio/pci-quirks.c
>>> +++ b/hw/vfio/pci-quirks.c
>>> @@ -12,6 +12,7 @@
>>>  
>>>  #include "qemu/osdep.h"
>>>  #include "qemu/error-report.h"
>>> +#include "qemu/main-loop.h"
>>>  #include "qemu/range.h"
>>>  #include "qapi/error.h"
>>>  #include "qapi/visitor.h"
>>> @@ -278,12 +279,24 @@ static const MemoryRegionOps vfio_ati_3c3_quirk = {
>>>  static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>>>  {
>>>      VFIOQuirk *quirk = g_malloc0(sizeof(*quirk));
>>> +    QLIST_INIT(&quirk->ioeventfds);
>>>      quirk->mem = g_new0(MemoryRegion, nr_mem);
>>>      quirk->nr_mem = nr_mem;
>>>  
>>>      return quirk;
>>>  }
>>>  
>>> +static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>>> +{
>>> +    QLIST_REMOVE(ioeventfd, next);
>>> +    memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
>>> +                              ioeventfd->match_data, ioeventfd->data,
>>> +                              &ioeventfd->e);
>>> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
>>> +    event_notifier_cleanup(&ioeventfd->e);
>>> +    g_free(ioeventfd);
>>> +}
>>> +
>>>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>>>  {
>>>      VFIOQuirk *quirk;
>>> @@ -1668,6 +1681,10 @@ void vfio_bar_quirk_exit(VFIOPCIDevice *vdev, int nr)
>>>      int i;
>>>  
>>>      QLIST_FOREACH(quirk, &bar->quirks, next) {
>>> +        while (!QLIST_EMPTY(&quirk->ioeventfds)) {
>>> +            vfio_ioeventfd_exit(QLIST_FIRST(&quirk->ioeventfds));
>>> +        }
>>> +
>>>          for (i = 0; i < quirk->nr_mem; i++) {
>>>              memory_region_del_subregion(bar->region.mem, &quirk->mem[i]);
>>>          }
>>> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
>>> index f4aa13e021fa..146065c2f715 100644
>>> --- a/hw/vfio/pci.h
>>> +++ b/hw/vfio/pci.h
>>> @@ -24,9 +24,20 @@
>>>  
>>>  struct VFIOPCIDevice;
>>>  
>>> +typedef struct VFIOIOEventFD {
>>> +    QLIST_ENTRY(VFIOIOEventFD) next;
>>> +    MemoryRegion *mr;
>>> +    hwaddr addr;
>>> +    unsigned size;
>>> +    bool match_data;  
>> Shouldn't you add the match_data field also in the kernel uapi?
> 
> In the vfio uapi?  vfio isn't matching anything and has no ability to
> match anything, it's just triggering a pre-programmed write via
> eventfd.  Maybe I'm not understanding the comment.  Thanks,

Yes that not sensible. We have a KVM ioeventfd that triggers an fd only
if the address/data matches and this VFIO "ioeventfd" listens to this fd
and writes the fixed address/data. At the beginning I was thinking about
having the VFIO handler taking in charge any value written at this
address but there is no way to retrieve the value besides doing a fixed
assignment from userspace. Sorry for the noise.

Thanks

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
  2018-02-08 18:24       ` [Qemu-devel] " Alex Williamson
@ 2018-02-08 20:52         ` Auger Eric
  -1 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 20:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

Hi Alex,

On 08/02/18 19:24, Alex Williamson wrote:
> On Thu, 8 Feb 2018 12:10:02 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 07/02/18 01:26, Alex Williamson wrote:
>>> Record data writes that come through the NVIDIA BAR0 quirk, if we get
>>> enough in a row that we're only passing through, automatically enable
>>> an ioeventfd for it.  The primary target for this is the MSI-ACK
>>> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
>>> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
>>> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
>>> benchmark, this takes us from 83% of performance versus disabling the
>>> quirk entirely (which GeForce cannot do), to to almost 90%.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  hw/vfio/pci.h        |    2 +
>>>  2 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>> index e4cf4ea2dd9c..e739efe601b1 100644lg  
>>
>>> --- a/hw/vfio/pci-quirks.c
>>> +++ b/hw/vfio/pci-quirks.c
>>> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>>>      uint32_t offset;
>>>      uint8_t bar;
>>>      MemoryRegion *mem;
>>> +    uint8_t data[];  
>> Do you foresee other usages of data besides the LastDataSet?
> 
> Sure, LastDataSet is just a very crude tracking mechanism, I could
> imagine some kind of device specific LRU array.  Any quirk wanting to
> add additional analytics on top of this generic quirk could make use of
> it.  I didn't want to pollute the generic quirk with LastDataSet since
> there are other users, but I also didn't want to complicate the shared
> free'ing path we have by simply adding a pointer.  Thus embedding quirk
> specific data is what I went with here.

OK
> 
>>>  } VFIOConfigMirrorQuirk;
>>>  
>>>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
>>> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>>>      g_free(ioeventfd);
>>>  }
>>>    
>> add a comment? user handler in case kvm ioeventfd setup failed?
> 
> As you correctly updated later, we're only setting up the kvm-user
> ioeventfd handler in this patch.  This is the handler when that
> succeeds.  Once we add vfio ioeventfd support, this handler will
> hopefully never get used, but we still need to support kernels that
> won't have vfio ioeventfd support and this alone does still provide a
> performance improvement.
> 
>>> +static void vfio_ioeventfd_handler(void *opaque)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = opaque;
>>> +
>>> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
>>> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
>>> +                          ioeventfd->data, ioeventfd->size);
>>> +    }
>>> +}
>>> +
>>> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>>> +                                          MemoryRegion *mr, hwaddr addr,
>>> +                                          unsigned size, uint64_t data,
>>> +                                          VFIORegion *region,
>>> +                                          hwaddr region_addr)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
>>> +
>>> +    if (event_notifier_init(&ioeventfd->e, 0)) {
>>> +        g_free(ioeventfd);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    ioeventfd->mr = mr;
>>> +    ioeventfd->addr = addr;
>>> +    ioeventfd->size = size;
>>> +    ioeventfd->match_data = true;
>>> +    ioeventfd->data = data;
>>> +    ioeventfd->region = region;
>>> +    ioeventfd->region_addr = region_addr;  
>> I found difficult to follow the different addr semantic.
>> I understand region_add is the offset % bar and addr is the offset %
>> mirror region. Maybe more explicit names would help (region = bar_region
>> and region_addr = bar_offset)
> 
> I was trying to avoid PCI specific fields for VFIOIOEventFD, but
> basically we're dealing with two different sets of base and offset.
> The first is the MemoryRegion of the quirk and the offset relative to
> that MemoryRegion.  Those are used by memory_region_add_eventfd() for
> establishing the kvm ioeventfd.  The second is the region and
> region_addr, which identify the vfio region and offset relative to the
> region so that we can actually handle it with vfio_region_write().
> 
> I agree though that these are confusing and maybe comments are a
> solution as I can't think of better names:
> 
>     /*
>      * MemoryRegion and relative offset, plus additional ioeventfd setup
>      * parameters for configuring and later tearing down KVM ioeventfd.
>      */
>     ioeventfd->mr = mr;
>     ioeventfd->addr = addr;
>     ioeventfd->size = size;
>     ioeventfd->match_data = true;
>     ioeventfd->data = data;
>     /*
>      * VFIORegion and relative offset for implementing the userspace
>      * handler.  data & size fields shared for both uses.
>      */
>     ioeventfd->region = region;
>     ioeventfd->region_addr = region_addr;
> 
> 
>>> +
>>> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
>>> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
>>> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>>> +                              ioeventfd->size, ioeventfd->match_data,
>>> +                              ioeventfd->data, &ioeventfd->e);
>>> +
>>> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
>>> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
>>> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
>>> +
>>> +    return ioeventfd;
>>> +}
>>> +
>>>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>>>  {
>>>      VFIOQuirk *quirk;
>>> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>>>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>>>  }
>>>  
>>> +typedef struct LastDataSet {
>>> +    hwaddr addr;
>>> +    uint64_t data;
>>> +    unsigned size;
>>> +    int count;
>>> +} LastDataSet;
>>> +
>>>  /*
>>>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>>>   * 0x1800 or 0x88000 through the PCI config space access functions.
>>> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>      VFIOConfigMirrorQuirk *mirror = opaque;
>>>      VFIOPCIDevice *vdev = mirror->vdev;
>>>      PCIDevice *pdev = &vdev->pdev;
>>> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>>>  
>>>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>>>  
>>> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>                            addr + mirror->offset, data, size);
>>>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>>>      }
>>> +
>>> +    /*
>>> +     * Automatically add an ioeventfd to handle any repeated write with the
>>> +     * same data and size above the standard PCI config space header.  This is
>>> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
>>> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
>>> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
>>> +     */
>>> +    if (addr > PCI_STD_HEADER_SIZEOF) {
>>> +        if (addr != last->addr || data != last->data || size != last->size) {
>>> +            last->addr = addr;
>>> +            last->data = data;
>>> +            last->size = size;
>>> +            last->count = 1;
>>> +        } else if (++last->count > 10) {  
>> So here is the naive question about the "10" choice and also the fact
>> this needs to be consecutive accesses. I guess you observed this works
>> but at first sight this is not obvious to me.
> 
> Initially when I started working on this I used the previously known
> MSI-ACK behavior, which was a write to the MSI capability ID register,
> but I quickly figured out it didn't do anything because my current guest
> setup uses the 0x704 offset noted in the comment above.  So I continued
> developing with 0x704 hard coded.  But I have no idea if everything
> uses 0x704 now or some drivers might still use the capability ID, or
> what NVIDIA might switch to next.  So I wanted to make this dynamic and
> I figured that shouldn't anything that gets repeated calls through here
> with the same data that we're not modifying on the way to hardware
> benefit from this acceleration.  Then I started thinking about what
> frequency of write should trigger this switch-over, but the time
> component of that is hard to deal with, so I found that I could just
> drop it.  The MSI-ACK behavior means that we see a pretty continuous
> stream of writes to the same offset, same data, so that triggers right
> away.  Interestingly, after the MSI-ACK triggers the ioeventfd, we even
> add another for a relatively low frequency write.  I wouldn't have
> bothered with this one otherwise, but I don't see the harm in leaving
> it since we're setup for a datamatch we'll only trigger the ioeventfd
> if it remains the same.  So basically to answer your question, it's 10
> because it seemed like a reasonable watermark and it works.  If we had
> to, we could use some sort of LRU array, but this is simple.

OK
> 
>> Does anyone check potential overlaps between ioeventfd's [addr, addr +
>> size -1]?
> 
> Do we need to?  KVM will only trigger the ioeventfd if we match the
> address, data, and size, so AIUI we could have multiple ioeventfds
> covering the same area with different data or size and there's nothing
> wrong with that.  I wondered whether we need some sort of pruning
> algorithm, but once we've enabled kvm-vfio acceleration, QEMU no longer
> has any idea which are being used.  Maybe we could arbitrarily decide
> some fixed number limit for a device, but currently I think the limit
> is based on the user's open file limit.  Thanks,

OK. I was just thinking about a device potentially writing the same area
with different width/offset patterns. But as you said the only drawback
is to register several KVM ioeventfds and their fellow VFIO kernel
listeners for the same area.

Thanks

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks
@ 2018-02-08 20:52         ` Auger Eric
  0 siblings, 0 replies; 35+ messages in thread
From: Auger Eric @ 2018-02-08 20:52 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

Hi Alex,

On 08/02/18 19:24, Alex Williamson wrote:
> On Thu, 8 Feb 2018 12:10:02 +0100
> Auger Eric <eric.auger@redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 07/02/18 01:26, Alex Williamson wrote:
>>> Record data writes that come through the NVIDIA BAR0 quirk, if we get
>>> enough in a row that we're only passing through, automatically enable
>>> an ioeventfd for it.  The primary target for this is the MSI-ACK
>>> that NVIDIA uses to allow the MSI interrupt to re-trigger, which is a
>>> 4-byte write, data value 0x0 to offset 0x704 into the quirk, 0x88704
>>> into BAR0 MMIO space.  For an interrupt latency sensitive micro-
>>> benchmark, this takes us from 83% of performance versus disabling the
>>> quirk entirely (which GeForce cannot do), to to almost 90%.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci-quirks.c |   89 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>>  hw/vfio/pci.h        |    2 +
>>>  2 files changed, 89 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
>>> index e4cf4ea2dd9c..e739efe601b1 100644lg  
>>
>>> --- a/hw/vfio/pci-quirks.c
>>> +++ b/hw/vfio/pci-quirks.c
>>> @@ -203,6 +203,7 @@ typedef struct VFIOConfigMirrorQuirk {
>>>      uint32_t offset;
>>>      uint8_t bar;
>>>      MemoryRegion *mem;
>>> +    uint8_t data[];  
>> Do you foresee other usages of data besides the LastDataSet?
> 
> Sure, LastDataSet is just a very crude tracking mechanism, I could
> imagine some kind of device specific LRU array.  Any quirk wanting to
> add additional analytics on top of this generic quirk could make use of
> it.  I didn't want to pollute the generic quirk with LastDataSet since
> there are other users, but I also didn't want to complicate the shared
> free'ing path we have by simply adding a pointer.  Thus embedding quirk
> specific data is what I went with here.

OK
> 
>>>  } VFIOConfigMirrorQuirk;
>>>  
>>>  static uint64_t vfio_generic_quirk_mirror_read(void *opaque,
>>> @@ -297,6 +298,50 @@ static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
>>>      g_free(ioeventfd);
>>>  }
>>>    
>> add a comment? user handler in case kvm ioeventfd setup failed?
> 
> As you correctly updated later, we're only setting up the kvm-user
> ioeventfd handler in this patch.  This is the handler when that
> succeeds.  Once we add vfio ioeventfd support, this handler will
> hopefully never get used, but we still need to support kernels that
> won't have vfio ioeventfd support and this alone does still provide a
> performance improvement.
> 
>>> +static void vfio_ioeventfd_handler(void *opaque)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = opaque;
>>> +
>>> +    if (event_notifier_test_and_clear(&ioeventfd->e)) {
>>> +        vfio_region_write(ioeventfd->region, ioeventfd->region_addr,
>>> +                          ioeventfd->data, ioeventfd->size);
>>> +    }
>>> +}
>>> +
>>> +static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>>> +                                          MemoryRegion *mr, hwaddr addr,
>>> +                                          unsigned size, uint64_t data,
>>> +                                          VFIORegion *region,
>>> +                                          hwaddr region_addr)
>>> +{
>>> +    VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
>>> +
>>> +    if (event_notifier_init(&ioeventfd->e, 0)) {
>>> +        g_free(ioeventfd);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    ioeventfd->mr = mr;
>>> +    ioeventfd->addr = addr;
>>> +    ioeventfd->size = size;
>>> +    ioeventfd->match_data = true;
>>> +    ioeventfd->data = data;
>>> +    ioeventfd->region = region;
>>> +    ioeventfd->region_addr = region_addr;  
>> I found difficult to follow the different addr semantic.
>> I understand region_add is the offset % bar and addr is the offset %
>> mirror region. Maybe more explicit names would help (region = bar_region
>> and region_addr = bar_offset)
> 
> I was trying to avoid PCI specific fields for VFIOIOEventFD, but
> basically we're dealing with two different sets of base and offset.
> The first is the MemoryRegion of the quirk and the offset relative to
> that MemoryRegion.  Those are used by memory_region_add_eventfd() for
> establishing the kvm ioeventfd.  The second is the region and
> region_addr, which identify the vfio region and offset relative to the
> region so that we can actually handle it with vfio_region_write().
> 
> I agree though that these are confusing and maybe comments are a
> solution as I can't think of better names:
> 
>     /*
>      * MemoryRegion and relative offset, plus additional ioeventfd setup
>      * parameters for configuring and later tearing down KVM ioeventfd.
>      */
>     ioeventfd->mr = mr;
>     ioeventfd->addr = addr;
>     ioeventfd->size = size;
>     ioeventfd->match_data = true;
>     ioeventfd->data = data;
>     /*
>      * VFIORegion and relative offset for implementing the userspace
>      * handler.  data & size fields shared for both uses.
>      */
>     ioeventfd->region = region;
>     ioeventfd->region_addr = region_addr;
> 
> 
>>> +
>>> +    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
>>> +                        vfio_ioeventfd_handler, NULL, ioeventfd);
>>> +    memory_region_add_eventfd(ioeventfd->mr, ioeventfd->addr,
>>> +                              ioeventfd->size, ioeventfd->match_data,
>>> +                              ioeventfd->data, &ioeventfd->e);
>>> +
>>> +    info_report("Enabled automatic ioeventfd acceleration for %s region %d, "
>>> +                "offset 0x%"HWADDR_PRIx", data 0x%"PRIx64", size %u",
>>> +                vdev->vbasedev.name, region->nr, region_addr, data, size);
>>> +
>>> +    return ioeventfd;
>>> +}
>>> +
>>>  static void vfio_vga_probe_ati_3c3_quirk(VFIOPCIDevice *vdev)
>>>  {
>>>      VFIOQuirk *quirk;
>>> @@ -732,6 +777,13 @@ static void vfio_probe_nvidia_bar5_quirk(VFIOPCIDevice *vdev, int nr)
>>>      trace_vfio_quirk_nvidia_bar5_probe(vdev->vbasedev.name);
>>>  }
>>>  
>>> +typedef struct LastDataSet {
>>> +    hwaddr addr;
>>> +    uint64_t data;
>>> +    unsigned size;
>>> +    int count;
>>> +} LastDataSet;
>>> +
>>>  /*
>>>   * Finally, BAR0 itself.  We want to redirect any accesses to either
>>>   * 0x1800 or 0x88000 through the PCI config space access functions.
>>> @@ -742,6 +794,7 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>      VFIOConfigMirrorQuirk *mirror = opaque;
>>>      VFIOPCIDevice *vdev = mirror->vdev;
>>>      PCIDevice *pdev = &vdev->pdev;
>>> +    LastDataSet *last = (LastDataSet *)&mirror->data;
>>>  
>>>      vfio_generic_quirk_mirror_write(opaque, addr, data, size);
>>>  
>>> @@ -756,6 +809,38 @@ static void vfio_nvidia_quirk_mirror_write(void *opaque, hwaddr addr,
>>>                            addr + mirror->offset, data, size);
>>>          trace_vfio_quirk_nvidia_bar0_msi_ack(vdev->vbasedev.name);
>>>      }
>>> +
>>> +    /*
>>> +     * Automatically add an ioeventfd to handle any repeated write with the
>>> +     * same data and size above the standard PCI config space header.  This is
>>> +     * primarily expected to accelerate the MSI-ACK behavior, such as noted
>>> +     * above.  Current hardware/drivers should trigger an ioeventfd at config
>>> +     * offset 0x704 (region offset 0x88704), with data 0x0, size 4.
>>> +     */
>>> +    if (addr > PCI_STD_HEADER_SIZEOF) {
>>> +        if (addr != last->addr || data != last->data || size != last->size) {
>>> +            last->addr = addr;
>>> +            last->data = data;
>>> +            last->size = size;
>>> +            last->count = 1;
>>> +        } else if (++last->count > 10) {  
>> So here is the naive question about the "10" choice and also the fact
>> this needs to be consecutive accesses. I guess you observed this works
>> but at first sight this is not obvious to me.
> 
> Initially when I started working on this I used the previously known
> MSI-ACK behavior, which was a write to the MSI capability ID register,
> but I quickly figured out it didn't do anything because my current guest
> setup uses the 0x704 offset noted in the comment above.  So I continued
> developing with 0x704 hard coded.  But I have no idea if everything
> uses 0x704 now or some drivers might still use the capability ID, or
> what NVIDIA might switch to next.  So I wanted to make this dynamic and
> I figured that shouldn't anything that gets repeated calls through here
> with the same data that we're not modifying on the way to hardware
> benefit from this acceleration.  Then I started thinking about what
> frequency of write should trigger this switch-over, but the time
> component of that is hard to deal with, so I found that I could just
> drop it.  The MSI-ACK behavior means that we see a pretty continuous
> stream of writes to the same offset, same data, so that triggers right
> away.  Interestingly, after the MSI-ACK triggers the ioeventfd, we even
> add another for a relatively low frequency write.  I wouldn't have
> bothered with this one otherwise, but I don't see the harm in leaving
> it since we're setup for a datamatch we'll only trigger the ioeventfd
> if it remains the same.  So basically to answer your question, it's 10
> because it seemed like a reasonable watermark and it works.  If we had
> to, we could use some sort of LRU array, but this is simple.

OK
> 
>> Does anyone check potential overlaps between ioeventfd's [addr, addr +
>> size -1]?
> 
> Do we need to?  KVM will only trigger the ioeventfd if we match the
> address, data, and size, so AIUI we could have multiple ioeventfds
> covering the same area with different data or size and there's nothing
> wrong with that.  I wondered whether we need some sort of pruning
> algorithm, but once we've enabled kvm-vfio acceleration, QEMU no longer
> has any idea which are being used.  Maybe we could arbitrarily decide
> some fixed number limit for a device, but currently I think the limit
> is based on the user's open file limit.  Thanks,

OK. I was just thinking about a device potentially writing the same area
with different width/offset patterns. But as you said the only drawback
is to register several KVM ioeventfds and their fellow VFIO kernel
listeners for the same area.

Thanks

Eric
> 
> Alex
> 

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
  (?)
  (?)
@ 2018-02-09  7:11   ` Peter Xu
  2018-02-09 22:09     ` Alex Williamson
  -1 siblings, 1 reply; 35+ messages in thread
From: Peter Xu @ 2018-02-09  7:11 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Tue, Feb 06, 2018 at 05:26:46PM -0700, Alex Williamson wrote:
> With vfio ioeventfd support, we can program vfio-pci to perform a
> specified BAR write when an eventfd is triggered.  This allows the
> KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> userspace handling for these events.  On the same micro-benchmark
> where the ioeventfd got us to almost 90% of performance versus
> disabling the GeForce quirks, this gets us to within 95%.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
>  1 file changed, 36 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> index e739efe601b1..35a4d5197e2d 100644
> --- a/hw/vfio/pci-quirks.c
> +++ b/hw/vfio/pci-quirks.c
> @@ -16,6 +16,7 @@
>  #include "qemu/range.h"
>  #include "qapi/error.h"
>  #include "qapi/visitor.h"
> +#include <sys/ioctl.h>
>  #include "hw/nvram/fw_cfg.h"
>  #include "pci.h"
>  #include "trace.h"
> @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
>      return quirk;
>  }
>  
> -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
>  {
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +
>      QLIST_REMOVE(ioeventfd, next);
> +
>      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
>                                ioeventfd->match_data, ioeventfd->data,
>                                &ioeventfd->e);
> +
>      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> +
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = -1;
> +
> +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> +
>      event_notifier_cleanup(&ioeventfd->e);
>      g_free(ioeventfd);
>  }
> @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>                                            hwaddr region_addr)
>  {
>      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> +    struct vfio_device_ioeventfd vfio_ioeventfd;
> +    char vfio_enabled = '+';
>  
>      if (event_notifier_init(&ioeventfd->e, 0)) {
>          g_free(ioeventfd);
> @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
>      ioeventfd->region = region;
>      ioeventfd->region_addr = region_addr;
>  
> -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> +    vfio_ioeventfd.flags = ioeventfd->size;
> +    vfio_ioeventfd.data = ioeventfd->data;
> +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> +                            ioeventfd->region_addr;
> +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> +
> +    if (ioctl(vdev->vbasedev.fd,
> +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> +        vfio_enabled = '-';

Would the performance be even slower if a new QEMU runs on a old
kernel due to these ioeventfds (MMIO -> eventfd -> same MMIO again)?
If so, shall we only enable this ioeventfd enhancement only if we
detected that the kernel supports this new feature (assuming this
feature bit won't change after VM starts)?

Then, could we avoid the slow path at all (qemu_set_fd_handler)?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-09  7:11   ` Peter Xu
@ 2018-02-09 22:09     ` Alex Williamson
  2018-02-11  2:38       ` Peter Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Alex Williamson @ 2018-02-09 22:09 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, kvm

On Fri, 9 Feb 2018 15:11:45 +0800
Peter Xu <peterx@redhat.com> wrote:

> On Tue, Feb 06, 2018 at 05:26:46PM -0700, Alex Williamson wrote:
> > With vfio ioeventfd support, we can program vfio-pci to perform a
> > specified BAR write when an eventfd is triggered.  This allows the
> > KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> > userspace handling for these events.  On the same micro-benchmark
> > where the ioeventfd got us to almost 90% of performance versus
> > disabling the GeForce quirks, this gets us to within 95%.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 36 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > index e739efe601b1..35a4d5197e2d 100644
> > --- a/hw/vfio/pci-quirks.c
> > +++ b/hw/vfio/pci-quirks.c
> > @@ -16,6 +16,7 @@
> >  #include "qemu/range.h"
> >  #include "qapi/error.h"
> >  #include "qapi/visitor.h"
> > +#include <sys/ioctl.h>
> >  #include "hw/nvram/fw_cfg.h"
> >  #include "pci.h"
> >  #include "trace.h"
> > @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> >      return quirk;
> >  }
> >  
> > -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
> >  {
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +
> >      QLIST_REMOVE(ioeventfd, next);
> > +
> >      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> >                                ioeventfd->match_data, ioeventfd->data,
> >                                &ioeventfd->e);
> > +
> >      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > +
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = -1;
> > +
> > +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> > +
> >      event_notifier_cleanup(&ioeventfd->e);
> >      g_free(ioeventfd);
> >  }
> > @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >                                            hwaddr region_addr)
> >  {
> >      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > +    char vfio_enabled = '+';
> >  
> >      if (event_notifier_init(&ioeventfd->e, 0)) {
> >          g_free(ioeventfd);
> > @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> >      ioeventfd->region = region;
> >      ioeventfd->region_addr = region_addr;
> >  
> > -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > +    vfio_ioeventfd.flags = ioeventfd->size;
> > +    vfio_ioeventfd.data = ioeventfd->data;
> > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > +                            ioeventfd->region_addr;
> > +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> > +
> > +    if (ioctl(vdev->vbasedev.fd,
> > +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> > +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> > +        vfio_enabled = '-';  
> 
> Would the performance be even slower if a new QEMU runs on a old
> kernel due to these ioeventfds (MMIO -> eventfd -> same MMIO again)?
> If so, shall we only enable this ioeventfd enhancement only if we
> detected that the kernel supports this new feature (assuming this
> feature bit won't change after VM starts)?

No, it's actually still a significant improvement to enable the KVM
ioeventfd even if we can't enable vfio.  My testing shows that the KVM
ioeventfd alone accounts for slightly more than half of the total
improvement, so I don't see any reason to restrict this to depending on
both ends being available.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [Qemu-devel] [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly
  2018-02-09 22:09     ` Alex Williamson
@ 2018-02-11  2:38       ` Peter Xu
  0 siblings, 0 replies; 35+ messages in thread
From: Peter Xu @ 2018-02-11  2:38 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, kvm

On Fri, Feb 09, 2018 at 03:09:33PM -0700, Alex Williamson wrote:
> On Fri, 9 Feb 2018 15:11:45 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Tue, Feb 06, 2018 at 05:26:46PM -0700, Alex Williamson wrote:
> > > With vfio ioeventfd support, we can program vfio-pci to perform a
> > > specified BAR write when an eventfd is triggered.  This allows the
> > > KVM ioeventfd to be wired directly to vfio-pci, entirely avoiding
> > > userspace handling for these events.  On the same micro-benchmark
> > > where the ioeventfd got us to almost 90% of performance versus
> > > disabling the GeForce quirks, this gets us to within 95%.
> > > 
> > > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > > ---
> > >  hw/vfio/pci-quirks.c |   42 ++++++++++++++++++++++++++++++++++++------
> > >  1 file changed, 36 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
> > > index e739efe601b1..35a4d5197e2d 100644
> > > --- a/hw/vfio/pci-quirks.c
> > > +++ b/hw/vfio/pci-quirks.c
> > > @@ -16,6 +16,7 @@
> > >  #include "qemu/range.h"
> > >  #include "qapi/error.h"
> > >  #include "qapi/visitor.h"
> > > +#include <sys/ioctl.h>
> > >  #include "hw/nvram/fw_cfg.h"
> > >  #include "pci.h"
> > >  #include "trace.h"
> > > @@ -287,13 +288,27 @@ static VFIOQuirk *vfio_quirk_alloc(int nr_mem)
> > >      return quirk;
> > >  }
> > >  
> > > -static void vfio_ioeventfd_exit(VFIOIOEventFD *ioeventfd)
> > > +static void vfio_ioeventfd_exit(VFIOPCIDevice *vdev, VFIOIOEventFD *ioeventfd)
> > >  {
> > > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > > +
> > >      QLIST_REMOVE(ioeventfd, next);
> > > +
> > >      memory_region_del_eventfd(ioeventfd->mr, ioeventfd->addr, ioeventfd->size,
> > >                                ioeventfd->match_data, ioeventfd->data,
> > >                                &ioeventfd->e);
> > > +
> > >      qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e), NULL, NULL, NULL);
> > > +
> > > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > > +    vfio_ioeventfd.flags = ioeventfd->size;
> > > +    vfio_ioeventfd.data = ioeventfd->data;
> > > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > > +                            ioeventfd->region_addr;
> > > +    vfio_ioeventfd.fd = -1;
> > > +
> > > +    ioctl(vdev->vbasedev.fd, VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd);
> > > +
> > >      event_notifier_cleanup(&ioeventfd->e);
> > >      g_free(ioeventfd);
> > >  }
> > > @@ -315,6 +330,8 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> > >                                            hwaddr region_addr)
> > >  {
> > >      VFIOIOEventFD *ioeventfd = g_malloc0(sizeof(*ioeventfd));
> > > +    struct vfio_device_ioeventfd vfio_ioeventfd;
> > > +    char vfio_enabled = '+';
> > >  
> > >      if (event_notifier_init(&ioeventfd->e, 0)) {
> > >          g_free(ioeventfd);
> > > @@ -329,15 +346,28 @@ static VFIOIOEventFD *vfio_ioeventfd_init(VFIOPCIDevice *vdev,
> > >      ioeventfd->region = region;
> > >      ioeventfd->region_addr = region_addr;
> > >  
> > > -    qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > > -                        vfio_ioeventfd_handler, NULL, ioeventfd);
> > > +    vfio_ioeventfd.argsz = sizeof(vfio_ioeventfd);
> > > +    vfio_ioeventfd.flags = ioeventfd->size;
> > > +    vfio_ioeventfd.data = ioeventfd->data;
> > > +    vfio_ioeventfd.offset = ioeventfd->region->fd_offset +
> > > +                            ioeventfd->region_addr;
> > > +    vfio_ioeventfd.fd = event_notifier_get_fd(&ioeventfd->e);
> > > +
> > > +    if (ioctl(vdev->vbasedev.fd,
> > > +              VFIO_DEVICE_IOEVENTFD, &vfio_ioeventfd) != 0) {
> > > +        qemu_set_fd_handler(event_notifier_get_fd(&ioeventfd->e),
> > > +                            vfio_ioeventfd_handler, NULL, ioeventfd);
> > > +        vfio_enabled = '-';  
> > 
> > Would the performance be even slower if a new QEMU runs on a old
> > kernel due to these ioeventfds (MMIO -> eventfd -> same MMIO again)?
> > If so, shall we only enable this ioeventfd enhancement only if we
> > detected that the kernel supports this new feature (assuming this
> > feature bit won't change after VM starts)?
> 
> No, it's actually still a significant improvement to enable the KVM
> ioeventfd even if we can't enable vfio.  My testing shows that the KVM
> ioeventfd alone accounts for slightly more than half of the total
> improvement, so I don't see any reason to restrict this to depending on
> both ends being available.  Thanks,

The numbers (83%->90%->95%) were mentioned in different patches but I
didn't really catch all of them.  Sorry.

And obviously the userspace code path is different, which I missed
too.  And it makes sense that ioeventfd should always be faster.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2018-02-11  2:38 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-07  0:26 [RFC PATCH 0/5] vfio: ioeventfd support Alex Williamson
2018-02-07  0:26 ` [Qemu-devel] " Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 1/5] vfio/quirks: Add common quirk alloc helper Alex Williamson
2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
2018-02-08 11:10   ` Auger Eric
2018-02-08 11:10     ` [Qemu-devel] " Auger Eric
2018-02-08 18:28     ` Alex Williamson
2018-02-08 18:28       ` [Qemu-devel] " Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 2/5] vfio/quirks: Add generic support for ioveventfds Alex Williamson
2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
2018-02-08 11:11   ` Auger Eric
2018-02-08 11:11     ` [Qemu-devel] " Auger Eric
2018-02-08 18:33     ` Alex Williamson
2018-02-08 18:33       ` [Qemu-devel] " Alex Williamson
2018-02-08 20:37       ` Auger Eric
2018-02-07  0:26 ` [RFC PATCH 3/5] vfio/quirks: Automatic ioeventfd enabling for NVIDIA BAR0 quirks Alex Williamson
2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
2018-02-08 11:10   ` Auger Eric
2018-02-08 11:10     ` [Qemu-devel] " Auger Eric
2018-02-08 11:33     ` Auger Eric
2018-02-08 18:24     ` Alex Williamson
2018-02-08 18:24       ` [Qemu-devel] " Alex Williamson
2018-02-08 20:52       ` Auger Eric
2018-02-08 20:52         ` [Qemu-devel] " Auger Eric
2018-02-07  0:26 ` [RFC PATCH 4/5] vfio: Update linux header Alex Williamson
2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
2018-02-07  0:26 ` [RFC PATCH 5/5] vfio/quirks: Enable ioeventfd quirks to be handled by vfio directly Alex Williamson
2018-02-07  0:26   ` [Qemu-devel] " Alex Williamson
2018-02-08 11:42   ` Auger Eric
2018-02-08 11:42     ` [Qemu-devel] " Auger Eric
2018-02-08 18:41     ` Alex Williamson
2018-02-08 18:41       ` [Qemu-devel] " Alex Williamson
2018-02-09  7:11   ` Peter Xu
2018-02-09 22:09     ` Alex Williamson
2018-02-11  2:38       ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.