All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation
@ 2017-12-18  5:02 Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 1/5] vfio/pci: Fixup VFIOMSIXInfo comment Alex Williamson
                   ` (5 more replies)
  0 siblings, 6 replies; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

Currently the kernel vfio-pci interface disallows users to mmap over
the MSI-X MMIO areas within a vfio region, however there's a proposal
to enable mmap of this area to better support SPAPR.  The kernel
change benefits those systems because they use a 64K system page size,
such that disallowing direct mapping of the MSI-X MMIO space also
potentially disallows direct access to other device registers within
that page.  Additionally, those systems make use of hypercalls rather
than VM access to the MSI-X MMIO space for programming interrupts and
can therefore disable traps into QEMU for MSI-X MMIO emulation.

Other platforms, like ARM64, can also use 64K pages and therefore
may also experience performance issues on devices where performance
critical device registers are within the same page as MSI-X MMIO.
These systems can also take advantage of the kernel allowing mmap of
the device MSI-X MMIO space, but in order to avoid traps to this
page, we also need to move (relocate) MSI-X MMIO to avoid any chance
of interference.  This series adds the option 'x-msix-relocation='
to the vfio-pci device, which accepts values of 'off' (default),
'auto', and 'bar0' through 'bar5'.  The default is expected to have
full device and guest OS compatibility as it leaves the MSI-X MMIO
space at the native offset of the device.  The 'auto' option will
automatically relocate MSI-X MMIO space using an algorithm that
prefers the least additional MMIO space for the device, by either
adding a new BAR or extending existing BARs.  Finally, specifying a
specific BAR allows the user to choose where to add MSI-X MMIO.

I've made this new option experimental here, because we don't know
that device drivers across all guests will be compatible with this
change.  It's possible that some drivers might hard code addresses.
Any Linux driver following standard programming should be compatible
with this change.  There are also devices which don't need this
modification and enabling it by default would only serve to increase
the MMIO requirements of the device.  An example of this is the
Intel 82576 PF NIC, where BAR3 is sized at 16K and hosts the MSI-X
vector table at offset 0 and the PBA at offset 8K.  The datasheet
indicates this BAR is exclusively for MSI-X, therefore it's pointless
to relocate it.  Perhaps we'll eventually develop vendor or device
checks where it makes sense to enable this automatically when running
with large system page sizes.

x86 is of course largely immune to this problem since the system page
size is always 4K, which falls within the design recommendations in
the PCI spec for alignment of MSI-X registers from other registers,
but as this is only a recommendation, there may exist devices for
which this change could also be useful on 4K hosts.

Testing for this series can be done on any kernel, but one allowing
mmap of the MSI-X MMIO space is required to evaluate any performance
difference.  This requires Alexey's patch here:
http://www.spinics.net/lists/kvm/msg160605.html
Which depends on my patch:
https://lkml.org/lkml/2017/12/12/1083
Also, disabling hw/vfio/pci.c:vfio_pci_fixup_msix_region() is
necessary until we have some funtional proposals for QEMU making
use of the new capability exported by the kernel.  I welcome feedback
and/or test reports.  Thanks,

Alex

---

Alex Williamson (5):
      vfio/pci: Fixup VFIOMSIXInfo comment
      vfio/pci: Add base BAR MemoryRegion
      vfio/pci: Emulate BARs
      qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR
      vfio/pci: Allow relocating MSI-X MMIO


 hw/core/qdev-properties.c    |   11 +++
 hw/vfio/pci.c                |  176 ++++++++++++++++++++++++++++++++++++++----
 hw/vfio/pci.h                |    6 +
 hw/vfio/trace-events         |    2 
 include/hw/qdev-properties.h |    4 +
 qapi/common.json             |   26 ++++++
 6 files changed, 207 insertions(+), 18 deletions(-)

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH 1/5] vfio/pci: Fixup VFIOMSIXInfo comment
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
@ 2017-12-18  5:02 ` Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion Alex Williamson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

The fields were removed in the referenced commit, but the comment
still mentions them.

Fixes: 2fb9636ebf24 ("vfio-pci: Remove unused fields from VFIOMSIXInfo")
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index a8fb3b34222c..3d753222ca4c 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -86,7 +86,7 @@ enum {
     VFIO_INT_MSIX = 3,
 };
 
-/* Cache of MSI-X setup plus extra mmap and memory region for split BAR map */
+/* Cache of MSI-X setup */
 typedef struct VFIOMSIXInfo {
     uint8_t table_bar;
     uint8_t pba_bar;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 1/5] vfio/pci: Fixup VFIOMSIXInfo comment Alex Williamson
@ 2017-12-18  5:02 ` Alex Williamson
  2017-12-19  3:44   ` Alexey Kardashevskiy
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 3/5] vfio/pci: Emulate BARs Alex Williamson
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

Add one more layer to our stack of MemoryRegions, this base region
allows us to register BARs independently of the vfio region or to
extend the size of BARs which do map to a region.  This will be
useful when we want hypervisor defined BARs or sections of BARs,
for purposes such as relocating MSI-X emulation.  We therefore call
msix_init() based on this new base MemoryRegion, while the quirks,
which only modify regions still operate on those sub-MemoryRegions.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c |   74 ++++++++++++++++++++++++++++++++++++++++++++-------------
 hw/vfio/pci.h |    3 ++
 2 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c977ee327f94..8f46fdd1d391 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1440,9 +1440,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
     vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
                                     sizeof(unsigned long));
     ret = msix_init(&vdev->pdev, vdev->msix->entries,
-                    vdev->bars[vdev->msix->table_bar].region.mem,
+                    vdev->bars[vdev->msix->table_bar].mr,
                     vdev->msix->table_bar, vdev->msix->table_offset,
-                    vdev->bars[vdev->msix->pba_bar].region.mem,
+                    vdev->bars[vdev->msix->pba_bar].mr,
                     vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
                     &err);
     if (ret < 0) {
@@ -1482,8 +1482,8 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev)
 
     if (vdev->msix) {
         msix_uninit(&vdev->pdev,
-                    vdev->bars[vdev->msix->table_bar].region.mem,
-                    vdev->bars[vdev->msix->pba_bar].region.mem);
+                    vdev->bars[vdev->msix->table_bar].mr,
+                    vdev->bars[vdev->msix->pba_bar].mr);
         g_free(vdev->msix->pending);
     }
 }
@@ -1500,12 +1500,11 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
     }
 }
 
-static void vfio_bar_setup(VFIOPCIDevice *vdev, int nr)
+static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
 {
     VFIOBAR *bar = &vdev->bars[nr];
 
     uint32_t pci_bar;
-    uint8_t type;
     int ret;
 
     /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
@@ -1524,23 +1523,52 @@ static void vfio_bar_setup(VFIOPCIDevice *vdev, int nr)
     pci_bar = le32_to_cpu(pci_bar);
     bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
     bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
-    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
-                                    ~PCI_BASE_ADDRESS_MEM_MASK);
+    bar->type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
+                                         ~PCI_BASE_ADDRESS_MEM_MASK);
+    bar->size = bar->region.size;
+}
 
-    if (vfio_region_mmap(&bar->region)) {
-        error_report("Failed to mmap %s BAR %d. Performance may be slow",
-                     vdev->vbasedev.name, nr);
+static void vfio_bars_prepare(VFIOPCIDevice *vdev)
+{
+    int i;
+
+    for (i = 0; i < PCI_ROM_SLOT; i++) {
+        vfio_bar_prepare(vdev, i);
     }
+}
 
-    pci_register_bar(&vdev->pdev, nr, type, bar->region.mem);
+static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
+{
+    VFIOBAR *bar = &vdev->bars[nr];
+    char *name;
+
+    if (!bar->size) {
+        return;
+    }
+
+    bar->mr = g_new0(MemoryRegion, 1);
+    name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);
+    memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
+    g_free(name);
+
+    if (bar->region.size) {
+        memory_region_add_subregion(bar->mr, 0, bar->region.mem);
+
+        if (vfio_region_mmap(&bar->region)) {
+            error_report("Failed to mmap %s BAR %d. Performance may be slow",
+                         vdev->vbasedev.name, nr);
+        }
+    }
+
+    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
 }
 
-static void vfio_bars_setup(VFIOPCIDevice *vdev)
+static void vfio_bars_register(VFIOPCIDevice *vdev)
 {
     int i;
 
     for (i = 0; i < PCI_ROM_SLOT; i++) {
-        vfio_bar_setup(vdev, i);
+        vfio_bar_register(vdev, i);
     }
 }
 
@@ -1549,8 +1577,13 @@ static void vfio_bars_exit(VFIOPCIDevice *vdev)
     int i;
 
     for (i = 0; i < PCI_ROM_SLOT; i++) {
+        VFIOBAR *bar = &vdev->bars[i];
+
         vfio_bar_quirk_exit(vdev, i);
-        vfio_region_exit(&vdev->bars[i].region);
+        vfio_region_exit(&bar->region);
+        if (bar->region.size) {
+            memory_region_del_subregion(bar->mr, bar->region.mem);
+        }
     }
 
     if (vdev->vga) {
@@ -1564,8 +1597,14 @@ static void vfio_bars_finalize(VFIOPCIDevice *vdev)
     int i;
 
     for (i = 0; i < PCI_ROM_SLOT; i++) {
+        VFIOBAR *bar = &vdev->bars[i];
+
         vfio_bar_quirk_finalize(vdev, i);
-        vfio_region_finalize(&vdev->bars[i].region);
+        vfio_region_finalize(&bar->region);
+        if (bar->size) {
+           object_unparent(OBJECT(bar->mr));
+           g_free(bar->mr);
+        }
     }
 
     if (vdev->vga) {
@@ -2810,7 +2849,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
         goto error;
     }
 
-    vfio_bars_setup(vdev);
+    vfio_bars_prepare(vdev);
+    vfio_bars_register(vdev);
 
     ret = vfio_add_capabilities(vdev, errp);
     if (ret) {
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 3d753222ca4c..dcdb1a806769 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -33,6 +33,9 @@ typedef struct VFIOQuirk {
 
 typedef struct VFIOBAR {
     VFIORegion region;
+    MemoryRegion *mr;
+    size_t size;
+    uint8_t type;
     bool ioport;
     bool mem64;
     QLIST_HEAD(, VFIOQuirk) quirks;

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH 3/5] vfio/pci: Emulate BARs
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 1/5] vfio/pci: Fixup VFIOMSIXInfo comment Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion Alex Williamson
@ 2017-12-18  5:02 ` Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 4/5] qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR Alex Williamson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

The kernel provides similar emulation of PCI BAR register access to
QEMU, so up until now we've used that for things like BAR sizing and
storing the BAR address.  However, if we intend to resize BARs or add
BARs that don't exist on the physical device, we need to switch to the
pure QEMU emulation of the BAR.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8f46fdd1d391..c383b842da20 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2773,6 +2773,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     /* QEMU can choose to expose the ROM or not */
     memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
+    /* QEMU can also add or extend BARs */
+    memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
 
     /*
      * The PCI spec reserves vendor ID 0xffff as an invalid value.  The

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH 4/5] qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
                   ` (2 preceding siblings ...)
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 3/5] vfio/pci: Emulate BARs Alex Williamson
@ 2017-12-18  5:02 ` Alex Williamson
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO Alex Williamson
  2017-12-18  5:34 ` [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation no-reply
  5 siblings, 0 replies; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

Add an option which allows the user to specify a PCI BAR number,
including an 'off' and 'auto' selection.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/core/qdev-properties.c    |   11 +++++++++++
 include/hw/qdev-properties.h |    4 ++++
 qapi/common.json             |   26 ++++++++++++++++++++++++++
 3 files changed, 41 insertions(+)

diff --git a/hw/core/qdev-properties.c b/hw/core/qdev-properties.c
index 1dc80fcea2af..e33184e5a342 100644
--- a/hw/core/qdev-properties.c
+++ b/hw/core/qdev-properties.c
@@ -1256,3 +1256,14 @@ const PropertyInfo qdev_prop_link = {
     .name = "link",
     .create = create_link_property,
 };
+
+/* --- OffAutoPCIBAR off/auto/bar0/bar1/bar2/bar3/bar4/bar5 --- */
+
+const PropertyInfo qdev_prop_off_auto_pcibar = {
+    .name = "OffAutoPCIBAR",
+    .description = "off/auto/bar0/bar1/bar2/bar3/bar4/bar5",
+    .enum_table = &OffAutoPCIBAR_lookup,
+    .get = get_enum,
+    .set = set_enum,
+    .set_default_value = set_default_value_enum,
+};
diff --git a/include/hw/qdev-properties.h b/include/hw/qdev-properties.h
index e2321f1cc1ec..9ca2af5e3625 100644
--- a/include/hw/qdev-properties.h
+++ b/include/hw/qdev-properties.h
@@ -32,6 +32,7 @@ extern const PropertyInfo qdev_prop_blocksize;
 extern const PropertyInfo qdev_prop_pci_host_devaddr;
 extern const PropertyInfo qdev_prop_arraylen;
 extern const PropertyInfo qdev_prop_link;
+extern const PropertyInfo qdev_prop_off_auto_pcibar;
 
 #define DEFINE_PROP(_name, _state, _field, _prop, _type) { \
         .name      = (_name),                                    \
@@ -212,6 +213,9 @@ extern const PropertyInfo qdev_prop_link;
     DEFINE_PROP(_n, _s, _f, qdev_prop_pci_host_devaddr, PCIHostDeviceAddress)
 #define DEFINE_PROP_MEMORY_REGION(_n, _s, _f)             \
     DEFINE_PROP(_n, _s, _f, qdev_prop_ptr, MemoryRegion *)
+#define DEFINE_PROP_OFF_AUTO_PCIBAR(_n, _s, _f, _d) \
+    DEFINE_PROP_SIGNED(_n, _s, _f, _d, qdev_prop_off_auto_pcibar, \
+                        OffAutoPCIBAR)
 
 #define DEFINE_PROP_END_OF_LIST()               \
     {}
diff --git a/qapi/common.json b/qapi/common.json
index 6eb01821ef59..d9b14dd429f3 100644
--- a/qapi/common.json
+++ b/qapi/common.json
@@ -100,3 +100,29 @@
 { 'alternate': 'StrOrNull',
   'data': { 's': 'str',
             'n': 'null' } }
+
+##
+# @OffAutoPCIBAR:
+#
+# An enumeration of options for specifying a PCI BAR
+#
+# @off: The specified feature is disabled
+#
+# @auto: The PCI BAR for the feature is automatically selected
+#
+# @bar0: PCI BAR0 is used for the feature
+#
+# @bar1: PCI BAR1 is used for the feature
+#
+# @bar2: PCI BAR2 is used for the feature
+#
+# @bar3: PCI BAR3 is used for the feature
+#
+# @bar4: PCI BAR4 is used for the feature
+#
+# @bar5: PCI BAR5 is used for the feature
+#
+# Since: 2.12
+##
+{ 'enum': 'OffAutoPCIBAR',
+  'data': [ 'off', 'auto', 'bar0', 'bar1', 'bar2', 'bar3', 'bar4', 'bar5' ] }

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
                   ` (3 preceding siblings ...)
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 4/5] qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR Alex Williamson
@ 2017-12-18  5:02 ` Alex Williamson
  2017-12-18  9:04   ` Alexey Kardashevskiy
  2017-12-19  3:07   ` Alexey Kardashevskiy
  2017-12-18  5:34 ` [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation no-reply
  5 siblings, 2 replies; 20+ messages in thread
From: Alex Williamson @ 2017-12-18  5:02 UTC (permalink / raw)
  To: qemu-devel; +Cc: alex.williamson, eric.auger

With recently proposed kernel side vfio-pci changes, the MSI-X vector
table area can be mmap'd from userspace, allowing direct access to
non-MSI-X registers within the host page size of this area.  However,
we only get that direct access if QEMU isn't also emulating MSI-X
within that same page.  For x86/64 host, the system page size is 4K
and the PCI spec recommends a minimum of 4K to 8K alignment to
separate MSI-X from non-MSI-X registers, therefore only devices which
don't honor this recommendation would see any improvement from this
option.  The real targets for this feature are hosts where the page
size exceeds the PCI spec recommended alignment, such as ARM64 systems
with 64K pages.

This new x-msix-relocation option accepts the following options:

  off: Disable MSI-X relocation, use native device config (default)
  auto: Automaically relocate MSI-X MMIO to another BAR or offset
       based on minimum additional MMIO requirement
  bar0..bar5: Specify the target BAR, which will either be extended
       if the BAR exists or added if the BAR slot is available.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/pci.h        |    1 
 hw/vfio/trace-events |    2 +
 3 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index c383b842da20..b4426abf297a 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
     }
 }
 
+static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
+{
+    int target_bar = -1;
+    size_t msix_sz;
+
+    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
+        return;
+    }
+
+    /* The actual minimum size of MSI-X structures */
+    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
+              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
+    /* Round up to host pages, we don't want to share a page */
+    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
+    /* PCI BARs must be a power of 2 */
+    msix_sz = pow2ceil(msix_sz);
+
+    /* Auto: pick the BAR that incurs the least additional MMIO space */
+    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
+        int i;
+        size_t best = UINT64_MAX;
+
+        for (i = 0; i < PCI_ROM_SLOT; i++) {
+            size_t size;
+
+            if (vdev->bars[i].ioport) {
+                continue;
+            }
+
+            /* MSI-X MMIO must reside within first 32bit offset of BAR */
+            if (vdev->bars[i].size > (UINT32_MAX / 2))
+                continue;
+
+            /*
+             * Must be pow2, so larger of double existing or double msix_sz,
+             * or if BAR unimplemented, msix_sz
+             */
+            size = MAX(vdev->bars[i].size * 2,
+                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
+
+            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
+
+            if (size < best) {
+                best = size;
+                target_bar = i;
+            }
+
+            if (vdev->bars[i].mem64) {
+              i++;
+            }
+        }
+    } else {
+        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
+    }
+
+    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
+        (!vdev->bars[target_bar].size &&
+         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
+        return; /* Go BOOM?  Plumb Error */
+    }
+
+    /*
+     * If adding a new BAR, test if we can make it 64bit.  We make it
+     * prefetchable since QEMU MSI-X emulation has no read side effects
+     * and doing so makes mapping more flexible.
+     */
+    if (!vdev->bars[target_bar].size) {
+        if (target_bar < (PCI_ROM_SLOT - 1) &&
+            !vdev->bars[target_bar + 1].size) {
+            vdev->bars[target_bar].mem64 = true;
+            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
+        }
+        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
+        vdev->bars[target_bar].size = msix_sz;
+        vdev->msix->table_offset = 0;
+    } else {
+        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
+                                          msix_sz * 2);
+        /*
+         * Due to above size calc, MSI-X always starts halfway into the BAR,
+         * which will always be a separate host page.
+         */
+        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
+    }
+
+    vdev->msix->table_bar = target_bar;
+    vdev->msix->pba_bar = target_bar;
+    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
+    vdev->msix->pba_offset = vdev->msix->table_offset +
+                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
+
+    trace_vfio_msix_relo(vdev->vbasedev.name,
+                         vdev->msix->table_bar, vdev->msix->table_offset);
+}
+
 /*
  * We don't have any control over how pci_add_capability() inserts
  * capabilities into the chain.  In order to setup MSI-X we need a
@@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
     vdev->msix = msix;
 
     vfio_pci_fixup_msix_region(vdev);
+
+    vfio_pci_relocate_msix(vdev);
 }
 
 static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
@@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
 
     vfio_pci_size_rom(vdev);
 
+    vfio_bars_prepare(vdev);
+
     vfio_msix_early_setup(vdev, &err);
     if (err) {
         error_propagate(errp, err);
         goto error;
     }
 
-    vfio_bars_prepare(vdev);
     vfio_bars_register(vdev);
 
     ret = vfio_add_capabilities(vdev, errp);
@@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
     DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
                                    nv_gpudirect_clique,
                                    qdev_prop_nv_gpudirect_clique, uint8_t),
+    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
+                                OFF_AUTOPCIBAR_OFF),
     /*
      * TODO - support passed fds... is this necessary?
      * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index dcdb1a806769..588381f201b4 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
                                 (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
     int32_t bootindex;
     uint32_t igd_gms;
+    OffAutoPCIBAR msix_relo;
     uint8_t pm_cap;
     uint8_t nv_gpudirect_clique;
     bool pci_aer;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index fae096c0724f..437ccdd29053 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
 vfio_msix_pba_enable(const char *name) " (%s)"
 vfio_msix_disable(const char *name) " (%s)"
 vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
+vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
+vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
 vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
 vfio_msi_disable(const char *name) " (%s)"
 vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation
  2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
                   ` (4 preceding siblings ...)
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO Alex Williamson
@ 2017-12-18  5:34 ` no-reply
  5 siblings, 0 replies; 20+ messages in thread
From: no-reply @ 2017-12-18  5:34 UTC (permalink / raw)
  To: alex.williamson; +Cc: famz, qemu-devel, eric.auger

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20171218040852.13478.19208.stgit@gimli.home
Subject: [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
Switched to a new branch 'test'
3a11d5cf58 vfio/pci: Allow relocating MSI-X MMIO
2b0d6c16be qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR
e4576d74be vfio/pci: Emulate BARs
51a4ffc4de vfio/pci: Add base BAR MemoryRegion
b98cf31ab0 vfio/pci: Fixup VFIOMSIXInfo comment

=== OUTPUT BEGIN ===
Checking PATCH 1/5: vfio/pci: Fixup VFIOMSIXInfo comment...
Checking PATCH 2/5: vfio/pci: Add base BAR MemoryRegion...
ERROR: suspect code indent for conditional statements (8, 11)
#143: FILE: hw/vfio/pci.c:1604:
+        if (bar->size) {
+           object_unparent(OBJECT(bar->mr));

total: 1 errors, 0 warnings, 141 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 3/5: vfio/pci: Emulate BARs...
Checking PATCH 4/5: qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR...
Checking PATCH 5/5: vfio/pci: Allow relocating MSI-X MMIO...
ERROR: braces {} are necessary for all arms of this statement
#67: FILE: hw/vfio/pci.c:1385:
+            if (vdev->bars[i].size > (UINT32_MAX / 2))
[...]

ERROR: suspect code indent for conditional statements (12, 14)
#84: FILE: hw/vfio/pci.c:1402:
+            if (vdev->bars[i].mem64) {
+              i++;

total: 2 errors, 0 warnings, 147 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@freelists.org

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO Alex Williamson
@ 2017-12-18  9:04   ` Alexey Kardashevskiy
  2017-12-18 13:28     ` Alex Williamson
  2017-12-19  3:07   ` Alexey Kardashevskiy
  1 sibling, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-18  9:04 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: eric.auger

On 18/12/17 16:02, Alex Williamson wrote:
> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> table area can be mmap'd from userspace, allowing direct access to
> non-MSI-X registers within the host page size of this area.  However,
> we only get that direct access if QEMU isn't also emulating MSI-X
> within that same page.  For x86/64 host, the system page size is 4K
> and the PCI spec recommends a minimum of 4K to 8K alignment to
> separate MSI-X from non-MSI-X registers, therefore only devices which
> don't honor this recommendation would see any improvement from this
> option.  The real targets for this feature are hosts where the page
> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> with 64K pages.
> 
> This new x-msix-relocation option accepts the following options:
> 
>   off: Disable MSI-X relocation, use native device config (default)
>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>        based on minimum additional MMIO requirement
>   bar0..bar5: Specify the target BAR, which will either be extended
>        if the BAR exists or added if the BAR slot is available.


While I am digesting the patchset, here are some test results.

This is the device:

00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
PCI-Express Fusion-MPT SAS-3 (rev 02)
Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]

Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=1 offset=0000e000
        PBA: BAR=1 offset=0000f000


Test #1: x-msix-relocation = "off":

FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
  000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
  000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
@000000000000e600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

Ok, works.


Test #2: x-msix-relocation = "auto":

FlatView #2
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000200080000000-00002000800005ff (prio 0, i/o): msix-table
  0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
@0000000000000600
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


The guest fails probing because the first 64bit BAR is broken.

lspci:

Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]

Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
        Vector table: BAR=0 offset=00000000
        PBA: BAR=0 offset=00000600



Test #3: x-msix-relocation = "bar1"


FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000010000-00002100000105ff (prio 0, i/o): msix-table
  0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
@0000000000010600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]

Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
appear as "ramd" in flatview, should it have appeared?

This is "mtree":

memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
      0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
      0000210000010000-00002100000105ff (prio 0, i/o): msix-table
      0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
      0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
        0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
3 mmaps[0]




Test #4: x-msix-relocation = "bar5"

The same net result as test #3: it works but BAR1 is not mapped:


Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K]

Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=5 offset=00000000
        PBA: BAR=5 offset=00000600

FlatView #0
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000200080000000-00002000800005ff (prio 0, i/o): msix-table
  0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
@0000000000000600
  0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
      0000000080000000-00000000800005ff (prio 0, i/o): msix-table
      0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled]
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
      0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
      0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
        0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
3 mmaps[0]



and there is also one minor comment below.


> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h        |    1 
>  hw/vfio/trace-events |    2 +
>  3 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c383b842da20..b4426abf297a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> +{
> +    int target_bar = -1;
> +    size_t msix_sz;
> +
> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> +        return;
> +    }
> +
> +    /* The actual minimum size of MSI-X structures */
> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> +    /* Round up to host pages, we don't want to share a page */
> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> +    /* PCI BARs must be a power of 2 */
> +    msix_sz = pow2ceil(msix_sz);
> +
> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> +        int i;
> +        size_t best = UINT64_MAX;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {
> +            size_t size;
> +
> +            if (vdev->bars[i].ioport) {
> +                continue;
> +            }
> +
> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> +                continue;
> +
> +            /*
> +             * Must be pow2, so larger of double existing or double msix_sz,
> +             * or if BAR unimplemented, msix_sz
> +             */
> +            size = MAX(vdev->bars[i].size * 2,
> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> +
> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> +
> +            if (size < best) {
> +                best = size;
> +                target_bar = i;
> +            }
> +
> +            if (vdev->bars[i].mem64) {
> +              i++;
> +            }
> +        }
> +    } else {
> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> +    }
> +
> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> +        (!vdev->bars[target_bar].size &&
> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> +        return; /* Go BOOM?  Plumb Error */
> +    }
> +
> +    /*
> +     * If adding a new BAR, test if we can make it 64bit.  We make it
> +     * prefetchable since QEMU MSI-X emulation has no read side effects
> +     * and doing so makes mapping more flexible.
> +     */
> +    if (!vdev->bars[target_bar].size) {
> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> +            !vdev->bars[target_bar + 1].size) {
> +            vdev->bars[target_bar].mem64 = true;
> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        }
> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> +        vdev->bars[target_bar].size = msix_sz;
> +        vdev->msix->table_offset = 0;
> +    } else {
> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> +                                          msix_sz * 2);
> +        /*
> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> +         * which will always be a separate host page.
> +         */
> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> +    }
> +
> +    vdev->msix->table_bar = target_bar;
> +    vdev->msix->pba_bar = target_bar;
> +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> +    vdev->msix->pba_offset = vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> +
> +    trace_vfio_msix_relo(vdev->vbasedev.name,
> +                         vdev->msix->table_bar, vdev->msix->table_offset);
> +}
> +
>  /*
>   * We don't have any control over how pci_add_capability() inserts
>   * capabilities into the chain.  In order to setup MSI-X we need a
> @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vdev->msix = msix;
>  
>      vfio_pci_fixup_msix_region(vdev);
> +
> +    vfio_pci_relocate_msix(vdev);
>  }
>  
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>      vfio_pci_size_rom(vdev);
>  
> +    vfio_bars_prepare(vdev);
> +
>      vfio_msix_early_setup(vdev, &err);
>      if (err) {
>          error_propagate(errp, err);
>          goto error;
>      }
>  
> -    vfio_bars_prepare(vdev);


This could be in 2/5.


>      vfio_bars_register(vdev);
>  
>      ret = vfio_add_capabilities(vdev, errp);
> @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>                                     nv_gpudirect_clique,
>                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> +                                OFF_AUTOPCIBAR_OFF),
>      /*
>       * TODO - support passed fds... is this necessary?
>       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index dcdb1a806769..588381f201b4 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
>                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
>      bool pci_aer;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fae096c0724f..437ccdd29053 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
>  vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"
>  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18  9:04   ` Alexey Kardashevskiy
@ 2017-12-18 13:28     ` Alex Williamson
  2017-12-18 13:55       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-18 13:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, eric.auger

On Mon, 18 Dec 2017 20:04:23 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 18/12/17 16:02, Alex Williamson wrote:
> > With recently proposed kernel side vfio-pci changes, the MSI-X vector
> > table area can be mmap'd from userspace, allowing direct access to
> > non-MSI-X registers within the host page size of this area.  However,
> > we only get that direct access if QEMU isn't also emulating MSI-X
> > within that same page.  For x86/64 host, the system page size is 4K
> > and the PCI spec recommends a minimum of 4K to 8K alignment to
> > separate MSI-X from non-MSI-X registers, therefore only devices which
> > don't honor this recommendation would see any improvement from this
> > option.  The real targets for this feature are hosts where the page
> > size exceeds the PCI spec recommended alignment, such as ARM64 systems
> > with 64K pages.
> > 
> > This new x-msix-relocation option accepts the following options:
> > 
> >   off: Disable MSI-X relocation, use native device config (default)
> >   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >        based on minimum additional MMIO requirement
> >   bar0..bar5: Specify the target BAR, which will either be extended
> >        if the BAR exists or added if the BAR slot is available.  
> 
> 
> While I am digesting the patchset, here are some test results.

Thanks for testing!

> This is the device:
> 
> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
> PCI-Express Fusion-MPT SAS-3 (rev 02)

BAR1:

> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]

BAR3:

> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> 
> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>         Vector table: BAR=1 offset=0000e000
>         PBA: BAR=1 offset=0000f000
> 
> 
> Test #1: x-msix-relocation = "off":
> 
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> Ok, works.
> 
> 
> Test #2: x-msix-relocation = "auto":
> 
> FlatView #2
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
> @0000000000000600
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> The guest fails probing because the first 64bit BAR is broken.
> 
> lspci:
> 
> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> 
> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>         Vector table: BAR=0 offset=00000000
>         PBA: BAR=0 offset=00000600

Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
64bit since BAR1 is implemented.  I don't see anything fundamentally
different between this and the working BAR5 test below.

> Test #3: x-msix-relocation = "bar1"
> 
> 
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>   0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
> @0000000000010600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
> appear as "ramd" in flatview, should it have appeared?
> 
> This is "mtree":
> 
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>       0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>       0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
> 3 mmaps[0]

Did you disable vfio_pci_fixup_msix_region() as noted in 0/5?  This
series doesn't do anything about consuming the new MSI-X mappable flag
that you introduced in the kernel, so vfio_pci_fixup_msix_region() will
continue to exclude mmap'ing the 64K page overlapping the actual BAR.

> Test #4: x-msix-relocation = "bar5"
> 
> The same net result as test #3: it works but BAR1 is not mapped:
> 
> 
> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> Region 5: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> 
> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>         Vector table: BAR=5 offset=00000000
>         PBA: BAR=5 offset=00000600
> 
> FlatView #0
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
> @0000000000000600
>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000000080000000-000000008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 5
>       0000000080000000-00000000800005ff (prio 0, i/o): msix-table
>       0000000080000600-000000008000060f (prio 0, i/o): msix-pba [disabled]
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
> 3 mmaps[0]

As above, you won't get the mmap without disabling the implicit page
exclusion.  The real question for this case is why does it work while
'auto' came up with a nearly identical layout, swapping BAR5 for BAR0
and it did not work.  The placement of the BARs is even the same.

> and there is also one minor comment below.
> 
> 
> > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >  
> >      vfio_pci_size_rom(vdev);
> >  
> > +    vfio_bars_prepare(vdev);
> > +
> >      vfio_msix_early_setup(vdev, &err);
> >      if (err) {
> >          error_propagate(errp, err);
> >          goto error;
> >      }
> >  
> > -    vfio_bars_prepare(vdev);  
> 
> 
> This could be in 2/5.

It could, but 2/5 was attempting to add the base BAR MemoryRegion and
split vfio_bars_setup() into vfio_bars_prepare() and
vfio_bars_register() without otherwise changing the ordering.  It's
only when we want to modify BARs between prepare and register that we
need to make this change, thus it's done here.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18 13:28     ` Alex Williamson
@ 2017-12-18 13:55       ` Alexey Kardashevskiy
  2017-12-18 14:28         ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-18 13:55 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, eric.auger

On 19/12/17 00:28, Alex Williamson wrote:
> On Mon, 18 Dec 2017 20:04:23 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>> table area can be mmap'd from userspace, allowing direct access to
>>> non-MSI-X registers within the host page size of this area.  However,
>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>> within that same page.  For x86/64 host, the system page size is 4K
>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>> don't honor this recommendation would see any improvement from this
>>> option.  The real targets for this feature are hosts where the page
>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>> with 64K pages.
>>>
>>> This new x-msix-relocation option accepts the following options:
>>>
>>>   off: Disable MSI-X relocation, use native device config (default)
>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>        based on minimum additional MMIO requirement
>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>        if the BAR exists or added if the BAR slot is available.  
>>
>>
>> While I am digesting the patchset, here are some test results.
> 
> Thanks for testing!
> 
>> This is the device:
>>
>> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
>> PCI-Express Fusion-MPT SAS-3 (rev 02)
> 
> BAR1:
> 
>> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> 
> BAR3:
> 
>> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>
>> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>>         Vector table: BAR=1 offset=0000e000
>>         PBA: BAR=1 offset=0000f000
>>
>>
>> Test #1: x-msix-relocation = "off":
>>
>> FlatView #1
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>> @000000000000e600
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>> Ok, works.
>>
>>
>> Test #2: x-msix-relocation = "auto":
>>
>> FlatView #2
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
>> @0000000000000600
>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>>
>> The guest fails probing because the first 64bit BAR is broken.
>>
>> lspci:
>>
>> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
>> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
>> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>
>> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>>         Vector table: BAR=0 offset=00000000
>>         PBA: BAR=0 offset=00000600
> 
> Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
> 64bit since BAR1 is implemented.  I don't see anything fundamentally
> different between this and the working BAR5 test below.


BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
BAR, no?


> 
>> Test #3: x-msix-relocation = "bar1"
>>
>>
>> FlatView #1
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>>   0000210000010600-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>> @0000000000010600
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>> Ok, works. BAR1 became 128K. However no part of BAR1 was mapped, i.e.
>> appear as "ramd" in flatview, should it have appeared?
>>
>> This is "mtree":
>>
>> memory-region: pci@800000020000000.mmio
>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>     0000210000000000-000021000001ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>       0000210000010000-00002100000105ff (prio 0, i/o): msix-table
>>       0000210000010600-000021000001060f (prio 0, i/o): msix-pba [disabled]
>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
>> 3 mmaps[0]
> 
> Did you disable vfio_pci_fixup_msix_region() as noted in 0/5?  This
> series doesn't do anything about consuming the new MSI-X mappable flag
> that you introduced in the kernel, so vfio_pci_fixup_msix_region() will
> continue to exclude mmap'ing the 64K page overlapping the actual BAR.


Ah, my bad, I've read this but when I got to testing - forgot. Sorry for
the noise, tests 3 and 4 mmap as expected with fixup disabled.



-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18 13:55       ` Alexey Kardashevskiy
@ 2017-12-18 14:28         ` Alex Williamson
  2017-12-19  1:22           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-18 14:28 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, eric.auger

On Tue, 19 Dec 2017 00:55:32 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/12/17 00:28, Alex Williamson wrote:
> > On Mon, 18 Dec 2017 20:04:23 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 18/12/17 16:02, Alex Williamson wrote:  
> >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> >>> table area can be mmap'd from userspace, allowing direct access to
> >>> non-MSI-X registers within the host page size of this area.  However,
> >>> we only get that direct access if QEMU isn't also emulating MSI-X
> >>> within that same page.  For x86/64 host, the system page size is 4K
> >>> and the PCI spec recommends a minimum of 4K to 8K alignment to
> >>> separate MSI-X from non-MSI-X registers, therefore only devices which
> >>> don't honor this recommendation would see any improvement from this
> >>> option.  The real targets for this feature are hosts where the page
> >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> >>> with 64K pages.
> >>>
> >>> This new x-msix-relocation option accepts the following options:
> >>>
> >>>   off: Disable MSI-X relocation, use native device config (default)
> >>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >>>        based on minimum additional MMIO requirement
> >>>   bar0..bar5: Specify the target BAR, which will either be extended
> >>>        if the BAR exists or added if the BAR slot is available.    
> >>
> >>
> >> While I am digesting the patchset, here are some test results.  
> > 
> > Thanks for testing!
> >   
> >> This is the device:
> >>
> >> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
> >> PCI-Express Fusion-MPT SAS-3 (rev 02)  
> > 
> > BAR1:
> >   
> >> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]  
> > 
> > BAR3:
> >   
> >> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> >>
> >> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
> >>         Vector table: BAR=1 offset=0000e000
> >>         PBA: BAR=1 offset=0000f000
> >>
> >>
> >> Test #1: x-msix-relocation = "off":
> >>
> >> FlatView #1
> >>  AS "memory", root: system
> >>  AS "cpu-memory", root: system
> >>  Root memory region: system
> >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
> >>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
> >>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> >> @000000000000e600
> >>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>
> >> Ok, works.
> >>
> >>
> >> Test #2: x-msix-relocation = "auto":
> >>
> >> FlatView #2
> >>  AS "memory", root: system
> >>  AS "cpu-memory", root: system
> >>  Root memory region: system
> >>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
> >>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
> >>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
> >> @0000000000000600
> >>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> >>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> >>
> >>
> >> The guest fails probing because the first 64bit BAR is broken.
> >>
> >> lspci:
> >>
> >> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
> >> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
> >> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
> >>
> >> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
> >>         Vector table: BAR=0 offset=00000000
> >>         PBA: BAR=0 offset=00000600  
> > 
> > Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
> > 64bit since BAR1 is implemented.  I don't see anything fundamentally
> > different between this and the working BAR5 test below.  
> 
> 
> BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
> BAR, no?

AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and
0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2.  Likewise
the 64bit BAR3 also consumes BAR4.  See for instance the 82576
datasheet:

https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf

9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes
BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3
(upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4
(lower) + BAR5 (upper).  lspci would show this as Region 0, 2, 4.  The
layout of your SAS card does seem poorly thought out that they've
essentially precluded a 3rd 64bit BAR by starting with BAR1, but
perhaps it's for compatibility with an equally poorly designed 32bit
version of the device.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18 14:28         ` Alex Williamson
@ 2017-12-19  1:22           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  1:22 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, eric.auger

On 19/12/17 01:28, Alex Williamson wrote:
> On Tue, 19 Dec 2017 00:55:32 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/12/17 00:28, Alex Williamson wrote:
>>> On Mon, 18 Dec 2017 20:04:23 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 18/12/17 16:02, Alex Williamson wrote:  
>>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>>>> table area can be mmap'd from userspace, allowing direct access to
>>>>> non-MSI-X registers within the host page size of this area.  However,
>>>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>>>> within that same page.  For x86/64 host, the system page size is 4K
>>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>>>> don't honor this recommendation would see any improvement from this
>>>>> option.  The real targets for this feature are hosts where the page
>>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>>>> with 64K pages.
>>>>>
>>>>> This new x-msix-relocation option accepts the following options:
>>>>>
>>>>>   off: Disable MSI-X relocation, use native device config (default)
>>>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>>>        based on minimum additional MMIO requirement
>>>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>>>        if the BAR exists or added if the BAR slot is available.    
>>>>
>>>>
>>>> While I am digesting the patchset, here are some test results.  
>>>
>>> Thanks for testing!
>>>   
>>>> This is the device:
>>>>
>>>> 00:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS3008
>>>> PCI-Express Fusion-MPT SAS-3 (rev 02)  
>>>
>>> BAR1:
>>>   
>>>> Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]  
>>>
>>> BAR3:
>>>   
>>>> Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>>>
>>>> Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
>>>>         Vector table: BAR=1 offset=0000e000
>>>>         PBA: BAR=1 offset=0000f000
>>>>
>>>>
>>>> Test #1: x-msix-relocation = "off":
>>>>
>>>> FlatView #1
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>>>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>> @000000000000e600
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>> Ok, works.
>>>>
>>>>
>>>> Test #2: x-msix-relocation = "auto":
>>>>
>>>> FlatView #2
>>>>  AS "memory", root: system
>>>>  AS "cpu-memory", root: system
>>>>  Root memory region: system
>>>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>>>   0000200080000000-00002000800005ff (prio 0, i/o): msix-table
>>>>   0000200080000600-000020008000ffff (prio 1, i/o): 0001:03:00.0 base BAR 0
>>>> @0000000000000600
>>>>   0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>>>
>>>>
>>>> The guest fails probing because the first 64bit BAR is broken.
>>>>
>>>> lspci:
>>>>
>>>> Region 0: Memory at 200080000000 (32-bit, prefetchable) [size=64K]
>>>> Region 1: Memory at 210000000000 (64-bit, non-prefetchable) [size=64K]
>>>> Region 3: Memory at 210000040000 (64-bit, non-prefetchable) [size=256K]
>>>>
>>>> Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
>>>>         Vector table: BAR=0 offset=00000000
>>>>         PBA: BAR=0 offset=00000600  
>>>
>>> Why do you suppose it's broken?  The added BAR0 is 32bit, it cannot be
>>> 64bit since BAR1 is implemented.  I don't see anything fundamentally
>>> different between this and the working BAR5 test below.  
>>
>>
>> BAR1 (0x14..0x17) uses BAR0 (0x10..0x13) as upper 32bits when it is 64bit
>> BAR, no?
> 
> AIUI, if BAR1 is 64bit, it consumes 0x14-0x17 for the lower 32bis and
> 0x18-1b for the upper 32bits, ie. it consumes BAR1 + BAR2.  Likewise
> the 64bit BAR3 also consumes BAR4.  See for instance the 82576
> datasheet:
> 
> https://www.intel.com/content/dam/www/public/us/en/documents/datasheets/82576eb-gigabit-ethernet-controller-datasheet.pdf
> 
> 9.4.11.2 shows the BAR configuration in 64bit mode, 64bit BAR0 consumes
> BAR0 (lower) + BAR1 (upper), 64bit BAR2 consumes BAR2 (lower) + BAR3
> (upper), and the MSI-X BAR becomes 64bit at BAR4, consuming BAR4
> (lower) + BAR5 (upper).  lspci would show this as Region 0, 2, 4.  The
> layout of your SAS card does seem poorly thought out that they've
> essentially precluded a 3rd 64bit BAR by starting with BAR1, but
> perhaps it's for compatibility with an equally poorly designed 32bit
> version of the device.  Thanks,


Ah, makes sense, I just never saw 64bit BARs starting from an odd offset.
My card is weird^Wunusual then:


aik@stratton2:~$ lspci -vbxs 0001:03:00.0

0001:03:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic
SAS3008 PCI-Express Fusion-MPT SAS-3 (rev
02)
        Subsystem: Super Micro Computer Inc SAS3008 PCI-Express Fusion-MPT
SAS-3
        Flags: bus master, fast devsel, latency 0
        I/O ports at <unassigned> [disabled]
        Memory at 80140000 (64-bit, non-prefetchable)
        Memory at 80100000 (64-bit, non-prefetchable)
        Capabilities: <access denied>
        Kernel driver in use: vfio-pci
        Kernel modules: mpt3sas
00: 00 10 97 00 46 05 10 00 02 00 07 01 00 00 00 00
10: 01 00 00 00 04 00 14 80 00 00 00 00 04 00 10 80
20: 00 00 00 00 00 00 00 00 00 00 00 00 d9 15 08 08
30: 00 00 00 00 50 00 00 00 00 00 00 00 00 01 00 00



The mpt3sas driver is funny too - it fails probing with MSIX in bar0 but
succeeds with bar5.

Region 1: Memory at 210000000000 (64-bit, non-prefetchable)
Region 3: Memory at 210000040000 (64-bit, non-prefetchable)
Region 5: Memory at 80000000 (32-bit, prefetchable)
Capabilities: [c0] MSI-X: Enable+ Count=96 Masked-
        Vector table: BAR=5 offset=00000000
        PBA: BAR=5 offset=00000600


vs.

Region 0: Memory at 80000000 (32-bit, prefetchable)
Region 1: Memory at 210000000000 (64-bit, non-prefetchable)
Region 3: Memory at 210000040000 (64-bit, non-prefetchable)
Capabilities: [c0] MSI-X: Enable- Count=96 Masked-
        Vector table: BAR=0 offset=00000000
        PBA: BAR=0 offset=00000600


Here is why:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/scsi/mpt3sas/mpt3sas_base.c?h=v4.15-rc4#n2608

It is looking for a first MMIO BAR and assumes it is the one which
implements the basic registers including doorbell. I am not so sure this is
that unusual.



-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO Alex Williamson
  2017-12-18  9:04   ` Alexey Kardashevskiy
@ 2017-12-19  3:07   ` Alexey Kardashevskiy
  2017-12-19  3:40     ` Alex Williamson
  1 sibling, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  3:07 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: eric.auger

On 18/12/17 16:02, Alex Williamson wrote:
> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> table area can be mmap'd from userspace, allowing direct access to
> non-MSI-X registers within the host page size of this area.  However,
> we only get that direct access if QEMU isn't also emulating MSI-X
> within that same page.  For x86/64 host, the system page size is 4K
> and the PCI spec recommends a minimum of 4K to 8K alignment to
> separate MSI-X from non-MSI-X registers, therefore only devices which
> don't honor this recommendation would see any improvement from this
> option.  The real targets for this feature are hosts where the page
> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> with 64K pages.
> 
> This new x-msix-relocation option accepts the following options:
> 
>   off: Disable MSI-X relocation, use native device config (default)
>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>        based on minimum additional MMIO requirement
>   bar0..bar5: Specify the target BAR, which will either be extended
>        if the BAR exists or added if the BAR slot is available.
> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/vfio/pci.h        |    1 
>  hw/vfio/trace-events |    2 +
>  3 files changed, 104 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c383b842da20..b4426abf297a 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>      }
>  }
>  
> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> +{
> +    int target_bar = -1;
> +    size_t msix_sz;
> +
> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> +        return;
> +    }
> +
> +    /* The actual minimum size of MSI-X structures */
> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> +    /* Round up to host pages, we don't want to share a page */
> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> +    /* PCI BARs must be a power of 2 */
> +    msix_sz = pow2ceil(msix_sz);
> +
> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> +        int i;
> +        size_t best = UINT64_MAX;
> +
> +        for (i = 0; i < PCI_ROM_SLOT; i++) {


I belieive that going from the other end is safer approach for "auto",
especially after discovering how mpt3sas works. Or you could add
"autoreverse" switch...




> +            size_t size;
> +
> +            if (vdev->bars[i].ioport) {
> +                continue;
> +            }
> +
> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> +                continue;
> +
> +            /*
> +             * Must be pow2, so larger of double existing or double msix_sz,
> +             * or if BAR unimplemented, msix_sz
> +             */
> +            size = MAX(vdev->bars[i].size * 2,
> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> +
> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> +
> +            if (size < best) {
> +                best = size;
> +                target_bar = i;
> +            }
> +
> +            if (vdev->bars[i].mem64) {
> +              i++;
> +            }
> +        }
> +    } else {
> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> +    }
> +
> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> +        (!vdev->bars[target_bar].size &&
> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> +        return; /* Go BOOM?  Plumb Error */
> +    }


This "if" only seems to make sense for the non-auto branch...


> +
> +    /*
> +     * If adding a new BAR, test if we can make it 64bit.  We make it
> +     * prefetchable since QEMU MSI-X emulation has no read side effects
> +     * and doing so makes mapping more flexible.
> +     */
> +    if (!vdev->bars[target_bar].size) {
> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> +            !vdev->bars[target_bar + 1].size) {
> +            vdev->bars[target_bar].mem64 = true;
> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> +        }
> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> +        vdev->bars[target_bar].size = msix_sz;
> +        vdev->msix->table_offset = 0;
> +    } else {
> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> +                                          msix_sz * 2);
> +        /*
> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> +         * which will always be a separate host page.
> +         */
> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> +    }
> +
> +    vdev->msix->table_bar = target_bar;
> +    vdev->msix->pba_bar = target_bar;


Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
was not necessary at the time but I missed that it is called before
vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,




> +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> +    vdev->msix->pba_offset = vdev->msix->table_offset +
> +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> +
> +    trace_vfio_msix_relo(vdev->vbasedev.name,
> +                         vdev->msix->table_bar, vdev->msix->table_offset);
> +}
> +
>  /*
>   * We don't have any control over how pci_add_capability() inserts
>   * capabilities into the chain.  In order to setup MSI-X we need a
> @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
>      vdev->msix = msix;
>  
>      vfio_pci_fixup_msix_region(vdev);
> +
> +    vfio_pci_relocate_msix(vdev);
>  }
>  
>  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>  
>      vfio_pci_size_rom(vdev);
>  
> +    vfio_bars_prepare(vdev);
> +
>      vfio_msix_early_setup(vdev, &err);
>      if (err) {
>          error_propagate(errp, err);
>          goto error;
>      }
>  
> -    vfio_bars_prepare(vdev);
>      vfio_bars_register(vdev);
>  
>      ret = vfio_add_capabilities(vdev, errp);
> @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
>      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
>                                     nv_gpudirect_clique,
>                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> +                                OFF_AUTOPCIBAR_OFF),
>      /*
>       * TODO - support passed fds... is this necessary?
>       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index dcdb1a806769..588381f201b4 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
>                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
>      int32_t bootindex;
>      uint32_t igd_gms;
> +    OffAutoPCIBAR msix_relo;
>      uint8_t pm_cap;
>      uint8_t nv_gpudirect_clique;
>      bool pci_aer;
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index fae096c0724f..437ccdd29053 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
>  vfio_msix_pba_enable(const char *name) " (%s)"
>  vfio_msix_disable(const char *name) " (%s)"
>  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
>  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
>  vfio_msi_disable(const char *name) " (%s)"
>  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-19  3:07   ` Alexey Kardashevskiy
@ 2017-12-19  3:40     ` Alex Williamson
  2017-12-19  6:02       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-19  3:40 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, eric.auger

On Tue, 19 Dec 2017 14:07:13 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 18/12/17 16:02, Alex Williamson wrote:
> > With recently proposed kernel side vfio-pci changes, the MSI-X vector
> > table area can be mmap'd from userspace, allowing direct access to
> > non-MSI-X registers within the host page size of this area.  However,
> > we only get that direct access if QEMU isn't also emulating MSI-X
> > within that same page.  For x86/64 host, the system page size is 4K
> > and the PCI spec recommends a minimum of 4K to 8K alignment to
> > separate MSI-X from non-MSI-X registers, therefore only devices which
> > don't honor this recommendation would see any improvement from this
> > option.  The real targets for this feature are hosts where the page
> > size exceeds the PCI spec recommended alignment, such as ARM64 systems
> > with 64K pages.
> > 
> > This new x-msix-relocation option accepts the following options:
> > 
> >   off: Disable MSI-X relocation, use native device config (default)
> >   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >        based on minimum additional MMIO requirement
> >   bar0..bar5: Specify the target BAR, which will either be extended
> >        if the BAR exists or added if the BAR slot is available.
> > 
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > ---
> >  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/vfio/pci.h        |    1 
> >  hw/vfio/trace-events |    2 +
> >  3 files changed, 104 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> > index c383b842da20..b4426abf297a 100644
> > --- a/hw/vfio/pci.c
> > +++ b/hw/vfio/pci.c
> > @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >      }
> >  }
> >  
> > +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> > +{
> > +    int target_bar = -1;
> > +    size_t msix_sz;
> > +
> > +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> > +        return;
> > +    }
> > +
> > +    /* The actual minimum size of MSI-X structures */
> > +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> > +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> > +    /* Round up to host pages, we don't want to share a page */
> > +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> > +    /* PCI BARs must be a power of 2 */
> > +    msix_sz = pow2ceil(msix_sz);
> > +
> > +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> > +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> > +        int i;
> > +        size_t best = UINT64_MAX;
> > +
> > +        for (i = 0; i < PCI_ROM_SLOT; i++) {  
> 
> 
> I belieive that going from the other end is safer approach for "auto",
> especially after discovering how mpt3sas works. Or you could add
> "autoreverse" switch...

Or is extending the smallest BAR really a safer option?  I wonder how
many drivers go through and fill fixed sized arrays with BAR info,
expecting only the device implemented number of BARs.  Maybe they
wouldn't notice if the BAR was simply bigger than expected.  On the
other hand there are probably drivers dumb enough to index registers
from the end for the BAR as well.  I don't think there exists an
auto algorithm that will fit every device, but a higher hit rate than
we have so far would be nice.  We could also implement MemoryRegionOps
for the base BAR with some error reporting if it gets called.  That
might make the problem more obvious than unassigned_mem_ops silently
eating those accesses.

> > +            size_t size;
> > +
> > +            if (vdev->bars[i].ioport) {
> > +                continue;
> > +            }
> > +
> > +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> > +            if (vdev->bars[i].size > (UINT32_MAX / 2))

Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.

NB, the existing test here needs a bit of work too, 32bit BARs max out
at 2G not 4G, so maybe we need separate tests here.  >1G for 32bit
BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
32bit BARs to 64bit?  It's all virtual addresses anyway, right.  We're
in real trouble if were extending BARs where this is an issue though. 

> > +                continue;
> > +
> > +            /*
> > +             * Must be pow2, so larger of double existing or double msix_sz,
> > +             * or if BAR unimplemented, msix_sz
> > +             */
> > +            size = MAX(vdev->bars[i].size * 2,
> > +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
> > +
> > +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
> > +
> > +            if (size < best) {
> > +                best = size;
> > +                target_bar = i;
> > +            }
> > +
> > +            if (vdev->bars[i].mem64) {
> > +              i++;
> > +            }
> > +        }
> > +    } else {
> > +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
> > +    }
> > +
> > +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
> > +        (!vdev->bars[target_bar].size &&
> > +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
> > +        return; /* Go BOOM?  Plumb Error */
> > +    }  
> 
> 
> This "if" only seems to make sense for the non-auto branch...

Most of it, yes, but it's still possible for a device to exist where
the auto loop would come up empty.  Imagine if each BAR was
sufficiently large that we couldn't extend it and still give the MSI-X
MMIO areas a 32-bit offset within the BAR.  Exceptionally unlikely, it
doesn't hurt to test all the corner cases.  I also missed the case of
testing that the BAR isn't too large already here.
 
> > +
> > +    /*
> > +     * If adding a new BAR, test if we can make it 64bit.  We make it
> > +     * prefetchable since QEMU MSI-X emulation has no read side effects
> > +     * and doing so makes mapping more flexible.
> > +     */
> > +    if (!vdev->bars[target_bar].size) {
> > +        if (target_bar < (PCI_ROM_SLOT - 1) &&
> > +            !vdev->bars[target_bar + 1].size) {
> > +            vdev->bars[target_bar].mem64 = true;
> > +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
> > +        }
> > +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
> > +        vdev->bars[target_bar].size = msix_sz;
> > +        vdev->msix->table_offset = 0;
> > +    } else {
> > +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
> > +                                          msix_sz * 2);
> > +        /*
> > +         * Due to above size calc, MSI-X always starts halfway into the BAR,
> > +         * which will always be a separate host page.
> > +         */
> > +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
> > +    }
> > +
> > +    vdev->msix->table_bar = target_bar;
> > +    vdev->msix->pba_bar = target_bar;  
> 
> 
> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
> was not necessary at the time but I missed that it is called before
> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,

For a kernel that allows mapping the MSI-X region, yes, but if you ran
that on an older kernel I think QEMU would break when it can't mmap the
entire region.  We can't only support new kernels.  Thanks,

Alex

> > +    /* Requires 8-byte alignment, but PCI_MSIX_ENTRY_SIZE guarantees that */
> > +    vdev->msix->pba_offset = vdev->msix->table_offset +
> > +                                  (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE);
> > +
> > +    trace_vfio_msix_relo(vdev->vbasedev.name,
> > +                         vdev->msix->table_bar, vdev->msix->table_offset);
> > +}
> > +
> >  /*
> >   * We don't have any control over how pci_add_capability() inserts
> >   * capabilities into the chain.  In order to setup MSI-X we need a
> > @@ -1430,6 +1525,8 @@ static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
> >      vdev->msix = msix;
> >  
> >      vfio_pci_fixup_msix_region(vdev);
> > +
> > +    vfio_pci_relocate_msix(vdev);
> >  }
> >  
> >  static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
> > @@ -2845,13 +2942,14 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
> >  
> >      vfio_pci_size_rom(vdev);
> >  
> > +    vfio_bars_prepare(vdev);
> > +
> >      vfio_msix_early_setup(vdev, &err);
> >      if (err) {
> >          error_propagate(errp, err);
> >          goto error;
> >      }
> >  
> > -    vfio_bars_prepare(vdev);
> >      vfio_bars_register(vdev);
> >  
> >      ret = vfio_add_capabilities(vdev, errp);
> > @@ -3041,6 +3139,8 @@ static Property vfio_pci_dev_properties[] = {
> >      DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
> >                                     nv_gpudirect_clique,
> >                                     qdev_prop_nv_gpudirect_clique, uint8_t),
> > +    DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
> > +                                OFF_AUTOPCIBAR_OFF),
> >      /*
> >       * TODO - support passed fds... is this necessary?
> >       * DEFINE_PROP_STRING("vfiofd", VFIOPCIDevice, vfiofd_name),
> > diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> > index dcdb1a806769..588381f201b4 100644
> > --- a/hw/vfio/pci.h
> > +++ b/hw/vfio/pci.h
> > @@ -135,6 +135,7 @@ typedef struct VFIOPCIDevice {
> >                                  (1 << VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT)
> >      int32_t bootindex;
> >      uint32_t igd_gms;
> > +    OffAutoPCIBAR msix_relo;
> >      uint8_t pm_cap;
> >      uint8_t nv_gpudirect_clique;
> >      bool pci_aer;
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index fae096c0724f..437ccdd29053 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -16,6 +16,8 @@ vfio_msix_pba_disable(const char *name) " (%s)"
> >  vfio_msix_pba_enable(const char *name) " (%s)"
> >  vfio_msix_disable(const char *name) " (%s)"
> >  vfio_msix_fixup(const char *name, int bar, uint64_t start, uint64_t end) " (%s) MSI-X region %d mmap fixup [0x%"PRIx64" - 0x%"PRIx64"]"
> > +vfio_msix_relo_cost(const char *name, int bar, uint64_t cost) " (%s) BAR %d cost 0x%"PRIx64""
> > +vfio_msix_relo(const char *name, int bar, uint64_t offset) " (%s) BAR %d offset 0x%"PRIx64""
> >  vfio_msi_enable(const char *name, int nr_vectors) " (%s) Enabled %d MSI vectors"
> >  vfio_msi_disable(const char *name) " (%s)"
> >  vfio_pci_load_rom(const char *name, unsigned long size, unsigned long offset, unsigned long flags) "Device %s ROM:\n  size: 0x%lx, offset: 0x%lx, flags: 0x%lx"
> > 
> >   
> 
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion
  2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion Alex Williamson
@ 2017-12-19  3:44   ` Alexey Kardashevskiy
  2017-12-19  3:56     ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  3:44 UTC (permalink / raw)
  To: Alex Williamson, qemu-devel; +Cc: eric.auger

On 18/12/17 16:02, Alex Williamson wrote:
> Add one more layer to our stack of MemoryRegions, this base region
> allows us to register BARs independently of the vfio region or to
> extend the size of BARs which do map to a region.  This will be
> useful when we want hypervisor defined BARs or sections of BARs,
> for purposes such as relocating MSI-X emulation.  We therefore call
> msix_init() based on this new base MemoryRegion, while the quirks,
> which only modify regions still operate on those sub-MemoryRegions.


Looks ok, one worry though - the default config produces this:


memory-region: pci@800000020000000.mmio
  0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
    0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
      0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
      000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
      000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
    0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
      0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
        0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
3 mmaps[0]


Where "BAR 1" and "msix-table" overlap. It resolves correctly:


FlatView #1
 AS "memory", root: system
 AS "cpu-memory", root: system
 Root memory region: system
  0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
  0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
  000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
  000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
@000000000000e600
  0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]


but this looks like an accident - should not we raise the msix-table
priority or lower the BAR1 priority (to -1)?



> 
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> ---
>  hw/vfio/pci.c |   74 ++++++++++++++++++++++++++++++++++++++++++++-------------
>  hw/vfio/pci.h |    3 ++
>  2 files changed, 60 insertions(+), 17 deletions(-)
> 
> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> index c977ee327f94..8f46fdd1d391 100644
> --- a/hw/vfio/pci.c
> +++ b/hw/vfio/pci.c
> @@ -1440,9 +1440,9 @@ static int vfio_msix_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
>      vdev->msix->pending = g_malloc0(BITS_TO_LONGS(vdev->msix->entries) *
>                                      sizeof(unsigned long));
>      ret = msix_init(&vdev->pdev, vdev->msix->entries,
> -                    vdev->bars[vdev->msix->table_bar].region.mem,
> +                    vdev->bars[vdev->msix->table_bar].mr,
>                      vdev->msix->table_bar, vdev->msix->table_offset,
> -                    vdev->bars[vdev->msix->pba_bar].region.mem,
> +                    vdev->bars[vdev->msix->pba_bar].mr,
>                      vdev->msix->pba_bar, vdev->msix->pba_offset, pos,
>                      &err);
>      if (ret < 0) {
> @@ -1482,8 +1482,8 @@ static void vfio_teardown_msi(VFIOPCIDevice *vdev)
>  
>      if (vdev->msix) {
>          msix_uninit(&vdev->pdev,
> -                    vdev->bars[vdev->msix->table_bar].region.mem,
> -                    vdev->bars[vdev->msix->pba_bar].region.mem);
> +                    vdev->bars[vdev->msix->table_bar].mr,
> +                    vdev->bars[vdev->msix->pba_bar].mr);
>          g_free(vdev->msix->pending);
>      }
>  }
> @@ -1500,12 +1500,11 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
>      }
>  }
>  
> -static void vfio_bar_setup(VFIOPCIDevice *vdev, int nr)
> +static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
>  {
>      VFIOBAR *bar = &vdev->bars[nr];
>  
>      uint32_t pci_bar;
> -    uint8_t type;
>      int ret;
>  
>      /* Skip both unimplemented BARs and the upper half of 64bit BARS. */
> @@ -1524,23 +1523,52 @@ static void vfio_bar_setup(VFIOPCIDevice *vdev, int nr)
>      pci_bar = le32_to_cpu(pci_bar);
>      bar->ioport = (pci_bar & PCI_BASE_ADDRESS_SPACE_IO);
>      bar->mem64 = bar->ioport ? 0 : (pci_bar & PCI_BASE_ADDRESS_MEM_TYPE_64);
> -    type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
> -                                    ~PCI_BASE_ADDRESS_MEM_MASK);
> +    bar->type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
> +                                         ~PCI_BASE_ADDRESS_MEM_MASK);
> +    bar->size = bar->region.size;
> +}
>  
> -    if (vfio_region_mmap(&bar->region)) {
> -        error_report("Failed to mmap %s BAR %d. Performance may be slow",
> -                     vdev->vbasedev.name, nr);
> +static void vfio_bars_prepare(VFIOPCIDevice *vdev)
> +{
> +    int i;
> +
> +    for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        vfio_bar_prepare(vdev, i);
>      }
> +}
>  
> -    pci_register_bar(&vdev->pdev, nr, type, bar->region.mem);
> +static void vfio_bar_register(VFIOPCIDevice *vdev, int nr)
> +{
> +    VFIOBAR *bar = &vdev->bars[nr];
> +    char *name;
> +
> +    if (!bar->size) {
> +        return;
> +    }
> +
> +    bar->mr = g_new0(MemoryRegion, 1);
> +    name = g_strdup_printf("%s base BAR %d", vdev->vbasedev.name, nr);
> +    memory_region_init_io(bar->mr, OBJECT(vdev), NULL, NULL, name, bar->size);
> +    g_free(name);
> +
> +    if (bar->region.size) {
> +        memory_region_add_subregion(bar->mr, 0, bar->region.mem);
> +
> +        if (vfio_region_mmap(&bar->region)) {
> +            error_report("Failed to mmap %s BAR %d. Performance may be slow",
> +                         vdev->vbasedev.name, nr);
> +        }
> +    }
> +
> +    pci_register_bar(&vdev->pdev, nr, bar->type, bar->mr);
>  }
>  
> -static void vfio_bars_setup(VFIOPCIDevice *vdev)
> +static void vfio_bars_register(VFIOPCIDevice *vdev)
>  {
>      int i;
>  
>      for (i = 0; i < PCI_ROM_SLOT; i++) {
> -        vfio_bar_setup(vdev, i);
> +        vfio_bar_register(vdev, i);
>      }
>  }
>  
> @@ -1549,8 +1577,13 @@ static void vfio_bars_exit(VFIOPCIDevice *vdev)
>      int i;
>  
>      for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        VFIOBAR *bar = &vdev->bars[i];
> +
>          vfio_bar_quirk_exit(vdev, i);
> -        vfio_region_exit(&vdev->bars[i].region);
> +        vfio_region_exit(&bar->region);
> +        if (bar->region.size) {
> +            memory_region_del_subregion(bar->mr, bar->region.mem);
> +        }
>      }
>  
>      if (vdev->vga) {
> @@ -1564,8 +1597,14 @@ static void vfio_bars_finalize(VFIOPCIDevice *vdev)
>      int i;
>  
>      for (i = 0; i < PCI_ROM_SLOT; i++) {
> +        VFIOBAR *bar = &vdev->bars[i];
> +
>          vfio_bar_quirk_finalize(vdev, i);
> -        vfio_region_finalize(&vdev->bars[i].region);
> +        vfio_region_finalize(&bar->region);
> +        if (bar->size) {
> +           object_unparent(OBJECT(bar->mr));
> +           g_free(bar->mr);
> +        }
>      }
>  
>      if (vdev->vga) {
> @@ -2810,7 +2849,8 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
>          goto error;
>      }
>  
> -    vfio_bars_setup(vdev);
> +    vfio_bars_prepare(vdev);
> +    vfio_bars_register(vdev);
>  
>      ret = vfio_add_capabilities(vdev, errp);
>      if (ret) {
> diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
> index 3d753222ca4c..dcdb1a806769 100644
> --- a/hw/vfio/pci.h
> +++ b/hw/vfio/pci.h
> @@ -33,6 +33,9 @@ typedef struct VFIOQuirk {
>  
>  typedef struct VFIOBAR {
>      VFIORegion region;
> +    MemoryRegion *mr;
> +    size_t size;
> +    uint8_t type;
>      bool ioport;
>      bool mem64;
>      QLIST_HEAD(, VFIOQuirk) quirks;
> 
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion
  2017-12-19  3:44   ` Alexey Kardashevskiy
@ 2017-12-19  3:56     ` Alex Williamson
  2017-12-19  4:15       ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-19  3:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, eric.auger

On Tue, 19 Dec 2017 14:44:51 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 18/12/17 16:02, Alex Williamson wrote:
> > Add one more layer to our stack of MemoryRegions, this base region
> > allows us to register BARs independently of the vfio region or to
> > extend the size of BARs which do map to a region.  This will be
> > useful when we want hypervisor defined BARs or sections of BARs,
> > for purposes such as relocating MSI-X emulation.  We therefore call
> > msix_init() based on this new base MemoryRegion, while the quirks,
> > which only modify regions still operate on those sub-MemoryRegions.  
> 
> 
> Looks ok, one worry though - the default config produces this:
> 
> 
> memory-region: pci@800000020000000.mmio
>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
> 3 mmaps[0]
> 
> 
> Where "BAR 1" and "msix-table" overlap. It resolves correctly:
> 
> 
> FlatView #1
>  AS "memory", root: system
>  AS "cpu-memory", root: system
>  Root memory region: system
>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
> @000000000000e600
>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
> 
> 
> but this looks like an accident - should not we raise the msix-table
> priority or lower the BAR1 priority (to -1)?

I was worried about this too, but the way I read the code is that the
last registered subregion with the same priority wins.  Therefore so
long as msix_init() is called after we layer anything else onto the base
BAR MemoryRegion that might interfere with it, everything should work
correctly.  It certainly might make sense for msix_init() to add
subregions with a higher priority, but then you immediately get into
the road block of what priority should it use, which is a bit of a rat
hole since we can make it work as-is with proper ordering.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion
  2017-12-19  3:56     ` Alex Williamson
@ 2017-12-19  4:15       ` Alexey Kardashevskiy
  0 siblings, 0 replies; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  4:15 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, eric.auger

On 19/12/17 14:56, Alex Williamson wrote:
> On Tue, 19 Dec 2017 14:44:51 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> Add one more layer to our stack of MemoryRegions, this base region
>>> allows us to register BARs independently of the vfio region or to
>>> extend the size of BARs which do map to a region.  This will be
>>> useful when we want hypervisor defined BARs or sections of BARs,
>>> for purposes such as relocating MSI-X emulation.  We therefore call
>>> msix_init() based on this new base MemoryRegion, while the quirks,
>>> which only modify regions still operate on those sub-MemoryRegions.  
>>
>>
>> Looks ok, one worry though - the default config produces this:
>>
>>
>> memory-region: pci@800000020000000.mmio
>>   0000000000000000-ffffffffffffffff (prio 0, i/o): pci@800000020000000.mmio
>>     0000210000000000-000021000000ffff (prio 1, i/o): 0001:03:00.0 base BAR 1
>>       0000210000000000-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>>       000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>       000021000000f000-000021000000f00f (prio 0, i/o): msix-pba [disabled]
>>     0000210000040000-000021000007ffff (prio 1, i/o): 0001:03:00.0 base BAR 3
>>       0000210000040000-000021000007ffff (prio 0, i/o): 0001:03:00.0 BAR 3
>>         0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR
>> 3 mmaps[0]
>>
>>
>> Where "BAR 1" and "msix-table" overlap. It resolves correctly:
>>
>>
>> FlatView #1
>>  AS "memory", root: system
>>  AS "cpu-memory", root: system
>>  Root memory region: system
>>   0000000000000000-000000007fffffff (prio 0, ram): ppc_spapr.ram
>>   0000210000000000-000021000000dfff (prio 0, i/o): 0001:03:00.0 BAR 1
>>   000021000000e000-000021000000e5ff (prio 0, i/o): msix-table
>>   000021000000e600-000021000000ffff (prio 0, i/o): 0001:03:00.0 BAR 1
>> @000000000000e600
>>   0000210000040000-000021000007ffff (prio 0, ramd): 0001:03:00.0 BAR 3 mmaps[0]
>>
>>
>> but this looks like an accident - should not we raise the msix-table
>> priority or lower the BAR1 priority (to -1)?
> 
> I was worried about this too, but the way I read the code is that the
> last registered subregion with the same priority wins.  Therefore so
> long as msix_init() is called after we layer anything else onto the base
> BAR MemoryRegion that might interfere with it, everything should work
> correctly.  It certainly might make sense for msix_init() to add
> subregions with a higher priority, but then you immediately get into
> the road block of what priority should it use, which is a bit of a rat
> hole since we can make it work as-is with proper ordering.

The order is not obvious from any variant of "info mtree" (I may post a
patch with a "-u" switch to disable sorting).

And docs/devel/memory.txt allows using negative priorities so the "BAR 1"
region could have -1 to make things predictable and more self-documented imho.



>  Thanks,
> 
> Alex
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-19  3:40     ` Alex Williamson
@ 2017-12-19  6:02       ` Alexey Kardashevskiy
  2017-12-19  6:56         ` Alex Williamson
  0 siblings, 1 reply; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  6:02 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, eric.auger

On 19/12/17 14:40, Alex Williamson wrote:
> On Tue, 19 Dec 2017 14:07:13 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 18/12/17 16:02, Alex Williamson wrote:
>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>> table area can be mmap'd from userspace, allowing direct access to
>>> non-MSI-X registers within the host page size of this area.  However,
>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>> within that same page.  For x86/64 host, the system page size is 4K
>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>> don't honor this recommendation would see any improvement from this
>>> option.  The real targets for this feature are hosts where the page
>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>> with 64K pages.
>>>
>>> This new x-msix-relocation option accepts the following options:
>>>
>>>   off: Disable MSI-X relocation, use native device config (default)
>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>        based on minimum additional MMIO requirement
>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>        if the BAR exists or added if the BAR slot is available.
>>>
>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>> ---
>>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  hw/vfio/pci.h        |    1 
>>>  hw/vfio/trace-events |    2 +
>>>  3 files changed, 104 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>> index c383b842da20..b4426abf297a 100644
>>> --- a/hw/vfio/pci.c
>>> +++ b/hw/vfio/pci.c
>>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>      }
>>>  }
>>>  
>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
>>> +{
>>> +    int target_bar = -1;
>>> +    size_t msix_sz;
>>> +
>>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>> +        return;
>>> +    }
>>> +
>>> +    /* The actual minimum size of MSI-X structures */
>>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>> +    /* Round up to host pages, we don't want to share a page */
>>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>> +    /* PCI BARs must be a power of 2 */
>>> +    msix_sz = pow2ceil(msix_sz);
>>> +
>>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
>>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>> +        int i;
>>> +        size_t best = UINT64_MAX;
>>> +
>>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {  
>>
>>
>> I belieive that going from the other end is safer approach for "auto",
>> especially after discovering how mpt3sas works. Or you could add
>> "autoreverse" switch...
> 
> Or is extending the smallest BAR really a safer option?  I wonder how
> many drivers go through and fill fixed sized arrays with BAR info,
> expecting only the device implemented number of BARs.  Maybe they
> wouldn't notice if the BAR was simply bigger than expected.  On the
> other hand there are probably drivers dumb enough to index registers
> from the end for the BAR as well.  I don't think there exists an
> auto algorithm that will fit every device, but a higher hit rate than
> we have so far would be nice.

Everything is possible :(

I do not know if there are many users for this relocation though. So far
only one device has the problem (in 5 years or so) and it is fixed by
moving msix to bar5, I'd suggest start with this for now.

In general, I think we still need a way to simply disable that msix_table
region anyway if we find a device driver which uses all BARs, does not
tolerate changes to the default set of BARs, etc.


>  We could also implement MemoryRegionOps
> for the base BAR with some error reporting if it gets called.  That
> might make the problem more obvious than unassigned_mem_ops silently
> eating those accesses.

Makes sense.


> 
>>> +            size_t size;
>>> +
>>> +            if (vdev->bars[i].ioport) {
>>> +                continue;
>>> +            }
>>> +
>>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
>>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))
> 
> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
> 
> NB, the existing test here needs a bit of work too, 32bit BARs max out
> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
> BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
> 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
> in real trouble if were extending BARs where this is an issue though. 

until you get a driver like this :)

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782


> 
>>> +                continue;
>>> +
>>> +            /*
>>> +             * Must be pow2, so larger of double existing or double msix_sz,
>>> +             * or if BAR unimplemented, msix_sz
>>> +             */
>>> +            size = MAX(vdev->bars[i].size * 2,
>>> +                       vdev->bars[i].size ? msix_sz * 2 : msix_sz);
>>> +
>>> +            trace_vfio_msix_relo_cost(vdev->vbasedev.name, i, size);
>>> +
>>> +            if (size < best) {
>>> +                best = size;
>>> +                target_bar = i;
>>> +            }
>>> +
>>> +            if (vdev->bars[i].mem64) {
>>> +              i++;
>>> +            }
>>> +        }
>>> +    } else {
>>> +        target_bar = (int)(vdev->msix_relo - OFF_AUTOPCIBAR_BAR0);
>>> +    }
>>> +
>>> +    if (target_bar < 0 || vdev->bars[target_bar].ioport ||
>>> +        (!vdev->bars[target_bar].size &&
>>> +         target_bar > 0 && vdev->bars[target_bar - 1].mem64)) {
>>> +        return; /* Go BOOM?  Plumb Error */
>>> +    }  
>>
>>
>> This "if" only seems to make sense for the non-auto branch...
> 
> Most of it, yes, but it's still possible for a device to exist where
> the auto loop would come up empty.  Imagine if each BAR was
> sufficiently large that we couldn't extend it and still give the MSI-X
> MMIO areas a 32-bit offset within the BAR.  Exceptionally unlikely, it
> doesn't hurt to test all the corner cases.  I also missed the case of
> testing that the BAR isn't too large already here.

Fair enough.


>  
>>> +
>>> +    /*
>>> +     * If adding a new BAR, test if we can make it 64bit.  We make it
>>> +     * prefetchable since QEMU MSI-X emulation has no read side effects
>>> +     * and doing so makes mapping more flexible.
>>> +     */
>>> +    if (!vdev->bars[target_bar].size) {
>>> +        if (target_bar < (PCI_ROM_SLOT - 1) &&
>>> +            !vdev->bars[target_bar + 1].size) {
>>> +            vdev->bars[target_bar].mem64 = true;
>>> +            vdev->bars[target_bar].type = PCI_BASE_ADDRESS_MEM_TYPE_64;
>>> +        }
>>> +        vdev->bars[target_bar].type |= PCI_BASE_ADDRESS_MEM_PREFETCH;
>>> +        vdev->bars[target_bar].size = msix_sz;
>>> +        vdev->msix->table_offset = 0;
>>> +    } else {
>>> +        vdev->bars[target_bar].size = MAX(vdev->bars[target_bar].size * 2,
>>> +                                          msix_sz * 2);
>>> +        /*
>>> +         * Due to above size calc, MSI-X always starts halfway into the BAR,
>>> +         * which will always be a separate host page.
>>> +         */
>>> +        vdev->msix->table_offset = vdev->bars[target_bar].size / 2;
>>> +    }
>>> +
>>> +    vdev->msix->table_bar = target_bar;
>>> +    vdev->msix->pba_bar = target_bar;  
>>
>>
>> Ah, here is how I got confused that commenting vfio_pci_fixup_msix_region() out
>> was not necessary at the time but I missed that it is called before
>> vfio_pci_relocate_msix(), when simply swapped - BARs get mapped. Ok, thanks,
> 
> For a kernel that allows mapping the MSI-X region, yes, but if you ran
> that on an older kernel I think QEMU would break when it can't mmap the
> entire region.  We can't only support new kernels.  Thanks,


Sure, I am not suggesting changing this.



-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-19  6:02       ` Alexey Kardashevskiy
@ 2017-12-19  6:56         ` Alex Williamson
  2017-12-19  8:28           ` Alexey Kardashevskiy
  0 siblings, 1 reply; 20+ messages in thread
From: Alex Williamson @ 2017-12-19  6:56 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: qemu-devel, eric.auger

On Tue, 19 Dec 2017 17:02:59 +1100
Alexey Kardashevskiy <aik@ozlabs.ru> wrote:

> On 19/12/17 14:40, Alex Williamson wrote:
> > On Tue, 19 Dec 2017 14:07:13 +1100
> > Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> >   
> >> On 18/12/17 16:02, Alex Williamson wrote:  
> >>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
> >>> table area can be mmap'd from userspace, allowing direct access to
> >>> non-MSI-X registers within the host page size of this area.  However,
> >>> we only get that direct access if QEMU isn't also emulating MSI-X
> >>> within that same page.  For x86/64 host, the system page size is 4K
> >>> and the PCI spec recommends a minimum of 4K to 8K alignment to
> >>> separate MSI-X from non-MSI-X registers, therefore only devices which
> >>> don't honor this recommendation would see any improvement from this
> >>> option.  The real targets for this feature are hosts where the page
> >>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
> >>> with 64K pages.
> >>>
> >>> This new x-msix-relocation option accepts the following options:
> >>>
> >>>   off: Disable MSI-X relocation, use native device config (default)
> >>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
> >>>        based on minimum additional MMIO requirement
> >>>   bar0..bar5: Specify the target BAR, which will either be extended
> >>>        if the BAR exists or added if the BAR slot is available.
> >>>
> >>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> >>> ---
> >>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >>>  hw/vfio/pci.h        |    1 
> >>>  hw/vfio/trace-events |    2 +
> >>>  3 files changed, 104 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
> >>> index c383b842da20..b4426abf297a 100644
> >>> --- a/hw/vfio/pci.c
> >>> +++ b/hw/vfio/pci.c
> >>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
> >>>      }
> >>>  }
> >>>  
> >>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
> >>> +{
> >>> +    int target_bar = -1;
> >>> +    size_t msix_sz;
> >>> +
> >>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    /* The actual minimum size of MSI-X structures */
> >>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
> >>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
> >>> +    /* Round up to host pages, we don't want to share a page */
> >>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
> >>> +    /* PCI BARs must be a power of 2 */
> >>> +    msix_sz = pow2ceil(msix_sz);
> >>> +
> >>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
> >>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
> >>> +        int i;
> >>> +        size_t best = UINT64_MAX;
> >>> +
> >>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {    
> >>
> >>
> >> I belieive that going from the other end is safer approach for "auto",
> >> especially after discovering how mpt3sas works. Or you could add
> >> "autoreverse" switch...  
> > 
> > Or is extending the smallest BAR really a safer option?  I wonder how
> > many drivers go through and fill fixed sized arrays with BAR info,
> > expecting only the device implemented number of BARs.  Maybe they
> > wouldn't notice if the BAR was simply bigger than expected.  On the
> > other hand there are probably drivers dumb enough to index registers
> > from the end for the BAR as well.  I don't think there exists an
> > auto algorithm that will fit every device, but a higher hit rate than
> > we have so far would be nice.  
> 
> Everything is possible :(
> 
> I do not know if there are many users for this relocation though. So far
> only one device has the problem (in 5 years or so) and it is fixed by
> moving msix to bar5, I'd suggest start with this for now.

Interesting, I would have thought it to be more common.

> In general, I think we still need a way to simply disable that msix_table
> region anyway if we find a device driver which uses all BARs, does not
> tolerate changes to the default set of BARs, etc.

Only SPAPR can do that.  In fact, I'm somewhat surprised by your
interest in this series as I positioned it as a way for other
platforms, which require interaction with MSI-X MMIO space for
programming interrupts.
 
> >  We could also implement MemoryRegionOps
> > for the base BAR with some error reporting if it gets called.  That
> > might make the problem more obvious than unassigned_mem_ops silently
> > eating those accesses.  
> 
> Makes sense.
> 
> 
> >   
> >>> +            size_t size;
> >>> +
> >>> +            if (vdev->bars[i].ioport) {
> >>> +                continue;
> >>> +            }
> >>> +
> >>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
> >>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))  
> > 
> > Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
> > 
> > NB, the existing test here needs a bit of work too, 32bit BARs max out
> > at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
> > BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
> > 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
> > in real trouble if were extending BARs where this is an issue though.   
> 
> until you get a driver like this :)
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782

Right, a diametric opposite of the SAS driver, verifying all the
attributes it can of specific BARs rather than assuming the first BAR
it finds must be the one to use.  Is it even worthwhile to try to have
any automatic selection?  I suppose this driver is another point
towards a reverse search rather than extended BAR.  Thanks,

Alex

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO
  2017-12-19  6:56         ` Alex Williamson
@ 2017-12-19  8:28           ` Alexey Kardashevskiy
  0 siblings, 0 replies; 20+ messages in thread
From: Alexey Kardashevskiy @ 2017-12-19  8:28 UTC (permalink / raw)
  To: Alex Williamson; +Cc: qemu-devel, eric.auger

On 19/12/17 17:56, Alex Williamson wrote:
> On Tue, 19 Dec 2017 17:02:59 +1100
> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
> 
>> On 19/12/17 14:40, Alex Williamson wrote:
>>> On Tue, 19 Dec 2017 14:07:13 +1100
>>> Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>>   
>>>> On 18/12/17 16:02, Alex Williamson wrote:  
>>>>> With recently proposed kernel side vfio-pci changes, the MSI-X vector
>>>>> table area can be mmap'd from userspace, allowing direct access to
>>>>> non-MSI-X registers within the host page size of this area.  However,
>>>>> we only get that direct access if QEMU isn't also emulating MSI-X
>>>>> within that same page.  For x86/64 host, the system page size is 4K
>>>>> and the PCI spec recommends a minimum of 4K to 8K alignment to
>>>>> separate MSI-X from non-MSI-X registers, therefore only devices which
>>>>> don't honor this recommendation would see any improvement from this
>>>>> option.  The real targets for this feature are hosts where the page
>>>>> size exceeds the PCI spec recommended alignment, such as ARM64 systems
>>>>> with 64K pages.
>>>>>
>>>>> This new x-msix-relocation option accepts the following options:
>>>>>
>>>>>   off: Disable MSI-X relocation, use native device config (default)
>>>>>   auto: Automaically relocate MSI-X MMIO to another BAR or offset
>>>>>        based on minimum additional MMIO requirement
>>>>>   bar0..bar5: Specify the target BAR, which will either be extended
>>>>>        if the BAR exists or added if the BAR slot is available.
>>>>>
>>>>> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
>>>>> ---
>>>>>  hw/vfio/pci.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>  hw/vfio/pci.h        |    1 
>>>>>  hw/vfio/trace-events |    2 +
>>>>>  3 files changed, 104 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
>>>>> index c383b842da20..b4426abf297a 100644
>>>>> --- a/hw/vfio/pci.c
>>>>> +++ b/hw/vfio/pci.c
>>>>> @@ -1352,6 +1352,101 @@ static void vfio_pci_fixup_msix_region(VFIOPCIDevice *vdev)
>>>>>      }
>>>>>  }
>>>>>  
>>>>> +static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev)
>>>>> +{
>>>>> +    int target_bar = -1;
>>>>> +    size_t msix_sz;
>>>>> +
>>>>> +    if (!vdev->msix || vdev->msix_relo == OFF_AUTOPCIBAR_OFF) {
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    /* The actual minimum size of MSI-X structures */
>>>>> +    msix_sz = (vdev->msix->entries * PCI_MSIX_ENTRY_SIZE) +
>>>>> +              (QEMU_ALIGN_UP(vdev->msix->entries, 64) / 8);
>>>>> +    /* Round up to host pages, we don't want to share a page */
>>>>> +    msix_sz = REAL_HOST_PAGE_ALIGN(msix_sz);
>>>>> +    /* PCI BARs must be a power of 2 */
>>>>> +    msix_sz = pow2ceil(msix_sz);
>>>>> +
>>>>> +    /* Auto: pick the BAR that incurs the least additional MMIO space */
>>>>> +    if (vdev->msix_relo == OFF_AUTOPCIBAR_AUTO) {
>>>>> +        int i;
>>>>> +        size_t best = UINT64_MAX;
>>>>> +
>>>>> +        for (i = 0; i < PCI_ROM_SLOT; i++) {    
>>>>
>>>>
>>>> I belieive that going from the other end is safer approach for "auto",
>>>> especially after discovering how mpt3sas works. Or you could add
>>>> "autoreverse" switch...  
>>>
>>> Or is extending the smallest BAR really a safer option?  I wonder how
>>> many drivers go through and fill fixed sized arrays with BAR info,
>>> expecting only the device implemented number of BARs.  Maybe they
>>> wouldn't notice if the BAR was simply bigger than expected.  On the
>>> other hand there are probably drivers dumb enough to index registers
>>> from the end for the BAR as well.  I don't think there exists an
>>> auto algorithm that will fit every device, but a higher hit rate than
>>> we have so far would be nice.  
>>
>> Everything is possible :(
>>
>> I do not know if there are many users for this relocation though. So far
>> only one device has the problem (in 5 years or so) and it is fixed by
>> moving msix to bar5, I'd suggest start with this for now.
> 
> Interesting, I would have thought it to be more common.

Just to clarify - one device with performance issue because of msix
emulation, non-64k-aligned msix data is not that unusual.


> 
>> In general, I think we still need a way to simply disable that msix_table
>> region anyway if we find a device driver which uses all BARs, does not
>> tolerate changes to the default set of BARs, etc.
> 
> Only SPAPR can do that.  In fact, I'm somewhat surprised by your
> interest in this series as I positioned it as a way for other
> platforms, which require interaction with MSI-X MMIO space for
> programming interrupts.

Well, it moves the guest-visible msix section away from the BAR causing
performance issues so I figured it might work for SPAPR eventually :)


>>>  We could also implement MemoryRegionOps
>>> for the base BAR with some error reporting if it gets called.  That
>>> might make the problem more obvious than unassigned_mem_ops silently
>>> eating those accesses.  
>>
>> Makes sense.
>>
>>
>>>   
>>>>> +            size_t size;
>>>>> +
>>>>> +            if (vdev->bars[i].ioport) {
>>>>> +                continue;
>>>>> +            }
>>>>> +
>>>>> +            /* MSI-X MMIO must reside within first 32bit offset of BAR */
>>>>> +            if (vdev->bars[i].size > (UINT32_MAX / 2))  
>>>
>>> Adding a '|| !vdev->bars[i].size' here would make auto only extend BARs.
>>>
>>> NB, the existing test here needs a bit of work too, 32bit BARs max out
>>> at 2G not 4G, so maybe we need separate tests here. >1G for 32bit
>>> BARs, >2G for 64bit BARs.  Hmm, do we have the option of promoting
>>> 32bit BARs to 64bit? It's all virtual addresses anyway, right.  We're
>>> in real trouble if were extending BARs where this is an issue though.   
>>
>> until you get a driver like this :)
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers/rapidio/devices/tsi721.c?h=v4.15-rc4#n2782
> 
> Right, a diametric opposite of the SAS driver, verifying all the
> attributes it can of specific BARs rather than assuming the first BAR
> it finds must be the one to use.  Is it even worthwhile to try to have
> any automatic selection?  I suppose this driver is another point
> towards a reverse search rather than extended BAR.  Thanks,

Well, guessing like this may fail occasionally and simply allowing MSIX
mapping won't fail on SPAPR, I do not really know if it is going to be very
useful anywhere else than just SPAPR.

And I guess if we go the automatic selection path, than extending a BAR
does not have much benefit over using the last BAR because it seems quite
unlikely that a device 1) does not have any BARs unused and 2) none of BARs
is MSIX-only but if this is a case, I am not sure what guess would be safer.

I looked nearby, for example:

001e:80:00.2 Ethernet controller: Broadcom Corporation NetXtreme BCM5719
Gigabit Ethernet PCIe (rev 01)
Region 0: Memory at 3fc2c0250000 (64-bit, prefetchable) [size=64K]
Region 2: Memory at 3fc2c0240000 (64-bit, prefetchable) [size=64K]
Region 4: Memory at 3fc2c0230000 (64-bit, prefetchable) [size=64K]
Capabilities: [a0] MSI-X: Enable- Count=17 Masked-
        Vector table: BAR=4 offset=00000000
        PBA: BAR=4 offset=00000120

It is fully packed and it *seems* that BAR4 is MSIX only but who knows why
it is 64K - can be anything...


This one looks more convincing but still no guarantee:

0001:09:00.0 USB controller: Texas Instruments TUSB73x0 SuperSpeed USB 3.0
xHCI Host Controller (rev 02)
Region 0: Memory at 3fe080800000 (64-bit, non-prefetchable) [size=64K]
Region 2: Memory at 3fe080810000 (64-bit, non-prefetchable) [size=8K]
Capabilities: [c0] MSI-X: Enable+ Count=8 Masked-
        Vector table: BAR=2 offset=00000000
        PBA: BAR=2 offset=00001000



A funny thing - my thinkpad x1 does not have a single msix-capable device,
many are MSI and "Express (v2) Endpoint, MSI 00". Hmmm. Xeon and POWER8
boxes do have MSIX.


-- 
Alexey

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2017-12-19  8:28 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-18  5:02 [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation Alex Williamson
2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 1/5] vfio/pci: Fixup VFIOMSIXInfo comment Alex Williamson
2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 2/5] vfio/pci: Add base BAR MemoryRegion Alex Williamson
2017-12-19  3:44   ` Alexey Kardashevskiy
2017-12-19  3:56     ` Alex Williamson
2017-12-19  4:15       ` Alexey Kardashevskiy
2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 3/5] vfio/pci: Emulate BARs Alex Williamson
2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 4/5] qapi: Create DEFINE_PROP_OFF_AUTO_PCIBAR Alex Williamson
2017-12-18  5:02 ` [Qemu-devel] [RFC/RFT PATCH 5/5] vfio/pci: Allow relocating MSI-X MMIO Alex Williamson
2017-12-18  9:04   ` Alexey Kardashevskiy
2017-12-18 13:28     ` Alex Williamson
2017-12-18 13:55       ` Alexey Kardashevskiy
2017-12-18 14:28         ` Alex Williamson
2017-12-19  1:22           ` Alexey Kardashevskiy
2017-12-19  3:07   ` Alexey Kardashevskiy
2017-12-19  3:40     ` Alex Williamson
2017-12-19  6:02       ` Alexey Kardashevskiy
2017-12-19  6:56         ` Alex Williamson
2017-12-19  8:28           ` Alexey Kardashevskiy
2017-12-18  5:34 ` [Qemu-devel] [RFC/RFT PATCH 0/5] vfio/pci: MSI-X MMIO relocation no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.