All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31
@ 2016-10-31 17:37 Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 1/5] memory: Replace skip_dump flag with "ram_device" Alex Williamson
                   ` (5 more replies)
  0 siblings, 6 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:37 UTC (permalink / raw)
  To: qemu-devel

The following changes since commit 0bb1137930f51a89fb1bfeb0c46aa68af0395167:

  Merge remote-tracking branch 'remotes/cohuck/tags/s390x-20161031' into staging (2016-10-31 14:48:47 +0000)

are available in the git repository at:


  git://github.com/awilliam/qemu-vfio.git tags/vfio-updates-20161031.0

for you to fetch changes up to 95251725e335af2b885e2ab33dd29c86f8084663:

  vfio: Add support for mmapping sub-page MMIO BARs (2016-10-31 09:53:04 -0600)

----------------------------------------------------------------
VFIO updates 2016-10-31

 - Replace skip_dump with ram_device to denote device memory and mark
   as non-direct to avoid memcpy to MMIO - fixes RTL (Alex Williamson)
 - Skip zero-length sparse mmaps - avoids unnecessary warning
   (Alex Williamson)
 - Clear BARs on reset so guest doesn't assume programming on return
   from S3 (Ido Yariv)
 - Enable sub-page MMIO mmaps - performance improvement for devices
   with smaller BARs, iff both host and guest map them to full,
   aligned pages (Yongji Xie)

----------------------------------------------------------------
Alex Williamson (3):
      memory: Replace skip_dump flag with "ram_device"
      memory: Don't use memcpy for ram_device regions
      vfio: Handle zero-length sparse mmap ranges

Ido Yariv (1):
      vfio/pci: fix out-of-sync BAR information on reset

Yongji Xie (1):
      vfio: Add support for mmapping sub-page MMIO BARs

 hw/vfio/common.c      | 48 +++++++++++++++++--------------
 hw/vfio/pci.c         | 79 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/vfio/spapr.c       |  2 +-
 include/exec/memory.h | 47 ++++++++++++++++++++----------
 memory.c              | 80 ++++++++++++++++++++++++++++++++++++++++++++++++---
 memory_mapping.c      |  2 +-
 trace-events          |  2 ++
 7 files changed, 218 insertions(+), 42 deletions(-)

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PULL 1/5] memory: Replace skip_dump flag with "ram_device"
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
@ 2016-10-31 17:37 ` Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 2/5] memory: Don't use memcpy for ram_device regions Alex Williamson
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini

Setting skip_dump on a MemoryRegion allows us to modify one specific
code path, but the restriction we're trying to address encompasses
more than that.  If we have a RAM MemoryRegion backed by a physical
device, it not only restricts our ability to dump that region, but
also affects how we should manipulate it.  Here we recognize that
MemoryRegions do not change to sometimes allow dumps and other times
not, so we replace setting the skip_dump flag with a new initializer
so that we know exactly the type of region to which we're applying
this behavior.

Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 hw/vfio/common.c      |    9 ++++-----
 hw/vfio/spapr.c       |    2 +-
 include/exec/memory.h |   41 ++++++++++++++++++++++++++++-------------
 memory.c              |   13 +++++++++----
 memory_mapping.c      |    2 +-
 5 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 9505fb3..c764cb3 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -724,12 +724,11 @@ int vfio_region_mmap(VFIORegion *region)
 
         name = g_strdup_printf("%s mmaps[%d]",
                                memory_region_name(region->mem), i);
-        memory_region_init_ram_ptr(&region->mmaps[i].mem,
-                                   memory_region_owner(region->mem),
-                                   name, region->mmaps[i].size,
-                                   region->mmaps[i].mmap);
+        memory_region_init_ram_device_ptr(&region->mmaps[i].mem,
+                                          memory_region_owner(region->mem),
+                                          name, region->mmaps[i].size,
+                                          region->mmaps[i].mmap);
         g_free(name);
-        memory_region_set_skip_dump(&region->mmaps[i].mem);
         memory_region_add_subregion(region->mem, region->mmaps[i].offset,
                                     &region->mmaps[i].mem);
 
diff --git a/hw/vfio/spapr.c b/hw/vfio/spapr.c
index 7443d34..4409bcc 100644
--- a/hw/vfio/spapr.c
+++ b/hw/vfio/spapr.c
@@ -25,7 +25,7 @@ static bool vfio_prereg_listener_skipped_section(MemoryRegionSection *section)
     }
 
     return !memory_region_is_ram(section->mr) ||
-            memory_region_is_skip_dump(section->mr);
+            memory_region_is_ram_device(section->mr);
 }
 
 static void *vfio_prereg_gpa_to_vaddr(MemoryRegionSection *section, hwaddr gpa)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 79ccaab..a75b8c3 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -209,7 +209,7 @@ struct MemoryRegion {
     void (*destructor)(MemoryRegion *mr);
     uint64_t align;
     bool terminates;
-    bool skip_dump;
+    bool ram_device;
     bool enabled;
     bool warning_printed; /* For reservations */
     uint8_t vga_logging_count;
@@ -449,6 +449,30 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
                                 void *ptr);
 
 /**
+ * memory_region_init_ram_device_ptr:  Initialize RAM device memory region from
+ *                                     a user-provided pointer.
+ *
+ * A RAM device represents a mapping to a physical device, such as to a PCI
+ * MMIO BAR of an vfio-pci assigned device.  The memory region may be mapped
+ * into the VM address space and access to the region will modify memory
+ * directly.  However, the memory region should not be included in a memory
+ * dump (device may not be enabled/mapped at the time of the dump), and
+ * operations incompatible with manipulating MMIO should be avoided.  Replaces
+ * skip_dump flag.
+ *
+ * @mr: the #MemoryRegion to be initialized.
+ * @owner: the object that tracks the region's reference count
+ * @name: the name of the region.
+ * @size: size of the region.
+ * @ptr: memory to be mapped; must contain at least @size bytes.
+ */
+void memory_region_init_ram_device_ptr(MemoryRegion *mr,
+                                       struct Object *owner,
+                                       const char *name,
+                                       uint64_t size,
+                                       void *ptr);
+
+/**
  * memory_region_init_alias: Initialize a memory region that aliases all or a
  *                           part of another memory region.
  *
@@ -574,22 +598,13 @@ static inline bool memory_region_is_ram(MemoryRegion *mr)
 }
 
 /**
- * memory_region_is_skip_dump: check whether a memory region should not be
- *                             dumped
- *
- * Returns %true is a memory region should not be dumped(e.g. VFIO BAR MMAP).
+ * memory_region_is_ram_device: check whether a memory region is a ram device
  *
- * @mr: the memory region being queried
- */
-bool memory_region_is_skip_dump(MemoryRegion *mr);
-
-/**
- * memory_region_set_skip_dump: Set skip_dump flag, dump will ignore this memory
- *                              region
+ * Returns %true is a memory region is a device backed ram region
  *
  * @mr: the memory region being queried
  */
-void memory_region_set_skip_dump(MemoryRegion *mr);
+bool memory_region_is_ram_device(MemoryRegion *mr);
 
 /**
  * memory_region_is_romd: check whether a memory region is in ROMD mode
diff --git a/memory.c b/memory.c
index edbc701..7ffcff1 100644
--- a/memory.c
+++ b/memory.c
@@ -1355,9 +1355,14 @@ void memory_region_init_ram_ptr(MemoryRegion *mr,
     mr->ram_block = qemu_ram_alloc_from_ptr(size, ptr, mr, &error_fatal);
 }
 
-void memory_region_set_skip_dump(MemoryRegion *mr)
+void memory_region_init_ram_device_ptr(MemoryRegion *mr,
+                                       Object *owner,
+                                       const char *name,
+                                       uint64_t size,
+                                       void *ptr)
 {
-    mr->skip_dump = true;
+    memory_region_init_ram_ptr(mr, owner, name, size, ptr);
+    mr->ram_device = true;
 }
 
 void memory_region_init_alias(MemoryRegion *mr,
@@ -1491,9 +1496,9 @@ const char *memory_region_name(const MemoryRegion *mr)
     return mr->name;
 }
 
-bool memory_region_is_skip_dump(MemoryRegion *mr)
+bool memory_region_is_ram_device(MemoryRegion *mr)
 {
-    return mr->skip_dump;
+    return mr->ram_device;
 }
 
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
diff --git a/memory_mapping.c b/memory_mapping.c
index e3e0d95..6a39d71 100644
--- a/memory_mapping.c
+++ b/memory_mapping.c
@@ -206,7 +206,7 @@ static void guest_phys_blocks_region_add(MemoryListener *listener,
 
     /* we only care about RAM */
     if (!memory_region_is_ram(section->mr) ||
-        memory_region_is_skip_dump(section->mr)) {
+        memory_region_is_ram_device(section->mr)) {
         return;
     }
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PULL 2/5] memory: Don't use memcpy for ram_device regions
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 1/5] memory: Replace skip_dump flag with "ram_device" Alex Williamson
@ 2016-10-31 17:37 ` Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 3/5] vfio: Handle zero-length sparse mmap ranges Alex Williamson
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Paolo Bonzini, Thorsten Kohfeldt

With a vfio assigned device we lay down a base MemoryRegion registered
as an IO region, giving us read & write accessors.  If the region
supports mmap, we lay down a higher priority sub-region MemoryRegion
on top of the base layer initialized as a RAM device pointer to the
mmap.  Finally, if we have any quirks for the device (ie. address
ranges that need additional virtualization support), we put another IO
sub-region on top of the mmap MemoryRegion.  When this is flattened,
we now potentially have sub-page mmap MemoryRegions exposed which
cannot be directly mapped through KVM.

This is as expected, but a subtle detail of this is that we end up
with two different access mechanisms through QEMU.  If we disable the
mmap MemoryRegion, we make use of the IO MemoryRegion and service
accesses using pread and pwrite to the vfio device file descriptor.
If the mmap MemoryRegion is enabled and results in one of these
sub-page gaps, QEMU handles the access as RAM, using memcpy to the
mmap.  Using either pread/pwrite or the mmap directly should be
correct, but using memcpy causes us problems.  I expect that not only
does memcpy not necessarily honor the original width and alignment in
performing a copy, but it potentially also uses processor instructions
not intended for MMIO spaces.  It turns out that this has been a
problem for Realtek NIC assignment, which has such a quirk that
creates a sub-page mmap MemoryRegion access.

To resolve this, we disable memory_access_is_direct() for ram_device
regions since QEMU assumes that it can use memcpy for those regions.
Instead we access through MemoryRegionOps, which replaces the memcpy
with simple de-references of standard sizes to the host memory.

With this patch we attempt to provide unrestricted access to the RAM
device, allowing byte through qword access as well as unaligned
access.  The assumption here is that accesses initiated by the VM are
driven by a device specific driver, which knows the device
capabilities.  If unaligned accesses are not supported by the device,
we don't want them to work in a VM by performing multiple aligned
accesses to compose the unaligned access.  A down-side of this
philosophy is that the xp command from the monitor attempts to use
the largest available access weidth, unaware of the underlying
device.  Using memcpy had this same restriction, but at least now an
operator can dump individual registers, even if blocks of device
memory may result in access widths beyond the capabilities of a
given device (RTL NICs only support up to dword).

Reported-by: Thorsten Kohfeldt <thorsten.kohfeldt@gmx.de>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Acked-by: Paolo Bonzini <pbonzini@redhat.com>
---
 include/exec/memory.h |    6 +++-
 memory.c              |   67 +++++++++++++++++++++++++++++++++++++++++++++++++
 trace-events          |    2 +
 3 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index a75b8c3..9728a2f 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1480,9 +1480,11 @@ void *qemu_map_ram_ptr(RAMBlock *ram_block, ram_addr_t addr);
 static inline bool memory_access_is_direct(MemoryRegion *mr, bool is_write)
 {
     if (is_write) {
-        return memory_region_is_ram(mr) && !mr->readonly;
+        return memory_region_is_ram(mr) &&
+               !mr->readonly && !memory_region_is_ram_device(mr);
     } else {
-        return memory_region_is_ram(mr) || memory_region_is_romd(mr);
+        return (memory_region_is_ram(mr) && !memory_region_is_ram_device(mr)) ||
+               memory_region_is_romd(mr);
     }
 }
 
diff --git a/memory.c b/memory.c
index 7ffcff1..33110e9 100644
--- a/memory.c
+++ b/memory.c
@@ -1128,6 +1128,71 @@ const MemoryRegionOps unassigned_mem_ops = {
     .endianness = DEVICE_NATIVE_ENDIAN,
 };
 
+static uint64_t memory_region_ram_device_read(void *opaque,
+                                              hwaddr addr, unsigned size)
+{
+    MemoryRegion *mr = opaque;
+    uint64_t data = (uint64_t)~0;
+
+    switch (size) {
+    case 1:
+        data = *(uint8_t *)(mr->ram_block->host + addr);
+        break;
+    case 2:
+        data = *(uint16_t *)(mr->ram_block->host + addr);
+        break;
+    case 4:
+        data = *(uint32_t *)(mr->ram_block->host + addr);
+        break;
+    case 8:
+        data = *(uint64_t *)(mr->ram_block->host + addr);
+        break;
+    }
+
+    trace_memory_region_ram_device_read(get_cpu_index(), mr, addr, data, size);
+
+    return data;
+}
+
+static void memory_region_ram_device_write(void *opaque, hwaddr addr,
+                                           uint64_t data, unsigned size)
+{
+    MemoryRegion *mr = opaque;
+
+    trace_memory_region_ram_device_write(get_cpu_index(), mr, addr, data, size);
+
+    switch (size) {
+    case 1:
+        *(uint8_t *)(mr->ram_block->host + addr) = (uint8_t)data;
+        break;
+    case 2:
+        *(uint16_t *)(mr->ram_block->host + addr) = (uint16_t)data;
+        break;
+    case 4:
+        *(uint32_t *)(mr->ram_block->host + addr) = (uint32_t)data;
+        break;
+    case 8:
+        *(uint64_t *)(mr->ram_block->host + addr) = data;
+        break;
+    }
+}
+
+static const MemoryRegionOps ram_device_mem_ops = {
+    .read = memory_region_ram_device_read,
+    .write = memory_region_ram_device_write,
+    .endianness = DEVICE_NATIVE_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+        .unaligned = true,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+        .unaligned = true,
+    },
+};
+
 bool memory_region_access_valid(MemoryRegion *mr,
                                 hwaddr addr,
                                 unsigned size,
@@ -1363,6 +1428,8 @@ void memory_region_init_ram_device_ptr(MemoryRegion *mr,
 {
     memory_region_init_ram_ptr(mr, owner, name, size, ptr);
     mr->ram_device = true;
+    mr->ops = &ram_device_mem_ops;
+    mr->opaque = mr;
 }
 
 void memory_region_init_alias(MemoryRegion *mr,
diff --git a/trace-events b/trace-events
index 8ecded5..f74e1d3 100644
--- a/trace-events
+++ b/trace-events
@@ -121,6 +121,8 @@ memory_region_subpage_read(int cpu_index, void *mr, uint64_t offset, uint64_t va
 memory_region_subpage_write(int cpu_index, void *mr, uint64_t offset, uint64_t value, unsigned size) "cpu %d mr %p offset %#"PRIx64" value %#"PRIx64" size %u"
 memory_region_tb_read(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
 memory_region_tb_write(int cpu_index, uint64_t addr, uint64_t value, unsigned size) "cpu %d addr %#"PRIx64" value %#"PRIx64" size %u"
+memory_region_ram_device_read(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u"
+memory_region_ram_device_write(int cpu_index, void *mr, uint64_t addr, uint64_t value, unsigned size) "cpu %d mr %p addr %#"PRIx64" value %#"PRIx64" size %u"
 
 ### Guest events, keep at bottom
 

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PULL 3/5] vfio: Handle zero-length sparse mmap ranges
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 1/5] memory: Replace skip_dump flag with "ram_device" Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 2/5] memory: Don't use memcpy for ram_device regions Alex Williamson
@ 2016-10-31 17:37 ` Alex Williamson
  2016-10-31 17:37 ` [Qemu-devel] [PULL 4/5] vfio/pci: fix out-of-sync BAR information on reset Alex Williamson
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:37 UTC (permalink / raw)
  To: qemu-devel

As reported in the link below, user has a PCI device with a 4KB BAR
which contains the MSI-X table.  This seems to hit a corner case in
the kernel where the region reports being mmap capable, but the sparse
mmap information reports a zero sized range.  It's not entirely clear
that the kernel is incorrect in doing this, but regardless, we need
to handle it.  To do this, fill our mmap array only with non-zero
sized sparse mmap entries and add an error return from the function
so we can tell the difference between nr_mmaps being zero based on
sparse mmap info vs lack of sparse mmap info.

NB, this doesn't actually change the behavior of the device, it only
removes the scary "Failed to mmap ... Performance may be slow" error
message.  We cannot currently create an mmap over the MSI-X table.

Link: http://lists.nongnu.org/archive/html/qemu-discuss/2016-10/msg00009.html
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |   36 ++++++++++++++++++++++--------------
 1 file changed, 22 insertions(+), 14 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c764cb3..f528309 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -610,16 +610,16 @@ vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id)
     return NULL;
 }
 
-static void vfio_setup_region_sparse_mmaps(VFIORegion *region,
-                                           struct vfio_region_info *info)
+static int vfio_setup_region_sparse_mmaps(VFIORegion *region,
+                                          struct vfio_region_info *info)
 {
     struct vfio_info_cap_header *hdr;
     struct vfio_region_info_cap_sparse_mmap *sparse;
-    int i;
+    int i, j;
 
     hdr = vfio_get_region_info_cap(info, VFIO_REGION_INFO_CAP_SPARSE_MMAP);
     if (!hdr) {
-        return;
+        return -ENODEV;
     }
 
     sparse = container_of(hdr, struct vfio_region_info_cap_sparse_mmap, header);
@@ -627,16 +627,24 @@ static void vfio_setup_region_sparse_mmaps(VFIORegion *region,
     trace_vfio_region_sparse_mmap_header(region->vbasedev->name,
                                          region->nr, sparse->nr_areas);
 
-    region->nr_mmaps = sparse->nr_areas;
-    region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
+    region->mmaps = g_new0(VFIOMmap, sparse->nr_areas);
 
-    for (i = 0; i < region->nr_mmaps; i++) {
-        region->mmaps[i].offset = sparse->areas[i].offset;
-        region->mmaps[i].size = sparse->areas[i].size;
-        trace_vfio_region_sparse_mmap_entry(i, region->mmaps[i].offset,
-                                            region->mmaps[i].offset +
-                                            region->mmaps[i].size);
+    for (i = 0, j = 0; i < sparse->nr_areas; i++) {
+        trace_vfio_region_sparse_mmap_entry(i, sparse->areas[i].offset,
+                                            sparse->areas[i].offset +
+                                            sparse->areas[i].size);
+
+        if (sparse->areas[i].size) {
+            region->mmaps[j].offset = sparse->areas[i].offset;
+            region->mmaps[j].size = sparse->areas[i].size;
+            j++;
+        }
     }
+
+    region->nr_mmaps = j;
+    region->mmaps = g_realloc(region->mmaps, j * sizeof(VFIOMmap));
+
+    return 0;
 }
 
 int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
@@ -665,9 +673,9 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
             region->flags & VFIO_REGION_INFO_FLAG_MMAP &&
             !(region->size & ~qemu_real_host_page_mask)) {
 
-            vfio_setup_region_sparse_mmaps(region, info);
+            ret = vfio_setup_region_sparse_mmaps(region, info);
 
-            if (!region->nr_mmaps) {
+            if (ret) {
                 region->nr_mmaps = 1;
                 region->mmaps = g_new0(VFIOMmap, region->nr_mmaps);
                 region->mmaps[0].offset = 0;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PULL 4/5] vfio/pci: fix out-of-sync BAR information on reset
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
                   ` (2 preceding siblings ...)
  2016-10-31 17:37 ` [Qemu-devel] [PULL 3/5] vfio: Handle zero-length sparse mmap ranges Alex Williamson
@ 2016-10-31 17:37 ` Alex Williamson
  2016-10-31 17:38 ` [Qemu-devel] [PULL 5/5] vfio: Add support for mmapping sub-page MMIO BARs Alex Williamson
  2016-10-31 19:05 ` [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Peter Maydell
  5 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:37 UTC (permalink / raw)
  To: qemu-devel; +Cc: Ido Yariv

From: Ido Yariv <ido@wizery.com>

When a PCI device is reset, pci_do_device_reset resets all BAR addresses
in the relevant PCIDevice's config buffer.

The VFIO configuration space stays untouched, so the guest OS may choose
to skip restoring the BAR addresses as they would seem intact. The PCI
device may be left non-operational.
One example of such a scenario is when the guest exits S3.

Fix this by resetting the BAR addresses in the VFIO configuration space
as well.

Signed-off-by: Ido Yariv <ido@wizery.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/pci.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 65d30fd..b399742 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1922,11 +1922,23 @@ static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
 static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
 {
     Error *err = NULL;
+    int nr;
 
     vfio_intx_enable(vdev, &err);
     if (err) {
         error_reportf_err(err, ERR_PREFIX, vdev->vbasedev.name);
     }
+
+    for (nr = 0; nr < PCI_NUM_REGIONS - 1; ++nr) {
+        off_t addr = vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr);
+        uint32_t val = 0;
+        uint32_t len = sizeof(val);
+
+        if (pwrite(vdev->vbasedev.fd, &val, len, addr) != len) {
+            error_report("%s(%s) reset bar %d failed: %m", __func__,
+                         vdev->vbasedev.name, nr);
+        }
+    }
 }
 
 static bool vfio_pci_host_match(PCIHostDeviceAddress *addr, const char *name)

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [Qemu-devel] [PULL 5/5] vfio: Add support for mmapping sub-page MMIO BARs
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
                   ` (3 preceding siblings ...)
  2016-10-31 17:37 ` [Qemu-devel] [PULL 4/5] vfio/pci: fix out-of-sync BAR information on reset Alex Williamson
@ 2016-10-31 17:38 ` Alex Williamson
  2016-10-31 19:05 ` [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Peter Maydell
  5 siblings, 0 replies; 7+ messages in thread
From: Alex Williamson @ 2016-10-31 17:38 UTC (permalink / raw)
  To: qemu-devel; +Cc: Yongji Xie

From: Yongji Xie <xyjxie@linux.vnet.ibm.com>

Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
to mmap sub-page BARs. This is the corresponding QEMU patch.
With those patches applied, we could passthrough sub-page BARs
to guest, which can help to improve IO performance for some devices.

In this patch, we expand MemoryRegions of these sub-page
MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
with a valid size. The expanding size will be recovered when
the base address of sub-page BAR is changed and not page aligned
any more in guest. And we also set the priority of these BARs'
memory regions to zero in case of overlap with BARs which share
the same page with sub-page BARs in guest.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
---
 hw/vfio/common.c |    3 +-
 hw/vfio/pci.c    |   67 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 68 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index f528309..801578b 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -670,8 +670,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                               region, name, region->size);
 
         if (!vbasedev->no_mmap &&
-            region->flags & VFIO_REGION_INFO_FLAG_MMAP &&
-            !(region->size & ~qemu_real_host_page_mask)) {
+            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
 
             ret = vfio_setup_region_sparse_mmaps(region, info);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index b399742..d7dbe0e 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1071,6 +1071,55 @@ static const MemoryRegionOps vfio_vga_ops = {
 };
 
 /*
+ * Expand memory region of sub-page(size < PAGE_SIZE) MMIO BAR to page
+ * size if the BAR is in an exclusive page in host so that we could map
+ * this BAR to guest. But this sub-page BAR may not occupy an exclusive
+ * page in guest. So we should set the priority of the expanded memory
+ * region to zero in case of overlap with BARs which share the same page
+ * with the sub-page BAR in guest. Besides, we should also recover the
+ * size of this sub-page BAR when its base address is changed in guest
+ * and not page aligned any more.
+ */
+static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
+{
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    VFIORegion *region = &vdev->bars[bar].region;
+    MemoryRegion *mmap_mr, *mr;
+    PCIIORegion *r;
+    pcibus_t bar_addr;
+    uint64_t size = region->size;
+
+    /* Make sure that the whole region is allowed to be mmapped */
+    if (region->nr_mmaps != 1 || !region->mmaps[0].mmap ||
+        region->mmaps[0].size != region->size) {
+        return;
+    }
+
+    r = &pdev->io_regions[bar];
+    bar_addr = r->addr;
+    mr = region->mem;
+    mmap_mr = &region->mmaps[0].mem;
+
+    /* If BAR is mapped and page aligned, update to fill PAGE_SIZE */
+    if (bar_addr != PCI_BAR_UNMAPPED &&
+        !(bar_addr & ~qemu_real_host_page_mask)) {
+        size = qemu_real_host_page_size;
+    }
+
+    memory_region_transaction_begin();
+
+    memory_region_set_size(mr, size);
+    memory_region_set_size(mmap_mr, size);
+    if (size != region->size && memory_region_is_mapped(mr)) {
+        memory_region_del_subregion(r->address_space, mr);
+        memory_region_add_subregion_overlap(r->address_space,
+                                            bar_addr, mr, 0);
+    }
+
+    memory_region_transaction_commit();
+}
+
+/*
  * PCI config space
  */
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
@@ -1153,6 +1202,24 @@ void vfio_pci_write_config(PCIDevice *pdev,
         } else if (was_enabled && !is_enabled) {
             vfio_msix_disable(vdev);
         }
+    } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
+        range_covers_byte(addr, len, PCI_COMMAND)) {
+        pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+        int bar;
+
+        for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+            old_addr[bar] = pdev->io_regions[bar].addr;
+        }
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+            if (old_addr[bar] != pdev->io_regions[bar].addr &&
+                pdev->io_regions[bar].size > 0 &&
+                pdev->io_regions[bar].size < qemu_real_host_page_size) {
+                vfio_sub_page_bar_update_mapping(pdev, bar);
+            }
+        }
     } else {
         /* Write everything to QEMU to keep emulated bits correct */
         pci_default_write_config(pdev, addr, val, len);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31
  2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
                   ` (4 preceding siblings ...)
  2016-10-31 17:38 ` [Qemu-devel] [PULL 5/5] vfio: Add support for mmapping sub-page MMIO BARs Alex Williamson
@ 2016-10-31 19:05 ` Peter Maydell
  5 siblings, 0 replies; 7+ messages in thread
From: Peter Maydell @ 2016-10-31 19:05 UTC (permalink / raw)
  To: Alex Williamson; +Cc: QEMU Developers

On 31 October 2016 at 17:37, Alex Williamson <alex.williamson@redhat.com> wrote:
> The following changes since commit 0bb1137930f51a89fb1bfeb0c46aa68af0395167:
>
>   Merge remote-tracking branch 'remotes/cohuck/tags/s390x-20161031' into staging (2016-10-31 14:48:47 +0000)
>
> are available in the git repository at:
>
>
>   git://github.com/awilliam/qemu-vfio.git tags/vfio-updates-20161031.0
>
> for you to fetch changes up to 95251725e335af2b885e2ab33dd29c86f8084663:
>
>   vfio: Add support for mmapping sub-page MMIO BARs (2016-10-31 09:53:04 -0600)
>
> ----------------------------------------------------------------
> VFIO updates 2016-10-31
>
>  - Replace skip_dump with ram_device to denote device memory and mark
>    as non-direct to avoid memcpy to MMIO - fixes RTL (Alex Williamson)
>  - Skip zero-length sparse mmaps - avoids unnecessary warning
>    (Alex Williamson)
>  - Clear BARs on reset so guest doesn't assume programming on return
>    from S3 (Ido Yariv)
>  - Enable sub-page MMIO mmaps - performance improvement for devices
>    with smaller BARs, iff both host and guest map them to full,
>    aligned pages (Yongji Xie)
>

Applied, thanks.

-- PMM

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2016-10-31 19:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-10-31 17:37 [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Alex Williamson
2016-10-31 17:37 ` [Qemu-devel] [PULL 1/5] memory: Replace skip_dump flag with "ram_device" Alex Williamson
2016-10-31 17:37 ` [Qemu-devel] [PULL 2/5] memory: Don't use memcpy for ram_device regions Alex Williamson
2016-10-31 17:37 ` [Qemu-devel] [PULL 3/5] vfio: Handle zero-length sparse mmap ranges Alex Williamson
2016-10-31 17:37 ` [Qemu-devel] [PULL 4/5] vfio/pci: fix out-of-sync BAR information on reset Alex Williamson
2016-10-31 17:38 ` [Qemu-devel] [PULL 5/5] vfio: Add support for mmapping sub-page MMIO BARs Alex Williamson
2016-10-31 19:05 ` [Qemu-devel] [PULL 0/5] VFIO updates 2016-10-31 Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.