All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [Qemu-devel] [PATCH v2] vfio: Add support for mmapping sub-page MMIO BARs
@ 2016-09-30  4:00 Thorsten Kohfeldt
  2016-09-30 15:07 ` Alex Williamson
  0 siblings, 1 reply; 10+ messages in thread
From: Thorsten Kohfeldt @ 2016-09-30  4:00 UTC (permalink / raw)
  To: xyjxie, Alex Williamson, Paolo Bonzini; +Cc: qemu-devel


(Re-post, as my mail client somehow made my previous post attach to the wrong thread.
I do not mean to spam y'all, but maybe my previous mail got lost in your filters ...
... as I have not yet seen any answer to my questions/remarks.
)

 > On 2016/9/14 6:55, Alex Williamson wrote:
 >
 > [cc +Paolo]
 >
 >> On Thu, 11 Aug 2016 19:05:57 +0800
 >> Yongji Xie <address@hidden> wrote:
 >>
 >> Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
 >> sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
 >> to mmap sub-page BARs. This is the corresponding QEMU patch.

Immediate questions first:
It seems that mentioned commit will be part of Kernel 4.8 ?
But as far as I can judge this change should also cooperate with
older/existing kernels (which then just have qemu behave as before) ?

(For my actual point of interrest related to this patch please see further down.)

 >> With those patches applied, we could passthrough sub-page BARs
 >> to guest, which can help to improve IO performance for some devices.
 >>
 >> In this patch, we expand MemoryRegions of these sub-page
 >> MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
 >> the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
 >> with a valid size. The expanding size will be recovered when
 >> the base address of sub-page BAR is changed and not page aligned
 >> any more in guest. And we also set the priority of these BARs'
 >> memory regions to zero in case of overlap with BARs which share
 >> the same page with sub-page BARs in guest.
 >>
 >> Signed-off-by: Yongji Xie <address@hidden>
 >> ---
 >> hw/vfio/common.c |    3 +--
 >> hw/vfio/pci.c    |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 >> 2 files changed, 77 insertions(+), 2 deletions(-)
 >
 > Hi Yongji,
...
 >> +    mr = region->mem;
 >> +    mmap_mr = &region->mmaps[0].mem;
 >> +    memory_region_transaction_begin();
 >> +    if (memory_region_size(mr) < qemu_real_host_page_size) {
 >> +        if (!(bar_addr & ~qemu_real_host_page_mask) &&
 >> +            memory_region_is_mapped(mr) && region->mmaps[0].mmap) {
 >> +            /* Expand memory region to page size and set priority */
 >> +            memory_region_del_subregion(r->address_space, mr);
 >> +            memory_region_set_size(mr, qemu_real_host_page_size);
 >> +            memory_region_set_size(mmap_mr, qemu_real_host_page_size);
 >> +            memory_region_add_subregion_overlap(r->address_space,
 >> +                                                bar_addr, mr, 0);
 >> +       }
 >> +    } else {
 >> +        /* This case would happen when guest rescan one PCI device */
 >> +        if (bar_addr & ~qemu_real_host_page_mask) {
 >> +            /* Recover the size of memory region */
 >> +            memory_region_set_size(mr, r->size);
 >> +            memory_region_set_size(mmap_mr, r->size);
 >> +        } else if (memory_region_is_mapped(mr)) {
 >> +            /* Set the priority of memory region to zero */
 >> +            memory_region_del_subregion(r->address_space, mr);
 >> +            memory_region_add_subregion_overlap(r->address_space,
 >> +                                                bar_addr, mr, 0);
 >> +        }
 >> +    }
 >> +    memory_region_transaction_commit();
 >
 > Paolo, as the reigning memory API expert, do you see any issues with
 > this?  Expanding the size of a container MemoryRegion and the contained
 > mmap'd subregion and manipulating priorities so that we don't interfere
 > with other MemoryRegions.

Since the following qemu commit function memory_region_add_subregion_overlap()
actually has a misleading name:
http://git.qemu.org/?p=qemu.git;a=blobdiff;f=memory.c;h=ac5236b51587ee397edd177502fc20ce159f2235;hp=9daac5ea2d9a9c83533880a812760683f6e09765;hb=b61359781958759317ee6fd1a45b59be0b7dbbe1;hpb=ab0a99560857302b60053c245d1231acbd976cd4

The sole thing that memory_region_add_subregion_overlap() now actually does
differently from memory_region_add_subregion() is nothing else than setting
the region's priority to a value of callers choice.
The _default_ priority as chosen by memory_region_add_subregion() _is_ 0.

So, explicitly choosing memory_region_add_subregion_overlap(... , 0) does
nothing new.
Or does it:
Actually, _yes_, because I see Alex actually willing to discuss choice
of memory region priorities related to VFIO and mmap.
Why do I "invade" this thread ?
I would like you to consider thinking twice about selecting proper priorities
for _any_ mmap related region (i.e. also the aligned case), and here is why:
(I will also make a statement related to region expansion for alignment.)

First of all, I recently suggested a patch which can visualise what I
write about subsequently:
https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg01315.html
(I would appreciate if somebody would review and thus get it merged.)

As a general remark, the sub-page mmap case does not only occur when
a 'small' BAR is encountered, it also occurs when a fully mmap-ed
page is split by a 'small' VFIO quirk.
Hi Alex, here we go again about RTL8168 and its MSIX quirk.
(Subsequently I also relate to/conclude for Yongji's patch.)
Mentioned quirk cuts for certain RTL8168 models a full-page BAR
right into 3 pieces, 0..qirkaddr-1, quirk and quirk+qsize..pagesize-1.
What I found is that both "mmap-fractions" behave _buggy_.
(Attempt of an analysis subsequently.)
Here the first piece could be covered by Yongji's patch, but the third
piece certainly not, as the kernel patch is limited to aligned small pieces.
And that only starting with kernel 4.8.

This is what we also need to solve and where above priority choice
is coming into play (but there's more to consider):
Right after memory_region_transaction_commit() a new flat view is created.
As documented in docs/memory.txt, priorities _are_ important for the final
'stuff' which is exposed in the flat view in a certain address range.
_But_ - priorities are not the only important property.
So priorities need to fit into a larger picture.
(Here I refer again to my suggested mtree info <mapinfo> patch.)

Complexity of the situation besides watching your priorities:
As far as I understand the mmap functionality is 2-phased, there is the
mmap system call, which Yongji's patch is tuning around, and then the
setup of the actual dma, which happens for sections of the flat view
in function hw/vfio/common.c/vfio_dma_map() via an ioctl on the container
file descriptor.
vfio_dma_map() is called from hw/vfio/common.c/vfio_iommu_map_notify()
_and_ hw/vfio/common.c/vfio_listener_region_add() (for mmap flat ranges).
There be dragons -
vfio_dma_map() is _skipped_ for ranges which are _not_ fully page aligned
(there may be exceptions, but the buggy behavior is triggered by a
skip-case currently occurring for the quirk split pieces mentioned above).
This would also happen for non-page-extended page-aligned sub-page BARs,
I suppose.
Why is this skip-and-forget 'dangerous' ?
Well, on the one hand memory.txt rules specify which mmap region is
actually exposed in the flat range and thus fed into the skip-happy
function vfio_listener_region_add() (I do not understand enough about
vfio_iommu_map_notify(), so I disregard that one here for now).
On the other hand, those regions that are dma-skipped do not seem to
safely fallback to the 'slow' region-ops/memory-handlers.
This is certainly also due to rules specified in memory.txt,
be those either implemented correctly, but not correctly employed by
well adjusted subregion-stacking or even by a buggy implementation of
those stacking rules.
End of last year I was able to 'fix' the RTL8168 MSIX quirk problem
by adding an additional subregion (one page in size, referring to the
page sized quirk split BAR), which re-exposed the required region-ops
for the slow path.
Back then Alex called that a hack, but still:
I assume, sub-page BARs which are the subject of Yongji's patch will
also be subject of the correct-stacking-problem and thus, just picking
a seemingly well edjucated priority for a page-extended BAR-region
might 'endanger' the extended page-fraction's slow-path/region-ops.

Conclusion
_This_ patch probably needs an additional page sized region added
right at the correct position of the subregion 'stack' in order to
be able to 'create-and-forget' to be prepared for all different
additional subregion add-ons, potentially by yet unknown quirks.

And while we're at it, we can also fix the RTL8168 MSIX quirk using
the same precautional add-a-helper-region approach.

I have a few patch snippets in store since start of this year to
roughly address the dma-skip problem (still need refinement).
Anybody interrested in starting to discuss those ?

In order to test/verify/visualise I suggest again
https://lists.gnu.org/archive/html/qemu-devel/2016-09/msg04808.html
(I would really appreciate a review, refine and merge.)

And maybe we should even add a refactoring patch for the purpose of
renaming memory_region_add_subregion_overlap() to
memory_region_add_subregion_stacked() or
memory_region_add_subregion_prioritised() or similar.

Regards, Thorsten

^ permalink raw reply	[flat|nested] 10+ messages in thread
* [Qemu-devel] [PATCH] hmp: Improve 'info mtree' with optional parm for mapinfo
@ 2016-09-07  0:48 Thorsten Kohfeldt
  2016-09-15  9:52 ` Paolo Bonzini
  0 siblings, 1 reply; 10+ messages in thread
From: Thorsten Kohfeldt @ 2016-09-07  0:48 UTC (permalink / raw)
  To: pbonzini, armbru, lcapitulino, alex.williamson, peter.maydell; +Cc: qemu-devel

From: Thorsten Kohfeldt <mailto__qemu-devel@nongnu.org>
Date: Wed, 31 Aug 2016 22:43:14 +0200
Subject: [PATCH] hmp: Improve 'info mtree' with optional parm for mapinfo

Motivation
When 'tuning' 'quirks' for VFIO imported devices, it is not easy to
directly grasp the implications of the priorisation algorithms in place
for the 'layered mapping' of memory regions.
Even though there are rules (documented in docs/memory.txt), once in a
while one might question the correctness of the actual implementation of
the rules.
Particularly, I believe I have uncovered a divergence of (sub-)region
priorisation/order/visibility in a corner case of importing a device
(which requires a 'quirk') with mmap enabled vs. mmap disabled.
This modification provides a means of visualising the ACTUAL
mapping/visibility/occlusion of subregions within regions, whereas the
current info mtree command only lists the tree of regions (all, visible
and invisible ones).
It is primarily intended to provide support for easy presentation of my
cause, but I strongly believe this modification also has general purpose
advantages.

Functional implementation
A new OPTIONAL integer parameter, mapinfo-width, is added to the
monitor/hmp 'info mtree' command.
Effect:
When given and between 5 and 257, then each mtree output line is
prefixed with <mapinfo-width> columns of 'visibility samples', depicting
what qemu's memory region priorisation algorithms effectively make
visible/accessible at that sample address at the time of inquiry.
NOTE that
the sampling algorithm virtually cuts the memory region into (width - 1)
'slices' and computes (width) samples at the edges of those virtual
slices. Thus, it is probably a good idea to always request (2^n + 1)
samples.
ALSO NOTE that
the memory regions are NOT actually accessed at those 'samples', ONLY a
region PRIORISATION EVALUATION is performed for the sample addresses.
You can setup a default using environment variable
QEMU_INFO_MTREE_MAPINFO_WIDTH (must be in the Qemu instance's
environment; when unset, then the default is 0, i.e. off).
Giving negative values for sample-width results in using that default,
while values above 257 are reduced to 257, and values from 0 to 4 switch
the sampling off.

Technical implementation
3 functions are added to memory.c:
sane_mtree_info_sample_width() is used to sanitise the new parameter,
i.e. to provide a default and adjust it towards 'usability'. It is
called by existing function mtree_info().
mtree_print_mr_v_samples() is called for each mtree memory region (mr)
output line in order to print the 'mapinfo' prefix. The call is
performed by existing function mtree_print_mr() with parameter
'const MemoryRegion *mr', thus promising the object under investigation
is not modified.
mtree_mr_sample_reftype_marker() is used to traverse an mr subtree for a
given hardware address in order to basically find the first matching
enabled region and return its type. It is called by
mtree_print_mr_v_samples() for <sanitised mapinfo-width> 'sample'
addresses. As an mr tree is traversed, limited recursion is involved.
Additionally, for existing functions memory_region_to_address_space(),
memory_region_size(), and memory_region_find_rcu() their parameter
'MemoryRegion *mr' had to be constrained to 'const MemoryRegion *mr' in
order to satisfy the compiler while attempting to emphasize the
'object is not modified' promise by insisting to pass *mr as const into
mtree_print_mr_v_samples(). This did not entail changes in the bodies of
mentioned functions.

Signed-off-by: Thorsten Kohfeldt <mailto__qemu-devel@nongnu.org>
---
  hmp-commands-info.hx  |   9 +-
  include/exec/memory.h |   7 +-
  memory.c              | 246 ++++++++++++++++++++++++++++++++++++++++++++++++--
  monitor.c             |   4 +-
  4 files changed, 250 insertions(+), 16 deletions(-)

diff --git a/hmp-commands-info.hx b/hmp-commands-info.hx
index 74446c6..1593238 100644
--- a/hmp-commands-info.hx
+++ b/hmp-commands-info.hx
@@ -264,16 +264,17 @@ ETEXI

      {
          .name       = "mtree",
-        .args_type  = "",
-        .params     = "",
-        .help       = "show memory tree",
+        .args_type  = "mapinfo-width:l?",
+        .params     = "[mapinfo-width]",
+        .help       = "show memory tree "
+                      "(mapinfo-width: depict memory subregion mappings with leading characters)",
          .mhandler.cmd = hmp_info_mtree,
      },

  STEXI
  @item info mtree
  @findex mtree
-Show memory tree.
+Show memory tree optionally depicting subregion mappings.
  ETEXI

      {
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 3e4d416..751483c 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -536,7 +536,7 @@ struct Object *memory_region_owner(MemoryRegion *mr);
   *
   * @mr: the memory region being queried.
   */
-uint64_t memory_region_size(MemoryRegion *mr);
+uint64_t memory_region_size(const MemoryRegion *mr);

  /**
   * memory_region_is_ram: check whether a memory region is random access
@@ -1202,7 +1202,10 @@ void memory_global_dirty_log_start(void);
   */
  void memory_global_dirty_log_stop(void);

-void mtree_info(fprintf_function mon_printf, void *f);
+/**
+ * mtree_info: see hmp-commands-info.hx item info mtree
+ */
+void mtree_info(fprintf_function mon_printf, void *f, const int mapinfo_width);

  /**
   * memory_region_dispatch_read: perform a read directly to the specified
diff --git a/memory.c b/memory.c
index 0eb6895..99161bd 100644
--- a/memory.c
+++ b/memory.c
@@ -595,7 +595,7 @@ static MemTxResult access_with_adjusted_size(hwaddr addr,
      return r;
  }

-static AddressSpace *memory_region_to_address_space(MemoryRegion *mr)
+static AddressSpace *memory_region_to_address_space(const MemoryRegion *mr)
  {
      AddressSpace *as;

@@ -1477,7 +1477,7 @@ void memory_region_unref(MemoryRegion *mr)
      }
  }

-uint64_t memory_region_size(MemoryRegion *mr)
+uint64_t memory_region_size(const MemoryRegion *mr)
  {
      if (int128_eq(mr->size, int128_2_64())) {
          return UINT64_MAX;
@@ -2062,11 +2062,11 @@ bool memory_region_is_mapped(MemoryRegion *mr)
  /* Same as memory_region_find, but it does not add a reference to the
   * returned region.  It must be called from an RCU critical section.
   */
-static MemoryRegionSection memory_region_find_rcu(MemoryRegion *mr,
+static MemoryRegionSection memory_region_find_rcu(const MemoryRegion *mr,
                                                    hwaddr addr, uint64_t size)
  {
      MemoryRegionSection ret = { .mr = NULL };
-    MemoryRegion *root;
+    const MemoryRegion *root;
      AddressSpace *as;
      AddrRange range;
      FlatView *view;
@@ -2324,10 +2324,172 @@ struct MemoryRegionList {

  typedef QTAILQ_HEAD(queue, MemoryRegionList) MemoryRegionListHead;

+static char mtree_mr_sample_reftype_marker(const MemoryRegion *mr,
+                                           const MemoryRegion *mv,
+                                           const MemoryRegion *mc,
+                                           const hwaddr offset,
+                                           const int max_chainlen)
+{
+    const MemoryRegion *al = mc->alias;
+    const MemoryRegion *submr;
+    char marker, hint = 0;
+
+    if (int128_ge(int128_make64(offset), mc->size)) {
+        return 0;
+    }
+    if (max_chainlen < 0) {
+        /* this is most probably a complex alias loop situation */
+        return 'E'; /* max. link chain length exceeded */
+    }
+
+    if (al) {
+        if (al == mv) {
+            if (mr->enabled) {
+                if (mc == mr) {
+                    return '@'; /* indirect link found */
+                } else {
+                    return 'a'; /* 2nd degree related alias */
+                }
+            } else {
+                return '~';
+            }
+        }
+        if (al == mr) {
+            return 'L'; /* alias loop */
+        }
+        if (al == mc) {
+            return 'X'; /* alias self reference */
+        }
+        if (al->alias == mc) {
+            return 'M'; /* mutual alias */
+        }
+        if ((al->enabled) &&
+            (int128_lt(int128_add(int128_make64(offset),
+                                  int128_make64(mc->alias_offset)),
+                       al->size))) {
+            marker = mtree_mr_sample_reftype_marker(mr, mv, al,
+                                                    offset + mc->alias_offset,
+                                                    max_chainlen - 1);
+            if (marker && marker != 'E') {
+                return marker;
+            }
+            hint |= marker; /* propagate 'E' dominantly over 0 */
+        }
+    }
+
+    QTAILQ_FOREACH(submr, &mc->subregions, subregions_link) {
+        if (submr == mv) {
+            if (mr->enabled) {
+                if (mc == mr) {
+                    return 's'; /* occluded by subregion */
+                } else {
+                    return 'c'; /* 2nd degree related child */
+                }
+            } else {
+                return '.';
+            }
+        }
+        if ((submr->enabled) && (offset >= submr->addr)) {
+            marker = mtree_mr_sample_reftype_marker(mr, mv, submr,
+                                                    offset - submr->addr,
+                                                    max_chainlen - 1);
+            if (marker && marker != 'E') {
+                return marker;
+            }
+            hint |= marker; /* propagate 'E' dominantly over 0 */
+        }
+    }
+
+    return hint; /* either 0 or 'E' */
+}
+
+/* Depict memory region subrange structure,
+ * i.e. occlusion by submaps respectively visibility of submaps
+ * as supposedly resulting from region priorisation rules.
+ */
+static void mtree_print_mr_v_samples(fprintf_function mon_printf,
+                                     void *f,
+                                     const MemoryRegion *mr,
+                                     const unsigned int columns)
+{
+    const MemoryRegion *mv;
+    unsigned int i, j;
+    hwaddr size, offset;
+    bool covered = false;
+
+    /* prevent uncovered corner cases and excessive sampling effort */
+    if ((columns < 2) || (columns > 1000)) {
+        if (columns) {
+            mon_printf(f, "[%s: not supporting %d column(s)]",
+                       __func__, columns);
+        }
+        return;
+    }
+
+    /* convert/constrain mr->size to hwaddr bit size */
+    size = memory_region_size(mr);
+    if (!size) {
+        /* size is 0 for example with 'dummy' regions in VFIO */
+        mon_printf(f, "%*s", columns, "");
+        return;
+    }
+
+    i = columns;
+    j = size / (i - 1); /* step ('slice') width */
+    if (j < 1) {
+        j = 1;
+    }
+    offset = 0; /* sample position within region */
+    while (i--) {
+        if (offset >= (size - 1)) {
+            /* region might have less bytes than columns were requested */
+            if (covered) {
+                mon_printf(f, " "); /* no more additional samples */
+                continue;
+            }
+            covered = true;
+            offset = size - 1;
+        }
+        rcu_read_lock();
+        mv = memory_region_find_rcu(mr, offset, 1).mr;
+        rcu_read_unlock();
+        if (mv == mr) {
+            if (mr->enabled) {
+                mon_printf(f, "+"); /* mr is directly visible at the sample addr */
+            } else {
+                mon_printf(f, "-");
+            }
+        } else {
+            /* mr is not visible, but mv is visible unless NULL */
+            if (!mv) {
+                mon_printf(f, "/"); /* nothing mapped at the sample addr */
+            } else {
+                /* mr is occluded by mv which supposedly is
+                 * either linked via an alias/subregion construct
+                 *     or a higher prioritized subregion of the same parent
+                 */
+                char marker;
+
+                /* current memory region hierarchy depth is close to 5,
+                 * so a maximum search depth of 20 should be sufficient;
+                 * i.e. a link chain length longer than 20 is considered a loop
+                 */
+                marker = mtree_mr_sample_reftype_marker(mr, mv, mr, offset, 20);
+                if (!marker) {
+                    marker = 'o'; /* occlusion by sibling or unrelated region */
+                }
+                mon_printf(f, "%c", marker);
+            }
+        }
+        offset += j;
+    }
+}
+
  static void mtree_print_mr(fprintf_function mon_printf, void *f,
                             const MemoryRegion *mr, unsigned int level,
                             hwaddr base,
-                           MemoryRegionListHead *alias_print_queue)
+                           MemoryRegionListHead *alias_print_queue,
+                           const unsigned int mr_visibility_samples)
  {
      MemoryRegionList *new_ml, *ml, *next_ml;
      MemoryRegionListHead submr_print_queue;
@@ -2338,6 +2500,12 @@ static void mtree_print_mr(fprintf_function mon_printf, void *f,
          return;
      }

+    if (mr_visibility_samples) {
+        /* on the very left depict region occlusion/visibility */
+        mtree_print_mr_v_samples(mon_printf, f,
+                                 mr, mr_visibility_samples);
+    }
+
      for (i = 0; i < level; i++) {
          mon_printf(f, "  ");
      }
@@ -2415,7 +2583,7 @@ static void mtree_print_mr(fprintf_function mon_printf, void *f,

      QTAILQ_FOREACH(ml, &submr_print_queue, queue) {
          mtree_print_mr(mon_printf, f, ml->mr, level + 1, base + mr->addr,
-                       alias_print_queue);
+                       alias_print_queue, mr_visibility_samples);
      }

      QTAILQ_FOREACH_SAFE(ml, &submr_print_queue, queue, next_ml) {
@@ -2423,24 +2591,84 @@ static void mtree_print_mr(fprintf_function mon_printf, void *f,
      }
  }

-void mtree_info(fprintf_function mon_printf, void *f)
+static int sane_mtree_info_mapinfo_width(const int requested_width)
+{
+    static int default_width = -1;
+
+    int width = requested_width;
+
+    /* (on first call) establish a default for the number of region samples */
+    if (default_width < 0) {
+        char *str = getenv("QEMU_INFO_MTREE_MAPINFO_WIDTH");
+
+        if (str) {
+            default_width = atoi(str);
+        } else {
+            default_width = 0;
+        }
+        if (default_width < 0) {
+            /* prevent repeating getenv - fallback to sampling over 8 'slices' */
+            default_width = 9;
+        }
+    }
+
+    /* use the default when the requested width is negative */
+    if (width < 0) {
+        width = default_width;
+    }
+
+    /* and finally adapt to 'usability' constraints */
+    if (width < 5) {
+        /* sampling over very few 'slices' creates too much
+         * corner case complexity and is also not much of use
+         */
+        width = 0; /* disable the sample display completely */
+    }
+    if (width > 0x101) {
+        /* limit the output prefix width to some (un-)reasonable value:
+         * max 2^8+1 samples subdividing a region into 2^8 'slices'
+         */
+        width = 0x101;
+    }
+
+    return width;
+}
+
+void mtree_info(fprintf_function mon_printf, void *f, const int mapinfo_width)
  {
      MemoryRegionListHead ml_head;
      MemoryRegionList *ml, *ml2;
      AddressSpace *as;
+    int mr_visibility_samples = sane_mtree_info_mapinfo_width(mapinfo_width);
+
+    if (mr_visibility_samples) {
+        /* print a symbolisation cross reference */
+        mon_printf(f, "\n");
+        mon_printf(f, "/: nothing mapped at sample address\n");
+        mon_printf(f, "+: region directly mapped at sample\n");
+        mon_printf(f, "@: alias region mapped at sample\n");
+        mon_printf(f, "~: alias region mappable but disabled at sample\n");
+        mon_printf(f, "s: region occluded by subregion at sample\n");
+        mon_printf(f, "a: region occluded by an aliased subregion at sample\n");
+        mon_printf(f, "c: region occluded by a child (subregion) of an alias at sample\n");
+        mon_printf(f, "o: region occluded by some other region at sample\n");
+        mon_printf(f, "\n");
+    }

      QTAILQ_INIT(&ml_head);

      QTAILQ_FOREACH(as, &address_spaces, address_spaces_link) {
          mon_printf(f, "address-space: %s\n", as->name);
-        mtree_print_mr(mon_printf, f, as->root, 1, 0, &ml_head);
+        mtree_print_mr(mon_printf, f, as->root, 1, 0, &ml_head,
+                       mr_visibility_samples);
          mon_printf(f, "\n");
      }

      /* print aliased regions */
      QTAILQ_FOREACH(ml, &ml_head, queue) {
          mon_printf(f, "memory-region: %s\n", memory_region_name(ml->mr));
-        mtree_print_mr(mon_printf, f, ml->mr, 1, 0, &ml_head);
+        mtree_print_mr(mon_printf, f, ml->mr, 1, 0, &ml_head,
+                       mr_visibility_samples);
          mon_printf(f, "\n");
      }

diff --git a/monitor.c b/monitor.c
index 5c00373..2f55d12 100644
--- a/monitor.c
+++ b/monitor.c
@@ -1527,7 +1527,9 @@ static void hmp_boot_set(Monitor *mon, const QDict *qdict)

  static void hmp_info_mtree(Monitor *mon, const QDict *qdict)
  {
-    mtree_info((fprintf_function)monitor_printf, mon);
+    int mapinfo_width = qdict_get_try_int(qdict, "mapinfo-width", -1);
+
+    mtree_info((fprintf_function)monitor_printf, mon, mapinfo_width);
  }

  static void hmp_info_numa(Monitor *mon, const QDict *qdict)
-- 
2.5.5

^ permalink raw reply related	[flat|nested] 10+ messages in thread
* [Qemu-devel] [PATCH v2] vfio: Add support for mmapping sub-page MMIO BARs
@ 2016-08-11 11:05 Yongji Xie
  2016-08-11 11:09 ` no-reply
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Yongji Xie @ 2016-08-11 11:05 UTC (permalink / raw)
  To: alex.williamson; +Cc: qemu-devel, kvm, aik, zhong, gwshan

Now the kernel commit 05f0c03fbac1 ("vfio-pci: Allow to mmap
sub-page MMIO BARs if the mmio page is exclusive") allows VFIO
to mmap sub-page BARs. This is the corresponding QEMU patch.
With those patches applied, we could passthrough sub-page BARs
to guest, which can help to improve IO performance for some devices.

In this patch, we expand MemoryRegions of these sub-page
MMIO BARs to PAGE_SIZE in vfio_pci_write_config(), so that
the BARs could be passed to KVM ioctl KVM_SET_USER_MEMORY_REGION
with a valid size. The expanding size will be recovered when
the base address of sub-page BAR is changed and not page aligned
any more in guest. And we also set the priority of these BARs'
memory regions to zero in case of overlap with BARs which share
the same page with sub-page BARs in guest.

Signed-off-by: Yongji Xie <xyjxie@linux.vnet.ibm.com>
---
 hw/vfio/common.c |    3 +--
 hw/vfio/pci.c    |   76 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index b313e7c..1a70307 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -661,8 +661,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
                               region, name, region->size);
 
         if (!vbasedev->no_mmap &&
-            region->flags & VFIO_REGION_INFO_FLAG_MMAP &&
-            !(region->size & ~qemu_real_host_page_mask)) {
+            region->flags & VFIO_REGION_INFO_FLAG_MMAP) {
 
             vfio_setup_region_sparse_mmaps(region, info);
 
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7bfa17c..7035617 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -1057,6 +1057,65 @@ static const MemoryRegionOps vfio_vga_ops = {
 };
 
 /*
+ * Expand memory region of sub-page(size < PAGE_SIZE) MMIO BAR to page
+ * size if the BAR is in an exclusive page in host so that we could map
+ * this BAR to guest. But this sub-page BAR may not occupy an exclusive
+ * page in guest. So we should set the priority of the expanded memory
+ * region to zero in case of overlap with BARs which share the same page
+ * with the sub-page BAR in guest. Besides, we should also recover the
+ * size of this sub-page BAR when its base address is changed in guest
+ * and not page aligned any more.
+ */
+static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
+{
+    VFIOPCIDevice *vdev = DO_UPCAST(VFIOPCIDevice, pdev, pdev);
+    VFIORegion *region = &vdev->bars[bar].region;
+    MemoryRegion *mmap_mr, *mr;
+    PCIIORegion *r;
+    pcibus_t bar_addr;
+
+    /* Make sure that the whole region is allowed to be mmapped */
+    if (!(region->nr_mmaps == 1 &&
+        region->mmaps[0].size == region->size)) {
+        return;
+    }
+
+    r = &pdev->io_regions[bar];
+    bar_addr = r->addr;
+    if (bar_addr == PCI_BAR_UNMAPPED) {
+        return;
+    }
+
+    mr = region->mem;
+    mmap_mr = &region->mmaps[0].mem;
+    memory_region_transaction_begin();
+    if (memory_region_size(mr) < qemu_real_host_page_size) {
+        if (!(bar_addr & ~qemu_real_host_page_mask) &&
+            memory_region_is_mapped(mr) && region->mmaps[0].mmap) {
+            /* Expand memory region to page size and set priority */
+            memory_region_del_subregion(r->address_space, mr);
+            memory_region_set_size(mr, qemu_real_host_page_size);
+            memory_region_set_size(mmap_mr, qemu_real_host_page_size);
+            memory_region_add_subregion_overlap(r->address_space,
+                                                bar_addr, mr, 0);
+	}
+    } else {
+        /* This case would happen when guest rescan one PCI device */
+        if (bar_addr & ~qemu_real_host_page_mask) {
+            /* Recover the size of memory region */
+            memory_region_set_size(mr, r->size);
+            memory_region_set_size(mmap_mr, r->size);
+        } else if (memory_region_is_mapped(mr)) {
+            /* Set the priority of memory region to zero */
+            memory_region_del_subregion(r->address_space, mr);
+            memory_region_add_subregion_overlap(r->address_space,
+                                                bar_addr, mr, 0);
+        }
+    }
+    memory_region_transaction_commit();
+}
+
+/*
  * PCI config space
  */
 uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
@@ -1139,6 +1198,23 @@ void vfio_pci_write_config(PCIDevice *pdev,
         } else if (was_enabled && !is_enabled) {
             vfio_msix_disable(vdev);
         }
+    } else if (ranges_overlap(addr, len, PCI_BASE_ADDRESS_0, 24) ||
+        range_covers_byte(addr, len, PCI_COMMAND)) {
+        pcibus_t old_addr[PCI_NUM_REGIONS - 1];
+        int bar;
+
+        for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+            old_addr[bar] = pdev->io_regions[bar].addr;
+        }
+
+        pci_default_write_config(pdev, addr, val, len);
+
+        for (bar = 0; bar < PCI_ROM_SLOT; bar++) {
+            if (old_addr[bar] != pdev->io_regions[bar].addr &&
+                pdev->io_regions[bar].size > 0 &&
+                pdev->io_regions[bar].size < qemu_real_host_page_size)
+                vfio_sub_page_bar_update_mapping(pdev, bar);
+        }
     } else {
         /* Write everything to QEMU to keep emulated bits correct */
         pci_default_write_config(pdev, addr, val, len);
-- 
1.7.9.5


^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2016-10-04  6:57 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-09-30  4:00 [Qemu-devel] [PATCH v2] vfio: Add support for mmapping sub-page MMIO BARs Thorsten Kohfeldt
2016-09-30 15:07 ` Alex Williamson
  -- strict thread matches above, loose matches on Subject: below --
2016-09-07  0:48 [Qemu-devel] [PATCH] hmp: Improve 'info mtree' with optional parm for mapinfo Thorsten Kohfeldt
2016-09-15  9:52 ` Paolo Bonzini
2016-09-20  0:16   ` Thorsten Kohfeldt
2016-09-20  0:51     ` Laszlo Ersek
2016-09-22 19:43       ` [Qemu-devel] [PATCH v2] vfio: Add support for mmapping sub-page MMIO BARs Thorsten Kohfeldt
2016-08-11 11:05 Yongji Xie
2016-08-11 11:09 ` no-reply
2016-09-05 10:25 ` Yongji Xie
2016-09-13 22:55 ` Alex Williamson
2016-09-14  5:04   ` Yongji Xie
2016-09-30 16:05   ` Paolo Bonzini
2016-10-04  6:57     ` Yongji Xie

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.