[Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()
@ 2017-06-02 11:50 Peter Xu
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate Peter Xu
                   ` (3 more replies)
  0 siblings, 4 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-02 11:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Maxime Coquelin, peterx, Jason Wang

With the patch applied:

  [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
  (already in Paolo's pull request but not yet merged)

Now we can have valid address masks. However it is still not ideal,
considering that the mask may not be aligned to guest page sizes. One
example would be when huge page is used in guest (please see commit
message in patch 1 for details). It applies to normal pages too. So we
not only need a valid address mask, we should make sure it is page
mask (for x86, it should be either 4K/2M/1G pages).

Patch 1+2 fixes the problem. Tested with both kernel net driver or
testpmd, on either 4K/2M pages, to make sure the page mask is correct.

Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
definitely want patch 3 now. Here's the simplest TCP streaming test
using vhost dmar and iommu=pt in guest:

  without patch 3:    12.0Gbps
  with patch 3:       33.5Gbps

Please review, thanks.

Peter Xu (3):
  exec: add page_mask for address_space_do_translate
  exec: simplify address_space_get_iotlb_entry
  vhost: iommu: cache static mapping if there is

 exec.c                 | 73 +++++++++++++++++++++++++++++++++-----------------
 hw/virtio/trace-events |  4 +++
 hw/virtio/vhost.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 119 insertions(+), 24 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate
  2017-06-02 11:50 [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Peter Xu
@ 2017-06-02 11:50 ` Peter Xu
  2017-06-02 16:45   ` Michael S. Tsirkin
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry Peter Xu
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-02 11:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Maxime Coquelin, peterx, Jason Wang

The function is originally used for address_space_translate() and what
we care about most is (xlat, plen) range. However for iotlb requests, we
don't really care about "plen", but the size of the page that "xlat" is
located on. While, plen cannot really contain this information.

A simple example to show why "plen" is not good for IOTLB translations:

E.g., for huge pages, it is possible that guest mapped 1G huge page on
device side that used this GPA range:

  0x100000000 - 0x13fffffff

Then let's say we want to translate one IOVA that finally mapped to GPA
0x13ffffe00 (which is located on this 1G huge page). Then here we'll
get:

  (xlat, plen) = (0x13fffe00, 0x200)

So the IOTLB would be only covering a very small range since from
"plen" (which is 0x200 bytes) we cannot tell the size of the page.

Actually we can really know that this is a huge page - we just throw the
information away in address_space_do_translate().

This patch introduced "page_mask" optional parameter to capture that
page mask info. Also, I made "plen" an optional parameter as well, with
some comments for the whole function.

No functional change yet.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 40 insertions(+), 6 deletions(-)

diff --git a/exec.c b/exec.c
index 8fc0e78..63a3ff0 100644
--- a/exec.c
+++ b/exec.c
@@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
     return section;
 }
 
-/* Called from RCU critical section */
+/**
+ * address_space_do_translate - translate an address in AddressSpace
+ *
+ * @as: the address space that we want to translate on
+ * @addr: the address to be translated in above address space
+ * @xlat: the translated address offset within memory region. It
+ *        cannot be @NULL.
+ * @plen_out: valid read/write length of the translated address. It
+ *            can be @NULL when we don't care about it.
+ * @page_mask_out: page mask for the translated address. This
+ *            should only be meaningful for IOMMU translated
+ *            addresses, since there may be huge pages that this bit
+ *            would tell. It can be @NULL if we don't care about it.
+ * @is_write: whether the translation operation is for write
+ * @is_mmio: whether this can be MMIO, set true if it can
+ *
+ * This function is called from RCU critical section
+ */
 static MemoryRegionSection address_space_do_translate(AddressSpace *as,
                                                       hwaddr addr,
                                                       hwaddr *xlat,
-                                                      hwaddr *plen,
+                                                      hwaddr *plen_out,
+                                                      hwaddr *page_mask_out,
                                                       bool is_write,
                                                       bool is_mmio)
 {
     IOMMUTLBEntry iotlb;
     MemoryRegionSection *section;
     MemoryRegion *mr;
+    hwaddr page_mask = TARGET_PAGE_MASK;
+    hwaddr plen = (hwaddr)(-1);
+
+    if (plen_out) {
+        plen = *plen_out;
+    }
 
     for (;;) {
         AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
-        section = address_space_translate_internal(d, addr, &addr, plen, is_mmio);
+        section = address_space_translate_internal(d, addr, &addr, &plen, is_mmio);
         mr = section->mr;
 
         if (!mr->iommu_ops) {
@@ -490,7 +514,8 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
                                          IOMMU_WO : IOMMU_RO);
         addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
                 | (addr & iotlb.addr_mask));
-        *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
+        page_mask = iotlb.addr_mask;
+        plen = MIN(plen, (addr | iotlb.addr_mask) - addr + 1);
         if (!(iotlb.perm & (1 << is_write))) {
             goto translate_fail;
         }
@@ -500,6 +525,14 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
 
     *xlat = addr;
 
+    if (page_mask_out) {
+        *page_mask_out = page_mask;
+    }
+
+    if (plen_out) {
+        *plen_out = plen;
+    }
+
     return *section;
 
 translate_fail:
@@ -518,7 +551,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
 
     /* This can never be MMIO. */
     section = address_space_do_translate(as, addr, &xlat, &plen,
-                                         is_write, false);
+                                         NULL, is_write, false);
 
     /* Illegal translation */
     if (section.mr == &io_mem_unassigned) {
@@ -560,7 +593,8 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
     MemoryRegionSection section;
 
     /* This can be MMIO, so setup MMIO bit. */
-    section = address_space_do_translate(as, addr, xlat, plen, is_write, true);
+    section = address_space_do_translate(as, addr, xlat, plen,
+                                         NULL, is_write, true);
     mr = section.mr;
 
     if (xen_enabled() && memory_access_is_direct(mr, is_write)) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-02 11:50 [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Peter Xu
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate Peter Xu
@ 2017-06-02 11:50 ` Peter Xu
  2017-06-02 16:49   ` Michael S. Tsirkin
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is Peter Xu
  2017-06-02 14:51 ` [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Michael S. Tsirkin
  3 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-02 11:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Maxime Coquelin, peterx, Jason Wang

This patch let address_space_get_iotlb_entry() to use the newly
introduced page_mask parameter in address_space_do_translate(). Then we
will be sure the IOTLB can be aligned to page mask, also we should
nicely support huge pages now when introducing a764040.

Fixes: a764040 ("exec: abstract address_space_do_translate()")
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 exec.c | 29 ++++++++++-------------------
 1 file changed, 10 insertions(+), 19 deletions(-)

diff --git a/exec.c b/exec.c
index 63a3ff0..1f86253 100644
--- a/exec.c
+++ b/exec.c
@@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
                                             bool is_write)
 {
     MemoryRegionSection section;
-    hwaddr xlat, plen;
+    hwaddr xlat, page_mask;
 
-    /* Try to get maximum page mask during translation. */
-    plen = (hwaddr)-1;
-
-    /* This can never be MMIO. */
-    section = address_space_do_translate(as, addr, &xlat, &plen,
-                                         NULL, is_write, false);
+    /*
+     * This can never be MMIO, and we don't really care about plen,
+     * but page mask.
+     */
+    section = address_space_do_translate(as, addr, &xlat, NULL,
+                                         &page_mask, is_write, false);
 
     /* Illegal translation */
     if (section.mr == &io_mem_unassigned) {
@@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
     xlat += section.offset_within_address_space -
         section.offset_within_region;
 
-    if (plen == (hwaddr)-1) {
-        /* If not specified during translation, use default mask */
-        plen = TARGET_PAGE_MASK;
-    } else {
-        /* Make it a valid page mask */
-        assert(plen);
-        plen = pow2floor(plen) - 1;
-    }
-
     return (IOMMUTLBEntry) {
         .target_as = section.address_space,
-        .iova = addr & ~plen,
-        .translated_addr = xlat & ~plen,
-        .addr_mask = plen,
+        .iova = addr & ~page_mask,
+        .translated_addr = xlat & ~page_mask,
+        .addr_mask = page_mask,
         /* IOTLBs are for DMAs, and DMA only allows on RAMs. */
         .perm = IOMMU_RW,
     };
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-02 11:50 [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Peter Xu
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate Peter Xu
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry Peter Xu
@ 2017-06-02 11:50 ` Peter Xu
  2017-06-02 15:45   ` Michael S. Tsirkin
  2017-06-02 16:51   ` Michael S. Tsirkin
  2017-06-02 14:51 ` [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Michael S. Tsirkin
  3 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-02 11:50 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Maxime Coquelin, peterx, Jason Wang

This patch pre-heat vhost iotlb cache when passthrough mode enabled.

Sometimes, even if user specified iommu_platform for vhost devices,
IOMMU might still be disabled. One case is passthrough mode in VT-d
implementation. We can detect this by observing iommu_list. If it's
empty, it means IOMMU translation is disabled, then we can actually
pre-heat the translation (it'll be static mapping then) by first
invalidating all IOTLB, then cache existing memory ranges into vhost
backend iotlb using 1:1 mapping.

Reviewed-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/virtio/trace-events |  4 +++
 hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+)

diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 1f7a7c1..54dcbb3 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
 virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
 virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
 virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
+
+# hw/virtio/vhost.c
+vhost_iommu_commit(void) ""
+vhost_iommu_static_preheat(void) ""
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index 03a46a7..d03d720 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -27,6 +27,7 @@
 #include "hw/virtio/virtio-access.h"
 #include "migration/blocker.h"
 #include "sysemu/dma.h"
+#include "trace.h"
 
 /* enabled until disconnected backend stabilizes */
 #define _VHOST_DEBUG 1
@@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
     }
 }
 
+static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
+{
+    return !QLIST_EMPTY(&dev->iommu_list);
+}
+
 static void vhost_iommu_region_add(MemoryListener *listener,
                                    MemoryRegionSection *section)
 {
@@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
     }
 }
 
+static void vhost_iommu_commit(MemoryListener *listener)
+{
+    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
+                                         iommu_listener);
+    struct vhost_memory_region *r;
+    int i;
+
+    trace_vhost_iommu_commit();
+
+    if (!vhost_iommu_mr_enabled(dev)) {
+        /*
+        * This means iommu_platform is enabled, however iommu memory
+        * region is disabled, e.g., when device passthrough is setup.
+        * Then, no translation is needed any more.
+        *
+        * Let's first invalidate the whole IOTLB, then pre-heat the
+        * static mapping by looping over vhost memory ranges.
+        */
+
+        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
+                                                          UINT64_MAX)) {
+            error_report("%s: flush existing IOTLB failed", __func__);
+            return;
+        }
+
+        /*
+         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
+         * the invalidation for (start=0, size=UINT64_MAX) cannot
+         * really invalidate an cached range of (start=UINT64_MAX-1,
+         * size=1). We send this 2nd invalidation to workaround this.
+         * But, frankly speaking for QEMU we don't have a problem with
+         * this since we will never have a vhost cache with range
+         * (start=UINT64_MAX-1, size=1) - if you see
+         * address_space_get_iotlb_entry() all IOTLBs are page
+         * aligned.
+         */
+        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
+                                                          1)) {
+            error_report("%s: flush existing IOTLB failed", __func__);
+            return;
+        }
+
+        for (i = 0; i < dev->mem->nregions; i++) {
+            r = &dev->mem->regions[i];
+            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
+            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
+                                                          r->guest_phys_addr,
+                                                          r->userspace_addr,
+                                                          r->memory_size,
+                                                          IOMMU_RW)) {
+                error_report("%s: pre-heat static mapping failed", __func__);
+                return;
+            }
+        }
+
+        trace_vhost_iommu_static_preheat();
+    }
+}
+
 static void vhost_region_nop(MemoryListener *listener,
                              MemoryRegionSection *section)
 {
@@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
     hdev->iommu_listener = (MemoryListener) {
         .region_add = vhost_iommu_region_add,
         .region_del = vhost_iommu_region_del,
+        .commit = vhost_iommu_commit,
     };
 
     if (hdev->migration_blocker == NULL) {
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()
  2017-06-02 11:50 [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Peter Xu
                   ` (2 preceding siblings ...)
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is Peter Xu
@ 2017-06-02 14:51 ` Michael S. Tsirkin
  2017-06-05  3:20   ` Peter Xu
  3 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-02 14:51 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:
> With the patch applied:
> 
>   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
>   (already in Paolo's pull request but not yet merged)
> 
> Now we can have valid address masks. However it is still not ideal,
> considering that the mask may not be aligned to guest page sizes. One
> example would be when huge page is used in guest (please see commit
> message in patch 1 for details). It applies to normal pages too. So we
> not only need a valid address mask, we should make sure it is page
> mask (for x86, it should be either 4K/2M/1G pages).

Why should we? To get better performance, right?

> Patch 1+2 fixes the problem. Tested with both kernel net driver or
> testpmd, on either 4K/2M pages, to make sure the page mask is correct.
> 
> Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> definitely want patch 3 now. Here's the simplest TCP streaming test
> using vhost dmar and iommu=pt in guest:
> 
>   without patch 3:    12.0Gbps

And what happens without patches 1-2?

>   with patch 3:       33.5Gbps

This is the part I don't get. Patches 1-2 will return a bigger region to
callers. The result should be better performance - instead it seems to
slow down vhost for some reason and we need tricks to get
performance back. What's going on?

> Please review, thanks.
> 
> Peter Xu (3):
>   exec: add page_mask for address_space_do_translate
>   exec: simplify address_space_get_iotlb_entry
>   vhost: iommu: cache static mapping if there is
> 
>  exec.c                 | 73 +++++++++++++++++++++++++++++++++-----------------
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 +++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 119 insertions(+), 24 deletions(-)
> 
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is Peter Xu
@ 2017-06-02 15:45   ` Michael S. Tsirkin
  2017-06-05  3:15     ` Peter Xu
  2017-06-02 16:51   ` Michael S. Tsirkin
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-02 15:45 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> 
> Sometimes, even if user specified iommu_platform for vhost devices,
> IOMMU might still be disabled. One case is passthrough mode in VT-d
> implementation. We can detect this by observing iommu_list. If it's
> empty, it means IOMMU translation is disabled, then we can actually
> pre-heat the translation (it'll be static mapping then) by first
> invalidating all IOTLB, then cache existing memory ranges into vhost
> backend iotlb using 1:1 mapping.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

This is still a hack I think. What if there's an invalidation?
I think the right thing is to send updates only when requested,
but sent the largest mapping including the iova, not from iova until end
of page. Thoughts?

> ---
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 1f7a7c1..54dcbb3 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> +
> +# hw/virtio/vhost.c
> +vhost_iommu_commit(void) ""
> +vhost_iommu_static_preheat(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 03a46a7..d03d720 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
>  
> +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> +{
> +    return !QLIST_EMPTY(&dev->iommu_list);
> +}
> +
>  static void vhost_iommu_region_add(MemoryListener *listener,
>                                     MemoryRegionSection *section)
>  {
> @@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_iommu_commit(MemoryListener *listener)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         iommu_listener);
> +    struct vhost_memory_region *r;
> +    int i;
> +
> +    trace_vhost_iommu_commit();
> +
> +    if (!vhost_iommu_mr_enabled(dev)) {
> +        /*
> +        * This means iommu_platform is enabled, however iommu memory
> +        * region is disabled, e.g., when device passthrough is setup.
> +        * Then, no translation is needed any more.
> +        *
> +        * Let's first invalidate the whole IOTLB, then pre-heat the
> +        * static mapping by looping over vhost memory ranges.
> +        */
> +
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> +                                                          UINT64_MAX)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        /*
> +         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
> +         * the invalidation for (start=0, size=UINT64_MAX) cannot
> +         * really invalidate an cached range of (start=UINT64_MAX-1,
> +         * size=1). We send this 2nd invalidation to workaround this.
> +         * But, frankly speaking for QEMU we don't have a problem with
> +         * this since we will never have a vhost cache with range
> +         * (start=UINT64_MAX-1, size=1) - if you see
> +         * address_space_get_iotlb_entry() all IOTLBs are page
> +         * aligned.
> +         */
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
> +                                                          1)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        for (i = 0; i < dev->mem->nregions; i++) {
> +            r = &dev->mem->regions[i];
> +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> +                                                          r->guest_phys_addr,
> +                                                          r->userspace_addr,
> +                                                          r->memory_size,
> +                                                          IOMMU_RW)) {
> +                error_report("%s: pre-heat static mapping failed", __func__);
> +                return;
> +            }
> +        }
> +
> +        trace_vhost_iommu_static_preheat();
> +    }
> +}
> +
>  static void vhost_region_nop(MemoryListener *listener,
>                               MemoryRegionSection *section)
>  {
> @@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>      hdev->iommu_listener = (MemoryListener) {
>          .region_add = vhost_iommu_region_add,
>          .region_del = vhost_iommu_region_del,
> +        .commit = vhost_iommu_commit,
>      };
>  
>      if (hdev->migration_blocker == NULL) {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate Peter Xu
@ 2017-06-02 16:45   ` Michael S. Tsirkin
  2017-06-05  2:52     ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-02 16:45 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:50:52PM +0800, Peter Xu wrote:
> The function is originally used for address_space_translate() and what
> we care about most is (xlat, plen) range. However for iotlb requests, we
> don't really care about "plen", but the size of the page that "xlat" is
> located on. While, plen cannot really contain this information.
> 
> A simple example to show why "plen" is not good for IOTLB translations:
> 
> E.g., for huge pages, it is possible that guest mapped 1G huge page on
> device side that used this GPA range:
> 
>   0x100000000 - 0x13fffffff
> 
> Then let's say we want to translate one IOVA that finally mapped to GPA
> 0x13ffffe00 (which is located on this 1G huge page). Then here we'll
> get:
> 
>   (xlat, plen) = (0x13fffe00, 0x200)
> 
> So the IOTLB would be only covering a very small range since from
> "plen" (which is 0x200 bytes) we cannot tell the size of the page.
> 
> Actually we can really know that this is a huge page - we just throw the
> information away in address_space_do_translate().
> 
> This patch introduced "page_mask" optional parameter to capture that
> page mask info. Also, I made "plen" an optional parameter as well, with
> some comments for the whole function.
> 
> No functional change yet.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 40 insertions(+), 6 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 8fc0e78..63a3ff0 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
>      return section;
>  }
>  
> -/* Called from RCU critical section */
> +/**
> + * address_space_do_translate - translate an address in AddressSpace
> + *
> + * @as: the address space that we want to translate on
> + * @addr: the address to be translated in above address space
> + * @xlat: the translated address offset within memory region. It
> + *        cannot be @NULL.
> + * @plen_out: valid read/write length of the translated address. It
> + *            can be @NULL when we don't care about it.
> + * @page_mask_out: page mask for the translated address. This
> + *            should only be meaningful for IOMMU translated
> + *            addresses, since there may be huge pages that this bit
> + *            would tell. It can be @NULL if we don't care about it.

Why do we need plen or mask at all? It seems MemoryRegionSection
has address and length already. So if you want to find out
distance to section end, do section.size - xlat and you are done.


> + * @is_write: whether the translation operation is for write
> + * @is_mmio: whether this can be MMIO, set true if it can
> + *
> + * This function is called from RCU critical section
> + */
>  static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>                                                        hwaddr addr,
>                                                        hwaddr *xlat,
> -                                                      hwaddr *plen,
> +                                                      hwaddr *plen_out,
> +                                                      hwaddr *page_mask_out,
>                                                        bool is_write,
>                                                        bool is_mmio)
>  {
>      IOMMUTLBEntry iotlb;
>      MemoryRegionSection *section;
>      MemoryRegion *mr;
> +    hwaddr page_mask = TARGET_PAGE_MASK;
> +    hwaddr plen = (hwaddr)(-1);
> +
> +    if (plen_out) {
> +        plen = *plen_out;
> +    }
>  
>      for (;;) {
>          AddressSpaceDispatch *d = atomic_rcu_read(&as->dispatch);
> -        section = address_space_translate_internal(d, addr, &addr, plen, is_mmio);
> +        section = address_space_translate_internal(d, addr, &addr, &plen, is_mmio);
>          mr = section->mr;
>  
>          if (!mr->iommu_ops) {
> @@ -490,7 +514,8 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>                                           IOMMU_WO : IOMMU_RO);
>          addr = ((iotlb.translated_addr & ~iotlb.addr_mask)
>                  | (addr & iotlb.addr_mask));
> -        *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);
> +        page_mask = iotlb.addr_mask;
> +        plen = MIN(plen, (addr | iotlb.addr_mask) - addr + 1);
>          if (!(iotlb.perm & (1 << is_write))) {
>              goto translate_fail;
>          }
> @@ -500,6 +525,14 @@ static MemoryRegionSection address_space_do_translate(AddressSpace *as,
>  
>      *xlat = addr;
>  
> +    if (page_mask_out) {
> +        *page_mask_out = page_mask;
> +    }
> +
> +    if (plen_out) {
> +        *plen_out = plen;
> +    }
> +
>      return *section;
>  
>  translate_fail:
> @@ -518,7 +551,7 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>  
>      /* This can never be MMIO. */
>      section = address_space_do_translate(as, addr, &xlat, &plen,
> -                                         is_write, false);
> +                                         NULL, is_write, false);
>  
>      /* Illegal translation */
>      if (section.mr == &io_mem_unassigned) {
> @@ -560,7 +593,8 @@ MemoryRegion *address_space_translate(AddressSpace *as, hwaddr addr,
>      MemoryRegionSection section;
>  
>      /* This can be MMIO, so setup MMIO bit. */
> -    section = address_space_do_translate(as, addr, xlat, plen, is_write, true);
> +    section = address_space_do_translate(as, addr, xlat, plen,
> +                                         NULL, is_write, true);
>      mr = section.mr;
>  
>      if (xen_enabled() && memory_access_is_direct(mr, is_write)) {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry Peter Xu
@ 2017-06-02 16:49   ` Michael S. Tsirkin
  2017-06-05  3:07     ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-02 16:49 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:50:53PM +0800, Peter Xu wrote:
> This patch let address_space_get_iotlb_entry() to use the newly
> introduced page_mask parameter in address_space_do_translate(). Then we
> will be sure the IOTLB can be aligned to page mask, also we should
> nicely support huge pages now when introducing a764040.
> 
> Fixes: a764040 ("exec: abstract address_space_do_translate()")
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  exec.c | 29 ++++++++++-------------------
>  1 file changed, 10 insertions(+), 19 deletions(-)
> 
> diff --git a/exec.c b/exec.c
> index 63a3ff0..1f86253 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>                                              bool is_write)
>  {
>      MemoryRegionSection section;
> -    hwaddr xlat, plen;
> +    hwaddr xlat, page_mask;
>  
> -    /* Try to get maximum page mask during translation. */
> -    plen = (hwaddr)-1;
> -
> -    /* This can never be MMIO. */
> -    section = address_space_do_translate(as, addr, &xlat, &plen,
> -                                         NULL, is_write, false);
> +    /*
> +     * This can never be MMIO, and we don't really care about plen,
> +     * but page mask.
> +     */
> +    section = address_space_do_translate(as, addr, &xlat, NULL,
> +                                         &page_mask, is_write, false);
>  
>      /* Illegal translation */
>      if (section.mr == &io_mem_unassigned) {


Can we just use section.size - xlat here?

> @@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
>      xlat += section.offset_within_address_space -
>          section.offset_within_region;
>  
> -    if (plen == (hwaddr)-1) {
> -        /* If not specified during translation, use default mask */
> -        plen = TARGET_PAGE_MASK;
> -    } else {
> -        /* Make it a valid page mask */
> -        assert(plen);
> -        plen = pow2floor(plen) - 1;
> -    }
> -
>      return (IOMMUTLBEntry) {
>          .target_as = section.address_space,
> -        .iova = addr & ~plen,
> -        .translated_addr = xlat & ~plen,
> -        .addr_mask = plen,
> +        .iova = addr & ~page_mask,
> +        .translated_addr = xlat & ~page_mask,
> +        .addr_mask = page_mask,
>          /* IOTLBs are for DMAs, and DMA only allows on RAMs. */

BTW this comment is pretty confusing. What does it mean?

>          .perm = IOMMU_RW,
>      };

Looks like we should change IOMMUTLBEntry to pass size and not mask -
then we could simply pass info from section as is. iova would be
addr - xlat.

> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-02 11:50 ` [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is Peter Xu
  2017-06-02 15:45   ` Michael S. Tsirkin
@ 2017-06-02 16:51   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-02 16:51 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> 
> Sometimes, even if user specified iommu_platform for vhost devices,
> IOMMU might still be disabled. One case is passthrough mode in VT-d
> implementation. We can detect this by observing iommu_list. If it's
> empty, it means IOMMU translation is disabled, then we can actually
> pre-heat the translation (it'll be static mapping then) by first
> invalidating all IOTLB, then cache existing memory ranges into vhost
> backend iotlb using 1:1 mapping.
> 
> Reviewed-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Can't say I like this, this does not help the more important
use-case of dpdk which has a small static mapping but can't
be detected by QEMU. vhost should just heat up whatever's
actually used. Why isn't this enough?

> ---
>  hw/virtio/trace-events |  4 +++
>  hw/virtio/vhost.c      | 66 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+)
> 
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 1f7a7c1..54dcbb3 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -24,3 +24,7 @@ virtio_balloon_handle_output(const char *name, uint64_t gpa) "section name: %s g
>  virtio_balloon_get_config(uint32_t num_pages, uint32_t actual) "num_pages: %d actual: %d"
>  virtio_balloon_set_config(uint32_t actual, uint32_t oldactual) "actual: %d oldactual: %d"
>  virtio_balloon_to_target(uint64_t target, uint32_t num_pages) "balloon target: %"PRIx64" num_pages: %d"
> +
> +# hw/virtio/vhost.c
> +vhost_iommu_commit(void) ""
> +vhost_iommu_static_preheat(void) ""
> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
> index 03a46a7..d03d720 100644
> --- a/hw/virtio/vhost.c
> +++ b/hw/virtio/vhost.c
> @@ -27,6 +27,7 @@
>  #include "hw/virtio/virtio-access.h"
>  #include "migration/blocker.h"
>  #include "sysemu/dma.h"
> +#include "trace.h"
>  
>  /* enabled until disconnected backend stabilizes */
>  #define _VHOST_DEBUG 1
> @@ -730,6 +731,11 @@ static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
>      }
>  }
>  
> +static bool vhost_iommu_mr_enabled(struct vhost_dev *dev)
> +{
> +    return !QLIST_EMPTY(&dev->iommu_list);
> +}
> +
>  static void vhost_iommu_region_add(MemoryListener *listener,
>                                     MemoryRegionSection *section)
>  {
> @@ -782,6 +788,65 @@ static void vhost_iommu_region_del(MemoryListener *listener,
>      }
>  }
>  
> +static void vhost_iommu_commit(MemoryListener *listener)
> +{
> +    struct vhost_dev *dev = container_of(listener, struct vhost_dev,
> +                                         iommu_listener);
> +    struct vhost_memory_region *r;
> +    int i;
> +
> +    trace_vhost_iommu_commit();
> +
> +    if (!vhost_iommu_mr_enabled(dev)) {
> +        /*
> +        * This means iommu_platform is enabled, however iommu memory
> +        * region is disabled, e.g., when device passthrough is setup.
> +        * Then, no translation is needed any more.
> +        *
> +        * Let's first invalidate the whole IOTLB, then pre-heat the
> +        * static mapping by looping over vhost memory ranges.
> +        */
> +
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, 0,
> +                                                          UINT64_MAX)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        /*
> +         * Current VHOST_IOTLB_INVALIDATE API has a small defect that
> +         * the invalidation for (start=0, size=UINT64_MAX) cannot
> +         * really invalidate an cached range of (start=UINT64_MAX-1,
> +         * size=1). We send this 2nd invalidation to workaround this.
> +         * But, frankly speaking for QEMU we don't have a problem with
> +         * this since we will never have a vhost cache with range
> +         * (start=UINT64_MAX-1, size=1) - if you see
> +         * address_space_get_iotlb_entry() all IOTLBs are page
> +         * aligned.
> +         */
> +        if (dev->vhost_ops->vhost_invalidate_device_iotlb(dev, UINT64_MAX,
> +                                                          1)) {
> +            error_report("%s: flush existing IOTLB failed", __func__);
> +            return;
> +        }
> +
> +        for (i = 0; i < dev->mem->nregions; i++) {
> +            r = &dev->mem->regions[i];
> +            /* Vhost regions are writable RAM, so IOMMU_RW suites. */
> +            if (dev->vhost_ops->vhost_update_device_iotlb(dev,
> +                                                          r->guest_phys_addr,
> +                                                          r->userspace_addr,
> +                                                          r->memory_size,
> +                                                          IOMMU_RW)) {
> +                error_report("%s: pre-heat static mapping failed", __func__);
> +                return;
> +            }
> +        }
> +
> +        trace_vhost_iommu_static_preheat();
> +    }
> +}
> +
>  static void vhost_region_nop(MemoryListener *listener,
>                               MemoryRegionSection *section)
>  {
> @@ -1298,6 +1363,7 @@ int vhost_dev_init(struct vhost_dev *hdev, void *opaque,
>      hdev->iommu_listener = (MemoryListener) {
>          .region_add = vhost_iommu_region_add,
>          .region_del = vhost_iommu_region_del,
> +        .commit = vhost_iommu_commit,
>      };
>  
>      if (hdev->migration_blocker == NULL) {
> -- 
> 2.7.4

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate
  2017-06-02 16:45   ` Michael S. Tsirkin
@ 2017-06-05  2:52     ` Peter Xu
  0 siblings, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-05  2:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 07:45:05PM +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 02, 2017 at 07:50:52PM +0800, Peter Xu wrote:
> > The function is originally used for address_space_translate() and what
> > we care about most is (xlat, plen) range. However for iotlb requests, we
> > don't really care about "plen", but the size of the page that "xlat" is
> > located on. While, plen cannot really contain this information.
> > 
> > A simple example to show why "plen" is not good for IOTLB translations:
> > 
> > E.g., for huge pages, it is possible that guest mapped 1G huge page on
> > device side that used this GPA range:
> > 
> >   0x100000000 - 0x13fffffff
> > 
> > Then let's say we want to translate one IOVA that finally mapped to GPA
> > 0x13ffffe00 (which is located on this 1G huge page). Then here we'll
> > get:
> > 
> >   (xlat, plen) = (0x13fffe00, 0x200)
> > 
> > So the IOTLB would be only covering a very small range since from
> > "plen" (which is 0x200 bytes) we cannot tell the size of the page.
> > 
> > Actually we can really know that this is a huge page - we just throw the
> > information away in address_space_do_translate().
> > 
> > This patch introduced "page_mask" optional parameter to capture that
> > page mask info. Also, I made "plen" an optional parameter as well, with
> > some comments for the whole function.
> > 
> > No functional change yet.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  exec.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> >  1 file changed, 40 insertions(+), 6 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 8fc0e78..63a3ff0 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -465,21 +465,45 @@ address_space_translate_internal(AddressSpaceDispatch *d, hwaddr addr, hwaddr *x
> >      return section;
> >  }
> >  
> > -/* Called from RCU critical section */
> > +/**
> > + * address_space_do_translate - translate an address in AddressSpace
> > + *
> > + * @as: the address space that we want to translate on
> > + * @addr: the address to be translated in above address space
> > + * @xlat: the translated address offset within memory region. It
> > + *        cannot be @NULL.
> > + * @plen_out: valid read/write length of the translated address. It
> > + *            can be @NULL when we don't care about it.
> > + * @page_mask_out: page mask for the translated address. This
> > + *            should only be meaningful for IOMMU translated
> > + *            addresses, since there may be huge pages that this bit
> > + *            would tell. It can be @NULL if we don't care about it.
> 
> Why do we need plen or mask at all? It seems MemoryRegionSection
> has address and length already. So if you want to find out
> distance to section end, do section.size - xlat and you are done.

Hi, Michael,

When you say:

  section.size - xlat

Do you really mean this?

  section.offset_within_address_space + section.size - xlat

Since otherwise it will make no much sense to me.

Anyway, I don't know whether it'll be okay we remove the plen...

In address_space_do_translate(), the logic is basically:

1. do internal translation (basically to find the section info from
   current address space)
2. do IOMMU translation if the MR is IOMMU typed
3. goto 1.

Along the way (1 -> 2 -> 3 -> 1 -> ...) until we finished the
translation (I don't really know whether we'll have cases for nested
IOMMU translation, but anyway we have a while loop there, so assume
the loop can be executed many times), plen can be shrinking all the
time, either by this in address_space_translate_internal():

    *plen = int128_get64(int128_min(diff, int128_make64(*plen)));

Or this in address_space_do_translate():

    *plen = MIN(*plen, (addr | iotlb.addr_mask) - addr + 1);

And I don't know only using the final section.size to decide plen
would be enough.

Also, for page_mask information - I don't quite sure
MemoryRegionSection can express that info. Again, huge page can be one
example: MemoryRegionSection doesn't really contain huge page
information, while MemoryRegionIOMMUOps.translate() does contain that
information (via addr_mask field).

(I see that you would like IOTLB to be using arbitary length rather
 than page masks. Maybe we can first decide which would be the best
 interface for IOTLB. I'll reply in that context later.)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-02 16:49   ` Michael S. Tsirkin
@ 2017-06-05  3:07     ` Peter Xu
  2017-06-06 14:34       ` Paolo Bonzini
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-05  3:07 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang, David Gibson

On Fri, Jun 02, 2017 at 07:49:58PM +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 02, 2017 at 07:50:53PM +0800, Peter Xu wrote:
> > This patch let address_space_get_iotlb_entry() to use the newly
> > introduced page_mask parameter in address_space_do_translate(). Then we
> > will be sure the IOTLB can be aligned to page mask, also we should
> > nicely support huge pages now when introducing a764040.
> > 
> > Fixes: a764040 ("exec: abstract address_space_do_translate()")
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  exec.c | 29 ++++++++++-------------------
> >  1 file changed, 10 insertions(+), 19 deletions(-)
> > 
> > diff --git a/exec.c b/exec.c
> > index 63a3ff0..1f86253 100644
> > --- a/exec.c
> > +++ b/exec.c
> > @@ -544,14 +544,14 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
> >                                              bool is_write)
> >  {
> >      MemoryRegionSection section;
> > -    hwaddr xlat, plen;
> > +    hwaddr xlat, page_mask;
> >  
> > -    /* Try to get maximum page mask during translation. */
> > -    plen = (hwaddr)-1;
> > -
> > -    /* This can never be MMIO. */
> > -    section = address_space_do_translate(as, addr, &xlat, &plen,
> > -                                         NULL, is_write, false);
> > +    /*
> > +     * This can never be MMIO, and we don't really care about plen,
> > +     * but page mask.
> > +     */
> > +    section = address_space_do_translate(as, addr, &xlat, NULL,
> > +                                         &page_mask, is_write, false);
> >  
> >      /* Illegal translation */
> >      if (section.mr == &io_mem_unassigned) {
> 
> 
> Can we just use section.size - xlat here?

I replied in the other thread about what I thought... So will skip
here.

> 
> > @@ -562,20 +562,11 @@ IOMMUTLBEntry address_space_get_iotlb_entry(AddressSpace *as, hwaddr addr,
> >      xlat += section.offset_within_address_space -
> >          section.offset_within_region;
> >  
> > -    if (plen == (hwaddr)-1) {
> > -        /* If not specified during translation, use default mask */
> > -        plen = TARGET_PAGE_MASK;
> > -    } else {
> > -        /* Make it a valid page mask */
> > -        assert(plen);
> > -        plen = pow2floor(plen) - 1;
> > -    }
> > -
> >      return (IOMMUTLBEntry) {
> >          .target_as = section.address_space,
> > -        .iova = addr & ~plen,
> > -        .translated_addr = xlat & ~plen,
> > -        .addr_mask = plen,
> > +        .iova = addr & ~page_mask,
> > +        .translated_addr = xlat & ~page_mask,
> > +        .addr_mask = page_mask,
> >          /* IOTLBs are for DMAs, and DMA only allows on RAMs. */
> 
> BTW this comment is pretty confusing. What does it mean?

This function, address_space_get_iotlb_entry(), is for device to get
IOTLB entry when they want to request for page translations for DMA,
and DMA should only be allowed to RAM, right? Then we need IOMMU_RW
permission here.

Maybe I should add one more check above on the returned MR - it should
be RAM typed as well. But I don't really know whether that's too
strict, since if guest setup the IOMMU page table to allow one IOVA
points to a non-RAM region, I thought it should still be legal from
hypervisor POV (I see it a guest OS bug though)?

> 
> >          .perm = IOMMU_RW,
> >      };
> 
> Looks like we should change IOMMUTLBEntry to pass size and not mask -
> then we could simply pass info from section as is. iova would be
> addr - xlat.

I don't sure whether it'll be a good interface for IOTLB. AFAIU at
least for VT-d, the IOMMU translation is page aligned which is defined
by spec, so it makes sense that (again at least for VT-d) here we'd
better just use page_mask/addr_mask.

That's also how I know about IOMMU in general - I assume it do the
translations always with page masks (never arbitary length), though
page size can differ from platfrom to platform, that's why here the
IOTLB interface used addr_mask, then it works for all platforms. I
don't know whether I'm 100% correct here though.

Maybe David/Paolo/... would comment as well?

(CC David)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-02 15:45   ` Michael S. Tsirkin
@ 2017-06-05  3:15     ` Peter Xu
  2017-06-05  4:07       ` Jason Wang
  2017-06-05 15:05       ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-05  3:15 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > 
> > Sometimes, even if user specified iommu_platform for vhost devices,
> > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > implementation. We can detect this by observing iommu_list. If it's
> > empty, it means IOMMU translation is disabled, then we can actually
> > pre-heat the translation (it'll be static mapping then) by first
> > invalidating all IOTLB, then cache existing memory ranges into vhost
> > backend iotlb using 1:1 mapping.
> > 
> > Reviewed-by: Jason Wang <jasowang@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> This is still a hack I think. What if there's an invalidation?
> I think the right thing is to send updates only when requested,
> but sent the largest mapping including the iova, not from iova until end
> of page. Thoughts?

Indeed it's kind of a hack, but it does not hurt anything but will
definitely boost performance in most cases...

Yes "sent the largest mapping including the iova" is okay, but the
first IO on one region would be delayed as well, so IMHO it's not the
best solution as well. I think the best solution should be (for sure)
that vhost knows it's PT, then it just skips the translation
completely. I just don't sure whether there's simple/good way to do
this.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()
  2017-06-02 14:51 ` [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Michael S. Tsirkin
@ 2017-06-05  3:20   ` Peter Xu
  2017-06-06 15:29     ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-05  3:20 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Fri, Jun 02, 2017 at 05:51:07PM +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:
> > With the patch applied:
> > 
> >   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
> >   (already in Paolo's pull request but not yet merged)
> > 
> > Now we can have valid address masks. However it is still not ideal,
> > considering that the mask may not be aligned to guest page sizes. One
> > example would be when huge page is used in guest (please see commit
> > message in patch 1 for details). It applies to normal pages too. So we
> > not only need a valid address mask, we should make sure it is page
> > mask (for x86, it should be either 4K/2M/1G pages).
> 
> Why should we? To get better performance, right?

IMHO one point is for performance, the other point is on how we should
define the IOTLB interface. My opinion is that it is better valid
masks.

> 
> > Patch 1+2 fixes the problem. Tested with both kernel net driver or
> > testpmd, on either 4K/2M pages, to make sure the page mask is correct.
> > 
> > Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> > definitely want patch 3 now. Here's the simplest TCP streaming test
> > using vhost dmar and iommu=pt in guest:
> > 
> >   without patch 3:    12.0Gbps
> 
> And what happens without patches 1-2?

Without 1-2, performance is good. But I think it is hacky to have such
a good result (I explained why the performance is good in the VT-d PT
support thread with some logs)...

> 
> >   with patch 3:       33.5Gbps
> 
> This is the part I don't get. Patches 1-2 will return a bigger region to
> callers. The result should be better performance - instead it seems to
> slow down vhost for some reason and we need tricks to get
> performance back. What's going on?

Yes. The problem is that if without patch 1/2 I think the codes lacks
correctness. With correctness, we lost performance, then I picked
patch 3 as well.

Again, I think the first thing we need to settle is what should be the
best definition for IOTLB (addr_mask or arbitary length).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-05  3:15     ` Peter Xu
@ 2017-06-05  4:07       ` Jason Wang
  2017-06-05 15:05       ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Jason Wang @ 2017-06-05  4:07 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin



On 2017年06月05日 11:15, Peter Xu wrote:
> On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:
>> On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
>>> This patch pre-heat vhost iotlb cache when passthrough mode enabled.
>>>
>>> Sometimes, even if user specified iommu_platform for vhost devices,
>>> IOMMU might still be disabled. One case is passthrough mode in VT-d
>>> implementation. We can detect this by observing iommu_list. If it's
>>> empty, it means IOMMU translation is disabled, then we can actually
>>> pre-heat the translation (it'll be static mapping then) by first
>>> invalidating all IOTLB, then cache existing memory ranges into vhost
>>> backend iotlb using 1:1 mapping.
>>>
>>> Reviewed-by: Jason Wang <jasowang@redhat.com>
>>> Signed-off-by: Peter Xu <peterx@redhat.com>
>> This is still a hack I think. What if there's an invalidation?
>> I think the right thing is to send updates only when requested,
>> but sent the largest mapping including the iova, not from iova until end
>> of page. Thoughts?
> Indeed it's kind of a hack, but it does not hurt anything but will
> definitely boost performance in most cases...
>
> Yes "sent the largest mapping including the iova" is okay, but the
> first IO on one region would be delayed as well, so IMHO it's not the
> best solution as well. I think the best solution should be (for sure)
> that vhost knows it's PT, then it just skips the translation
> completely. I just don't sure whether there's simple/good way to do
> this.
>
> Thanks,

We can disabled device IOTLB completely in this case. But looks like 
there's a minor kernel bug prevent us from doing this. Let me post a fix 
and let's see then.

Thanks

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is
  2017-06-05  3:15     ` Peter Xu
  2017-06-05  4:07       ` Jason Wang
@ 2017-06-05 15:05       ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-05 15:05 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, Paolo Bonzini, Maxime Coquelin, Jason Wang

On Mon, Jun 05, 2017 at 11:15:11AM +0800, Peter Xu wrote:
> On Fri, Jun 02, 2017 at 06:45:05PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 02, 2017 at 07:50:54PM +0800, Peter Xu wrote:
> > > This patch pre-heat vhost iotlb cache when passthrough mode enabled.
> > > 
> > > Sometimes, even if user specified iommu_platform for vhost devices,
> > > IOMMU might still be disabled. One case is passthrough mode in VT-d
> > > implementation. We can detect this by observing iommu_list. If it's
> > > empty, it means IOMMU translation is disabled, then we can actually
> > > pre-heat the translation (it'll be static mapping then) by first
> > > invalidating all IOTLB, then cache existing memory ranges into vhost
> > > backend iotlb using 1:1 mapping.
> > > 
> > > Reviewed-by: Jason Wang <jasowang@redhat.com>
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > This is still a hack I think. What if there's an invalidation?
> > I think the right thing is to send updates only when requested,
> > but sent the largest mapping including the iova, not from iova until end
> > of page. Thoughts?
> 
> Indeed it's kind of a hack, but it does not hurt anything but will
> definitely boost performance in most cases...
> 
> Yes "sent the largest mapping including the iova" is okay, but the
> first IO on one region would be delayed as well, so IMHO it's not the
> best solution as well. I think the best solution should be (for sure)
> that vhost knows it's PT, then it just skips the translation
> completely. I just don't sure whether there's simple/good way to do
> this.
> 
> Thanks,

If you send the whole 64 bit area, then backend can detect this
easily.


> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-05  3:07     ` Peter Xu
@ 2017-06-06 14:34       ` Paolo Bonzini
  2017-06-06 23:47         ` David Gibson
  0 siblings, 1 reply; 34+ messages in thread
From: Paolo Bonzini @ 2017-06-06 14:34 UTC (permalink / raw)
  To: Peter Xu, Michael S. Tsirkin
  Cc: qemu-devel, Maxime Coquelin, Jason Wang, David Gibson



On 05/06/2017 05:07, Peter Xu wrote:
> I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> least for VT-d, the IOMMU translation is page aligned which is defined
> by spec, so it makes sense that (again at least for VT-d) here we'd
> better just use page_mask/addr_mask.
> 
> That's also how I know about IOMMU in general - I assume it do the
> translations always with page masks (never arbitary length), though
> page size can differ from platfrom to platform, that's why here the
> IOTLB interface used addr_mask, then it works for all platforms. I
> don't know whether I'm 100% correct here though.
> 
> Maybe David/Paolo/... would comment as well?

I would ask David.  There are PowerPC MMUs that allow fast lookup of
arbitrarily-sized windows (not necessarily power of two), so maybe the
IOMMUs can do the same.

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry()
  2017-06-05  3:20   ` Peter Xu
@ 2017-06-06 15:29     ` Michael S. Tsirkin
  0 siblings, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-06 15:29 UTC (permalink / raw)
  To: Peter Xu; +Cc: Jason Wang, Paolo Bonzini, Maxime Coquelin, qemu-devel

On Mon, Jun 05, 2017 at 11:20:13AM +0800, Peter Xu wrote:
> On Fri, Jun 02, 2017 at 05:51:07PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 02, 2017 at 07:50:51PM +0800, Peter Xu wrote:
> > > With the patch applied:
> > > 
> > >   [PATCH v3] exec: fix address_space_get_iotlb_entry page mask
> > >   (already in Paolo's pull request but not yet merged)
> > > 
> > > Now we can have valid address masks. However it is still not ideal,
> > > considering that the mask may not be aligned to guest page sizes. One
> > > example would be when huge page is used in guest (please see commit
> > > message in patch 1 for details). It applies to normal pages too. So we
> > > not only need a valid address mask, we should make sure it is page
> > > mask (for x86, it should be either 4K/2M/1G pages).
> > 
> > Why should we? To get better performance, right?
> 
> IMHO one point is for performance, the other point is on how we should
> define the IOTLB interface. My opinion is that it is better valid
> masks.
> 
> > 
> > > Patch 1+2 fixes the problem. Tested with both kernel net driver or
> > > testpmd, on either 4K/2M pages, to make sure the page mask is correct.
> > > 
> > > Patch 3 is cherry picked from PT series, after fixing from 1+2, we'll
> > > definitely want patch 3 now. Here's the simplest TCP streaming test
> > > using vhost dmar and iommu=pt in guest:
> > > 
> > >   without patch 3:    12.0Gbps
> > 
> > And what happens without patches 1-2?
> 
> Without 1-2, performance is good. But I think it is hacky to have such
> a good result (I explained why the performance is good in the VT-d PT
> support thread with some logs)...
> 
> > 
> > >   with patch 3:       33.5Gbps
> > 
> > This is the part I don't get. Patches 1-2 will return a bigger region to
> > callers. The result should be better performance - instead it seems to
> > slow down vhost for some reason and we need tricks to get
> > performance back. What's going on?
> 
> Yes. The problem is that if without patch 1/2 I think the codes lacks
> correctness. With correctness, we lost performance, then I picked
> patch 3 as well.
> 
> Again, I think the first thing we need to settle is what should be the
> best definition for IOTLB (addr_mask or arbitary length).
> 
> Thanks,

If arbitary length means we don't require prefaulting hacks,
I'm for using arbitary length.


> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-06 14:34       ` Paolo Bonzini
@ 2017-06-06 23:47         ` David Gibson
  2017-06-07  3:44           ` Peter Xu
  2017-06-07 13:01           ` Paolo Bonzini
  0 siblings, 2 replies; 34+ messages in thread
From: David Gibson @ 2017-06-06 23:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Xu, Michael S. Tsirkin, qemu-devel, Maxime Coquelin, Jason Wang

[-- Attachment #1: Type: text/plain, Size: 1461 bytes --]

On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> 
> 
> On 05/06/2017 05:07, Peter Xu wrote:
> > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > least for VT-d, the IOMMU translation is page aligned which is defined
> > by spec, so it makes sense that (again at least for VT-d) here we'd
> > better just use page_mask/addr_mask.
> > 
> > That's also how I know about IOMMU in general - I assume it do the
> > translations always with page masks (never arbitary length), though
> > page size can differ from platfrom to platform, that's why here the
> > IOTLB interface used addr_mask, then it works for all platforms. I
> > don't know whether I'm 100% correct here though.
> > 
> > Maybe David/Paolo/... would comment as well?
> 
> I would ask David.  There are PowerPC MMUs that allow fast lookup of
> arbitrarily-sized windows (not necessarily power of two),

Uh.. I'm not sure what you mean here.  You might be thinking of the
BATs which really old (32-bit) PowerPC MMUs had - those allow
arbitrary large block translations, but they do have to be a power of
two.

> so maybe the
> IOMMUs can do the same.

The only Power IOMMU I know about uses a fixed, power-of-two page size
per DMA window.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-06 23:47         ` David Gibson
@ 2017-06-07  3:44           ` Peter Xu
  2017-06-07 13:07             ` Michael S. Tsirkin
  2017-06-07 13:01           ` Paolo Bonzini
  1 sibling, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-07  3:44 UTC (permalink / raw)
  To: David Gibson
  Cc: Paolo Bonzini, Michael S. Tsirkin, qemu-devel, Maxime Coquelin,
	Jason Wang

On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > 
> > 
> > On 05/06/2017 05:07, Peter Xu wrote:
> > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > better just use page_mask/addr_mask.
> > > 
> > > That's also how I know about IOMMU in general - I assume it do the
> > > translations always with page masks (never arbitary length), though
> > > page size can differ from platfrom to platform, that's why here the
> > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > don't know whether I'm 100% correct here though.
> > > 
> > > Maybe David/Paolo/... would comment as well?
> > 
> > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > arbitrarily-sized windows (not necessarily power of two),
> 
> Uh.. I'm not sure what you mean here.  You might be thinking of the
> BATs which really old (32-bit) PowerPC MMUs had - those allow
> arbitrary large block translations, but they do have to be a power of
> two.
> 
> > so maybe the
> > IOMMUs can do the same.
> 
> The only Power IOMMU I know about uses a fixed, power-of-two page size
> per DMA window.

If so, I would still be inclined to keep using masks for QEMU IOTLB.
Then, my first two patches should still stand.

I am just afraid that not using masks will diverge the emulation from
real hardware and brings trouble one day.

For vhost IOTLB interface, it does not need to be strictly aligned to
QEMU IOMMU IOTLB definition, and that's how it's working now (current
vhost iotlb allows arbitary length, and I think it's good). So imho we
don't really need to worry about the performance - after all, we can
do everything customized for vhost, just like what patch 3 did (yeah,
it can be better...).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-06 23:47         ` David Gibson
  2017-06-07  3:44           ` Peter Xu
@ 2017-06-07 13:01           ` Paolo Bonzini
  1 sibling, 0 replies; 34+ messages in thread
From: Paolo Bonzini @ 2017-06-07 13:01 UTC (permalink / raw)
  To: David Gibson
  Cc: Peter Xu, Michael S. Tsirkin, qemu-devel, Maxime Coquelin, Jason Wang



On 07/06/2017 01:47, David Gibson wrote:
>> I would ask David.  There are PowerPC MMUs that allow fast lookup of
>> arbitrarily-sized windows (not necessarily power of two),
>
> Uh.. I'm not sure what you mean here.  You might be thinking of the
> BATs which really old (32-bit) PowerPC MMUs had - those allow
> arbitrary large block translations, but they do have to be a power of
> two.

Oh, that is what I was thinking about.  Thanks for figuring out! :)

Paolo

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-07  3:44           ` Peter Xu
@ 2017-06-07 13:07             ` Michael S. Tsirkin
  2017-06-08  6:11               ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-07 13:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin, Jason Wang

On Wed, Jun 07, 2017 at 11:44:43AM +0800, Peter Xu wrote:
> On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> > On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > > 
> > > 
> > > On 05/06/2017 05:07, Peter Xu wrote:
> > > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > > better just use page_mask/addr_mask.
> > > > 
> > > > That's also how I know about IOMMU in general - I assume it do the
> > > > translations always with page masks (never arbitary length), though
> > > > page size can differ from platfrom to platform, that's why here the
> > > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > > don't know whether I'm 100% correct here though.
> > > > 
> > > > Maybe David/Paolo/... would comment as well?
> > > 
> > > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > > arbitrarily-sized windows (not necessarily power of two),
> > 
> > Uh.. I'm not sure what you mean here.  You might be thinking of the
> > BATs which really old (32-bit) PowerPC MMUs had - those allow
> > arbitrary large block translations, but they do have to be a power of
> > two.
> > 
> > > so maybe the
> > > IOMMUs can do the same.
> > 
> > The only Power IOMMU I know about uses a fixed, power-of-two page size
> > per DMA window.
> 
> If so, I would still be inclined to keep using masks for QEMU IOTLB.
> Then, my first two patches should still stand.
> 
> I am just afraid that not using masks will diverge the emulation from
> real hardware and brings trouble one day.
> 
> For vhost IOTLB interface, it does not need to be strictly aligned to
> QEMU IOMMU IOTLB definition, and that's how it's working now (current
> vhost iotlb allows arbitary length, and I think it's good). So imho we
> don't really need to worry about the performance - after all, we can
> do everything customized for vhost, just like what patch 3 did (yeah,
> it can be better...).
> 
> Thanks,

Pre-faults is also something that does not happen on real hardware.
And it's about security so a bigger issue.

If I had to choose between that and using non-power-of-2 in
the API, I'd go for non-power-of-2. Let backends that can only
support power of 2 split it up to multiple transactions.



> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-07 13:07             ` Michael S. Tsirkin
@ 2017-06-08  6:11               ` Peter Xu
  2017-06-08 18:59                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-08  6:11 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin, Jason Wang

On Wed, Jun 07, 2017 at 04:07:20PM +0300, Michael S. Tsirkin wrote:
> On Wed, Jun 07, 2017 at 11:44:43AM +0800, Peter Xu wrote:
> > On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> > > On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > > > 
> > > > 
> > > > On 05/06/2017 05:07, Peter Xu wrote:
> > > > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > > > better just use page_mask/addr_mask.
> > > > > 
> > > > > That's also how I know about IOMMU in general - I assume it do the
> > > > > translations always with page masks (never arbitary length), though
> > > > > page size can differ from platfrom to platform, that's why here the
> > > > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > > > don't know whether I'm 100% correct here though.
> > > > > 
> > > > > Maybe David/Paolo/... would comment as well?
> > > > 
> > > > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > > > arbitrarily-sized windows (not necessarily power of two),
> > > 
> > > Uh.. I'm not sure what you mean here.  You might be thinking of the
> > > BATs which really old (32-bit) PowerPC MMUs had - those allow
> > > arbitrary large block translations, but they do have to be a power of
> > > two.
> > > 
> > > > so maybe the
> > > > IOMMUs can do the same.
> > > 
> > > The only Power IOMMU I know about uses a fixed, power-of-two page size
> > > per DMA window.
> > 
> > If so, I would still be inclined to keep using masks for QEMU IOTLB.
> > Then, my first two patches should still stand.
> > 
> > I am just afraid that not using masks will diverge the emulation from
> > real hardware and brings trouble one day.
> > 
> > For vhost IOTLB interface, it does not need to be strictly aligned to
> > QEMU IOMMU IOTLB definition, and that's how it's working now (current
> > vhost iotlb allows arbitary length, and I think it's good). So imho we
> > don't really need to worry about the performance - after all, we can
> > do everything customized for vhost, just like what patch 3 did (yeah,
> > it can be better...).
> > 
> > Thanks,
> 
> Pre-faults is also something that does not happen on real hardware.
> And it's about security so a bigger issue.
> 
> If I had to choose between that and using non-power-of-2 in
> the API, I'd go for non-power-of-2. Let backends that can only
> support power of 2 split it up to multiple transactions.

The problem is that when I was fixing the problem that vhost had with
PT (a764040, "exec: abstract address_space_do_translate()"), I did
broke the IOTLB translation a bit (it was using page masks). IMHO we
need to fix it first for correctness (patch 1/2).

For patch 3, if we can have Jason's patch to allow dynamic
iommu_platform switching, that'll be the best, then I can rewrite
patch 3 with the switching logic rather than caching anything. But
IMHO that can be separated from patch 1/2 if you like.

Or do you have better suggestion on how should we fix it?

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-08  6:11               ` Peter Xu
@ 2017-06-08 18:59                 ` Michael S. Tsirkin
  2017-06-09  1:58                   ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-08 18:59 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin, Jason Wang

On Thu, Jun 08, 2017 at 02:11:50PM +0800, Peter Xu wrote:
> On Wed, Jun 07, 2017 at 04:07:20PM +0300, Michael S. Tsirkin wrote:
> > On Wed, Jun 07, 2017 at 11:44:43AM +0800, Peter Xu wrote:
> > > On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> > > > On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > > > > 
> > > > > 
> > > > > On 05/06/2017 05:07, Peter Xu wrote:
> > > > > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > > > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > > > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > > > > better just use page_mask/addr_mask.
> > > > > > 
> > > > > > That's also how I know about IOMMU in general - I assume it do the
> > > > > > translations always with page masks (never arbitary length), though
> > > > > > page size can differ from platfrom to platform, that's why here the
> > > > > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > > > > don't know whether I'm 100% correct here though.
> > > > > > 
> > > > > > Maybe David/Paolo/... would comment as well?
> > > > > 
> > > > > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > > > > arbitrarily-sized windows (not necessarily power of two),
> > > > 
> > > > Uh.. I'm not sure what you mean here.  You might be thinking of the
> > > > BATs which really old (32-bit) PowerPC MMUs had - those allow
> > > > arbitrary large block translations, but they do have to be a power of
> > > > two.
> > > > 
> > > > > so maybe the
> > > > > IOMMUs can do the same.
> > > > 
> > > > The only Power IOMMU I know about uses a fixed, power-of-two page size
> > > > per DMA window.
> > > 
> > > If so, I would still be inclined to keep using masks for QEMU IOTLB.
> > > Then, my first two patches should still stand.
> > > 
> > > I am just afraid that not using masks will diverge the emulation from
> > > real hardware and brings trouble one day.
> > > 
> > > For vhost IOTLB interface, it does not need to be strictly aligned to
> > > QEMU IOMMU IOTLB definition, and that's how it's working now (current
> > > vhost iotlb allows arbitary length, and I think it's good). So imho we
> > > don't really need to worry about the performance - after all, we can
> > > do everything customized for vhost, just like what patch 3 did (yeah,
> > > it can be better...).
> > > 
> > > Thanks,
> > 
> > Pre-faults is also something that does not happen on real hardware.
> > And it's about security so a bigger issue.
> > 
> > If I had to choose between that and using non-power-of-2 in
> > the API, I'd go for non-power-of-2. Let backends that can only
> > support power of 2 split it up to multiple transactions.
> 
> The problem is that when I was fixing the problem that vhost had with
> PT (a764040, "exec: abstract address_space_do_translate()"), I did
> broke the IOTLB translation a bit (it was using page masks). IMHO we
> need to fix it first for correctness (patch 1/2).
> 
> For patch 3, if we can have Jason's patch to allow dynamic
> iommu_platform switching, that'll be the best, then I can rewrite
> patch 3 with the switching logic rather than caching anything. But
> IMHO that can be separated from patch 1/2 if you like.
> 
> Or do you have better suggestion on how should we fix it?
> 
> Thanks,

Can we drop masks completely and replace with length? I think we
should do that instead of trying to fix masks.

> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-08 18:59                 ` Michael S. Tsirkin
@ 2017-06-09  1:58                   ` Peter Xu
  2017-06-09  2:37                     ` David Gibson
  2017-06-11 10:09                     ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-09  1:58 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Thu, Jun 08, 2017 at 09:59:50PM +0300, Michael S. Tsirkin wrote:
> On Thu, Jun 08, 2017 at 02:11:50PM +0800, Peter Xu wrote:
> > On Wed, Jun 07, 2017 at 04:07:20PM +0300, Michael S. Tsirkin wrote:
> > > On Wed, Jun 07, 2017 at 11:44:43AM +0800, Peter Xu wrote:
> > > > On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> > > > > On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > > > > > 
> > > > > > 
> > > > > > On 05/06/2017 05:07, Peter Xu wrote:
> > > > > > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > > > > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > > > > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > > > > > better just use page_mask/addr_mask.
> > > > > > > 
> > > > > > > That's also how I know about IOMMU in general - I assume it do the
> > > > > > > translations always with page masks (never arbitary length), though
> > > > > > > page size can differ from platfrom to platform, that's why here the
> > > > > > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > > > > > don't know whether I'm 100% correct here though.
> > > > > > > 
> > > > > > > Maybe David/Paolo/... would comment as well?
> > > > > > 
> > > > > > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > > > > > arbitrarily-sized windows (not necessarily power of two),
> > > > > 
> > > > > Uh.. I'm not sure what you mean here.  You might be thinking of the
> > > > > BATs which really old (32-bit) PowerPC MMUs had - those allow
> > > > > arbitrary large block translations, but they do have to be a power of
> > > > > two.
> > > > > 
> > > > > > so maybe the
> > > > > > IOMMUs can do the same.
> > > > > 
> > > > > The only Power IOMMU I know about uses a fixed, power-of-two page size
> > > > > per DMA window.
> > > > 
> > > > If so, I would still be inclined to keep using masks for QEMU IOTLB.
> > > > Then, my first two patches should still stand.
> > > > 
> > > > I am just afraid that not using masks will diverge the emulation from
> > > > real hardware and brings trouble one day.
> > > > 
> > > > For vhost IOTLB interface, it does not need to be strictly aligned to
> > > > QEMU IOMMU IOTLB definition, and that's how it's working now (current
> > > > vhost iotlb allows arbitary length, and I think it's good). So imho we
> > > > don't really need to worry about the performance - after all, we can
> > > > do everything customized for vhost, just like what patch 3 did (yeah,
> > > > it can be better...).
> > > > 
> > > > Thanks,
> > > 
> > > Pre-faults is also something that does not happen on real hardware.
> > > And it's about security so a bigger issue.
> > > 
> > > If I had to choose between that and using non-power-of-2 in
> > > the API, I'd go for non-power-of-2. Let backends that can only
> > > support power of 2 split it up to multiple transactions.
> > 
> > The problem is that when I was fixing the problem that vhost had with
> > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > need to fix it first for correctness (patch 1/2).
> > 
> > For patch 3, if we can have Jason's patch to allow dynamic
> > iommu_platform switching, that'll be the best, then I can rewrite
> > patch 3 with the switching logic rather than caching anything. But
> > IMHO that can be separated from patch 1/2 if you like.
> > 
> > Or do you have better suggestion on how should we fix it?
> > 
> > Thanks,
> 
> Can we drop masks completely and replace with length? I think we
> should do that instead of trying to fix masks.

Do you mean to modify IOMMUTLBEntry.addr_mask into length?

Again, I am not sure this is good... At least we need to get ack from
David since spapr should be the initial user of it, and possibly also
Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
addr_mask is page masks rather than arbirary length.

(CC Alex)

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-09  1:58                   ` Peter Xu
@ 2017-06-09  2:37                     ` David Gibson
  2017-06-11 10:09                     ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: David Gibson @ 2017-06-09  2:37 UTC (permalink / raw)
  To: Peter Xu
  Cc: Michael S. Tsirkin, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 4656 bytes --]

On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> On Thu, Jun 08, 2017 at 09:59:50PM +0300, Michael S. Tsirkin wrote:
> > On Thu, Jun 08, 2017 at 02:11:50PM +0800, Peter Xu wrote:
> > > On Wed, Jun 07, 2017 at 04:07:20PM +0300, Michael S. Tsirkin wrote:
> > > > On Wed, Jun 07, 2017 at 11:44:43AM +0800, Peter Xu wrote:
> > > > > On Wed, Jun 07, 2017 at 09:47:05AM +1000, David Gibson wrote:
> > > > > > On Tue, Jun 06, 2017 at 04:34:30PM +0200, Paolo Bonzini wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 05/06/2017 05:07, Peter Xu wrote:
> > > > > > > > I don't sure whether it'll be a good interface for IOTLB. AFAIU at
> > > > > > > > least for VT-d, the IOMMU translation is page aligned which is defined
> > > > > > > > by spec, so it makes sense that (again at least for VT-d) here we'd
> > > > > > > > better just use page_mask/addr_mask.
> > > > > > > > 
> > > > > > > > That's also how I know about IOMMU in general - I assume it do the
> > > > > > > > translations always with page masks (never arbitary length), though
> > > > > > > > page size can differ from platfrom to platform, that's why here the
> > > > > > > > IOTLB interface used addr_mask, then it works for all platforms. I
> > > > > > > > don't know whether I'm 100% correct here though.
> > > > > > > > 
> > > > > > > > Maybe David/Paolo/... would comment as well?
> > > > > > > 
> > > > > > > I would ask David.  There are PowerPC MMUs that allow fast lookup of
> > > > > > > arbitrarily-sized windows (not necessarily power of two),
> > > > > > 
> > > > > > Uh.. I'm not sure what you mean here.  You might be thinking of the
> > > > > > BATs which really old (32-bit) PowerPC MMUs had - those allow
> > > > > > arbitrary large block translations, but they do have to be a power of
> > > > > > two.
> > > > > > 
> > > > > > > so maybe the
> > > > > > > IOMMUs can do the same.
> > > > > > 
> > > > > > The only Power IOMMU I know about uses a fixed, power-of-two page size
> > > > > > per DMA window.
> > > > > 
> > > > > If so, I would still be inclined to keep using masks for QEMU IOTLB.
> > > > > Then, my first two patches should still stand.
> > > > > 
> > > > > I am just afraid that not using masks will diverge the emulation from
> > > > > real hardware and brings trouble one day.
> > > > > 
> > > > > For vhost IOTLB interface, it does not need to be strictly aligned to
> > > > > QEMU IOMMU IOTLB definition, and that's how it's working now (current
> > > > > vhost iotlb allows arbitary length, and I think it's good). So imho we
> > > > > don't really need to worry about the performance - after all, we can
> > > > > do everything customized for vhost, just like what patch 3 did (yeah,
> > > > > it can be better...).
> > > > > 
> > > > > Thanks,
> > > > 
> > > > Pre-faults is also something that does not happen on real hardware.
> > > > And it's about security so a bigger issue.
> > > > 
> > > > If I had to choose between that and using non-power-of-2 in
> > > > the API, I'd go for non-power-of-2. Let backends that can only
> > > > support power of 2 split it up to multiple transactions.
> > > 
> > > The problem is that when I was fixing the problem that vhost had with
> > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > need to fix it first for correctness (patch 1/2).
> > > 
> > > For patch 3, if we can have Jason's patch to allow dynamic
> > > iommu_platform switching, that'll be the best, then I can rewrite
> > > patch 3 with the switching logic rather than caching anything. But
> > > IMHO that can be separated from patch 1/2 if you like.
> > > 
> > > Or do you have better suggestion on how should we fix it?
> > > 
> > > Thanks,
> > 
> > Can we drop masks completely and replace with length? I think we
> > should do that instead of trying to fix masks.
> 
> Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> 
> Again, I am not sure this is good... At least we need to get ack from
> David since spapr should be the initial user of it, and possibly also
> Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> addr_mask is page masks rather than arbirary length.

So, I don't see that using size instead of mask would be a particular
problem for spapr.  However, I also don't see any advantage to
switching.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-09  1:58                   ` Peter Xu
  2017-06-09  2:37                     ` David Gibson
@ 2017-06-11 10:09                     ` Michael S. Tsirkin
  2017-06-11 12:10                       ` David Gibson
  1 sibling, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-11 10:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > The problem is that when I was fixing the problem that vhost had with
> > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > need to fix it first for correctness (patch 1/2).
> > > 
> > > For patch 3, if we can have Jason's patch to allow dynamic
> > > iommu_platform switching, that'll be the best, then I can rewrite
> > > patch 3 with the switching logic rather than caching anything. But
> > > IMHO that can be separated from patch 1/2 if you like.
> > > 
> > > Or do you have better suggestion on how should we fix it?
> > > 
> > > Thanks,
> > 
> > Can we drop masks completely and replace with length? I think we
> > should do that instead of trying to fix masks.
> 
> Do you mean to modify IOMMUTLBEntry.addr_mask into length?

I think it's better than alternatives.

> Again, I am not sure this is good... At least we need to get ack from
> David since spapr should be the initial user of it, and possibly also
> Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> addr_mask is page masks rather than arbirary length.
> 
> (CC Alex)
> 
> Thanks,

Callbacks that need powers of two can easily split up the range.


> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-11 10:09                     ` Michael S. Tsirkin
@ 2017-06-11 12:10                       ` David Gibson
  2017-06-12  2:34                         ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: David Gibson @ 2017-06-11 12:10 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Peter Xu, Paolo Bonzini, qemu-devel, Maxime Coquelin, Jason Wang,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 1809 bytes --]

On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > The problem is that when I was fixing the problem that vhost had with
> > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > need to fix it first for correctness (patch 1/2).
> > > > 
> > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > patch 3 with the switching logic rather than caching anything. But
> > > > IMHO that can be separated from patch 1/2 if you like.
> > > > 
> > > > Or do you have better suggestion on how should we fix it?
> > > > 
> > > > Thanks,
> > > 
> > > Can we drop masks completely and replace with length? I think we
> > > should do that instead of trying to fix masks.
> > 
> > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> 
> I think it's better than alternatives.
> 
> > Again, I am not sure this is good... At least we need to get ack from
> > David since spapr should be the initial user of it, and possibly also
> > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > addr_mask is page masks rather than arbirary length.
> > 
> > (CC Alex)
> > 
> > Thanks,
> 
> Callbacks that need powers of two can easily split up the range.

I think I missed part of the thread.  What's the original use case for
non-power-of-two IOTLB entries?  It certainly won't happen on Power.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-11 12:10                       ` David Gibson
@ 2017-06-12  2:34                         ` Peter Xu
  2017-06-12  3:07                           ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-12  2:34 UTC (permalink / raw)
  To: David Gibson, Michael S. Tsirkin
  Cc: Paolo Bonzini, qemu-devel, Maxime Coquelin, Jason Wang, Alex Williamson

On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > need to fix it first for correctness (patch 1/2).
> > > > > 
> > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > 
> > > > > Or do you have better suggestion on how should we fix it?
> > > > > 
> > > > > Thanks,
> > > > 
> > > > Can we drop masks completely and replace with length? I think we
> > > > should do that instead of trying to fix masks.
> > > 
> > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > 
> > I think it's better than alternatives.
> > 
> > > Again, I am not sure this is good... At least we need to get ack from
> > > David since spapr should be the initial user of it, and possibly also
> > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > addr_mask is page masks rather than arbirary length.
> > > 
> > > (CC Alex)
> > > 
> > > Thanks,
> > 
> > Callbacks that need powers of two can easily split up the range.
> 
> I think I missed part of the thread.  What's the original use case for
> non-power-of-two IOTLB entries?  It certainly won't happen on Power.

Currently address_space_get_iotlb_entry() didn't really follow the
rule, addr_mask can be arbitary length. This series tried to fix it,
while Michael was questioning about whether we should really fix that
at all.

Michael,

Even if for performance's sake, I should still think we should fix it.
Let's consider a most simple worst case: we have a single page mapped
with IOVA range (2M page):

 [0x0, 0x200000)

And if guest access IOVA using the following patern:

 0x1fffff, 0x1ffffe, 0x1ffffd, ...

Then now we'll get this:

- request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
- request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
- request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
- ...

We'll all cache miss along the way until we access 0x0. While if we
are with page mask, we'll get:

- request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
- request 0x1ffffe, cache hit
- request 0x1ffffd, cache hit
- ...

We'll only miss at the first IO.

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-12  2:34                         ` Peter Xu
@ 2017-06-12  3:07                           ` Michael S. Tsirkin
  2017-06-12  4:04                             ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-12  3:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > 
> > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > 
> > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > 
> > > > > > Thanks,
> > > > > 
> > > > > Can we drop masks completely and replace with length? I think we
> > > > > should do that instead of trying to fix masks.
> > > > 
> > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > 
> > > I think it's better than alternatives.
> > > 
> > > > Again, I am not sure this is good... At least we need to get ack from
> > > > David since spapr should be the initial user of it, and possibly also
> > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > addr_mask is page masks rather than arbirary length.
> > > > 
> > > > (CC Alex)
> > > > 
> > > > Thanks,
> > > 
> > > Callbacks that need powers of two can easily split up the range.
> > 
> > I think I missed part of the thread.  What's the original use case for
> > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> 
> Currently address_space_get_iotlb_entry() didn't really follow the
> rule, addr_mask can be arbitary length. This series tried to fix it,
> while Michael was questioning about whether we should really fix that
> at all.
> 
> Michael,
> 
> Even if for performance's sake, I should still think we should fix it.
> Let's consider a most simple worst case: we have a single page mapped
> with IOVA range (2M page):
> 
>  [0x0, 0x200000)
> 
> And if guest access IOVA using the following patern:
> 
>  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> 
> Then now we'll get this:
> 
> - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> - ...

We pass an offset too, do we not? So callee can figure out
the region starts at 0x0 and avoid 2nd and 3rd misses.


> We'll all cache miss along the way until we access 0x0. While if we
> are with page mask, we'll get:
> 
> - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> - request 0x1ffffe, cache hit
> - request 0x1ffffd, cache hit
> - ...
> 
> We'll only miss at the first IO.

I think we should send as much info as we can.
There should be a way to find full region start and length.

> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-12  3:07                           ` Michael S. Tsirkin
@ 2017-06-12  4:04                             ` Peter Xu
  2017-06-14 18:34                               ` Michael S. Tsirkin
  0 siblings, 1 reply; 34+ messages in thread
From: Peter Xu @ 2017-06-12  4:04 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > 
> > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > 
> > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > 
> > > > > > > Thanks,
> > > > > > 
> > > > > > Can we drop masks completely and replace with length? I think we
> > > > > > should do that instead of trying to fix masks.
> > > > > 
> > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > 
> > > > I think it's better than alternatives.
> > > > 
> > > > > Again, I am not sure this is good... At least we need to get ack from
> > > > > David since spapr should be the initial user of it, and possibly also
> > > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > > addr_mask is page masks rather than arbirary length.
> > > > > 
> > > > > (CC Alex)
> > > > > 
> > > > > Thanks,
> > > > 
> > > > Callbacks that need powers of two can easily split up the range.
> > > 
> > > I think I missed part of the thread.  What's the original use case for
> > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > 
> > Currently address_space_get_iotlb_entry() didn't really follow the
> > rule, addr_mask can be arbitary length. This series tried to fix it,
> > while Michael was questioning about whether we should really fix that
> > at all.
> > 
> > Michael,
> > 
> > Even if for performance's sake, I should still think we should fix it.
> > Let's consider a most simple worst case: we have a single page mapped
> > with IOVA range (2M page):
> > 
> >  [0x0, 0x200000)
> > 
> > And if guest access IOVA using the following patern:
> > 
> >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > 
> > Then now we'll get this:
> > 
> > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > - ...
> 
> We pass an offset too, do we not? So callee can figure out
> the region starts at 0x0 and avoid 2nd and 3rd misses.

Here when you say "offset", do you mean the offset in
MemoryRegionSection?

In address_space_get_iotlb_entry() we have this:

    section = address_space_do_translate(as, addr, &xlat, &plen,
                                         is_write, false);

One thing to mention is that, imho we cannot really assume the xlat is
valid on the whole "section" range - the section can be a huge GPA
range, while the xlat may only be valid on a single 4K page. The only
safe region we can use here is (xlat, xlat+plen). Outside that, we
should know nothing valid.

Please correct me if I didn't really catch the point..

> 
> 
> > We'll all cache miss along the way until we access 0x0. While if we
> > are with page mask, we'll get:
> > 
> > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > - request 0x1ffffe, cache hit
> > - request 0x1ffffd, cache hit
> > - ...
> > 
> > We'll only miss at the first IO.
> 
> I think we should send as much info as we can.
> There should be a way to find full region start and length.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-12  4:04                             ` Peter Xu
@ 2017-06-14 18:34                               ` Michael S. Tsirkin
  2017-06-15  2:31                                 ` Peter Xu
  0 siblings, 1 reply; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-14 18:34 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Mon, Jun 12, 2017 at 12:04:58PM +0800, Peter Xu wrote:
> On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> > On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > > 
> > > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > > 
> > > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > 
> > > > > > > Can we drop masks completely and replace with length? I think we
> > > > > > > should do that instead of trying to fix masks.
> > > > > > 
> > > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > > 
> > > > > I think it's better than alternatives.
> > > > > 
> > > > > > Again, I am not sure this is good... At least we need to get ack from
> > > > > > David since spapr should be the initial user of it, and possibly also
> > > > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > > > addr_mask is page masks rather than arbirary length.
> > > > > > 
> > > > > > (CC Alex)
> > > > > > 
> > > > > > Thanks,
> > > > > 
> > > > > Callbacks that need powers of two can easily split up the range.
> > > > 
> > > > I think I missed part of the thread.  What's the original use case for
> > > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > > 
> > > Currently address_space_get_iotlb_entry() didn't really follow the
> > > rule, addr_mask can be arbitary length. This series tried to fix it,
> > > while Michael was questioning about whether we should really fix that
> > > at all.
> > > 
> > > Michael,
> > > 
> > > Even if for performance's sake, I should still think we should fix it.
> > > Let's consider a most simple worst case: we have a single page mapped
> > > with IOVA range (2M page):
> > > 
> > >  [0x0, 0x200000)
> > > 
> > > And if guest access IOVA using the following patern:
> > > 
> > >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > > 
> > > Then now we'll get this:
> > > 
> > > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > > - ...
> > 
> > We pass an offset too, do we not? So callee can figure out
> > the region starts at 0x0 and avoid 2nd and 3rd misses.
> 
> Here when you say "offset", do you mean the offset in
> MemoryRegionSection?
> 
> In address_space_get_iotlb_entry() we have this:
> 
>     section = address_space_do_translate(as, addr, &xlat, &plen,
>                                          is_write, false);
> 
> One thing to mention is that, imho we cannot really assume the xlat is
> valid on the whole "section" range - the section can be a huge GPA
> range, while the xlat may only be valid on a single 4K page. The only
> safe region we can use here is (xlat, xlat+plen). Outside that, we
> should know nothing valid.
> 
> Please correct me if I didn't really catch the point..

IIUC section is the translation result. If so all of it is valid,
not just one page.

> > 
> > 
> > > We'll all cache miss along the way until we access 0x0. While if we
> > > are with page mask, we'll get:
> > > 
> > > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > > - request 0x1ffffe, cache hit
> > > - request 0x1ffffd, cache hit
> > > - ...
> > > 
> > > We'll only miss at the first IO.
> > 
> > I think we should send as much info as we can.
> > There should be a way to find full region start and length.
> 
> Thanks,
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-14 18:34                               ` Michael S. Tsirkin
@ 2017-06-15  2:31                                 ` Peter Xu
  2017-06-15  2:57                                   ` Peter Xu
  2017-06-16 15:33                                   ` Michael S. Tsirkin
  0 siblings, 2 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-15  2:31 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Wed, Jun 14, 2017 at 09:34:52PM +0300, Michael S. Tsirkin wrote:
> On Mon, Jun 12, 2017 at 12:04:58PM +0800, Peter Xu wrote:
> > On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> > > On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > > > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > > > 
> > > > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > > > 
> > > > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > > > 
> > > > > > > > > Thanks,
> > > > > > > > 
> > > > > > > > Can we drop masks completely and replace with length? I think we
> > > > > > > > should do that instead of trying to fix masks.
> > > > > > > 
> > > > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > > > 
> > > > > > I think it's better than alternatives.
> > > > > > 
> > > > > > > Again, I am not sure this is good... At least we need to get ack from
> > > > > > > David since spapr should be the initial user of it, and possibly also
> > > > > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > > > > addr_mask is page masks rather than arbirary length.
> > > > > > > 
> > > > > > > (CC Alex)
> > > > > > > 
> > > > > > > Thanks,
> > > > > > 
> > > > > > Callbacks that need powers of two can easily split up the range.
> > > > > 
> > > > > I think I missed part of the thread.  What's the original use case for
> > > > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > > > 
> > > > Currently address_space_get_iotlb_entry() didn't really follow the
> > > > rule, addr_mask can be arbitary length. This series tried to fix it,
> > > > while Michael was questioning about whether we should really fix that
> > > > at all.
> > > > 
> > > > Michael,
> > > > 
> > > > Even if for performance's sake, I should still think we should fix it.
> > > > Let's consider a most simple worst case: we have a single page mapped
> > > > with IOVA range (2M page):
> > > > 
> > > >  [0x0, 0x200000)
> > > > 
> > > > And if guest access IOVA using the following patern:
> > > > 
> > > >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > > > 
> > > > Then now we'll get this:
> > > > 
> > > > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > > > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > > > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > > > - ...
> > > 
> > > We pass an offset too, do we not? So callee can figure out
> > > the region starts at 0x0 and avoid 2nd and 3rd misses.
> > 
> > Here when you say "offset", do you mean the offset in
> > MemoryRegionSection?
> > 
> > In address_space_get_iotlb_entry() we have this:
> > 
> >     section = address_space_do_translate(as, addr, &xlat, &plen,
> >                                          is_write, false);
> > 
> > One thing to mention is that, imho we cannot really assume the xlat is
> > valid on the whole "section" range - the section can be a huge GPA
> > range, while the xlat may only be valid on a single 4K page. The only
> > safe region we can use here is (xlat, xlat+plen). Outside that, we
> > should know nothing valid.

[1]

> > 
> > Please correct me if I didn't really catch the point..
> 
> IIUC section is the translation result. If so all of it is valid,
> not just one page.

It should be only one page (I think that depends on how we implemented
the translation logic). Please check this (gdb session on current
master):

(gdb) b address_space_get_iotlb_entry 
Breakpoint 2 at 0x55555576803a: file /root/git/qemu/exec.c, line 512.
... (until break)
(gdb) bt
#0  0x000055555576803a in address_space_get_iotlb_entry (as=0x55555841bbb0, addr=4294959104, is_write=false) at /root/git/qemu/exec.c:512
#1  0x00005555558322dd in vhost_device_iotlb_miss (dev=0x5555568b03a0, iova=4294959104, write=0) at /root/git/qemu/hw/virtio/vhost.c:982
#2  0x0000555555834ad9 in vhost_backend_handle_iotlb_msg (dev=0x5555568b03a0, imsg=0x7fffffffd428) at /root/git/qemu/hw/virtio/vhost-backend.c:334
#3  0x00005555558347be in vhost_kernel_iotlb_read (opaque=0x5555568b03a0) at /root/git/qemu/hw/virtio/vhost-backend.c:204
#4  0x0000555555c7e7c3 in aio_dispatch_handlers (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:399
#5  0x0000555555c7e956 in aio_dispatch (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:430
#6  0x0000555555c7a722 in aio_ctx_dispatch (source=0x5555568091b0, callback=0x0, user_data=0x0) at /root/git/qemu/util/async.c:261
#7  0x00007f98c62f3703 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#8  0x0000555555c7d36c in glib_pollfds_poll () at /root/git/qemu/util/main-loop.c:213
#9  0x0000555555c7d468 in os_host_main_loop_wait (timeout=29311167) at /root/git/qemu/util/main-loop.c:261
#10 0x0000555555c7d521 in main_loop_wait (nonblocking=0) at /root/git/qemu/util/main-loop.c:517
#11 0x00005555558fe234 in main_loop () at /root/git/qemu/vl.c:1918
#12 0x0000555555905fe8 in main (argc=21, argv=0x7fffffffda48, envp=0x7fffffffdaf8) at /root/git/qemu/vl.c:4752
(gdb) s
517         plen = (hwaddr)-1;
(gdb)
520         section = address_space_do_translate(as, addr, &xlat, &plen,
(gdb) n
524         if (section.mr == &io_mem_unassigned) {
(gdb) p/x xlat
$2 = 0x73900000
(gdb) p/x addr
$3 = 0xffffe000
(gdb) p/x plen
$4 = 0x1000
(gdb) p section
$5 = {mr = 0x5555569ad790, address_space = 0x5555562f9b40 <address_space_memory>, offset_within_region = 1048576, size = 0x0000000000000000000000007ff00000, 
  offset_within_address_space = 1048576, readonly = false}

Here $2-$4 shows that we are translating IOVA 0xffffe000 into
[0x73900000, 0x73901000) range, while we can see the section is having
size as 0x7ff00000, which is ranged from [0x100000000, 0x17ff00000).
That's the upper RAM that the guest has, corresponds to what we can
see in "info mtree -f" (the guest contains 4G RAM, this is upper 2G):

  0000000100000000-000000017fffffff (prio 0, ram): pc.ram @0000000080000000

Obvious, we cannot assume we have a linear mapping on this whole
range. So I don't think we can really use section info (though the
section can be used to obtain some offset information, or address
space the range belongs) to build up the IOTLB, but only the plen.
Then, we need to fix current codes.

And this is exactly what I meant above at [1].  Hope this clarifies a
bit on the issue.  Thanks,

> 
> > > 
> > > 
> > > > We'll all cache miss along the way until we access 0x0. While if we
> > > > are with page mask, we'll get:
> > > > 
> > > > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > > > - request 0x1ffffe, cache hit
> > > > - request 0x1ffffd, cache hit
> > > > - ...
> > > > 
> > > > We'll only miss at the first IO.
> > > 
> > > I think we should send as much info as we can.
> > > There should be a way to find full region start and length.
> > 
> > Thanks,
> > 
> > -- 
> > Peter Xu

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-15  2:31                                 ` Peter Xu
@ 2017-06-15  2:57                                   ` Peter Xu
  2017-06-16 15:33                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Peter Xu @ 2017-06-15  2:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Alex Williamson, qemu-devel, Maxime Coquelin,
	Paolo Bonzini, David Gibson

On Thu, Jun 15, 2017 at 10:31:11AM +0800, Peter Xu wrote:
> On Wed, Jun 14, 2017 at 09:34:52PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Jun 12, 2017 at 12:04:58PM +0800, Peter Xu wrote:
> > > On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > > > > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > > > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > > > > 
> > > > > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > > > > 
> > > > > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Can we drop masks completely and replace with length? I think we
> > > > > > > > > should do that instead of trying to fix masks.
> > > > > > > > 
> > > > > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > > > > 
> > > > > > > I think it's better than alternatives.
> > > > > > > 
> > > > > > > > Again, I am not sure this is good... At least we need to get ack from
> > > > > > > > David since spapr should be the initial user of it, and possibly also
> > > > > > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > > > > > addr_mask is page masks rather than arbirary length.
> > > > > > > > 
> > > > > > > > (CC Alex)
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > 
> > > > > > > Callbacks that need powers of two can easily split up the range.
> > > > > > 
> > > > > > I think I missed part of the thread.  What's the original use case for
> > > > > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > > > > 
> > > > > Currently address_space_get_iotlb_entry() didn't really follow the
> > > > > rule, addr_mask can be arbitary length. This series tried to fix it,
> > > > > while Michael was questioning about whether we should really fix that
> > > > > at all.
> > > > > 
> > > > > Michael,
> > > > > 
> > > > > Even if for performance's sake, I should still think we should fix it.
> > > > > Let's consider a most simple worst case: we have a single page mapped
> > > > > with IOVA range (2M page):
> > > > > 
> > > > >  [0x0, 0x200000)
> > > > > 
> > > > > And if guest access IOVA using the following patern:
> > > > > 
> > > > >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > > > > 
> > > > > Then now we'll get this:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > > > > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > > > > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > > > > - ...
> > > > 
> > > > We pass an offset too, do we not? So callee can figure out
> > > > the region starts at 0x0 and avoid 2nd and 3rd misses.
> > > 
> > > Here when you say "offset", do you mean the offset in
> > > MemoryRegionSection?
> > > 
> > > In address_space_get_iotlb_entry() we have this:
> > > 
> > >     section = address_space_do_translate(as, addr, &xlat, &plen,
> > >                                          is_write, false);
> > > 
> > > One thing to mention is that, imho we cannot really assume the xlat is
> > > valid on the whole "section" range - the section can be a huge GPA
> > > range, while the xlat may only be valid on a single 4K page. The only
> > > safe region we can use here is (xlat, xlat+plen). Outside that, we
> > > should know nothing valid.
> 
> [1]
> 
> > > 
> > > Please correct me if I didn't really catch the point..
> > 
> > IIUC section is the translation result. If so all of it is valid,
> > not just one page.
> 
> It should be only one page (I think that depends on how we implemented
> the translation logic). Please check this (gdb session on current
> master):
> 
> (gdb) b address_space_get_iotlb_entry 
> Breakpoint 2 at 0x55555576803a: file /root/git/qemu/exec.c, line 512.
> ... (until break)
> (gdb) bt
> #0  0x000055555576803a in address_space_get_iotlb_entry (as=0x55555841bbb0, addr=4294959104, is_write=false) at /root/git/qemu/exec.c:512
> #1  0x00005555558322dd in vhost_device_iotlb_miss (dev=0x5555568b03a0, iova=4294959104, write=0) at /root/git/qemu/hw/virtio/vhost.c:982
> #2  0x0000555555834ad9 in vhost_backend_handle_iotlb_msg (dev=0x5555568b03a0, imsg=0x7fffffffd428) at /root/git/qemu/hw/virtio/vhost-backend.c:334
> #3  0x00005555558347be in vhost_kernel_iotlb_read (opaque=0x5555568b03a0) at /root/git/qemu/hw/virtio/vhost-backend.c:204
> #4  0x0000555555c7e7c3 in aio_dispatch_handlers (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:399
> #5  0x0000555555c7e956 in aio_dispatch (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:430
> #6  0x0000555555c7a722 in aio_ctx_dispatch (source=0x5555568091b0, callback=0x0, user_data=0x0) at /root/git/qemu/util/async.c:261
> #7  0x00007f98c62f3703 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #8  0x0000555555c7d36c in glib_pollfds_poll () at /root/git/qemu/util/main-loop.c:213
> #9  0x0000555555c7d468 in os_host_main_loop_wait (timeout=29311167) at /root/git/qemu/util/main-loop.c:261
> #10 0x0000555555c7d521 in main_loop_wait (nonblocking=0) at /root/git/qemu/util/main-loop.c:517
> #11 0x00005555558fe234 in main_loop () at /root/git/qemu/vl.c:1918
> #12 0x0000555555905fe8 in main (argc=21, argv=0x7fffffffda48, envp=0x7fffffffdaf8) at /root/git/qemu/vl.c:4752
> (gdb) s
> 517         plen = (hwaddr)-1;
> (gdb)
> 520         section = address_space_do_translate(as, addr, &xlat, &plen,
> (gdb) n
> 524         if (section.mr == &io_mem_unassigned) {
> (gdb) p/x xlat
> $2 = 0x73900000
> (gdb) p/x addr
> $3 = 0xffffe000
> (gdb) p/x plen
> $4 = 0x1000
> (gdb) p section
> $5 = {mr = 0x5555569ad790, address_space = 0x5555562f9b40 <address_space_memory>, offset_within_region = 1048576, size = 0x0000000000000000000000007ff00000, 
>   offset_within_address_space = 1048576, readonly = false}
> 
> Here $2-$4 shows that we are translating IOVA 0xffffe000 into
> [0x73900000, 0x73901000) range, while we can see the section is having
> size as 0x7ff00000,

> which is ranged from [0x100000000, 0x17ff00000).
> That's the upper RAM that the guest has, corresponds to what we can
> see in "info mtree -f" (the guest contains 4G RAM, this is upper 2G):
> 
>   0000000100000000-000000017fffffff (prio 0, ram): pc.ram @0000000080000000

Hmm I made a mistake here... It should be the lower 2G mem (not
upper), which ranges from:

  0000000000100000-000000007fffffff (prio 0, ram): pc.ram @0000000000100000

Though the main point still stands below...

> 
> Obvious, we cannot assume we have a linear mapping on this whole
> range. So I don't think we can really use section info (though the
> section can be used to obtain some offset information, or address
> space the range belongs) to build up the IOTLB, but only the plen.
> Then, we need to fix current codes.
> 
> And this is exactly what I meant above at [1].  Hope this clarifies a
> bit on the issue.  Thanks,
> 
> > 
> > > > 
> > > > 
> > > > > We'll all cache miss along the way until we access 0x0. While if we
> > > > > are with page mask, we'll get:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > > > > - request 0x1ffffe, cache hit
> > > > > - request 0x1ffffd, cache hit
> > > > > - ...
> > > > > 
> > > > > We'll only miss at the first IO.
> > > > 
> > > > I think we should send as much info as we can.
> > > > There should be a way to find full region start and length.
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> 
> -- 
> Peter Xu
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry
  2017-06-15  2:31                                 ` Peter Xu
  2017-06-15  2:57                                   ` Peter Xu
@ 2017-06-16 15:33                                   ` Michael S. Tsirkin
  1 sibling, 0 replies; 34+ messages in thread
From: Michael S. Tsirkin @ 2017-06-16 15:33 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Gibson, Paolo Bonzini, qemu-devel, Maxime Coquelin,
	Jason Wang, Alex Williamson

On Thu, Jun 15, 2017 at 10:31:11AM +0800, Peter Xu wrote:
> On Wed, Jun 14, 2017 at 09:34:52PM +0300, Michael S. Tsirkin wrote:
> > On Mon, Jun 12, 2017 at 12:04:58PM +0800, Peter Xu wrote:
> > > On Mon, Jun 12, 2017 at 06:07:04AM +0300, Michael S. Tsirkin wrote:
> > > > On Mon, Jun 12, 2017 at 10:34:43AM +0800, Peter Xu wrote:
> > > > > On Sun, Jun 11, 2017 at 08:10:15PM +0800, David Gibson wrote:
> > > > > > On Sun, Jun 11, 2017 at 01:09:26PM +0300, Michael S. Tsirkin wrote:
> > > > > > > On Fri, Jun 09, 2017 at 09:58:47AM +0800, Peter Xu wrote:
> > > > > > > > > > The problem is that when I was fixing the problem that vhost had with
> > > > > > > > > > PT (a764040, "exec: abstract address_space_do_translate()"), I did
> > > > > > > > > > broke the IOTLB translation a bit (it was using page masks). IMHO we
> > > > > > > > > > need to fix it first for correctness (patch 1/2).
> > > > > > > > > > 
> > > > > > > > > > For patch 3, if we can have Jason's patch to allow dynamic
> > > > > > > > > > iommu_platform switching, that'll be the best, then I can rewrite
> > > > > > > > > > patch 3 with the switching logic rather than caching anything. But
> > > > > > > > > > IMHO that can be separated from patch 1/2 if you like.
> > > > > > > > > > 
> > > > > > > > > > Or do you have better suggestion on how should we fix it?
> > > > > > > > > > 
> > > > > > > > > > Thanks,
> > > > > > > > > 
> > > > > > > > > Can we drop masks completely and replace with length? I think we
> > > > > > > > > should do that instead of trying to fix masks.
> > > > > > > > 
> > > > > > > > Do you mean to modify IOMMUTLBEntry.addr_mask into length?
> > > > > > > 
> > > > > > > I think it's better than alternatives.
> > > > > > > 
> > > > > > > > Again, I am not sure this is good... At least we need to get ack from
> > > > > > > > David since spapr should be the initial user of it, and possibly also
> > > > > > > > Alex since vfio should be assuming that (IIUC both in QEMU and kernel)
> > > > > > > > addr_mask is page masks rather than arbirary length.
> > > > > > > > 
> > > > > > > > (CC Alex)
> > > > > > > > 
> > > > > > > > Thanks,
> > > > > > > 
> > > > > > > Callbacks that need powers of two can easily split up the range.
> > > > > > 
> > > > > > I think I missed part of the thread.  What's the original use case for
> > > > > > non-power-of-two IOTLB entries?  It certainly won't happen on Power.
> > > > > 
> > > > > Currently address_space_get_iotlb_entry() didn't really follow the
> > > > > rule, addr_mask can be arbitary length. This series tried to fix it,
> > > > > while Michael was questioning about whether we should really fix that
> > > > > at all.
> > > > > 
> > > > > Michael,
> > > > > 
> > > > > Even if for performance's sake, I should still think we should fix it.
> > > > > Let's consider a most simple worst case: we have a single page mapped
> > > > > with IOVA range (2M page):
> > > > > 
> > > > >  [0x0, 0x200000)
> > > > > 
> > > > > And if guest access IOVA using the following patern:
> > > > > 
> > > > >  0x1fffff, 0x1ffffe, 0x1ffffd, ...
> > > > > 
> > > > > Then now we'll get this:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x1fffff, 0x200000)
> > > > > - request 0x1ffffe, cache miss, will get iotlb [0x1ffffe, 0x200000)
> > > > > - request 0x1ffffd, cache miss, will get iotlb [0x1ffffd, 0x200000)
> > > > > - ...
> > > > 
> > > > We pass an offset too, do we not? So callee can figure out
> > > > the region starts at 0x0 and avoid 2nd and 3rd misses.
> > > 
> > > Here when you say "offset", do you mean the offset in
> > > MemoryRegionSection?
> > > 
> > > In address_space_get_iotlb_entry() we have this:
> > > 
> > >     section = address_space_do_translate(as, addr, &xlat, &plen,
> > >                                          is_write, false);
> > > 
> > > One thing to mention is that, imho we cannot really assume the xlat is
> > > valid on the whole "section" range - the section can be a huge GPA
> > > range, while the xlat may only be valid on a single 4K page. The only
> > > safe region we can use here is (xlat, xlat+plen). Outside that, we
> > > should know nothing valid.
> 
> [1]
> 
> > > 
> > > Please correct me if I didn't really catch the point..
> > 
> > IIUC section is the translation result. If so all of it is valid,
> > not just one page.
> 
> It should be only one page (I think that depends on how we implemented
> the translation logic). Please check this (gdb session on current
> master):
> 
> (gdb) b address_space_get_iotlb_entry 
> Breakpoint 2 at 0x55555576803a: file /root/git/qemu/exec.c, line 512.
> ... (until break)
> (gdb) bt
> #0  0x000055555576803a in address_space_get_iotlb_entry (as=0x55555841bbb0, addr=4294959104, is_write=false) at /root/git/qemu/exec.c:512
> #1  0x00005555558322dd in vhost_device_iotlb_miss (dev=0x5555568b03a0, iova=4294959104, write=0) at /root/git/qemu/hw/virtio/vhost.c:982
> #2  0x0000555555834ad9 in vhost_backend_handle_iotlb_msg (dev=0x5555568b03a0, imsg=0x7fffffffd428) at /root/git/qemu/hw/virtio/vhost-backend.c:334
> #3  0x00005555558347be in vhost_kernel_iotlb_read (opaque=0x5555568b03a0) at /root/git/qemu/hw/virtio/vhost-backend.c:204
> #4  0x0000555555c7e7c3 in aio_dispatch_handlers (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:399
> #5  0x0000555555c7e956 in aio_dispatch (ctx=0x5555568091b0) at /root/git/qemu/util/aio-posix.c:430
> #6  0x0000555555c7a722 in aio_ctx_dispatch (source=0x5555568091b0, callback=0x0, user_data=0x0) at /root/git/qemu/util/async.c:261
> #7  0x00007f98c62f3703 in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> #8  0x0000555555c7d36c in glib_pollfds_poll () at /root/git/qemu/util/main-loop.c:213
> #9  0x0000555555c7d468 in os_host_main_loop_wait (timeout=29311167) at /root/git/qemu/util/main-loop.c:261
> #10 0x0000555555c7d521 in main_loop_wait (nonblocking=0) at /root/git/qemu/util/main-loop.c:517
> #11 0x00005555558fe234 in main_loop () at /root/git/qemu/vl.c:1918
> #12 0x0000555555905fe8 in main (argc=21, argv=0x7fffffffda48, envp=0x7fffffffdaf8) at /root/git/qemu/vl.c:4752
> (gdb) s
> 517         plen = (hwaddr)-1;
> (gdb)
> 520         section = address_space_do_translate(as, addr, &xlat, &plen,
> (gdb) n
> 524         if (section.mr == &io_mem_unassigned) {
> (gdb) p/x xlat
> $2 = 0x73900000
> (gdb) p/x addr
> $3 = 0xffffe000
> (gdb) p/x plen
> $4 = 0x1000
> (gdb) p section
> $5 = {mr = 0x5555569ad790, address_space = 0x5555562f9b40 <address_space_memory>, offset_within_region = 1048576, size = 0x0000000000000000000000007ff00000, 
>   offset_within_address_space = 1048576, readonly = false}
> 
> Here $2-$4 shows that we are translating IOVA 0xffffe000 into
> [0x73900000, 0x73901000) range, while we can see the section is having
> size as 0x7ff00000, which is ranged from [0x100000000, 0x17ff00000).
> That's the upper RAM that the guest has, corresponds to what we can
> see in "info mtree -f" (the guest contains 4G RAM, this is upper 2G):
> 
>   0000000100000000-000000017fffffff (prio 0, ram): pc.ram @0000000080000000
> 
> Obvious, we cannot assume we have a linear mapping on this whole
> range. So I don't think we can really use section info (though the
> section can be used to obtain some offset information, or address
> space the range belongs) to build up the IOTLB, but only the plen.
> Then, we need to fix current codes.
> 
> And this is exactly what I meant above at [1].  Hope this clarifies a
> bit on the issue.  Thanks,

I agree we need to take the IOMMU PTE into account.  What I was saying
is that e.g. with a huge PTE we can figure out how it overlaps with the
section and pass that exact info. Whatever is valid in both PTE and in
section, is always safe to access, but might not be a power of two.

> > 
> > > > 
> > > > 
> > > > > We'll all cache miss along the way until we access 0x0. While if we
> > > > > are with page mask, we'll get:
> > > > > 
> > > > > - request 0x1fffff, cache miss, will get iotlb [0x0, 0x200000)
> > > > > - request 0x1ffffe, cache hit
> > > > > - request 0x1ffffd, cache hit
> > > > > - ...
> > > > > 
> > > > > We'll only miss at the first IO.
> > > > 
> > > > I think we should send as much info as we can.
> > > > There should be a way to find full region start and length.
> > > 
> > > Thanks,
> > > 
> > > -- 
> > > Peter Xu
> 
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2017-06-16 15:33 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-06-02 11:50 [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Peter Xu
2017-06-02 11:50 ` [Qemu-devel] [PATCH 1/3] exec: add page_mask for address_space_do_translate Peter Xu
2017-06-02 16:45   ` Michael S. Tsirkin
2017-06-05  2:52     ` Peter Xu
2017-06-02 11:50 ` [Qemu-devel] [PATCH 2/3] exec: simplify address_space_get_iotlb_entry Peter Xu
2017-06-02 16:49   ` Michael S. Tsirkin
2017-06-05  3:07     ` Peter Xu
2017-06-06 14:34       ` Paolo Bonzini
2017-06-06 23:47         ` David Gibson
2017-06-07  3:44           ` Peter Xu
2017-06-07 13:07             ` Michael S. Tsirkin
2017-06-08  6:11               ` Peter Xu
2017-06-08 18:59                 ` Michael S. Tsirkin
2017-06-09  1:58                   ` Peter Xu
2017-06-09  2:37                     ` David Gibson
2017-06-11 10:09                     ` Michael S. Tsirkin
2017-06-11 12:10                       ` David Gibson
2017-06-12  2:34                         ` Peter Xu
2017-06-12  3:07                           ` Michael S. Tsirkin
2017-06-12  4:04                             ` Peter Xu
2017-06-14 18:34                               ` Michael S. Tsirkin
2017-06-15  2:31                                 ` Peter Xu
2017-06-15  2:57                                   ` Peter Xu
2017-06-16 15:33                                   ` Michael S. Tsirkin
2017-06-07 13:01           ` Paolo Bonzini
2017-06-02 11:50 ` [Qemu-devel] [PATCH 3/3] vhost: iommu: cache static mapping if there is Peter Xu
2017-06-02 15:45   ` Michael S. Tsirkin
2017-06-05  3:15     ` Peter Xu
2017-06-05  4:07       ` Jason Wang
2017-06-05 15:05       ` Michael S. Tsirkin
2017-06-02 16:51   ` Michael S. Tsirkin
2017-06-02 14:51 ` [Qemu-devel] [PATCH 0/3] exec: further refine address_space_get_iotlb_entry() Michael S. Tsirkin
2017-06-05  3:20   ` Peter Xu
2017-06-06 15:29     ` Michael S. Tsirkin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.