All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once()
@ 2019-01-08 11:47 Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
                   ` (4 more replies)
  0 siblings, 5 replies; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

Recently we have switched quite a few VT-d trace points into
error_report_once()s and this does expose some errors that we didn't
detect before (previously tracepoints won't trigger as long as
tracepoints weren't enabled).  These errors are not severe in that all
of them will not affect functionality of the guest otherwise we'll
notice them even earlier.  However it still worths to fix all of
them.  This patchset tries to fix quite a few errors like this (except
the last patch, which should also workaround an error but has nothing
to do with the newly introduced error_report_once).

All the errors could easily be triggered by rebooting a guest with
both vfio-pci device and vIOMMU, and we can see errors dumped in
stderr like:

qemu-kvm: vtd_iova_to_slpte: detected slpte permission error (iova=0xffd9ce00, level=0x2, slpte=0x0, write=1)
qemu-kvm: vtd_iommu_translate: detected translation failure (dev=02:00:00, iova=0x0)
qemu-kvm: vtd_interrupt_remap_msi: MSI address low 32 bit invalid: 0x0

Regarding to the patchset itself:

Patch 1:    fixes slpte permission error warning
Patch 2:    fixes intr_enabled flag reset missing
Patch 3-4:  fixes MSI translation warning
Patch 5:    workaround of a kernel bug that could cause UNMAP fail error

It was tested that this series can fix all the error messages observed
in below bugs:

https://bugzilla.redhat.com/show_bug.cgi?id=1662270
https://bugzilla.redhat.com/show_bug.cgi?id=1662291

Please have a look, thanks.

Peter Xu (5):
  intel_iommu: fix operator in vtd_switch_address_space
  intel_iommu: reset intr_enabled when system reset
  pci/msi: export msi_is_masked()
  i386/kvm: ignore masked irqs when update msi routes
  vfio: retry one more time conditionally for type1 unmap

 hw/i386/intel_iommu.c |  3 ++-
 hw/pci/msi.c          |  2 +-
 hw/vfio/common.c      | 16 ++++++++++++++++
 hw/vfio/trace-events  |  1 +
 include/hw/pci/msi.h  |  1 +
 target/i386/kvm.c     | 14 +++++++++++---
 6 files changed, 32 insertions(+), 5 deletions(-)

-- 
2.17.1

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space
  2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
@ 2019-01-08 11:47 ` Peter Xu
  2019-01-11  4:03   ` Jason Wang
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset Peter Xu
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

When calculating use_iommu, we wanted to first detect whether DMAR is
enabled, then check whether PT is enabled if DMAR is enabled.  However
in the current code we used "&" rather than "&&" so the ordering
requirement is lost (instead it'll be an "AND" operation).  This could
introduce errors dumped in QEMU console when rebooting a guest with
both assigned device and vIOMMU, like:

  qemu-system-x86_64: vtd_dev_to_context_entry: invalid root entry:
  rsvd=0xf000ff53f000e2c3, val=0xf000ff53f000ff53 (reserved nonzero)

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 8b72735650..6d5cc1d039 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -1153,7 +1153,7 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
 
     assert(as);
 
-    use_iommu = as->iommu_state->dmar_enabled & !vtd_dev_pt_enabled(as);
+    use_iommu = as->iommu_state->dmar_enabled && !vtd_dev_pt_enabled(as);
 
     trace_vtd_switch_address_space(pci_bus_num(as->bus),
                                    VTD_PCI_SLOT(as->devfn),
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset
  2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
@ 2019-01-08 11:47 ` Peter Xu
  2019-01-11  4:04   ` Jason Wang
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 3/5] pci/msi: export msi_is_masked() Peter Xu
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

This is found when I was debugging another problem.  Until now no bug
is reported with this but we'd better reset the IR status correctly
after a system reset.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/i386/intel_iommu.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
index 6d5cc1d039..ee22e754f0 100644
--- a/hw/i386/intel_iommu.c
+++ b/hw/i386/intel_iommu.c
@@ -3138,6 +3138,7 @@ static void vtd_init(IntelIOMMUState *s)
     s->root = 0;
     s->root_extended = false;
     s->dmar_enabled = false;
+    s->intr_enabled = false;
     s->iq_head = 0;
     s->iq_tail = 0;
     s->iq = 0;
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH 3/5] pci/msi: export msi_is_masked()
  2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset Peter Xu
@ 2019-01-08 11:47 ` Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 4/5] i386/kvm: ignore masked irqs when update msi routes Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap Peter Xu
  4 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

It is going to be used later on outside MSI code to detect whether one
MSI vector is masked out.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/pci/msi.c         | 2 +-
 include/hw/pci/msi.h | 1 +
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/pci/msi.c b/hw/pci/msi.c
index 5e05ce5ec2..47d2b0f33c 100644
--- a/hw/pci/msi.c
+++ b/hw/pci/msi.c
@@ -286,7 +286,7 @@ void msi_reset(PCIDevice *dev)
     MSI_DEV_PRINTF(dev, "reset\n");
 }
 
-static bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
+bool msi_is_masked(const PCIDevice *dev, unsigned int vector)
 {
     uint16_t flags = pci_get_word(dev->config + msi_flags_off(dev));
     uint32_t mask, data;
diff --git a/include/hw/pci/msi.h b/include/hw/pci/msi.h
index 4837bcf490..8440eaee11 100644
--- a/include/hw/pci/msi.h
+++ b/include/hw/pci/msi.h
@@ -39,6 +39,7 @@ int msi_init(struct PCIDevice *dev, uint8_t offset,
              bool msi_per_vector_mask, Error **errp);
 void msi_uninit(struct PCIDevice *dev);
 void msi_reset(PCIDevice *dev);
+bool msi_is_masked(const PCIDevice *dev, unsigned int vector);
 void msi_notify(PCIDevice *dev, unsigned int vector);
 void msi_send_message(PCIDevice *dev, MSIMessage msg);
 void msi_write_config(PCIDevice *dev, uint32_t addr, uint32_t val, int len);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH 4/5] i386/kvm: ignore masked irqs when update msi routes
  2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
                   ` (2 preceding siblings ...)
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 3/5] pci/msi: export msi_is_masked() Peter Xu
@ 2019-01-08 11:47 ` Peter Xu
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap Peter Xu
  4 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

When we are with intel-iommu device and with IR on, KVM will register
an IEC notifier to detect interrupt updates from the guest and we'll
kick off kvm_update_msi_routes_all() when it happens to make sure
kernel IRQ cache is matching the latest.

Though, kvm_update_msi_routes_all() is buggy in that it ignored the
mask bit of either MSI/MSIX messages and it tries to translate the
message even if the corresponding message was already masked by the
guest driver (hence the MSI/MSIX message will be invalid).

Without this patch, we can receive an error message when we reboot a
guest with both an assigned vfio-pci device and intel-iommu enabled:

  qemu-system-x86_64: vtd_interrupt_remap_msi: MSI address low 32 bit invalid: 0x0

The error does not affect functionality of the guest since when we
failed to translate we'll just silently continue (which makes sense
since crashing the VM for this seems even worse), but still it's
better to fix it up.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 target/i386/kvm.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/target/i386/kvm.c b/target/i386/kvm.c
index 739cf8c8ea..08e211c70e 100644
--- a/target/i386/kvm.c
+++ b/target/i386/kvm.c
@@ -3889,7 +3889,7 @@ static QLIST_HEAD(, MSIRouteEntry) msi_route_list = \
 static void kvm_update_msi_routes_all(void *private, bool global,
                                       uint32_t index, uint32_t mask)
 {
-    int cnt = 0;
+    int cnt = 0, vector;
     MSIRouteEntry *entry;
     MSIMessage msg;
     PCIDevice *dev;
@@ -3897,11 +3897,19 @@ static void kvm_update_msi_routes_all(void *private, bool global,
     /* TODO: explicit route update */
     QLIST_FOREACH(entry, &msi_route_list, list) {
         cnt++;
+        vector = entry->vector;
         dev = entry->dev;
-        if (!msix_enabled(dev) && !msi_enabled(dev)) {
+        if (msix_enabled(dev) && !msix_is_masked(dev, vector)) {
+            msg = msix_get_message(dev, vector);
+        } else if (msi_enabled(dev) && !msi_is_masked(dev, vector)) {
+            msg = msi_get_message(dev, vector);
+        } else {
+            /*
+             * Either MSI/MSIX is disabled for the device, or the
+             * specific message was masked out.  Skip this one.
+             */
             continue;
         }
-        msg = pci_get_msi_message(dev, entry->vector);
         kvm_irqchip_update_msi_route(kvm_state, entry->virq, msg, dev);
     }
     kvm_irqchip_commit_routes(kvm_state);
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap
  2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
                   ` (3 preceding siblings ...)
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 4/5] i386/kvm: ignore masked irqs when update msi routes Peter Xu
@ 2019-01-08 11:47 ` Peter Xu
  2019-01-08 15:23   ` Alex Williamson
  4 siblings, 1 reply; 10+ messages in thread
From: Peter Xu @ 2019-01-08 11:47 UTC (permalink / raw)
  To: qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger, peterx, Jason Wang

In Linux version v4.15+ there's a bug (introduced in 71a7d3d78e3c,
"vfio/type1: silence integer overflow warning", 2017-10-20) that could
potentially reject a valid unmap region that covers exactly the whole
u64 address space (like iova=0xfef00000, size=2^64-0xfef00000).
Besides a fix on the kernel side, QEMU also needs to live well even
with the old kernels.  When booting a guest with both vfio-pci and
intel-iommu device, we can see error dumped:

  qemu-kvm: VFIO_UNMAP_DMA: -22
  qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000,
            0xffffffff01100000) = -22 (Invalid argument)

This patch gives another shot of the UNMAP ioctl if the specific error
was detected, while in the second UNMAP ioctl we skip the last page
assuming that it's never used.  In our case, currently only Intel VT-d
is using this code and it should never use the iova address
2^64-4096 (so far largest supported GAW is 57 bits) so ignoring that
page should be fine.

After this patch is applied, the errors go away.

Reported-by: Pei Zhang <pezhang@redhat.com>
Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
Suggested-by: Alex Williamson <alex.williamson@redhat.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 hw/vfio/common.c     | 16 ++++++++++++++++
 hw/vfio/trace-events |  1 +
 2 files changed, 17 insertions(+)

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c185e5a2e..7f8de5b7a5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -221,6 +221,22 @@ static int vfio_dma_unmap(VFIOContainer *container,
     };
 
     if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        /*
+         * Give it another shot due to a bug in kernel (v4.15-v4.20)
+         * that could potentially reject a region that exactly covers
+         * the whole u64 address space (71a7d3d78e3c, "vfio/type1:
+         * silence integer overflow warning", 2017-10-20).  If that
+         * happens, we retry for one more time assuming that the last
+         * page of the address space (2^64-getpagesize()) has already
+         * been dropped.
+         */
+        if (errno == EINVAL && unmap.size && unmap.iova + unmap.size == 0) {
+            trace_vfio_dma_unmap_workaround_overflow();
+            unmap.size -= getpagesize();
+            if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap) == 0) {
+                return 0;
+            }
+        }
         error_report("VFIO_UNMAP_DMA: %d", -errno);
         return -errno;
     }
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index a85e8662ea..2c9db4fb9a 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -110,6 +110,7 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
 vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
 vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
 vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
+vfio_dma_unmap_workaround_overflow(void) ""
 
 # hw/vfio/platform.c
 vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
-- 
2.17.1

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap Peter Xu
@ 2019-01-08 15:23   ` Alex Williamson
  2019-01-09  2:53     ` Peter Xu
  0 siblings, 1 reply; 10+ messages in thread
From: Alex Williamson @ 2019-01-08 15:23 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Eric Auger, Jason Wang

On Tue,  8 Jan 2019 19:47:20 +0800
Peter Xu <peterx@redhat.com> wrote:

> In Linux version v4.15+ there's a bug (introduced in 71a7d3d78e3c,
> "vfio/type1: silence integer overflow warning", 2017-10-20) that could
> potentially reject a valid unmap region that covers exactly the whole
> u64 address space (like iova=0xfef00000, size=2^64-0xfef00000).
> Besides a fix on the kernel side, QEMU also needs to live well even
> with the old kernels.  When booting a guest with both vfio-pci and
> intel-iommu device, we can see error dumped:
> 
>   qemu-kvm: VFIO_UNMAP_DMA: -22
>   qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000,
>             0xffffffff01100000) = -22 (Invalid argument)
> 
> This patch gives another shot of the UNMAP ioctl if the specific error
> was detected, while in the second UNMAP ioctl we skip the last page
> assuming that it's never used.  In our case, currently only Intel VT-d
> is using this code and it should never use the iova address
> 2^64-4096 (so far largest supported GAW is 57 bits) so ignoring that
> page should be fine.
> 
> After this patch is applied, the errors go away.
> 
> Reported-by: Pei Zhang <pezhang@redhat.com>
> Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
> Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  hw/vfio/common.c     | 16 ++++++++++++++++
>  hw/vfio/trace-events |  1 +
>  2 files changed, 17 insertions(+)
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7c185e5a2e..7f8de5b7a5 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -221,6 +221,22 @@ static int vfio_dma_unmap(VFIOContainer *container,
>      };
>  
>      if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        /*
> +         * Give it another shot due to a bug in kernel (v4.15-v4.20)
> +         * that could potentially reject a region that exactly covers
> +         * the whole u64 address space (71a7d3d78e3c, "vfio/type1:
> +         * silence integer overflow warning", 2017-10-20).  If that
> +         * happens, we retry for one more time assuming that the last
> +         * page of the address space (2^64-getpagesize()) has already
> +         * been dropped.
> +         */
> +        if (errno == EINVAL && unmap.size && unmap.iova + unmap.size == 0) {
> +            trace_vfio_dma_unmap_workaround_overflow();
> +            unmap.size -= getpagesize();
> +            if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap) == 0) {
> +                return 0;
> +            }
> +        }
>          error_report("VFIO_UNMAP_DMA: %d", -errno);
>          return -errno;
>      }
> diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> index a85e8662ea..2c9db4fb9a 100644
> --- a/hw/vfio/trace-events
> +++ b/hw/vfio/trace-events
> @@ -110,6 +110,7 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
>  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
>  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
>  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> +vfio_dma_unmap_workaround_overflow(void) ""
>  
>  # hw/vfio/platform.c
>  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"

Hi Peter,

I was working on a slightly different version:

diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 7c185e5a2e79..9f5a140cb1c3 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -220,7 +220,24 @@ static int vfio_dma_unmap(VFIOContainer *container,
         .size = size,
     };
 
-    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
+        /*
+         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
+         * v4.15) where its overflow check prevents us from unmapping the last
+         * page of the address space.  Test for the error condition and re-try
+         * the unmap excluding the last page.  The expectation is that we've
+         * never mapped the last page anyway and this unmap request comes via
+         * vIOMMU support which also makes it unlikely that this page is used.
+         * This bug was introduced well after type1 v2 support was introduced,
+         * so we shouldn't need to test for v1.  A fix is proposed for kernel
+         * v5.0 so this workaround can be removed once affected kernels are
+         * sufficiently deprecated.
+         */
+        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
+            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
+            unmap.size -= 1ULL << ctz64(container->pgsizes);
+            continue;
+        }
         error_report("VFIO_UNMAP_DMA: %d", -errno);
         return -errno;
     }

I like your addition of tracing, but I prefer the type1v2 test (the
bug is specific to the type1 backend) and using the iommu minimum page
size rather than the cpu page size.  Do you want to incorporate or
would you prefer I post mine?  Thanks,

Alex

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap
  2019-01-08 15:23   ` Alex Williamson
@ 2019-01-09  2:53     ` Peter Xu
  0 siblings, 0 replies; 10+ messages in thread
From: Peter Xu @ 2019-01-09  2:53 UTC (permalink / raw)
  To: Alex Williamson
  Cc: qemu-devel, Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Eric Auger, Jason Wang

On Tue, Jan 08, 2019 at 08:23:50AM -0700, Alex Williamson wrote:
> On Tue,  8 Jan 2019 19:47:20 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > In Linux version v4.15+ there's a bug (introduced in 71a7d3d78e3c,
> > "vfio/type1: silence integer overflow warning", 2017-10-20) that could
> > potentially reject a valid unmap region that covers exactly the whole
> > u64 address space (like iova=0xfef00000, size=2^64-0xfef00000).
> > Besides a fix on the kernel side, QEMU also needs to live well even
> > with the old kernels.  When booting a guest with both vfio-pci and
> > intel-iommu device, we can see error dumped:
> > 
> >   qemu-kvm: VFIO_UNMAP_DMA: -22
> >   qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000,
> >             0xffffffff01100000) = -22 (Invalid argument)
> > 
> > This patch gives another shot of the UNMAP ioctl if the specific error
> > was detected, while in the second UNMAP ioctl we skip the last page
> > assuming that it's never used.  In our case, currently only Intel VT-d
> > is using this code and it should never use the iova address
> > 2^64-4096 (so far largest supported GAW is 57 bits) so ignoring that
> > page should be fine.
> > 
> > After this patch is applied, the errors go away.
> > 
> > Reported-by: Pei Zhang <pezhang@redhat.com>
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/vfio/common.c     | 16 ++++++++++++++++
> >  hw/vfio/trace-events |  1 +
> >  2 files changed, 17 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 7c185e5a2e..7f8de5b7a5 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -221,6 +221,22 @@ static int vfio_dma_unmap(VFIOContainer *container,
> >      };
> >  
> >      if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> > +        /*
> > +         * Give it another shot due to a bug in kernel (v4.15-v4.20)
> > +         * that could potentially reject a region that exactly covers
> > +         * the whole u64 address space (71a7d3d78e3c, "vfio/type1:
> > +         * silence integer overflow warning", 2017-10-20).  If that
> > +         * happens, we retry for one more time assuming that the last
> > +         * page of the address space (2^64-getpagesize()) has already
> > +         * been dropped.
> > +         */
> > +        if (errno == EINVAL && unmap.size && unmap.iova + unmap.size == 0) {
> > +            trace_vfio_dma_unmap_workaround_overflow();
> > +            unmap.size -= getpagesize();
> > +            if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap) == 0) {
> > +                return 0;
> > +            }
> > +        }
> >          error_report("VFIO_UNMAP_DMA: %d", -errno);
> >          return -errno;
> >      }
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index a85e8662ea..2c9db4fb9a 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -110,6 +110,7 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
> >  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
> >  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
> >  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> > +vfio_dma_unmap_workaround_overflow(void) ""
> >  
> >  # hw/vfio/platform.c
> >  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> 
> Hi Peter,
> 
> I was working on a slightly different version:
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7c185e5a2e79..9f5a140cb1c3 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -220,7 +220,24 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> -    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        /*
> +         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> +         * v4.15) where its overflow check prevents us from unmapping the last
> +         * page of the address space.  Test for the error condition and re-try
> +         * the unmap excluding the last page.  The expectation is that we've
> +         * never mapped the last page anyway and this unmap request comes via
> +         * vIOMMU support which also makes it unlikely that this page is used.
> +         * This bug was introduced well after type1 v2 support was introduced,
> +         * so we shouldn't need to test for v1.  A fix is proposed for kernel
> +         * v5.0 so this workaround can be removed once affected kernels are
> +         * sufficiently deprecated.
> +         */
> +        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
> +            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
> +            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            continue;
> +        }
>          error_report("VFIO_UNMAP_DMA: %d", -errno);
>          return -errno;
>      }
> 
> I like your addition of tracing, but I prefer the type1v2 test (the
> bug is specific to the type1 backend) and using the iommu minimum page
> size rather than the cpu page size.  Do you want to incorporate or
> would you prefer I post mine?  Thanks,

Hi, Alex,

I think the type check and using the container->pgsizes are better!
Please use your version, I'll simply drop mine from the series.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
@ 2019-01-11  4:03   ` Jason Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Wang @ 2019-01-11  4:03 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger


On 2019/1/8 下午7:47, Peter Xu wrote:
> When calculating use_iommu, we wanted to first detect whether DMAR is
> enabled, then check whether PT is enabled if DMAR is enabled.  However
> in the current code we used "&" rather than "&&" so the ordering
> requirement is lost (instead it'll be an "AND" operation).  This could
> introduce errors dumped in QEMU console when rebooting a guest with
> both assigned device and vIOMMU, like:
>
>    qemu-system-x86_64: vtd_dev_to_context_entry: invalid root entry:
>    rsvd=0xf000ff53f000e2c3, val=0xf000ff53f000ff53 (reserved nonzero)
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 8b72735650..6d5cc1d039 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -1153,7 +1153,7 @@ static bool vtd_switch_address_space(VTDAddressSpace *as)
>   
>       assert(as);
>   
> -    use_iommu = as->iommu_state->dmar_enabled & !vtd_dev_pt_enabled(as);
> +    use_iommu = as->iommu_state->dmar_enabled && !vtd_dev_pt_enabled(as);
>   
>       trace_vtd_switch_address_space(pci_bus_num(as->bus),
>                                      VTD_PCI_SLOT(as->devfn),


Acked-by: Jason Wang <jasowang@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset
  2019-01-08 11:47 ` [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset Peter Xu
@ 2019-01-11  4:04   ` Jason Wang
  0 siblings, 0 replies; 10+ messages in thread
From: Jason Wang @ 2019-01-11  4:04 UTC (permalink / raw)
  To: Peter Xu, qemu-devel
  Cc: Michael S . Tsirkin, Paolo Bonzini, Marcel Apfelbaum,
	Alex Williamson, Eric Auger


On 2019/1/8 下午7:47, Peter Xu wrote:
> This is found when I was debugging another problem.  Until now no bug
> is reported with this but we'd better reset the IR status correctly
> after a system reset.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>   hw/i386/intel_iommu.c | 1 +
>   1 file changed, 1 insertion(+)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6d5cc1d039..ee22e754f0 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -3138,6 +3138,7 @@ static void vtd_init(IntelIOMMUState *s)
>       s->root = 0;
>       s->root_extended = false;
>       s->dmar_enabled = false;
> +    s->intr_enabled = false;
>       s->iq_head = 0;
>       s->iq_tail = 0;
>       s->iq = 0;


Acked-by: Jason Wang <jasowang@redhat.com>

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-11  4:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
2019-01-11  4:03   ` Jason Wang
2019-01-08 11:47 ` [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset Peter Xu
2019-01-11  4:04   ` Jason Wang
2019-01-08 11:47 ` [Qemu-devel] [PATCH 3/5] pci/msi: export msi_is_masked() Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 4/5] i386/kvm: ignore masked irqs when update msi routes Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap Peter Xu
2019-01-08 15:23   ` Alex Williamson
2019-01-09  2:53     ` Peter Xu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.