All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-devel@nongnu.org, "Michael S . Tsirkin" <mst@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Marcel Apfelbaum <marcel.apfelbaum@gmail.com>,
	Eric Auger <eric.auger@redhat.com>,
	Jason Wang <jasowang@redhat.com>
Subject: Re: [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap
Date: Wed, 9 Jan 2019 10:53:48 +0800	[thread overview]
Message-ID: <20190109025348.GA12837@xz-x1> (raw)
In-Reply-To: <20190108082350.1f4902f4@x1.home>

On Tue, Jan 08, 2019 at 08:23:50AM -0700, Alex Williamson wrote:
> On Tue,  8 Jan 2019 19:47:20 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > In Linux version v4.15+ there's a bug (introduced in 71a7d3d78e3c,
> > "vfio/type1: silence integer overflow warning", 2017-10-20) that could
> > potentially reject a valid unmap region that covers exactly the whole
> > u64 address space (like iova=0xfef00000, size=2^64-0xfef00000).
> > Besides a fix on the kernel side, QEMU also needs to live well even
> > with the old kernels.  When booting a guest with both vfio-pci and
> > intel-iommu device, we can see error dumped:
> > 
> >   qemu-kvm: VFIO_UNMAP_DMA: -22
> >   qemu-kvm: vfio_dma_unmap(0x561f059948f0, 0xfef00000,
> >             0xffffffff01100000) = -22 (Invalid argument)
> > 
> > This patch gives another shot of the UNMAP ioctl if the specific error
> > was detected, while in the second UNMAP ioctl we skip the last page
> > assuming that it's never used.  In our case, currently only Intel VT-d
> > is using this code and it should never use the iova address
> > 2^64-4096 (so far largest supported GAW is 57 bits) so ignoring that
> > page should be fine.
> > 
> > After this patch is applied, the errors go away.
> > 
> > Reported-by: Pei Zhang <pezhang@redhat.com>
> > Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1662291
> > Suggested-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> > ---
> >  hw/vfio/common.c     | 16 ++++++++++++++++
> >  hw/vfio/trace-events |  1 +
> >  2 files changed, 17 insertions(+)
> > 
> > diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> > index 7c185e5a2e..7f8de5b7a5 100644
> > --- a/hw/vfio/common.c
> > +++ b/hw/vfio/common.c
> > @@ -221,6 +221,22 @@ static int vfio_dma_unmap(VFIOContainer *container,
> >      };
> >  
> >      if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> > +        /*
> > +         * Give it another shot due to a bug in kernel (v4.15-v4.20)
> > +         * that could potentially reject a region that exactly covers
> > +         * the whole u64 address space (71a7d3d78e3c, "vfio/type1:
> > +         * silence integer overflow warning", 2017-10-20).  If that
> > +         * happens, we retry for one more time assuming that the last
> > +         * page of the address space (2^64-getpagesize()) has already
> > +         * been dropped.
> > +         */
> > +        if (errno == EINVAL && unmap.size && unmap.iova + unmap.size == 0) {
> > +            trace_vfio_dma_unmap_workaround_overflow();
> > +            unmap.size -= getpagesize();
> > +            if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap) == 0) {
> > +                return 0;
> > +            }
> > +        }
> >          error_report("VFIO_UNMAP_DMA: %d", -errno);
> >          return -errno;
> >      }
> > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
> > index a85e8662ea..2c9db4fb9a 100644
> > --- a/hw/vfio/trace-events
> > +++ b/hw/vfio/trace-events
> > @@ -110,6 +110,7 @@ vfio_region_mmaps_set_enabled(const char *name, bool enabled) "Region %s mmaps e
> >  vfio_region_sparse_mmap_header(const char *name, int index, int nr_areas) "Device %s region %d: %d sparse mmap entries"
> >  vfio_region_sparse_mmap_entry(int i, unsigned long start, unsigned long end) "sparse entry %d [0x%lx - 0x%lx]"
> >  vfio_get_dev_region(const char *name, int index, uint32_t type, uint32_t subtype) "%s index %d, %08x/%0x8"
> > +vfio_dma_unmap_workaround_overflow(void) ""
> >  
> >  # hw/vfio/platform.c
> >  vfio_platform_base_device_init(char *name, int groupid) "%s belongs to group #%d"
> 
> Hi Peter,
> 
> I was working on a slightly different version:
> 
> diff --git a/hw/vfio/common.c b/hw/vfio/common.c
> index 7c185e5a2e79..9f5a140cb1c3 100644
> --- a/hw/vfio/common.c
> +++ b/hw/vfio/common.c
> @@ -220,7 +220,24 @@ static int vfio_dma_unmap(VFIOContainer *container,
>          .size = size,
>      };
>  
> -    if (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +    while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
> +        /*
> +         * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
> +         * v4.15) where its overflow check prevents us from unmapping the last
> +         * page of the address space.  Test for the error condition and re-try
> +         * the unmap excluding the last page.  The expectation is that we've
> +         * never mapped the last page anyway and this unmap request comes via
> +         * vIOMMU support which also makes it unlikely that this page is used.
> +         * This bug was introduced well after type1 v2 support was introduced,
> +         * so we shouldn't need to test for v1.  A fix is proposed for kernel
> +         * v5.0 so this workaround can be removed once affected kernels are
> +         * sufficiently deprecated.
> +         */
> +        if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
> +            container->iommu_type == VFIO_TYPE1v2_IOMMU) {
> +            unmap.size -= 1ULL << ctz64(container->pgsizes);
> +            continue;
> +        }
>          error_report("VFIO_UNMAP_DMA: %d", -errno);
>          return -errno;
>      }
> 
> I like your addition of tracing, but I prefer the type1v2 test (the
> bug is specific to the type1 backend) and using the iommu minimum page
> size rather than the cpu page size.  Do you want to incorporate or
> would you prefer I post mine?  Thanks,

Hi, Alex,

I think the type check and using the container->pgsizes are better!
Please use your version, I'll simply drop mine from the series.

Thanks,

-- 
Peter Xu

      reply	other threads:[~2019-01-09  2:54 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-01-08 11:47 [Qemu-devel] [PATCH 0/5] intel_iommu: misc fixes for error exposed after error_report_once() Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 1/5] intel_iommu: fix operator in vtd_switch_address_space Peter Xu
2019-01-11  4:03   ` Jason Wang
2019-01-08 11:47 ` [Qemu-devel] [PATCH 2/5] intel_iommu: reset intr_enabled when system reset Peter Xu
2019-01-11  4:04   ` Jason Wang
2019-01-08 11:47 ` [Qemu-devel] [PATCH 3/5] pci/msi: export msi_is_masked() Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 4/5] i386/kvm: ignore masked irqs when update msi routes Peter Xu
2019-01-08 11:47 ` [Qemu-devel] [PATCH 5/5] vfio: retry one more time conditionally for type1 unmap Peter Xu
2019-01-08 15:23   ` Alex Williamson
2019-01-09  2:53     ` Peter Xu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190109025348.GA12837@xz-x1 \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=eric.auger@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=marcel.apfelbaum@gmail.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.