All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Tian, Kevin" <kevin.tian@intel.com>
Cc: Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	Jintack Lim <jintack@cs.columbia.edu>
Subject: Re: [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges
Date: Fri, 27 Apr 2018 19:40:29 +0800	[thread overview]
Message-ID: <20180427114029.GF13269@xz-mi> (raw)
In-Reply-To: <20180427095527.GE13269@xz-mi>

On Fri, Apr 27, 2018 at 05:55:27PM +0800, Peter Xu wrote:
> On Fri, Apr 27, 2018 at 07:44:07AM +0000, Tian, Kevin wrote:
> > > From: Peter Xu [mailto:peterx@redhat.com]
> > > Sent: Friday, April 27, 2018 3:28 PM
> > > 
> > > On Fri, Apr 27, 2018 at 07:02:14AM +0000, Tian, Kevin wrote:
> > > > > From: Jason Wang [mailto:jasowang@redhat.com]
> > > > > Sent: Friday, April 27, 2018 2:08 PM
> > > > >
> > > > > On 2018年04月25日 12:51, Peter Xu wrote:
> > > > > > For each VTDAddressSpace, now we maintain what IOVA ranges we
> > > have
> > > > > > mapped and what we have not.  With that information, now we only
> > > > > send
> > > > > > MAP or UNMAP when necessary.  Say, we don't send MAP notifies if
> > > we
> > > > > know
> > > > > > we have already mapped the range, meanwhile we don't send
> > > UNMAP
> > > > > notifies
> > > > > > if we know we never mapped the range at all.
> > > > > >
> > > > > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > > > > > ---
> > > > > >   include/hw/i386/intel_iommu.h |  2 ++
> > > > > >   hw/i386/intel_iommu.c         | 28 ++++++++++++++++++++++++++++
> > > > > >   hw/i386/trace-events          |  2 ++
> > > > > >   3 files changed, 32 insertions(+)
> > > > > >
> > > > > > diff --git a/include/hw/i386/intel_iommu.h
> > > > > b/include/hw/i386/intel_iommu.h
> > > > > > index 486e205e79..09a2e94404 100644
> > > > > > --- a/include/hw/i386/intel_iommu.h
> > > > > > +++ b/include/hw/i386/intel_iommu.h
> > > > > > @@ -27,6 +27,7 @@
> > > > > >   #include "hw/i386/ioapic.h"
> > > > > >   #include "hw/pci/msi.h"
> > > > > >   #include "hw/sysbus.h"
> > > > > > +#include "qemu/interval-tree.h"
> > > > > >
> > > > > >   #define TYPE_INTEL_IOMMU_DEVICE "intel-iommu"
> > > > > >   #define INTEL_IOMMU_DEVICE(obj) \
> > > > > > @@ -95,6 +96,7 @@ struct VTDAddressSpace {
> > > > > >       QLIST_ENTRY(VTDAddressSpace) next;
> > > > > >       /* Superset of notifier flags that this address space has */
> > > > > >       IOMMUNotifierFlag notifier_flags;
> > > > > > +    ITTree *iova_tree;          /* Traces mapped IOVA ranges */
> > > > > >   };
> > > > > >
> > > > > >   struct VTDBus {
> > > > > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > > > > index a19c18b8d4..8f396a5d13 100644
> > > > > > --- a/hw/i386/intel_iommu.c
> > > > > > +++ b/hw/i386/intel_iommu.c
> > > > > > @@ -768,12 +768,37 @@ typedef struct {
> > > > > >   static int vtd_page_walk_one(IOMMUTLBEntry *entry, int level,
> > > > > >                                vtd_page_walk_info *info)
> > > > > >   {
> > > > > > +    VTDAddressSpace *as = info->as;
> > > > > >       vtd_page_walk_hook hook_fn = info->hook_fn;
> > > > > >       void *private = info->private;
> > > > > > +    ITRange *mapped = it_tree_find(as->iova_tree, entry->iova,
> > > > > > +                                   entry->iova + entry->addr_mask);
> > > > > >
> > > > > >       assert(hook_fn);
> > > > > > +
> > > > > > +    /* Update local IOVA mapped ranges */
> > > > > > +    if (entry->perm) {
> > > > > > +        if (mapped) {
> > > > > > +            /* Skip since we have already mapped this range */
> > > > > > +            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > > > >addr_mask,
> > > > > > +                                             mapped->start, mapped->end);
> > > > > > +            return 0;
> > > > > > +        }
> > > > > > +        it_tree_insert(as->iova_tree, entry->iova,
> > > > > > +                       entry->iova + entry->addr_mask);
> > > > >
> > > > > I was consider a case e.g:
> > > > >
> > > > > 1) map A (iova) to B (pa)
> > > > > 2) invalidate A
> > > > > 3) map A (iova) to C (pa)
> > > > > 4) invalidate A
> > > > >
> > > > > In this case, we will probably miss a walk here. But I'm not sure it was
> > > > > allowed by the spec (though I think so).
> > > > >
> > > 
> > > Hi, Kevin,
> > > 
> > > Thanks for joining the discussion.
> > > 
> > > >
> > > > I thought it was wrong in a glimpse, but then changed my mind after
> > > > another thinking. As long as device driver can quiescent the device
> > > > to not access A (iova) within above window, then above sequence
> > > > has no problem since any stale mappings (A->B) added before step 4)
> > > > won't be used and then will get flushed after step 4). Regarding to
> > > > that actually the 1st invalidation can be skipped:
> > > >
> > > > 1) map A (iova) to B (pa)
> > > > 2) driver programs device to use A
> > > > 3) driver programs device to not use A
> > > > 4) map A (iova) to C (pa)
> > > > 	A->B may be still valid in IOTLB
> > > > 5) invalidate A
> > > > 6) driver programs device to use A
> > > 
> > > Note that IMHO this is a bit different from Jason's example, and it'll
> > > be fine.  Current code should work well with this scenario since the
> > > emulation code will not aware of the map A until step (5).  Then we'll
> > > have the correct mapping.
> > 
> > you are right. we still need the extra PSI otherwise the 1st mapping
> > is problematic for use. So back to Jason's example.
> > 
> > > 
> > > While for Jason's example it's exactly the extra PSI that might cause
> > > stale mappings (though again I think it's still problematic...).
> > 
> > problematic in software side (e.g. that way IOMMU core relies on
> > device drivers which conflict with the value of using IOMMU) but
> > it is OK from hardware p.o.v. btw the extra PSI itself doesn't cause
> > stale mapping. Instead it is device activity after that PSI may cause it.
> > 
> > > 
> > > Actually I think I can just fix up the code even if Jason's case
> > > happens by unmapping that first then remap:
> > > 
> > > diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> > > index 31e9b52452..2a9584f9d8 100644
> > > --- a/hw/i386/intel_iommu.c
> > > +++ b/hw/i386/intel_iommu.c
> > > @@ -778,13 +778,21 @@ static int vtd_page_walk_one(IOMMUTLBEntry
> > > *entry, int level,
> > >      /* Update local IOVA mapped ranges */
> > >      if (entry->perm) {
> > >          if (mapped) {
> > > -            /* Skip since we have already mapped this range */
> > > -            trace_vtd_page_walk_one_skip_map(entry->iova, entry-
> > > >addr_mask,
> > > -                                             mapped->start, mapped->end);
> > > -            return 0;
> > > +            int ret;
> > > +            /* Cache the write permission */
> > > +            IOMMUAccessFlags flags = entry->perm;
> > > +
> > > +            /* UNMAP the old first then remap.  No need to touch IOVA tree */
> > > +            entry->perm = IOMMU_NONE;
> > > +            ret = hook_fn(entry, private);
> > > +            if (ret) {
> > > +                return ret;
> > > +            }
> > > +            entry->perm = flags;
> > > +        } else {
> > > +            it_tree_insert(as->iova_tree, entry->iova,
> > > +                           entry->iova + entry->addr_mask);
> > >          }
> > > -        it_tree_insert(as->iova_tree, entry->iova,
> > > -                       entry->iova + entry->addr_mask);
> > >      } else {
> > >          if (!mapped) {
> > >              /* Skip since we didn't map this range at all */
> > > 
> > > If we really think it necessary, I can squash this in, though this is
> > > a bit ugly.  But I just want to confirm whether this would be anything
> > > we want...
> > > 
> > 
> > I didn’t look into your actual code yet. If others think above
> > change is OK then it's definitely good as we conform to hardware
> > behavior here. Otherwise if there is a way to detect such unusual 
> > usage and then adopt some action (say kill the VM), it's also fine 
> > since user knows he runs a bad OS which is not supported by 
> > Qemu. It's just not good if such situation is not handled, which 
> > leads to some undefined behavior which nobody knows the reason 
> > w/o hard debug into. :-)
> 
> Yeah, then let me do this. :)
> 
> Jason, would you be good with above change squashed?

Self NAK on this...

More than half of the whole series tries to solve the solo problem
that we unmapped some pages that were already mapped, which proved to
be wrong.  Now if we squash the change we will do the same wrong
thing, so we'll still have a very small window that the remapped page
be missing from a device's POV.

Now to solve this I suppose we'll need to cache every translation then
we can know whether a mapping has changed, and we only remap when it
really has changed.  But I'm afraid that can be a big amount of data
for nested guests.  For a most common 4G L2 guest, I think the worst
case (e.g., no huge page at all, no continuous pages) is 4G/4K=1M
entries in that tree.

Is it really worth it to solve this possibly-buggy-guest-OS problem
with such a overhead?  I don't know..

I'm not sure whether it's still acceptable that we put this issue
aside.  We should know that normal OSs should not do this, and if they
do, IMHO it proves a buggy OS already (so even from hardware POV we
allow this, from software POV it should still be problematic), then
it'll have problem for sure, but only within the VM itself, and it
won't affect other VMs or the host.  That sounds still reasonable to
me so far.

Of course I'm always willing to listen to more advice on this.

Thanks,

-- 
Peter Xu

  reply	other threads:[~2018-04-27 11:40 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-25  4:51 [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 01/10] intel-iommu: send PSI always even if across PDEs Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 02/10] intel-iommu: remove IntelIOMMUNotifierNode Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 03/10] intel-iommu: add iommu lock Peter Xu
2018-04-25 16:26   ` Emilio G. Cota
2018-04-26  5:45     ` Peter Xu
2018-04-27  5:13   ` Jason Wang
2018-04-27  6:26     ` Peter Xu
2018-04-27  7:19       ` Tian, Kevin
2018-04-27  9:53         ` Peter Xu
2018-04-28  1:54           ` Tian, Kevin
2018-04-28  1:43       ` Jason Wang
2018-04-28  2:24         ` Peter Xu
2018-04-28  2:42           ` Jason Wang
2018-04-28  3:06             ` Peter Xu
2018-04-28  3:11               ` Jason Wang
2018-04-28  3:14             ` Peter Xu
2018-04-28  3:16               ` Jason Wang
2018-04-30  7:22               ` Paolo Bonzini
2018-04-30  7:20           ` Paolo Bonzini
2018-05-03  5:39             ` Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 04/10] intel-iommu: only do page walk for MAP notifiers Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 05/10] intel-iommu: introduce vtd_page_walk_info Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 06/10] intel-iommu: pass in address space when page walk Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 07/10] util: implement simple interval tree logic Peter Xu
2018-04-27  5:53   ` Jason Wang
2018-04-27  6:27     ` Peter Xu
2018-05-03  7:10     ` Peter Xu
2018-05-03  7:21       ` Jason Wang
2018-05-03  7:30         ` Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 08/10] intel-iommu: maintain per-device iova ranges Peter Xu
2018-04-27  6:07   ` Jason Wang
2018-04-27  6:34     ` Peter Xu
2018-04-27  7:02     ` Tian, Kevin
2018-04-27  7:28       ` Peter Xu
2018-04-27  7:44         ` Tian, Kevin
2018-04-27  9:55           ` Peter Xu
2018-04-27 11:40             ` Peter Xu [this message]
2018-04-27 23:37               ` Tian, Kevin
2018-05-03  6:04                 ` Peter Xu
2018-05-03  7:20                   ` Jason Wang
2018-05-03  7:28                     ` Peter Xu
2018-05-03  7:43                       ` Jason Wang
2018-05-03  7:53                         ` Peter Xu
2018-05-03  9:22                           ` Jason Wang
2018-05-03  9:53                             ` Peter Xu
2018-05-03 12:01                               ` Peter Xu
2018-04-28  1:49               ` Jason Wang
2018-04-25  4:51 ` [Qemu-devel] [PATCH 09/10] intel-iommu: don't unmap all for shadow page table Peter Xu
2018-04-25  4:51 ` [Qemu-devel] [PATCH 10/10] intel-iommu: remove notify_unmap for page walk Peter Xu
2018-04-25  5:05 ` [Qemu-devel] [PATCH 00/10] intel-iommu: nested vIOMMU, cleanups, bug fixes no-reply
2018-04-25  5:34   ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180427114029.GF13269@xz-mi \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=jintack@cs.columbia.edu \
    --cc=kevin.tian@intel.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.