All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: qemu-devel@nongnu.org, tianyu.lan@intel.com,
	kevin.tian@intel.com, mst@redhat.com, jan.kiszka@siemens.com,
	jasowang@redhat.com, bd.aviv@gmail.com
Subject: Re: [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region
Date: Wed, 18 Jan 2017 15:49:44 +0800	[thread overview]
Message-ID: <20170118074944.GR30108@pxdev.xzpeter.org> (raw)
In-Reply-To: <20170117084604.2b1f5e50@t450s.home>

On Tue, Jan 17, 2017 at 08:46:04AM -0700, Alex Williamson wrote:
> On Tue, 17 Jan 2017 22:00:00 +0800
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Mon, Jan 16, 2017 at 12:53:57PM -0700, Alex Williamson wrote:
> > > On Fri, 13 Jan 2017 11:06:39 +0800
> > > Peter Xu <peterx@redhat.com> wrote:
> > >   
> > > > This is preparation work to finally enabled dynamic switching ON/OFF for
> > > > VT-d protection. The old VT-d codes is using static IOMMU address space,
> > > > and that won't satisfy vfio-pci device listeners.
> > > > 
> > > > Let me explain.
> > > > 
> > > > vfio-pci devices depend on the memory region listener and IOMMU replay
> > > > mechanism to make sure the device mapping is coherent with the guest
> > > > even if there are domain switches. And there are two kinds of domain
> > > > switches:
> > > > 
> > > >   (1) switch from domain A -> B
> > > >   (2) switch from domain A -> no domain (e.g., turn DMAR off)
> > > > 
> > > > Case (1) is handled by the context entry invalidation handling by the
> > > > VT-d replay logic. What the replay function should do here is to replay
> > > > the existing page mappings in domain B.  
> > > 
> > > There's really 2 steps here, right?  Invalidate A, replay B.  I think
> > > the code handles this, but I want to make sure.  We don't want to end
> > > up with a superset of both A & B.  
> > 
> > First of all, this discussion should be beyond this patch's scope,
> > since this patch is currently only handling the case when guest
> > disables DMAR in general.
> > 
> > Then, my understanding for above question: when we do A -> B domain
> > switch, guest will not send specific context entry invalidations for
> > A, but will for sure send one when context entry is ready for B. In
> > that sense, IMO we don't have a clear "two steps", only one, which is
> > the latter "replay B". We do correct unmap based on the PSIs
> > (page-selective invalidations) of A when guest unmaps the pages in A.
> > 
> > So, for the use case of nested device assignment (which is the goal of
> > this series for now):
> > 
> > - L1 guest put device D1,D2,... of L2 guest into domain A
> > - L1 guest map the L2 memory into L1 address space (L2GPA -> L1GPA)
> > - ... (L2 guest runs, until it stops running)
> > - L1 guest unmap all the pages in domain A
> > - L1 guest move device D1,D2,... of L2 guest outside domain A
> > 
> > This series should work for above, since before any device leaves its
> > domain, the domain will be clean and without unmapped pages.
> > 
> > However, if we have the following scenario (which I don't know whether
> > this's achievable):
> > 
> > - guest iommu domain A has device D1, D2
> > - guest iommu domain B has device D3
> > - move device D2 from domain A into B
> > 
> > Here when D2 move from A to B, IIUC our current Linux IOMMU driver
> > code will not send any PSI (page-selected invalidations) for D2 or
> > domain A because domain A still has device in it, guest should only
> > send a context entry invalidation for device D2, telling that D2 has
> > switched to domain B. In that case, I am not sure whether current
> > series can work properly, and IMHO we may need to have the domain
> > knowledge in VT-d emulation code (while we don't have it yet) in the
> > future to further support this kind of domain switches.
> 
> This is a serious issue that needs to be resolved.  The context entry
> invalidation when D2 is switched from A->B must unmap anything from
> domain A before the replay of domain B.  Your example is easily
> achieved, for instance what if domain A is the SI (static identity)
> domain for the L1 guest, domain B is the device assignment domain for
> the L2 guest with current device D3.  The user hot adds device D2 into
> the L2 guest moving it from the L1 SI domain to the device assignment
> domain.  vfio will not override existing mappings on replay, it will
> error, giving the L2 guest a device with access to the static identity
> mappings of the L1 host.  This isn't acceptable.
>  
> > > On the invalidation, a future optimization when disabling an entire
> > > memory region might also be to invalidate the entire range at once
> > > rather than each individual mapping within the range, which I think is
> > > what happens now, right?  
> > 
> > Right. IIUC this can be an enhancement to current page walk logic - we
> > can coalesce continuous IOTLB with same property and notify only once
> > for these coalesced entries.
> > 
> > Noted in my todo list.
> 
> A context entry invalidation as in the example above might make use of
> this to skip any sort of page walk logic, simply invalidate the entire
> address space.

Alex, I got one more thing to ask:

I was trying to invalidate the entire address space by sending a big
IOTLB notification to vfio-pci, which looks like:

  IOMMUTLBEntry entry = {
      .target_as = &address_space_memory,
      .iova = 0,
      .translated_addr = 0,
      .addr_mask = (1 << 63) - 1,
      .perm = IOMMU_NONE,     /* UNMAP */
  };

Then I feed this entry to vfio-pci IOMMU notifier.

However, this was blocked in vfio_iommu_map_notify(), with error:

  qemu-system-x86_64: iommu has granularity incompatible with target AS

Since we have:

  /*
   * The IOMMU TLB entry we have just covers translation through
   * this IOMMU to its immediate target.  We need to translate
   * it the rest of the way through to memory.
   */
  rcu_read_lock();
  mr = address_space_translate(&address_space_memory,
                               iotlb->translated_addr,
                               &xlat, &len, iotlb->perm & IOMMU_WO);
  if (!memory_region_is_ram(mr)) {
      error_report("iommu map to non memory area %"HWADDR_PRIx"",
                   xlat);
      goto out;
  }
  /*
   * Translation truncates length to the IOMMU page size,
   * check that it did not truncate too much.
   */
  if (len & iotlb->addr_mask) {
      error_report("iommu has granularity incompatible with target AS");
      goto out;
  }

In my case len == 0xa0000 (that's the translation result), and
iotlb->addr_mask == (1<<63)-1. So looks like the translation above
splitted the big region and a simple big UNMAP doesn't work for me.

Do you have any suggestion on how I can solve this? In what case will
we need the above address_space_translate()?

> 
> > >   
> > > > However for case (2), we don't want to replay any domain mappings - we
> > > > just need the default GPA->HPA mappings (the address_space_memory
> > > > mapping). And this patch helps on case (2) to build up the mapping
> > > > automatically by leveraging the vfio-pci memory listeners.  
> > > 
> > > Have you thought about using this address space switching to emulate
> > > ecap.PT?  ie. advertise hardware based passthrough so that the guest
> > > doesn't need to waste pagetable entries for a direct mapped, static
> > > identity domain.  
> > 
> > Kind of. Currently we still don't have iommu=pt for the emulated code.
> > We can achieve that by leveraging this patch.
> 
> Well, we have iommu=pt, but the L1 guest will implement this as a fully
> populated SI domain rather than as a bit in the context entry to do
> hardware direct translation.  Given the mapping overhead through vfio,
> the L1 guest will always want to use iommu=pt as dynamic mapping
> performance is going to be horrid.  Thanks,

I see, so we have iommu=pt in guest even VT-d emulation does not
provide that bit. Anyway, supporting ecap.pt is in my todo list.

Thanks,

-- peterx

  reply	other threads:[~2017-01-18  7:49 UTC|newest]

Thread overview: 93+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-13  3:06 [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 01/14] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
2017-01-20  8:32   ` Tian, Kevin
2017-01-20  8:54     ` Peter Xu
2017-01-20  8:59       ` Tian, Kevin
2017-01-20  9:11         ` Peter Xu
2017-01-20  9:20           ` Tian, Kevin
2017-01-20  9:30             ` Peter Xu
2017-01-20 15:42   ` Eric Blake
2017-01-22  2:32     ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 02/14] intel_iommu: simplify irq region translation Peter Xu
2017-01-20  8:22   ` Tian, Kevin
2017-01-20  9:05     ` Peter Xu
2017-01-20  9:15       ` Tian, Kevin
2017-01-20  9:27         ` Peter Xu
2017-01-20  9:52           ` Tian, Kevin
2017-01-20 10:04             ` Peter Xu
2017-01-22  4:42               ` Tian, Kevin
2017-01-22  4:50                 ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 03/14] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-01-20  8:27   ` Tian, Kevin
2017-01-20  9:23     ` Peter Xu
2017-01-20  9:41       ` Tian, Kevin
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 04/14] intel_iommu: fix trace for inv desc handling Peter Xu
2017-01-13  7:46   ` Jason Wang
2017-01-13  9:13     ` Peter Xu
2017-01-13  9:33       ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 05/14] intel_iommu: fix trace for addr translation Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 06/14] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 07/14] memory: add section range info for IOMMU notifier Peter Xu
2017-01-13  7:55   ` Jason Wang
2017-01-13  9:23     ` Peter Xu
2017-01-13  9:37       ` Jason Wang
2017-01-13 10:22         ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 08/14] memory: provide iommu_replay_all() Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 09/14] memory: introduce memory_region_notify_one() Peter Xu
2017-01-13  7:58   ` Jason Wang
2017-01-16  7:08     ` Peter Xu
2017-01-16  7:38       ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 10/14] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 11/14] intel_iommu: provide its own replay() callback Peter Xu
2017-01-13  9:26   ` Jason Wang
2017-01-16  7:31     ` Peter Xu
2017-01-16  7:47       ` Jason Wang
2017-01-16  7:59         ` Peter Xu
2017-01-16  8:03           ` Jason Wang
2017-01-16  8:06             ` Peter Xu
2017-01-16  8:23               ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 12/14] intel_iommu: do replay when context invalidate Peter Xu
2017-01-16  5:53   ` Jason Wang
2017-01-16  7:43     ` Peter Xu
2017-01-16  7:52       ` Jason Wang
2017-01-16  8:02         ` Peter Xu
2017-01-16  8:18         ` Peter Xu
2017-01-16  8:28           ` Jason Wang
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 13/14] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-01-16  6:20   ` Jason Wang
2017-01-16  7:50     ` Peter Xu
2017-01-16  8:01       ` Jason Wang
2017-01-16  8:12         ` Peter Xu
2017-01-16  8:25           ` Jason Wang
2017-01-16  8:32             ` Peter Xu
2017-01-16 16:25               ` Michael S. Tsirkin
2017-01-17 14:53                 ` Peter Xu
2017-01-16 19:53   ` Alex Williamson
2017-01-17 14:00     ` Peter Xu
2017-01-17 15:46       ` Alex Williamson
2017-01-18  7:49         ` Peter Xu [this message]
2017-01-19  8:20           ` Peter Xu
2017-01-13  3:06 ` [Qemu-devel] [PATCH RFC v3 14/14] intel_iommu: enable vfio devices Peter Xu
2017-01-16  6:30   ` Jason Wang
2017-01-16  9:18     ` Peter Xu
2017-01-16  9:54       ` Jason Wang
2017-01-17 14:45         ` Peter Xu
2017-01-18  3:10           ` Jason Wang
2017-01-18  8:11             ` Peter Xu
2017-01-18  8:36               ` Jason Wang
2017-01-18  8:46                 ` Peter Xu
2017-01-18  9:38                   ` Tian, Kevin
2017-01-18 10:06                     ` Jason Wang
2017-01-19  3:32                       ` Peter Xu
2017-01-19  3:36                         ` Jason Wang
2017-01-19  3:16                     ` Peter Xu
2017-01-19  6:22                       ` Tian, Kevin
2017-01-19  9:38                         ` Peter Xu
2017-01-19  6:44                     ` Liu, Yi L
2017-01-19  7:02                       ` Jason Wang
2017-01-19  7:02                       ` Peter Xu
2017-01-16  9:20     ` Peter Xu
2017-01-13 15:58 ` [Qemu-devel] [PATCH RFC v3 00/14] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-01-14  2:59   ` Peter Xu
2017-01-17 15:07     ` Michael S. Tsirkin
2017-01-18  7:34       ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170118074944.GR30108@pxdev.xzpeter.org \
    --to=peterx@redhat.com \
    --cc=alex.williamson@redhat.com \
    --cc=bd.aviv@gmail.com \
    --cc=jan.kiszka@siemens.com \
    --cc=jasowang@redhat.com \
    --cc=kevin.tian@intel.com \
    --cc=mst@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=tianyu.lan@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.