Re: Xen virtual IOMMU high level design doc V2

From: "Tian, Kevin" <kevin.tian@intel.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
	"Lan, Tianyu" <tianyu.lan@intel.com>
Cc: "yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
	"xuquan8@huawei.com" <xuquan8@huawei.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Jan Beulich <JBeulich@suse.com>,
	Andrew Cooper <andrew.cooper3@citrix.com>,
	"ian.jackson@eu.citrix.com" <ian.jackson@eu.citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	"anthony.perard@citrix.com" <anthony.perard@citrix.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: Xen virtual IOMMU high level design doc V2
Date: Thu, 20 Oct 2016 10:11:01 +0000	[thread overview]
Message-ID: <AADFC41AFE54684AB9EE6CBC0274A5D18DFD4F3F@SHSMSX101.ccr.corp.intel.com> (raw)
In-Reply-To: <20161018202631.GO30736@x230.dumpdata.com>

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Wednesday, October 19, 2016 4:27 AM
> >
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > l2 Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> > Page-table pointer to context entry of physical IOMMU.
> >
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when l2
> 
> defaultly?
> 
> > translation function is enabled. These change will not affect current
> > P2M logic.
> 
> What happens if the guests IO page tables have incorrect values?
> 
> For example the guest sets up the pagetables to cover some section
> of HPA ranges (which are all good and permitted). But then during execution
> the guest kernel decides to muck around with the pagetables and adds an HPA
> range that is outside what the guest has been allocated.
> 
> What then?

Shadow PTE is controlled by hypervisor. Whatever IOVA->GPA mapping in
guest PTE must be validated (IOVA->GPA->HPA) before updating into the
shadow PTE. So regardless of when guest mucks its PTE, the operation is
always trapped and validated. Why do you think there is a problem?

Also guest only sees GPA. All it can operate is GPA ranges.

> >
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table.
> >
> >
> > 3.4 l1 translation
> > When nested translation is enabled, any address generated by l1
> > translation is used as the input address for nesting with l2
> > translation. Physical IOMMU needs to enable both l1 and l2 translation
> > in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> >
> > VT-d context entry points to guest l1 translation table which
> > will be nest-translated by l2 translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> 
> I think this means that the shared_ept will be disabled?
> >
> What about different versions of contexts? Say the V1 is exposed
> to guest but the hardware supports V2? Are there any flags that have
> swapped positions? Or is it pretty backwards compatible?

yes, backward compatible.

> >
> >
> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> 
> I am lost on that sentence. Are you saying that it tries to load
> the IOVA and if they fail.. then it keeps on going? What is the result
> of this? That you can't do IOVA (so can't use vfio ?)

It's about VT-d capability. VT-d supports both 1st-level and 2nd-level 
translation, however only the 1st-level translation can be optionally
reported through a capability bit. There is no capability bit to say
a version doesn't support 2nd-level translation. The implication is
that, as long as a vIOMMU is exposed, guest IOMMU driver always
assumes IOVA capability available thru 2nd level translation. 

So we can first emulate a vIOMMU w/ only 2nd-level capability, and
then extend it to support 1st-level and interrupt remapping, but cannot 
do the reverse direction. I think Tianyu's point is more to describe 
enabling sequence based on this fact. :-)

> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for l2 translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with l2 translation when
> > DMA operations of virtual PCI devices happen.
> 
> You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
> could expose an vIOMMU to a guest and actually use the AMD IOMMU
> in the hypervisor?

Did you mean "expose an Intel vIOMMU to guest and then use physical
AMD IOMMU in hypervisor"? I didn't think about this, but what's the value
of doing so? :-)

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel