Re: Xen virtual IOMMU high level design doc V2

From: Lan Tianyu <tianyu.lan@intel.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>,
	Jan Beulich <JBeulich@suse.com>,
	Kevin Tian <kevin.tian@intel.com>,
	"yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
	Jun Nakajima <jun.nakajima@intel.com>,
	Stefano Stabellini <sstabellini@kernel.org>
Cc: "anthony.perard@citrix.com" <anthony.perard@citrix.com>,
	xuquan8@huawei.com,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	"ian.jackson@eu.citrix.com" <ian.jackson@eu.citrix.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: Xen virtual IOMMU high level design doc V2
Date: Thu, 20 Oct 2016 22:17:34 +0800	[thread overview]
Message-ID: <884d6ac3-5f8a-a01a-f87d-26037b1069e3@intel.com> (raw)
In-Reply-To: <4a8616a2-a576-aadc-993f-3d349f91f310@citrix.com>

Hi Andrew:
	Thanks for your review.

On 2016年10月19日 03:17, Andrew Cooper wrote:
> On 18/10/16 15:14, Lan Tianyu wrote:
>> Change since V1:
>>     1) Update motivation for Xen vIOMMU - 288 vcpus support part
>>     2) Change definition of struct xen_sysctl_viommu_op
>>     3) Update "3.5 Implementation consideration" to explain why we
>> needs to enable l2 translation first.
>>     4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
>> on the emulated I440 chipset.
>>     5) Remove stale statement in the "3.3 Interrupt remapping"
>>
>> Content:
>> ===============================================================================
>>
>> 1. Motivation of vIOMMU
>>     1.1 Enable more than 255 vcpus
>>     1.2 Support VFIO-based user space driver
>>     1.3 Support guest Shared Virtual Memory (SVM)
>> 2. Xen vIOMMU Architecture
>>     2.1 l2 translation overview
>>     2.2 Interrupt remapping overview
>> 3. Xen hypervisor
>>     3.1 New vIOMMU hypercall interface
>>     3.2 l2 translation
>>     3.3 Interrupt remapping
>>     3.4 l1 translation
>>     3.5 Implementation consideration
>> 4. Qemu
>>     4.1 Qemu vIOMMU framework
>>     4.2 Dummy xen-vIOMMU driver
>>     4.3 Q35 vs. i440x
>>     4.4 Report vIOMMU to hvmloader
>>
>>
>> 1 Motivation for Xen vIOMMU
>> ===============================================================================
>>
>> 1.1 Enable more than 255 vcpu support
>> HPC cloud service requires VM provides high performance parallel
>> computing and we hope to create a huge VM with >255 vcpu on one machine
>> to meet such requirement.Ping each vcpus on separated pcpus. More than
>
> Pin ?
>

Sorry, it's a typo.

> Also, grammatically speaking, I think you mean "each vcpu to separate
> pcpus".

Yes.

>
>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>> there is no interrupt remapping function which is present by vIOMMU.
>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>
> This is only a requirement for xapic interrupt sources.  x2apic
> interrupt sources already deliver correctly.

The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.

>
>>
>>
>> 1.3 Support guest SVM (Shared Virtual Memory)
>> It relies on the l1 translation table capability (GVA->GPA) on
>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
>> is the main usage today (to support OpenCL 2.0 SVM feature). In the
>> future SVM might be used by other I/O devices too.
>
> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
> ATS/PASID, or something rather more magic as IGD is on the same piece of
> silicon?

IGD on Skylake supports PCIe PASID.

>
>>
>> 2. Xen vIOMMU Architecture
>> ================================================================================
>>
>>
>> * vIOMMU will be inside Xen hypervisor for following factors
>>     1) Avoid round trips between Qemu and Xen hypervisor
>>     2) Ease of integration with the rest of the hypervisor
>>     3) HVMlite/PVH doesn't use Qemu
>> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
>> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2
>> translation.
>>
>> 2.1 l2 translation overview
>> For Virtual PCI device, dummy xen-vIOMMU does translation in the
>> Qemu via new hypercall.
>>
>> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
>> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>>
>> The following diagram shows l2 translation architecture.
>
> Which scenario is this?  Is this the passthrough case where the Qemu
> Virtual PCI device is a shadow of the real PCI device in hardware?
>

No, this is for traditional virtual pci device emulated by Qemu and
passthough PCI device.

>> +---------------------------------------------------------+
>> |Qemu                                +----------------+   |
>> |                                    |     Virtual    |   |
>> |                                    |   PCI device   |   |
>> |                                    |                |   |
>> |                                    +----------------+   |
>> |                                            |DMA         |
>> |                                            V            |
>> |  +--------------------+   Request  +----------------+   |
>> |  |                    +<-----------+                |   |
>> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
>> |  |                    +----------->+                |   |
>> |  +---------+----------+            +-------+--------+   |
>> |            |                               |            |
>> |            |Hypercall                      |            |
>> +--------------------------------------------+------------+
>> |Hypervisor  |                               |            |
>> |            |                               |            |
>> |            v                               |            |
>> |     +------+------+                        |            |
>> |     |   vIOMMU    |                        |            |
>> |     +------+------+                        |            |
>> |            |                               |            |
>> |            v                               |            |
>> |     +------+------+                        |            |
>> |     | IOMMU driver|                        |            |
>> |     +------+------+                        |            |
>> |            |                               |            |
>> +--------------------------------------------+------------+
>> |HW          v                               V            |
>> |     +------+------+                 +-------------+     |
>> |     |   IOMMU     +---------------->+  Memory     |     |
>> |     +------+------+                 +-------------+     |
>> |            ^                                            |
>> |            |                                            |
>> |     +------+------+                                     |
>> |     | PCI Device  |                                     |
>> |     +-------------+                                     |
>> +---------------------------------------------------------+
>>
>> 2.2 Interrupt remapping overview.
>> Interrupts from virtual devices and physical devices will be delivered
>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
>> procedure.
>>
>> +---------------------------------------------------+
>> |Qemu                       |VM                     |
>> |                           | +----------------+    |
>> |                           | |  Device driver |    |
>> |                           | +--------+-------+    |
>> |                           |          ^            |
>> |       +----------------+  | +--------+-------+    |
>> |       | Virtual device |  | |  IRQ subsystem |    |
>> |       +-------+--------+  | +--------+-------+    |
>> |               |           |          ^            |
>> |               |           |          |            |
>> +---------------------------+-----------------------+
>> |hyperviosr     |                      | VIRQ       |
>> |               |            +---------+--------+   |
>> |               |            |      vLAPIC      |   |
>> |               |            +---------+--------+   |
>> |               |                      ^            |
>> |               |                      |            |
>> |               |            +---------+--------+   |
>> |               |            |      vIOMMU      |   |
>> |               |            +---------+--------+   |
>> |               |                      ^            |
>> |               |                      |            |
>> |               |            +---------+--------+   |
>> |               |            |   vIOAPIC/vMSI   |   |
>> |               |            +----+----+--------+   |
>> |               |                 ^    ^            |
>> |               +-----------------+    |            |
>> |                                      |            |
>> +---------------------------------------------------+
>> HW                                     |IRQ
>>                                +-------------------+
>>                                |   PCI Device      |
>>                                +-------------------+
>>
>>
>>
>>
>> 3 Xen hypervisor
>> ==========================================================================
>>
>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>> This hypercall should also support pv IOMMU which is still under RFC
>> review. Here only covers non-pv part.
>>
>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
>> parameter.
>
> Why did you choose sysctl?  As these are per-domain, domctl would be a
> more logical choice.  However, neither of these should be usable by
> Qemu, and we are trying to split out "normal qemu operations" into dmops
> which can be safely deprivileged.
>

Do you know what's the status of dmop now? I just found some discussions
about design in the maillist. We may use domctl first and move to dmop
when it's ready?

>
>>
>> struct xen_sysctl_viommu_op {
>>     u32 cmd;
>>     u32 domid;
>>     union {
>>         struct {
>>             u32 capabilities;
>>         } query_capabilities;
>>         struct {
>>             u32 capabilities;
>>             u64 base_address;
>>         } create_iommu;
>>             struct {
>>             /* IN parameters. */
>>             u16 segment;
>>                     u8  bus;
>>                     u8  devfn;
>
> I think this would be cleaner as u32 vsbdf, which makes it clear which
> address space to look for sbdf in.

Ok. Will update.

>
>>                     u64 iova;
>>                 /* Out parameters. */
>>                     u64 translated_addr;
>>                     u64 addr_mask; /* Translation page size */
>>                     IOMMUAccessFlags permisson;
>
> How is this translation intended to be used?  How do you plan to avoid
> race conditions where qemu requests a translation, receives one, the
> guest invalidated the mapping, and then qemu tries to use its translated
> address?
>
> There are only two ways I can see of doing this race-free.  One is to
> implement a "memcpy with translation" hypercall, and the other is to
> require the use of ATS in the vIOMMU, where the guest OS is required to
> wait for a positive response from the vIOMMU before it can safely reuse
> the mapping.
>
> The former behaves like real hardware in that an intermediate entity
> performs the translation without interacting with the DMA source.  The
> latter explicitly exposing the fact that caching is going on at the
> endpoint to the OS.

The former one seems to move DMA operation into hypervisor but Qemu 
vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input data 
and access length. I will dig more to figure out solution.

>
>>             } l2_translation;
>> };
>>
>> typedef enum {
>>     IOMMU_NONE = 0,
>>     IOMMU_RO   = 1,
>>     IOMMU_WO   = 2,
>>     IOMMU_RW   = 3,
>> } IOMMUAccessFlags;
>
> No enumerations in an ABI please.  They are not stable in C.  Please us
> a u32 and more #define's

Ok. Will update.

>
>>
>>
>> Definition of VIOMMU subops:
>> #define XEN_SYSCTL_viommu_query_capability        0
>> #define XEN_SYSCTL_viommu_create            1
>> #define XEN_SYSCTL_viommu_destroy            2
>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>>
>> Definition of VIOMMU capabilities
>> #define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
>> #define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
>> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
>
> How are vIOMMUs going to be modelled to guests?  On real hardware, they
> all seem to end associated with a PCI device of some sort, even if it is
> just the LPC bridge.

This design just considers one vIOMMU has all PCI device under its
specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
vIOMMU.

> 	
> How do we deal with multiple vIOMMUs in a single guest?

For multi-vIOMMU, we need to add new field in the struct iommu_op to
designate device scope of vIOMMUs if they are under same PCI
segment. This also needs to change DMAR table.

>
>>
>>
>> 2) Design for subops
>> - XEN_SYSCTL_viommu_query_capability
>>        Get vIOMMU capabilities(l1/l2 translation and interrupt
>> remapping).
>>
>> - XEN_SYSCTL_viommu_create
>>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
>> base address.
>>
>> - XEN_SYSCTL_viommu_destroy
>>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>>
>> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>>       Translate IOVA to GPA for specified virtual PCI device with dom id,
>> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
>> address mask and access permission.
>>
>>
>> 3.2 l2 translation
>> 1) For virtual PCI device
>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>> hypercall when DMA operation happens.
>>
>> 2) For physical PCI device
>> DMA operations go though physical IOMMU directly and IO page table for
>> IOVA->HPA should be loaded into physical IOMMU. When guest updates
>> l2 Page-table pointer field, it provides IO page table for
>> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
>> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
>> Page-table pointer to context entry of physical IOMMU.
>
> How are you proposing to do this shadowing?  Do we need to trap and
> emulate all writes to the vIOMMU pagetables, or is there a better way to
> know when the mappings need invalidating?

No, we don't need to trap all write to IO page table.
 From VTD spec 6.1, "Reporting the Caching Mode as Set for the
virtual hardware requires the guest software to explicitly issue
invalidation operations on the virtual hardware for any/all updates to
the guest remapping structures.The virtualizing software may trap these
guest invalidation operations to keep the shadow translation structures
consistent to guest translation structure modifications, without
resorting to other less efficient techniques."
So any updates of IO page table will follow invalidation operation and
we use them to do shadowing.

>
>>
>> Now all PCI devices in same hvm domain share one IO page table
>> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
>> translation of vIOMMU, IOMMU driver need to support multiple address
>> spaces per device entry. Using existing IO page table(GPA->HPA)
>> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
>> translation function is enabled. These change will not affect current
>> P2M logic.
>>
>> 3.3 Interrupt remapping
>> Interrupts from virtual devices and physical devices will be delivered
>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>> according interrupt remapping table.
>>
>>
>> 3.4 l1 translation
>> When nested translation is enabled, any address generated by l1
>> translation is used as the input address for nesting with l2
>> translation. Physical IOMMU needs to enable both l1 and l2 translation
>> in nested translation mode(GVA->GPA->HPA) for passthrough
>> device.
>
> All these l1 and l2 translations are getting confusing.  Could we
> perhaps call them guest translation and host translation, or is that
> likely to cause other problems?

Definitions of l1 and l2 translation from VTD spec.
first-level translation to remap a virtual address to intermediate
(guest) physical address.
second-level translations to remap a intermediate physical address to
machine (host) physical address.
guest and host translation maybe not suitable for them?

>
>>
>> VT-d context entry points to guest l1 translation table which
>> will be nest-translated by l2 translation table and so it
>> can be directly linked to context entry of physical IOMMU.
>>
>> To enable l1 translation in VM
>> 1) Xen IOMMU driver enables nested translation mode
>> 2) Update GPA root of guest l1 translation table to context entry
>> of physical IOMMU.
>>
>> All handles are in hypervisor and no interaction with Qemu.
>>
>>
>> 3.5 Implementation consideration
>> VT-d spec doesn't define a capability bit for the l2 translation.
>> Architecturally there is no way to tell guest that l2 translation
>> capability is not available. Linux Intel IOMMU driver thinks l2
>> translation is always available when VTD exits and fail to be loaded
>> without l2 translation support even if interrupt remapping and l1
>> translation are available. So it needs to enable l2 translation first
>> before other functions.
>
> What then is the purpose of the nested translation support bit in the
> extended capability register?

It's to translate output GPA from first level translation(IOVA->GPA) to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."

>> 4.4 Report vIOMMU to hvmloader
>> Hvmloader is in charge of building ACPI tables for Guest OS and OS
>> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
>> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
>> for Guest OS.
>>
>> There are three ways to do that.
>> 1) Extend struct hvm_info_table and add variables in the struct
>> hvm_info_table to pass vIOMMU information to hvmloader. But this
>> requires to add new xc interface to use struct hvm_info_table in the
>> Qemu.
>>
>> 2) Pass vIOMMU information to hvmloader via Xenstore
>>
>> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
>> This solution is already present in the vNVDIMM design(4.3.1
>> Building Guest ACPI Tables
>> http://www.gossamer-threads.com/lists/xen/devel/439766).
>>
>> The third option seems more clear and hvmloader doesn't need to deal
>> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
>> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
>
> Part of ACPI table building has now moved into the toolstack.  Unless
> the table needs creating dynamically (which doesn't appear to be the
> case), it can be done without any further communication.
>

The DMAR table needs to be created according input parameters.
.E,G When interrupt remapping is enabled, INTR_REMAP bit in the dmar
structure needs to be set. So we need to create table dynamically during
creating VM.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel