From: "Lan, Tianyu" <tianyu.lan@intel.com>
To: Andrew Cooper <andrew.cooper3@citrix.com>,
Stefano Stabellini <sstabellini@kernel.org>
Cc: "yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
Kevin Tian <kevin.tian@intel.com>,
"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
Jan Beulich <JBeulich@suse.com>,
"ian.jackson@eu.citrix.com" <ian.jackson@eu.citrix.com>,
xuquan8@huawei.com, Jun Nakajima <jun.nakajima@intel.com>,
"anthony.perard@citrix.com" <anthony.perard@citrix.com>,
Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: Xen virtual IOMMU high level design doc
Date: Thu, 15 Sep 2016 22:22:36 +0800 [thread overview]
Message-ID: <eb1ef7ed-eeae-2545-b685-015c5052a43a@intel.com> (raw)
In-Reply-To: <b17f9cc8-955d-c7e7-f944-a20a77f75d64@intel.com>
Hi Andrew:
Sorry to bother you. To make sure we are on the right direction, it's
better to get feedback from you before we go further step. Could you
have a look? Thanks.
On 8/17/2016 8:05 PM, Lan, Tianyu wrote:
> Hi All:
> The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.
>
>
> Content:
> ===============================================================================
>
> 1. Motivation of vIOMMU
> 1.1 Enable more than 255 vcpus
> 1.2 Support VFIO-based user space driver
> 1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 2.1 2th level translation overview
> 2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 3.1 New vIOMMU hypercall interface
> 3.2 2nd level translation
> 3.3 Interrupt remapping
> 3.4 1st level translation
> 3.5 Implementation consideration
> 4. Qemu
> 4.1 Qemu vIOMMU framework
> 4.2 Dummy xen-vIOMMU driver
> 4.3 Q35 vs. i440x
> 4.4 Report vIOMMU to hvmloader
>
>
> 1 Motivation for Xen vIOMMU
> ===============================================================================
>
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
>
>
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
>
>
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
>
> 2. Xen vIOMMU Architecture
> ================================================================================
>
>
> * vIOMMU will be inside Xen hypervisor for following factors
> 1) Avoid round trips between Qemu and Xen hypervisor
> 2) Ease of integration with the rest of the hypervisor
> 3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
>
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
>
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>
> The following diagram shows 2th level translation architecture.
> +---------------------------------------------------------+
> |Qemu +----------------+ |
> | | Virtual | |
> | | PCI device | |
> | | | |
> | +----------------+ |
> | |DMA |
> | V |
> | +--------------------+ Request +----------------+ |
> | | +<-----------+ | |
> | | Dummy xen vIOMMU | Target GPA | Memory region | |
> | | +----------->+ | |
> | +---------+----------+ +-------+--------+ |
> | | | |
> | |Hypercall | |
> +--------------------------------------------+------------+
> |Hypervisor | | |
> | | | |
> | v | |
> | +------+------+ | |
> | | vIOMMU | | |
> | +------+------+ | |
> | | | |
> | v | |
> | +------+------+ | |
> | | IOMMU driver| | |
> | +------+------+ | |
> | | | |
> +--------------------------------------------+------------+
> |HW v V |
> | +------+------+ +-------------+ |
> | | IOMMU +---------------->+ Memory | |
> | +------+------+ +-------------+ |
> | ^ |
> | | |
> | +------+------+ |
> | | PCI Device | |
> | +-------------+ |
> +---------------------------------------------------------+
>
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
>
> +---------------------------------------------------+
> |Qemu |VM |
> | | +----------------+ |
> | | | Device driver | |
> | | +--------+-------+ |
> | | ^ |
> | +----------------+ | +--------+-------+ |
> | | Virtual device | | | IRQ subsystem | |
> | +-------+--------+ | +--------+-------+ |
> | | | ^ |
> | | | | |
> +---------------------------+-----------------------+
> |hyperviosr | | VIRQ |
> | | +---------+--------+ |
> | | | vLAPIC | |
> | | +---------+--------+ |
> | | ^ |
> | | | |
> | | +---------+--------+ |
> | | | vIOMMU | |
> | | +---------+--------+ |
> | | ^ |
> | | | |
> | | +---------+--------+ |
> | | | vIOAPIC/vMSI | |
> | | +----+----+--------+ |
> | | ^ ^ |
> | +-----------------+ | |
> | | |
> +---------------------------------------------------+
> HW |IRQ
> +-------------------+
> | PCI Device |
> +-------------------+
>
>
>
>
>
> 3 Xen hypervisor
> ==========================================================================
>
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>
> struct xen_sysctl_viommu_op {
> u32 cmd;
> u32 domid;
> union {
> struct {
> u32 capabilities;
> } query_capabilities;
> struct {
> u32 capabilities;
> u64 base_address;
> } create_iommu;
> struct {
> u8 bus;
> u8 devfn;
> u64 iova;
> u64 translated_addr;
> u64 addr_mask; /* Translation page size */
> IOMMUAccessFlags permisson;
> } 2th_level_translation;
> };
>
> typedef enum {
> IOMMU_NONE = 0,
> IOMMU_RO = 1,
> IOMMU_WO = 2,
> IOMMU_RW = 3,
> } IOMMUAccessFlags;
>
>
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability 0
> #define XEN_SYSCTL_viommu_create 1
> #define XEN_SYSCTL_viommu_destroy 2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 3
>
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping (1 << 2)
>
>
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
> Get vIOMMU capabilities(1st/2th level translation and interrupt
> remapping).
>
> - XEN_SYSCTL_viommu_create
> Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
>
> - XEN_SYSCTL_viommu_destroy
> Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
> Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
>
>
> 3.2 2nd level translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
>
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> Second-level Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> Page-table pointer to context entry of physical IOMMU.
>
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> translation function is enabled. These change will not affect current
> P2M logic.
>
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.
>
>
> 3.4 1st level translation
> When nested translation is enabled, any address generated by first-level
> translation is used as the input address for nesting with second-level
> translation. Physical IOMMU needs to enable both 1st level and 2nd level
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
>
> VT-d context entry points to guest 1st level translation table which
> will be nest-translated by 2nd level translation table and so it
> can be directly linked to context entry of physical IOMMU.
>
> To enable 1st level translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest 1st level translation table to context entry
> of physical IOMMU.
>
> All handles are in hypervisor and no interaction with Qemu.
>
>
> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.
>
>
> 4 Qemu
> ==============================================================================
>
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for 2th level translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with 2th level translation when
> DMA operations of virtual PCI devices happen.
>
>
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
>
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
>
> 3) Virtual PCI device's 2th level translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
>
>
> 4.3 Q35 vs i440x
> VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first.
>
> Consulted with Linux/Windows IOMMU driver experts and get that these
> drivers doesn't have such assumption. So we may skip Q35 implementation
> and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> with virtual PCI device's DMA translation and interrupt remapping. We
> are using KVM to do experiment of adding vIOMMU on the I440x and test
> Linux/Windows guest. Will report back when have some results.
>
>
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
>
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
>
> 2) Pass vIOMMU information to hvmloader via Xenstore
>
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
>
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
>
>
>
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel
next prev parent reply other threads:[~2016-09-15 14:22 UTC|newest]
Thread overview: 86+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-05-26 8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
2016-05-26 8:42 ` Dong, Eddie
2016-05-27 2:26 ` Lan Tianyu
2016-05-27 8:11 ` Tian, Kevin
2016-05-26 11:35 ` Andrew Cooper
2016-05-27 8:19 ` Lan Tianyu
2016-06-02 15:03 ` Lan, Tianyu
2016-06-02 18:58 ` Andrew Cooper
2016-06-03 11:01 ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
2016-06-03 11:21 ` Tian, Kevin
2016-06-03 11:52 ` Roger Pau Monne
2016-06-03 12:11 ` Tian, Kevin
2016-06-03 16:56 ` Stefano Stabellini
2016-06-07 5:48 ` Tian, Kevin
2016-06-03 11:17 ` Discussion about virtual iommu support for Xen guest Tian, Kevin
2016-06-03 13:09 ` Lan, Tianyu
2016-06-03 14:00 ` Andrew Cooper
2016-06-03 13:51 ` Andrew Cooper
2016-06-03 14:31 ` Jan Beulich
2016-06-03 17:14 ` Stefano Stabellini
2016-06-07 5:14 ` Tian, Kevin
2016-06-07 7:26 ` Jan Beulich
2016-06-07 10:07 ` Stefano Stabellini
2016-06-08 8:11 ` Tian, Kevin
2016-06-26 13:42 ` Lan, Tianyu
2016-06-29 3:04 ` Tian, Kevin
2016-07-05 13:37 ` Lan, Tianyu
2016-07-05 13:57 ` Jan Beulich
2016-07-05 14:19 ` Lan, Tianyu
2016-08-17 12:05 ` Xen virtual IOMMU high level design doc Lan, Tianyu
2016-08-17 12:42 ` Paul Durrant
2016-08-18 2:57 ` Lan, Tianyu
2016-08-25 11:11 ` Jan Beulich
2016-08-31 8:39 ` Lan Tianyu
2016-08-31 12:02 ` Jan Beulich
2016-09-01 1:26 ` Tian, Kevin
2016-09-01 2:35 ` Lan Tianyu
2016-09-15 14:22 ` Lan, Tianyu [this message]
2016-10-05 18:36 ` Konrad Rzeszutek Wilk
2016-10-11 1:52 ` Lan Tianyu
2016-11-23 18:19 ` Edgar E. Iglesias
2016-11-23 19:09 ` Stefano Stabellini
2016-11-24 2:00 ` Tian, Kevin
2016-11-24 4:09 ` Edgar E. Iglesias
2016-11-24 6:49 ` Lan Tianyu
2016-11-24 13:37 ` Edgar E. Iglesias
2016-11-25 2:01 ` Xuquan (Quan Xu)
2016-11-25 5:53 ` Lan, Tianyu
2016-10-18 14:14 ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
2016-10-18 19:17 ` Andrew Cooper
2016-10-20 9:53 ` Tian, Kevin
2016-10-20 18:10 ` Andrew Cooper
2016-10-20 14:17 ` Lan Tianyu
2016-10-20 20:36 ` Andrew Cooper
2016-10-22 7:32 ` Lan, Tianyu
2016-10-26 9:39 ` Jan Beulich
2016-10-26 15:03 ` Lan, Tianyu
2016-11-03 15:41 ` Lan, Tianyu
2016-10-28 15:36 ` Lan Tianyu
2016-10-18 20:26 ` Konrad Rzeszutek Wilk
2016-10-20 10:11 ` Tian, Kevin
2016-10-20 14:56 ` Lan, Tianyu
2016-10-26 9:36 ` Jan Beulich
2016-10-26 14:53 ` Lan, Tianyu
2016-11-17 15:36 ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
2016-11-18 19:43 ` Julien Grall
2016-11-21 2:21 ` Lan, Tianyu
2016-11-21 13:17 ` Julien Grall
2016-11-21 18:24 ` Stefano Stabellini
2016-11-21 7:05 ` Tian, Kevin
2016-11-23 1:36 ` Lan Tianyu
2016-11-21 13:41 ` Andrew Cooper
2016-11-22 6:02 ` Tian, Kevin
2016-11-22 8:32 ` Lan Tianyu
2016-11-22 10:24 ` Jan Beulich
2016-11-24 2:34 ` Lan Tianyu
2016-06-03 19:51 ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
2016-06-06 9:55 ` Jan Beulich
2016-06-06 17:25 ` Konrad Rzeszutek Wilk
2016-08-02 15:15 ` Lan, Tianyu
2016-05-27 8:35 ` Tian, Kevin
2016-05-27 8:46 ` Paul Durrant
2016-05-27 9:39 ` Tian, Kevin
2016-05-31 9:43 ` George Dunlap
2016-05-27 2:26 ` Yang Zhang
2016-05-27 8:13 ` Tian, Kevin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=eb1ef7ed-eeae-2545-b685-015c5052a43a@intel.com \
--to=tianyu.lan@intel.com \
--cc=JBeulich@suse.com \
--cc=andrew.cooper3@citrix.com \
--cc=anthony.perard@citrix.com \
--cc=ian.jackson@eu.citrix.com \
--cc=jun.nakajima@intel.com \
--cc=kevin.tian@intel.com \
--cc=roger.pau@citrix.com \
--cc=sstabellini@kernel.org \
--cc=xen-devel@lists.xensource.com \
--cc=xuquan8@huawei.com \
--cc=yang.zhang.wz@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).