Re: Discussion about virtual iommu support for Xen guest

From: Andrew Cooper <andrew.cooper3@citrix.com>
To: "Tian, Kevin" <kevin.tian@intel.com>,
	"Lan, Tianyu" <tianyu.lan@intel.com>,
	"jbeulich@suse.com" <jbeulich@suse.com>,
	"sstabellini@kernel.org" <sstabellini@kernel.org>,
	"ian.jackson@eu.citrix.com" <ian.jackson@eu.citrix.com>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	"Dong, Eddie" <eddie.dong@intel.com>,
	"Nakajima, Jun" <jun.nakajima@intel.com>,
	"yang.zhang.wz@gmail.com" <yang.zhang.wz@gmail.com>,
	"anthony.perard@citrix.com" <anthony.perard@citrix.com>,
	Roger Pau Monne <roger.pau@citrix.com>
Subject: Re: Discussion about virtual iommu support for Xen guest
Date: Fri, 3 Jun 2016 14:51:52 +0100	[thread overview]
Message-ID: <57518B78.6060604@citrix.com> (raw)
In-Reply-To: <AADFC41AFE54684AB9EE6CBC0274A5D15F88B441@SHSMSX101.ccr.corp.intel.com>

On 03/06/16 12:17, Tian, Kevin wrote:
>> Very sorry for the delay.
>>
>> There are multiple interacting issues here.  On the one side, it would
>> be useful if we could have a central point of coordination on
>> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
>> would you mind organising that?
>>
>> For the qemu/xen interaction, the current state is woeful and a tangled
>> mess.  I wish to ensure that we don't make any development decisions
>> which makes the situation worse.
>>
>> In your case, the two motivations are quite different I would recommend
>> dealing with them independently.
>>
>> IIRC, the issue with more than 255 cpus and interrupt remapping is that
>> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
>> can't be programmed to generate x2apic interrupts?  In principle, if you
>> don't have an IOAPIC, are there any other issues to be considered?  What
>> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
>> deliver xapic interrupts?
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.

Thanks for clarifying.

>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> deliver interrupts to all cpus in the system if #cpu > 255.

Ok.  So not ideal (and we certainly want to address it), but this isn't
a complete show stopper for a guest.

>> On the other side of things, what is IGD passthrough going to look like
>> in Skylake?  Is there any device-model interaction required (i.e. the
>> opregion), or will it work as a completely standalone device?  What are
>> your plans with the interaction of virtual graphics and shared virtual
>> memory?
>>
> The plan is to use a so-called universal pass-through driver in the guest
> which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)

This is fantastic news.

>
> ----
> Here is a brief of potential usages relying on vIOMMU:
>
> a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. 
> It requires interrupt remapping capability present on vIOMMU;
>
> b) support guest SVM (Shared Virtual Memory), which relies on the
> 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
> needs to enable both 1st level and 2nd level translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
> the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too;
>
> c) support VFIO-based user space driver (e.g. DPDK) in the guest,
> which relies on the 2nd level translation capability (IOVA->GPA) on 
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);

All of these look like interesting things to do.  I know there is a lot
of interest for b).

As a quick aside, does Xen currently boot on a Phi?  Last time I looked
at the Phi manual, I would expect Xen to crash on boot because of MCXSR
differences from more-common x86 hardware.

>
> ----
> And below is my thought viability of implementing vIOMMU in Qemu:
>
> a) enable >255 vcpus:
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Add interrupt remapping in Qemu vIOMMU;
> 	o Virtual interrupt injection in hypervisor needs to know virtual
> interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
> which requires new hypervisor interfaces as Andrew pointed out:
> 		* either for hypervisor to query IR from Qemu which is not
> good;
> 		* or for Qemu to register IR info to hypervisor which means
> partial IR knowledge implemented in hypervisor (then why not putting
> whole IR emulation in Xen?)
>
> b) support SVM
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Add 1st level translation capability in Qemu vIOMMU;
> 	o VT-d context entry points to guest 1st level translation table
> which is nest-translated by 2nd level translation table so vIOMMU
> structure can be directly linked. It means:
> 		* Xen IOMMU driver enables nested mode;
> 		* Introduce a new hypercall so Qemu vIOMMU can register
> GPA root of guest 1st level translation table which is then written
> to context entry in pIOMMU;
>
> c) support VFIO-based user space driver
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Leverage existing 2nd level translation implementation in Qemu 
> vIOMMU;
> 	o Change Xen IOMMU to support (IOVA->HPA) translation which
> means decouple current logic from P2M layer (only for GPA->HPA);
> 	o As a means of shadowing approach, Xen IOMMU driver needs to
> know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
> mapping in case of any one is changed. So new interface is required
> for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
> which may need to be further cached. 
>
> ----
>
> After writing down above detail, looks it's clear that putting vIOMMU
> in Qemu is not a clean design for a) and c). For b) the hypervisor
> change is not that hacky, but for it alone seems not strong to pursue
> Qemu path. Seems we may have to go with hypervisor based 
> approach...
>
> Anyway stop here. With above background let's see whether others
> may have a better thought how to accelerate TTM of those usages
> in Xen. Xen once is a leading hypervisor for many new features, but
> recently it is not sustaining. If above usages can be enabled decoupled
> from HVMlite/virtual_root_port effort, then we can have staged plan
> to move faster (first for HVM, later for HVMLite). :-)

I dislike that we are in this situation, but I glad to see that I am not
the only one who thinks that the current situation is unsustainable.

The problem is things were hacked up in the past to assume qemu could
deal with everything like this.  Later, performance sucked sufficiently
that bit of qemu were moved back up into the hypervisor, which is why
the vIOAPIC is currently located there.  The result is a complete
tangled ratsnest.

Xen has 3 common uses for qemu, which are:
1) Emulation of legacy devices
2) PCI Passthrough
3) PV backends

3 isn't really relevant here.  For 1, we are basically just using Qemu
to provide an LPC implementation (with some populated slots for
disk/network devices).

I think it would be far cleaner to re-engineer the current Xen/qemu
interaction to more closely resemble real hardware, including
considering having multiple vIOAPICs/vIOMMUs/etc when architecturally
appropriate.  I expect that it would be a far cleaner interface to use
and extend.  I also realise that this isn't a simple task I am
suggesting, but I don't see any other viable way out.

Other issues in the mix is support for multiple device emulators, in
which case Xen is already performing first-level redirection of MMIO
requests.

For HVMLite, there is specifically no qemu, and we need something which
can function when we want PCI Passthrough to work.  I am quite confident
that the correct solution here is to have a basic host bridge/root port
implementation in Xen (as we already have 80% of this already), at which
point we don't need any qemu interaction for PCI Passthough at all, even
for HVM guests.

From this perspective, it would make sense to have emulators map IOVAs,
not GPAs.  We already have mapcache_invalidate infrastructure to flush
mappings as they are changed by the guest.

For the HVMLite side of things, my key concern is not to try and do any
development which we realistically expect to have to undo/change.  As
you said yourself, we are struggling to sustain, and really aren't
helping ourselves by doing lots of work, and subsequently redoing it
when it doesn't work; PVH is the most obvious recent example here.

If others agree, I think that it is well worth making some concrete
plans for improvements in this area for Xen 4.8.  I think the only
viable way forward is to try and get out of the current hole we are in.

Thoughts?  (especially Stefano/Anthony)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel