xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* Discussion about virtual iommu support for Xen guest
@ 2016-05-26  8:29 Lan Tianyu
  2016-05-26  8:42 ` Dong, Eddie
                   ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-05-26  8:29 UTC (permalink / raw)
  To: jbeulich, sstabellini, ian.jackson, xen-devel, kevin.tian, Dong,
	Eddie, Nakajima, Jun, yang.zhang.wz, anthony.perard

Hi All:
We try pushing virtual iommu support for Xen guest and there are some
features blocked by it.

Motivation:
-----------------------
1) Add SVM(Shared Virtual Memory) support for Xen guest
To support iGFX pass-through for SVM enabled devices, it requires
virtual iommu support to emulate related registers and intercept/handle
guest SVM configure in the VMM.

2) Increase max vcpu support for one VM.

So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
Computing) cloud computing, it requires more vcpus support in a single
VM. The usage model is to create just one VM on a machine with the
same number vcpus as logical cpus on the host and pin vcpu on each
logical cpu in order to get good compute performance.

Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
supports 288 logical cpus. So we hope VM can support 288 vcpu
to meet HPC requirement.

Current Linux kernel requires IR(interrupt remapping) when MAX APIC
ID is > 255 because interrupt only can be delivered among 0~255 cpus
without IR. IR in VM relies on the virtual iommu support.

KVM Virtual iommu support status
------------------------
Current, Qemu has a basic virtual iommu to do address translation for
virtual device and it only works for the Q35 machine type. KVM reuses it
and Redhat is adding IR to support more than 255 vcpus.

How to add virtual iommu for Xen?
-------------------------
First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
support Q35 so far. Enabling Q35 for Xen seems not a short term task.
Anthony did some related jobs before.

I'd like to see your comments about how to implement virtual iommu for Xen.

1) Reuse Qemu virtual iommu or write a separate one for Xen?
2) Enable Q35 for Xen to reuse Qemu virtual iommu?

Your comments are very appreciated. Thanks a lot.
-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26  8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
@ 2016-05-26  8:42 ` Dong, Eddie
  2016-05-27  2:26   ` Lan Tianyu
  2016-05-26 11:35 ` Andrew Cooper
  2016-05-27  2:26 ` Yang Zhang
  2 siblings, 1 reply; 86+ messages in thread
From: Dong, Eddie @ 2016-05-26  8:42 UTC (permalink / raw)
  To: Lan, Tianyu, jbeulich, sstabellini, ian.jackson, xen-devel, Tian,
	Kevin, Nakajima, Jun, yang.zhang.wz, anthony.perard

If enabling virtual Q35 solves the problem, it has the advantage: When more and more virtual IOMMU feature comes (likely), we can reuse the KVM code for Xen.
How big is the effort for virtual Q35?

Thx Eddie

> -----Original Message-----
> From: Lan, Tianyu
> Sent: Thursday, May 26, 2016 4:30 PM
> To: jbeulich@suse.com; sstabellini@kernel.org; ian.jackson@eu.citrix.com;
> xen-devel@lists.xensource.com; Tian, Kevin <kevin.tian@intel.com>; Dong,
> Eddie <eddie.dong@intel.com>; Nakajima, Jun <jun.nakajima@intel.com>;
> yang.zhang.wz@gmail.com; anthony.perard@citrix.com
> Subject: Discussion about virtual iommu support for Xen guest
> 
> Hi All:
> We try pushing virtual iommu support for Xen guest and there are some
> features blocked by it.
> 
> Motivation:
> -----------------------
> 1) Add SVM(Shared Virtual Memory) support for Xen guest To support iGFX
> pass-through for SVM enabled devices, it requires virtual iommu support to
> emulate related registers and intercept/handle guest SVM configure in the
> VMM.
> 
> 2) Increase max vcpu support for one VM.
> 
> So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
> Computing) cloud computing, it requires more vcpus support in a single VM.
> The usage model is to create just one VM on a machine with the same number
> vcpus as logical cpus on the host and pin vcpu on each logical cpu in order to get
> good compute performance.
> 
> Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and supports
> 288 logical cpus. So we hope VM can support 288 vcpu to meet HPC
> requirement.
> 
> Current Linux kernel requires IR(interrupt remapping) when MAX APIC ID is >
> 255 because interrupt only can be delivered among 0~255 cpus without IR. IR in
> VM relies on the virtual iommu support.
> 
> KVM Virtual iommu support status
> ------------------------
> Current, Qemu has a basic virtual iommu to do address translation for virtual
> device and it only works for the Q35 machine type. KVM reuses it and Redhat is
> adding IR to support more than 255 vcpus.
> 
> How to add virtual iommu for Xen?
> -------------------------
> First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
> support Q35 so far. Enabling Q35 for Xen seems not a short term task.
> Anthony did some related jobs before.
> 
> I'd like to see your comments about how to implement virtual iommu for Xen.
> 
> 1) Reuse Qemu virtual iommu or write a separate one for Xen?
> 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
> 
> Your comments are very appreciated. Thanks a lot.
> --
> Best regards
> Tianyu Lan
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26  8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
  2016-05-26  8:42 ` Dong, Eddie
@ 2016-05-26 11:35 ` Andrew Cooper
  2016-05-27  8:19   ` Lan Tianyu
                     ` (2 more replies)
  2016-05-27  2:26 ` Yang Zhang
  2 siblings, 3 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-05-26 11:35 UTC (permalink / raw)
  To: Lan Tianyu, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard

On 26/05/16 09:29, Lan Tianyu wrote:
> Hi All:
> We try pushing virtual iommu support for Xen guest and there are some
> features blocked by it.
>
> Motivation:
> -----------------------
> 1) Add SVM(Shared Virtual Memory) support for Xen guest
> To support iGFX pass-through for SVM enabled devices, it requires
> virtual iommu support to emulate related registers and intercept/handle
> guest SVM configure in the VMM.
>
> 2) Increase max vcpu support for one VM.
>
> So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
> Computing) cloud computing, it requires more vcpus support in a single
> VM. The usage model is to create just one VM on a machine with the
> same number vcpus as logical cpus on the host and pin vcpu on each
> logical cpu in order to get good compute performance.
>
> Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
> supports 288 logical cpus. So we hope VM can support 288 vcpu
> to meet HPC requirement.
>
> Current Linux kernel requires IR(interrupt remapping) when MAX APIC
> ID is > 255 because interrupt only can be delivered among 0~255 cpus
> without IR. IR in VM relies on the virtual iommu support.
>
> KVM Virtual iommu support status
> ------------------------
> Current, Qemu has a basic virtual iommu to do address translation for
> virtual device and it only works for the Q35 machine type. KVM reuses it
> and Redhat is adding IR to support more than 255 vcpus.
>
> How to add virtual iommu for Xen?
> -------------------------
> First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
> support Q35 so far. Enabling Q35 for Xen seems not a short term task.
> Anthony did some related jobs before.
>
> I'd like to see your comments about how to implement virtual iommu for Xen.
>
> 1) Reuse Qemu virtual iommu or write a separate one for Xen?
> 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
>
> Your comments are very appreciated. Thanks a lot.

To be viable going forwards, any solution must work with PVH/HVMLite as
much as HVM.  This alone negates qemu as a viable option.

From a design point of view, having Xen needing to delegate to qemu to
inject an interrupt into a guest seems backwards.


A whole lot of this would be easier to reason about if/when we get a
basic root port implementation in Xen, which is necessary for HVMLite,
and which will make the interaction with qemu rather more clean.  It is
probably worth coordinating work in this area.


As for the individual issue of 288vcpu support, there are already issues
with 64vcpu guests at the moment.  While it is certainly fine to remove
the hard limit at 255 vcpus, there is a lot of other work required to
even get 128vcpu guests stable.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26  8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
  2016-05-26  8:42 ` Dong, Eddie
  2016-05-26 11:35 ` Andrew Cooper
@ 2016-05-27  2:26 ` Yang Zhang
  2016-05-27  8:13   ` Tian, Kevin
  2 siblings, 1 reply; 86+ messages in thread
From: Yang Zhang @ 2016-05-27  2:26 UTC (permalink / raw)
  To: Lan Tianyu, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, anthony.perard

On 2016/5/26 16:29, Lan Tianyu wrote:
> Hi All:
> We try pushing virtual iommu support for Xen guest and there are some
> features blocked by it.
>
> Motivation:
> -----------------------
> 1) Add SVM(Shared Virtual Memory) support for Xen guest
> To support iGFX pass-through for SVM enabled devices, it requires
> virtual iommu support to emulate related registers and intercept/handle
> guest SVM configure in the VMM.

IIRC, SVM needs the nested IOMMU support not only virtual iommu. Correct 
me if i am wrong.

>
> 2) Increase max vcpu support for one VM.
>
> So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
> Computing) cloud computing, it requires more vcpus support in a single
> VM. The usage model is to create just one VM on a machine with the
> same number vcpus as logical cpus on the host and pin vcpu on each
> logical cpu in order to get good compute performance.
>
> Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
> supports 288 logical cpus. So we hope VM can support 288 vcpu
> to meet HPC requirement.
>
> Current Linux kernel requires IR(interrupt remapping) when MAX APIC
> ID is > 255 because interrupt only can be delivered among 0~255 cpus
> without IR. IR in VM relies on the virtual iommu support.
>
> KVM Virtual iommu support status
> ------------------------
> Current, Qemu has a basic virtual iommu to do address translation for
> virtual device and it only works for the Q35 machine type. KVM reuses it
> and Redhat is adding IR to support more than 255 vcpus.
>
> How to add virtual iommu for Xen?
> -------------------------
> First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
> support Q35 so far. Enabling Q35 for Xen seems not a short term task.
> Anthony did some related jobs before.
>
> I'd like to see your comments about how to implement virtual iommu for Xen.
>
> 1) Reuse Qemu virtual iommu or write a separate one for Xen?
> 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
>
> Your comments are very appreciated. Thanks a lot.
>


-- 
best regards
yang

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26  8:42 ` Dong, Eddie
@ 2016-05-27  2:26   ` Lan Tianyu
  2016-05-27  8:11     ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Lan Tianyu @ 2016-05-27  2:26 UTC (permalink / raw)
  To: Dong, Eddie, jbeulich, sstabellini, ian.jackson, xen-devel, Tian,
	Kevin, Nakajima, Jun, yang.zhang.wz, anthony.perard

On 2016年05月26日 16:42, Dong, Eddie wrote:
> If enabling virtual Q35 solves the problem, it has the advantage: When more and more virtual IOMMU feature comes (likely), we can reuse the KVM code for Xen.
> How big is the effort for virtual Q35?

I think the most effort are to rebuild all ACPI tables for Q35 and add
Q35 support in the hvmloader. My concern is about new ACPI tables'
compatibility issue. Especially with Windows guest.

-- 
Best regards
Tianyu Lan

> 
> Thx Eddie
> 


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  2:26   ` Lan Tianyu
@ 2016-05-27  8:11     ` Tian, Kevin
  0 siblings, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-05-27  8:11 UTC (permalink / raw)
  To: Lan, Tianyu, Dong, Eddie, jbeulich, sstabellini, ian.jackson,
	xen-devel, Nakajima, Jun, yang.zhang.wz, anthony.perard

> From: Lan, Tianyu
> Sent: Friday, May 27, 2016 10:27 AM
> 
> On 2016年05月26日 16:42, Dong, Eddie wrote:
> > If enabling virtual Q35 solves the problem, it has the advantage: When more and more
> virtual IOMMU feature comes (likely), we can reuse the KVM code for Xen.
> > How big is the effort for virtual Q35?
> 
> I think the most effort are to rebuild all ACPI tables for Q35 and add
> Q35 support in the hvmloader. My concern is about new ACPI tables'
> compatibility issue. Especially with Windows guest.
> 

Another question is how tightly this vIOMMU implementation is bound to Q35?
Can it work with old chipset too and if yes how big is the effort compared to
others?

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  2:26 ` Yang Zhang
@ 2016-05-27  8:13   ` Tian, Kevin
  0 siblings, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-05-27  8:13 UTC (permalink / raw)
  To: Yang Zhang, Lan, Tianyu, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, anthony.perard

> From: Yang Zhang [mailto:yang.zhang.wz@gmail.com]
> Sent: Friday, May 27, 2016 10:26 AM
> 
> On 2016/5/26 16:29, Lan Tianyu wrote:
> > Hi All:
> > We try pushing virtual iommu support for Xen guest and there are some
> > features blocked by it.
> >
> > Motivation:
> > -----------------------
> > 1) Add SVM(Shared Virtual Memory) support for Xen guest
> > To support iGFX pass-through for SVM enabled devices, it requires
> > virtual iommu support to emulate related registers and intercept/handle
> > guest SVM configure in the VMM.
> 
> IIRC, SVM needs the nested IOMMU support not only virtual iommu. Correct
> me if i am wrong.
> 

nested in physical IOMMU. You don't need to present nested in vIOMMU.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26 11:35 ` Andrew Cooper
@ 2016-05-27  8:19   ` Lan Tianyu
  2016-06-02 15:03     ` Lan, Tianyu
  2016-08-02 15:15     ` Lan, Tianyu
  2016-05-27  8:35   ` Tian, Kevin
  2016-05-31  9:43   ` George Dunlap
  2 siblings, 2 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-05-27  8:19 UTC (permalink / raw)
  To: Andrew Cooper, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard

On 2016年05月26日 19:35, Andrew Cooper wrote:
> On 26/05/16 09:29, Lan Tianyu wrote:
> 
> To be viable going forwards, any solution must work with PVH/HVMLite as
> much as HVM.  This alone negates qemu as a viable option.
> 
> From a design point of view, having Xen needing to delegate to qemu to
> inject an interrupt into a guest seems backwards.
>

Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
the qemu virtual iommu can't work for it. We have to rewrite virtual
iommu in the Xen, right?

> 
> A whole lot of this would be easier to reason about if/when we get a
> basic root port implementation in Xen, which is necessary for HVMLite,
> and which will make the interaction with qemu rather more clean.  It is
> probably worth coordinating work in this area.

The virtual iommu also should be under basic root port in Xen, right?

> 
> As for the individual issue of 288vcpu support, there are already issues
> with 64vcpu guests at the moment. While it is certainly fine to remove
> the hard limit at 255 vcpus, there is a lot of other work required to
> even get 128vcpu guests stable.


Could you give some points to these issues? We are enabling more vcpus
support and it can boot up 255 vcpus without IR support basically. It's
very helpful to learn about known issues.

We will also add more tests for 128 vcpus into our regular test to find
related bugs. Increasing max vcpu to 255 should be a good start.





> 
> ~Andrew
> 


-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26 11:35 ` Andrew Cooper
  2016-05-27  8:19   ` Lan Tianyu
@ 2016-05-27  8:35   ` Tian, Kevin
  2016-05-27  8:46     ` Paul Durrant
  2016-05-31  9:43   ` George Dunlap
  2 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-05-27  8:35 UTC (permalink / raw)
  To: Andrew Cooper, Lan, Tianyu, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Thursday, May 26, 2016 7:36 PM
> 
> On 26/05/16 09:29, Lan Tianyu wrote:
> > Hi All:
> > We try pushing virtual iommu support for Xen guest and there are some
> > features blocked by it.
> >
> > Motivation:
> > -----------------------
> > 1) Add SVM(Shared Virtual Memory) support for Xen guest
> > To support iGFX pass-through for SVM enabled devices, it requires
> > virtual iommu support to emulate related registers and intercept/handle
> > guest SVM configure in the VMM.
> >
> > 2) Increase max vcpu support for one VM.
> >
> > So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
> > Computing) cloud computing, it requires more vcpus support in a single
> > VM. The usage model is to create just one VM on a machine with the
> > same number vcpus as logical cpus on the host and pin vcpu on each
> > logical cpu in order to get good compute performance.
> >
> > Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
> > supports 288 logical cpus. So we hope VM can support 288 vcpu
> > to meet HPC requirement.
> >
> > Current Linux kernel requires IR(interrupt remapping) when MAX APIC
> > ID is > 255 because interrupt only can be delivered among 0~255 cpus
> > without IR. IR in VM relies on the virtual iommu support.
> >
> > KVM Virtual iommu support status
> > ------------------------
> > Current, Qemu has a basic virtual iommu to do address translation for
> > virtual device and it only works for the Q35 machine type. KVM reuses it
> > and Redhat is adding IR to support more than 255 vcpus.
> >
> > How to add virtual iommu for Xen?
> > -------------------------
> > First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
> > support Q35 so far. Enabling Q35 for Xen seems not a short term task.
> > Anthony did some related jobs before.
> >
> > I'd like to see your comments about how to implement virtual iommu for Xen.
> >
> > 1) Reuse Qemu virtual iommu or write a separate one for Xen?
> > 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
> >
> > Your comments are very appreciated. Thanks a lot.
> 
> To be viable going forwards, any solution must work with PVH/HVMLite as
> much as HVM.  This alone negates qemu as a viable option.

KVM wants things done in Qemu as much as possible. Now Xen may 
have more things moved into hypervisor instead for HVMLite. The end
result is that many new platform features from IHVs will require
double effort in the future (nvdimm is another example) which means
much longer enabling path to bring those new features to customers.

I can understand the importance of covering HVMLite in Xen community,
but is it really the only factor to negate Qemu option?

> 
> From a design point of view, having Xen needing to delegate to qemu to
> inject an interrupt into a guest seems backwards.
> 
> 
> A whole lot of this would be easier to reason about if/when we get a
> basic root port implementation in Xen, which is necessary for HVMLite,
> and which will make the interaction with qemu rather more clean.  It is
> probably worth coordinating work in this area.

Would it make Xen too complex? Qemu also has its own root port 
implementation, and then you need some tricks within Qemu to not
use its own root port but instead registering to Xen root port. Why is
such movement more clean?

> 
> 
> As for the individual issue of 288vcpu support, there are already issues
> with 64vcpu guests at the moment.  While it is certainly fine to remove
> the hard limit at 255 vcpus, there is a lot of other work required to
> even get 128vcpu guests stable.
> 

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  8:35   ` Tian, Kevin
@ 2016-05-27  8:46     ` Paul Durrant
  2016-05-27  9:39       ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Paul Durrant @ 2016-05-27  8:46 UTC (permalink / raw)
  To: Kevin Tian, Andrew Cooper, Lan, Tianyu, jbeulich, sstabellini,
	Ian Jackson, xen-devel, Eddie Dong, Nakajima, Jun, yang.zhang.wz,
	Anthony Perard

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Tian, Kevin
> Sent: 27 May 2016 09:35
> To: Andrew Cooper; Lan, Tianyu; jbeulich@suse.com; sstabellini@kernel.org;
> Ian Jackson; xen-devel@lists.xensource.com; Eddie Dong; Nakajima, Jun;
> yang.zhang.wz@gmail.com; Anthony Perard
> Subject: Re: [Xen-devel] Discussion about virtual iommu support for Xen
> guest
> 
> > From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> > Sent: Thursday, May 26, 2016 7:36 PM
> >
> > On 26/05/16 09:29, Lan Tianyu wrote:
> > > Hi All:
> > > We try pushing virtual iommu support for Xen guest and there are some
> > > features blocked by it.
> > >
> > > Motivation:
> > > -----------------------
> > > 1) Add SVM(Shared Virtual Memory) support for Xen guest
> > > To support iGFX pass-through for SVM enabled devices, it requires
> > > virtual iommu support to emulate related registers and intercept/handle
> > > guest SVM configure in the VMM.
> > >
> > > 2) Increase max vcpu support for one VM.
> > >
> > > So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
> > > Computing) cloud computing, it requires more vcpus support in a single
> > > VM. The usage model is to create just one VM on a machine with the
> > > same number vcpus as logical cpus on the host and pin vcpu on each
> > > logical cpu in order to get good compute performance.
> > >
> > > Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
> > > supports 288 logical cpus. So we hope VM can support 288 vcpu
> > > to meet HPC requirement.
> > >
> > > Current Linux kernel requires IR(interrupt remapping) when MAX APIC
> > > ID is > 255 because interrupt only can be delivered among 0~255 cpus
> > > without IR. IR in VM relies on the virtual iommu support.
> > >
> > > KVM Virtual iommu support status
> > > ------------------------
> > > Current, Qemu has a basic virtual iommu to do address translation for
> > > virtual device and it only works for the Q35 machine type. KVM reuses it
> > > and Redhat is adding IR to support more than 255 vcpus.
> > >
> > > How to add virtual iommu for Xen?
> > > -------------------------
> > > First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
> > > support Q35 so far. Enabling Q35 for Xen seems not a short term task.
> > > Anthony did some related jobs before.
> > >
> > > I'd like to see your comments about how to implement virtual iommu for
> Xen.
> > >
> > > 1) Reuse Qemu virtual iommu or write a separate one for Xen?
> > > 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
> > >
> > > Your comments are very appreciated. Thanks a lot.
> >
> > To be viable going forwards, any solution must work with PVH/HVMLite as
> > much as HVM.  This alone negates qemu as a viable option.
> 
> KVM wants things done in Qemu as much as possible. Now Xen may
> have more things moved into hypervisor instead for HVMLite. The end
> result is that many new platform features from IHVs will require
> double effort in the future (nvdimm is another example) which means
> much longer enabling path to bring those new features to customers.
> 
> I can understand the importance of covering HVMLite in Xen community,
> but is it really the only factor to negate Qemu option?
> 
> >
> > From a design point of view, having Xen needing to delegate to qemu to
> > inject an interrupt into a guest seems backwards.
> >
> >
> > A whole lot of this would be easier to reason about if/when we get a
> > basic root port implementation in Xen, which is necessary for HVMLite,
> > and which will make the interaction with qemu rather more clean.  It is
> > probably worth coordinating work in this area.
> 
> Would it make Xen too complex? Qemu also has its own root port
> implementation, and then you need some tricks within Qemu to not
> use its own root port but instead registering to Xen root port. Why is
> such movement more clean?
> 

Upstream QEMU already registers PCI BDFs with Xen, and Xen already handles cf8 and cfc accesses (to turn them into single config space read/write ioreqs). So, it really isn't much of a leap to put the root port implementation in Xen.

  Paul

> >
> >
> > As for the individual issue of 288vcpu support, there are already issues
> > with 64vcpu guests at the moment.  While it is certainly fine to remove
> > the hard limit at 255 vcpus, there is a lot of other work required to
> > even get 128vcpu guests stable.
> >
> 
> Thanks
> Kevin
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  8:46     ` Paul Durrant
@ 2016-05-27  9:39       ` Tian, Kevin
  0 siblings, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-05-27  9:39 UTC (permalink / raw)
  To: Paul Durrant, Andrew Cooper, Lan, Tianyu, jbeulich, sstabellini,
	Ian Jackson, xen-devel, Dong, Eddie, Nakajima, Jun,
	yang.zhang.wz, Anthony Perard

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Friday, May 27, 2016 4:47 PM
> > >
> > > A whole lot of this would be easier to reason about if/when we get a
> > > basic root port implementation in Xen, which is necessary for HVMLite,
> > > and which will make the interaction with qemu rather more clean.  It is
> > > probably worth coordinating work in this area.
> >
> > Would it make Xen too complex? Qemu also has its own root port
> > implementation, and then you need some tricks within Qemu to not
> > use its own root port but instead registering to Xen root port. Why is
> > such movement more clean?
> >
> 
> Upstream QEMU already registers PCI BDFs with Xen, and Xen already handles cf8 and cfc
> accesses (to turn them into single config space read/write ioreqs). So, it really isn't much
> of a leap to put the root port implementation in Xen.
> 
>   Paul
> 

Thanks for your information. I didn't realize that fact.

Curious is there anyone already working on a basic root port support
in Xen? If yes, what's the current progress?

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-26 11:35 ` Andrew Cooper
  2016-05-27  8:19   ` Lan Tianyu
  2016-05-27  8:35   ` Tian, Kevin
@ 2016-05-31  9:43   ` George Dunlap
  2 siblings, 0 replies; 86+ messages in thread
From: George Dunlap @ 2016-05-31  9:43 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan Tianyu, yang.zhang.wz, Tian, Kevin, Stefano Stabellini,
	Nakajima, Jun, Dong, Eddie, Ian Jackson, xen-devel, Jan Beulich,
	Anthony Perard

On Thu, May 26, 2016 at 12:35 PM, Andrew Cooper
<andrew.cooper3@citrix.com> wrote:
> On 26/05/16 09:29, Lan Tianyu wrote:
>> Hi All:
>> We try pushing virtual iommu support for Xen guest and there are some
>> features blocked by it.
>>
>> Motivation:
>> -----------------------
>> 1) Add SVM(Shared Virtual Memory) support for Xen guest
>> To support iGFX pass-through for SVM enabled devices, it requires
>> virtual iommu support to emulate related registers and intercept/handle
>> guest SVM configure in the VMM.
>>
>> 2) Increase max vcpu support for one VM.
>>
>> So far, max vcpu for Xen hvm guest is 128. For HPC(High Performance
>> Computing) cloud computing, it requires more vcpus support in a single
>> VM. The usage model is to create just one VM on a machine with the
>> same number vcpus as logical cpus on the host and pin vcpu on each
>> logical cpu in order to get good compute performance.
>>
>> Intel Xeon phi KNL(Knights Landing) is dedicated to HPC market and
>> supports 288 logical cpus. So we hope VM can support 288 vcpu
>> to meet HPC requirement.
>>
>> Current Linux kernel requires IR(interrupt remapping) when MAX APIC
>> ID is > 255 because interrupt only can be delivered among 0~255 cpus
>> without IR. IR in VM relies on the virtual iommu support.
>>
>> KVM Virtual iommu support status
>> ------------------------
>> Current, Qemu has a basic virtual iommu to do address translation for
>> virtual device and it only works for the Q35 machine type. KVM reuses it
>> and Redhat is adding IR to support more than 255 vcpus.
>>
>> How to add virtual iommu for Xen?
>> -------------------------
>> First idea came to my mind is to reuse Qemu virtual iommu but Xen didn't
>> support Q35 so far. Enabling Q35 for Xen seems not a short term task.
>> Anthony did some related jobs before.
>>
>> I'd like to see your comments about how to implement virtual iommu for Xen.
>>
>> 1) Reuse Qemu virtual iommu or write a separate one for Xen?
>> 2) Enable Q35 for Xen to reuse Qemu virtual iommu?
>>
>> Your comments are very appreciated. Thanks a lot.
>
> To be viable going forwards, any solution must work with PVH/HVMLite as
> much as HVM.  This alone negates qemu as a viable option.

There's a big difference between "suboptimal" and "not viable".
Obviously it would be nice to be able to have HVMLite do graphics
pass-through, but if this functionality ends up being HVM-only, is
that really such a huge issue?

If as Paul seems to indicate, the extra work to get the functionality
in Xen isn't very large, then it's worth pursuing; but I don't think
we should take other options off the table.

 -George

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  8:19   ` Lan Tianyu
@ 2016-06-02 15:03     ` Lan, Tianyu
  2016-06-02 18:58       ` Andrew Cooper
  2016-08-02 15:15     ` Lan, Tianyu
  1 sibling, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-06-02 15:03 UTC (permalink / raw)
  To: Andrew Cooper, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard

On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> On 2016年05月26日 19:35, Andrew Cooper wrote:
>> On 26/05/16 09:29, Lan Tianyu wrote:
>>
>> To be viable going forwards, any solution must work with PVH/HVMLite as
>> much as HVM.  This alone negates qemu as a viable option.
>>
>> From a design point of view, having Xen needing to delegate to qemu to
>> inject an interrupt into a guest seems backwards.
>>
>
> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> the qemu virtual iommu can't work for it. We have to rewrite virtual
> iommu in the Xen, right?
>
>>
>> A whole lot of this would be easier to reason about if/when we get a
>> basic root port implementation in Xen, which is necessary for HVMLite,
>> and which will make the interaction with qemu rather more clean.  It is
>> probably worth coordinating work in this area.
>
> The virtual iommu also should be under basic root port in Xen, right?
>
>>
>> As for the individual issue of 288vcpu support, there are already issues
>> with 64vcpu guests at the moment. While it is certainly fine to remove
>> the hard limit at 255 vcpus, there is a lot of other work required to
>> even get 128vcpu guests stable.
>
>
> Could you give some points to these issues? We are enabling more vcpus
> support and it can boot up 255 vcpus without IR support basically. It's
> very helpful to learn about known issues.
>
> We will also add more tests for 128 vcpus into our regular test to find
> related bugs. Increasing max vcpu to 255 should be a good start.

Hi Andrew:
Could you give more inputs about issues with 64 vcpus and what needs to
be done to make 128vcpu guest stable? We hope to do somethings to
improve them.

What's progress of PCI host bridge in Xen? From your opinion, we should
do that first, right? Thanks.


>
>
>
>
>
>>
>> ~Andrew
>>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-02 15:03     ` Lan, Tianyu
@ 2016-06-02 18:58       ` Andrew Cooper
  2016-06-03 11:01         ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
  2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
  0 siblings, 2 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-06-02 18:58 UTC (permalink / raw)
  To: Lan, Tianyu, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard, Roger Pau Monne

On 02/06/16 16:03, Lan, Tianyu wrote:
> On 5/27/2016 4:19 PM, Lan Tianyu wrote:
>> On 2016年05月26日 19:35, Andrew Cooper wrote:
>>> On 26/05/16 09:29, Lan Tianyu wrote:
>>>
>>> To be viable going forwards, any solution must work with PVH/HVMLite as
>>> much as HVM.  This alone negates qemu as a viable option.
>>>
>>> From a design point of view, having Xen needing to delegate to qemu to
>>> inject an interrupt into a guest seems backwards.
>>>
>>
>> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
>> the qemu virtual iommu can't work for it. We have to rewrite virtual
>> iommu in the Xen, right?
>>
>>>
>>> A whole lot of this would be easier to reason about if/when we get a
>>> basic root port implementation in Xen, which is necessary for HVMLite,
>>> and which will make the interaction with qemu rather more clean.  It is
>>> probably worth coordinating work in this area.
>>
>> The virtual iommu also should be under basic root port in Xen, right?
>>
>>>
>>> As for the individual issue of 288vcpu support, there are already
>>> issues
>>> with 64vcpu guests at the moment. While it is certainly fine to remove
>>> the hard limit at 255 vcpus, there is a lot of other work required to
>>> even get 128vcpu guests stable.
>>
>>
>> Could you give some points to these issues? We are enabling more vcpus
>> support and it can boot up 255 vcpus without IR support basically. It's
>> very helpful to learn about known issues.
>>
>> We will also add more tests for 128 vcpus into our regular test to find
>> related bugs. Increasing max vcpu to 255 should be a good start.
>
> Hi Andrew:
> Could you give more inputs about issues with 64 vcpus and what needs to
> be done to make 128vcpu guest stable? We hope to do somethings to
> improve them.
>
> What's progress of PCI host bridge in Xen? From your opinion, we should
> do that first, right? Thanks.

Very sorry for the delay.

There are multiple interacting issues here.  On the one side, it would
be useful if we could have a central point of coordination on
PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
would you mind organising that?

For the qemu/xen interaction, the current state is woeful and a tangled
mess.  I wish to ensure that we don't make any development decisions
which makes the situation worse.

In your case, the two motivations are quite different I would recommend
dealing with them independently.

IIRC, the issue with more than 255 cpus and interrupt remapping is that
you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
can't be programmed to generate x2apic interrupts?  In principle, if you
don't have an IOAPIC, are there any other issues to be considered?  What
happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
deliver xapic interrupts?

On the other side of things, what is IGD passthrough going to look like
in Skylake?  Is there any device-model interaction required (i.e. the
opregion), or will it work as a completely standalone device?  What are
your plans with the interaction of virtual graphics and shared virtual
memory?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-02 18:58       ` Andrew Cooper
@ 2016-06-03 11:01         ` Roger Pau Monne
  2016-06-03 11:21           ` Tian, Kevin
  2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
  1 sibling, 1 reply; 86+ messages in thread
From: Roger Pau Monne @ 2016-06-03 11:01 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan, Tianyu, yang.zhang.wz, kevin.tian, sstabellini, Nakajima,
	Jun, Dong, Eddie, ian.jackson, xen-devel, jbeulich,
	anthony.perard, boris.ostrovsky

On Thu, Jun 02, 2016 at 07:58:48PM +0100, Andrew Cooper wrote:
> On 02/06/16 16:03, Lan, Tianyu wrote:
> > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> >>> On 26/05/16 09:29, Lan Tianyu wrote:
> >>>
> >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> >>> much as HVM.  This alone negates qemu as a viable option.
> >>>
> >>> From a design point of view, having Xen needing to delegate to qemu to
> >>> inject an interrupt into a guest seems backwards.
> >>>
> >>
> >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> >> iommu in the Xen, right?
> >>
> >>>
> >>> A whole lot of this would be easier to reason about if/when we get a
> >>> basic root port implementation in Xen, which is necessary for HVMLite,
> >>> and which will make the interaction with qemu rather more clean.  It is
> >>> probably worth coordinating work in this area.
> >>
> >> The virtual iommu also should be under basic root port in Xen, right?
> >>
[...]
> > What's progress of PCI host bridge in Xen? From your opinion, we should
> > do that first, right? Thanks.
> 
> Very sorry for the delay.
> 
> There are multiple interacting issues here.  On the one side, it would
> be useful if we could have a central point of coordination on
> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> would you mind organising that?

Sure. Adding Boris and Konrad.

AFAIK, the current status is that Boris posted a RFC to provide some basic 
ACPI tables to PVH/HVMlite guests, and I'm currently working on rebasing my 
half-backed HVMlite Dom0 series on top of that. None of those two projects 
require the presence of an emulated PCI root complex inside of Xen, so 
there's nobody working on it ATM that I'm aware of.

Speaking about the PVH/HVMlite roadmap, after those two items are done we 
had plans to work on having full PCI root complex emulation inside of Xen, 
so that we could do passthrough of PCI devices to PVH/HVMlite guests without 
QEMU (and of course without pcifront inside of the guest). I don't foresee 
any of us working on it for at least the next 6 months, so I think there's a 
good chance that this can be done in parallel to the work that Boris and I 
are doing, without any clashes. Is anyone at Intel interested in picking 
this up?

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-02 18:58       ` Andrew Cooper
  2016-06-03 11:01         ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
@ 2016-06-03 11:17         ` Tian, Kevin
  2016-06-03 13:09           ` Lan, Tianyu
  2016-06-03 13:51           ` Andrew Cooper
  1 sibling, 2 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-06-03 11:17 UTC (permalink / raw)
  To: Andrew Cooper, Lan, Tianyu, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard, Roger Pau Monne

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Friday, June 03, 2016 2:59 AM
> 
> On 02/06/16 16:03, Lan, Tianyu wrote:
> > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> >>> On 26/05/16 09:29, Lan Tianyu wrote:
> >>>
> >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> >>> much as HVM.  This alone negates qemu as a viable option.
> >>>
> >>> From a design point of view, having Xen needing to delegate to qemu to
> >>> inject an interrupt into a guest seems backwards.
> >>>
> >>
> >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> >> iommu in the Xen, right?
> >>
> >>>
> >>> A whole lot of this would be easier to reason about if/when we get a
> >>> basic root port implementation in Xen, which is necessary for HVMLite,
> >>> and which will make the interaction with qemu rather more clean.  It is
> >>> probably worth coordinating work in this area.
> >>
> >> The virtual iommu also should be under basic root port in Xen, right?
> >>
> >>>
> >>> As for the individual issue of 288vcpu support, there are already
> >>> issues
> >>> with 64vcpu guests at the moment. While it is certainly fine to remove
> >>> the hard limit at 255 vcpus, there is a lot of other work required to
> >>> even get 128vcpu guests stable.
> >>
> >>
> >> Could you give some points to these issues? We are enabling more vcpus
> >> support and it can boot up 255 vcpus without IR support basically. It's
> >> very helpful to learn about known issues.
> >>
> >> We will also add more tests for 128 vcpus into our regular test to find
> >> related bugs. Increasing max vcpu to 255 should be a good start.
> >
> > Hi Andrew:
> > Could you give more inputs about issues with 64 vcpus and what needs to
> > be done to make 128vcpu guest stable? We hope to do somethings to
> > improve them.
> >
> > What's progress of PCI host bridge in Xen? From your opinion, we should
> > do that first, right? Thanks.
> 
> Very sorry for the delay.
> 
> There are multiple interacting issues here.  On the one side, it would
> be useful if we could have a central point of coordination on
> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> would you mind organising that?
> 
> For the qemu/xen interaction, the current state is woeful and a tangled
> mess.  I wish to ensure that we don't make any development decisions
> which makes the situation worse.
> 
> In your case, the two motivations are quite different I would recommend
> dealing with them independently.
> 
> IIRC, the issue with more than 255 cpus and interrupt remapping is that
> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
> can't be programmed to generate x2apic interrupts?  In principle, if you
> don't have an IOAPIC, are there any other issues to be considered?  What
> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
> deliver xapic interrupts?

The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
deliver interrupts to all cpus in the system if #cpu > 255.

> 
> On the other side of things, what is IGD passthrough going to look like
> in Skylake?  Is there any device-model interaction required (i.e. the
> opregion), or will it work as a completely standalone device?  What are
> your plans with the interaction of virtual graphics and shared virtual
> memory?
> 

The plan is to use a so-called universal pass-through driver in the guest
which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)

----
Here is a brief of potential usages relying on vIOMMU:

a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. 
It requires interrupt remapping capability present on vIOMMU;

b) support guest SVM (Shared Virtual Memory), which relies on the
1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
needs to enable both 1st level and 2nd level translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too;

c) support VFIO-based user space driver (e.g. DPDK) in the guest,
which relies on the 2nd level translation capability (IOVA->GPA) on 
vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);

----
And below is my thought viability of implementing vIOMMU in Qemu:

a) enable >255 vcpus:

	o Enable Q35 in Qemu-Xen;
	o Add interrupt remapping in Qemu vIOMMU;
	o Virtual interrupt injection in hypervisor needs to know virtual
interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
which requires new hypervisor interfaces as Andrew pointed out:
		* either for hypervisor to query IR from Qemu which is not
good;
		* or for Qemu to register IR info to hypervisor which means
partial IR knowledge implemented in hypervisor (then why not putting
whole IR emulation in Xen?)

b) support SVM

	o Enable Q35 in Qemu-Xen;
	o Add 1st level translation capability in Qemu vIOMMU;
	o VT-d context entry points to guest 1st level translation table
which is nest-translated by 2nd level translation table so vIOMMU
structure can be directly linked. It means:
		* Xen IOMMU driver enables nested mode;
		* Introduce a new hypercall so Qemu vIOMMU can register
GPA root of guest 1st level translation table which is then written
to context entry in pIOMMU;

c) support VFIO-based user space driver

	o Enable Q35 in Qemu-Xen;
	o Leverage existing 2nd level translation implementation in Qemu 
vIOMMU;
	o Change Xen IOMMU to support (IOVA->HPA) translation which
means decouple current logic from P2M layer (only for GPA->HPA);
	o As a means of shadowing approach, Xen IOMMU driver needs to
know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
mapping in case of any one is changed. So new interface is required
for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
which may need to be further cached. 

----

After writing down above detail, looks it's clear that putting vIOMMU
in Qemu is not a clean design for a) and c). For b) the hypervisor
change is not that hacky, but for it alone seems not strong to pursue
Qemu path. Seems we may have to go with hypervisor based 
approach...

Anyway stop here. With above background let's see whether others
may have a better thought how to accelerate TTM of those usages
in Xen. Xen once is a leading hypervisor for many new features, but
recently it is not sustaining. If above usages can be enabled decoupled
from HVMlite/virtual_root_port effort, then we can have staged plan
to move faster (first for HVM, later for HVMLite). :-)

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-03 11:01         ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
@ 2016-06-03 11:21           ` Tian, Kevin
  2016-06-03 11:52             ` Roger Pau Monne
  0 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-06-03 11:21 UTC (permalink / raw)
  To: Roger Pau Monne, Andrew Cooper
  Cc: Lan, Tianyu, yang.zhang.wz, sstabellini, Nakajima, Jun,
	ian.jackson, Dong, Eddie, xen-devel, jbeulich, anthony.perard,
	boris.ostrovsky

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: Friday, June 03, 2016 7:02 PM
> 
> On Thu, Jun 02, 2016 at 07:58:48PM +0100, Andrew Cooper wrote:
> > On 02/06/16 16:03, Lan, Tianyu wrote:
> > > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> > >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> > >>> On 26/05/16 09:29, Lan Tianyu wrote:
> > >>>
> > >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> > >>> much as HVM.  This alone negates qemu as a viable option.
> > >>>
> > >>> From a design point of view, having Xen needing to delegate to qemu to
> > >>> inject an interrupt into a guest seems backwards.
> > >>>
> > >>
> > >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> > >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> > >> iommu in the Xen, right?
> > >>
> > >>>
> > >>> A whole lot of this would be easier to reason about if/when we get a
> > >>> basic root port implementation in Xen, which is necessary for HVMLite,
> > >>> and which will make the interaction with qemu rather more clean.  It is
> > >>> probably worth coordinating work in this area.
> > >>
> > >> The virtual iommu also should be under basic root port in Xen, right?
> > >>
> [...]
> > > What's progress of PCI host bridge in Xen? From your opinion, we should
> > > do that first, right? Thanks.
> >
> > Very sorry for the delay.
> >
> > There are multiple interacting issues here.  On the one side, it would
> > be useful if we could have a central point of coordination on
> > PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > would you mind organising that?
> 
> Sure. Adding Boris and Konrad.
> 
> AFAIK, the current status is that Boris posted a RFC to provide some basic
> ACPI tables to PVH/HVMlite guests, and I'm currently working on rebasing my
> half-backed HVMlite Dom0 series on top of that. None of those two projects
> require the presence of an emulated PCI root complex inside of Xen, so
> there's nobody working on it ATM that I'm aware of.
> 
> Speaking about the PVH/HVMlite roadmap, after those two items are done we
> had plans to work on having full PCI root complex emulation inside of Xen,
> so that we could do passthrough of PCI devices to PVH/HVMlite guests without
> QEMU (and of course without pcifront inside of the guest). I don't foresee
> any of us working on it for at least the next 6 months, so I think there's a
> good chance that this can be done in parallel to the work that Boris and I
> are doing, without any clashes. Is anyone at Intel interested in picking
> this up?

How stable is the HVMLite today? Is it already in production usage?

Wonder whether you have some detail thought how full PCI root complex
emulation will be done in Xen (including how to interact with Qemu)...

As I just wrote in another mail, if we just hit for HVM first, will it work if
we implement vIOMMU in Xen but still relies on Qemu root complex to
report to the guest? 

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-03 11:21           ` Tian, Kevin
@ 2016-06-03 11:52             ` Roger Pau Monne
  2016-06-03 12:11               ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Roger Pau Monne @ 2016-06-03 11:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lan, Tianyu, yang.zhang.wz, sstabellini, Dong, Eddie, Nakajima,
	Jun, Andrew Cooper, ian.jackson, xen-devel, jbeulich,
	anthony.perard, boris.ostrovsky

On Fri, Jun 03, 2016 at 11:21:20AM +0000, Tian, Kevin wrote:
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: Friday, June 03, 2016 7:02 PM
> > 
> > On Thu, Jun 02, 2016 at 07:58:48PM +0100, Andrew Cooper wrote:
> > > On 02/06/16 16:03, Lan, Tianyu wrote:
> > > > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> > > >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> > > >>> On 26/05/16 09:29, Lan Tianyu wrote:
> > > >>>
> > > >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> > > >>> much as HVM.  This alone negates qemu as a viable option.
> > > >>>
> > > >>> From a design point of view, having Xen needing to delegate to qemu to
> > > >>> inject an interrupt into a guest seems backwards.
> > > >>>
> > > >>
> > > >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> > > >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> > > >> iommu in the Xen, right?
> > > >>
> > > >>>
> > > >>> A whole lot of this would be easier to reason about if/when we get a
> > > >>> basic root port implementation in Xen, which is necessary for HVMLite,
> > > >>> and which will make the interaction with qemu rather more clean.  It is
> > > >>> probably worth coordinating work in this area.
> > > >>
> > > >> The virtual iommu also should be under basic root port in Xen, right?
> > > >>
> > [...]
> > > > What's progress of PCI host bridge in Xen? From your opinion, we should
> > > > do that first, right? Thanks.
> > >
> > > Very sorry for the delay.
> > >
> > > There are multiple interacting issues here.  On the one side, it would
> > > be useful if we could have a central point of coordination on
> > > PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > > would you mind organising that?
> > 
> > Sure. Adding Boris and Konrad.
> > 
> > AFAIK, the current status is that Boris posted a RFC to provide some basic
> > ACPI tables to PVH/HVMlite guests, and I'm currently working on rebasing my
> > half-backed HVMlite Dom0 series on top of that. None of those two projects
> > require the presence of an emulated PCI root complex inside of Xen, so
> > there's nobody working on it ATM that I'm aware of.
> > 
> > Speaking about the PVH/HVMlite roadmap, after those two items are done we
> > had plans to work on having full PCI root complex emulation inside of Xen,
> > so that we could do passthrough of PCI devices to PVH/HVMlite guests without
> > QEMU (and of course without pcifront inside of the guest). I don't foresee
> > any of us working on it for at least the next 6 months, so I think there's a
> > good chance that this can be done in parallel to the work that Boris and I
> > are doing, without any clashes. Is anyone at Intel interested in picking
> > this up?
> 
> How stable is the HVMLite today? Is it already in production usage?
> 
> Wonder whether you have some detail thought how full PCI root complex
> emulation will be done in Xen (including how to interact with Qemu)...

I haven't looked into much detail regarding all this, since as I said, it's 
still a little bit far away in the PVH/HVMlite roadmap, we have more 
pressing issues to solve before getting to the point of implementing 
PCI-passthrough. I expect Xen is going to intercept all PCI accesses and is 
then going to forward them to the ioreq servers that have been registered 
for that specific config space, but this of course needs much more thought 
and a proper design document.

> As I just wrote in another mail, if we just hit for HVM first, will it work if
> we implement vIOMMU in Xen but still relies on Qemu root complex to
> report to the guest? 

This seems quite inefficient IMHO (but I don't know that much about all this 
vIOMMU stuff). If you implement vIOMMU inside of Xen, but the PCI root 
complex is inside of Qemu aren't you going to perform quite a lot of jumps 
between Xen and QEMU just to access the vIOMMU?

I expect something like:

Xen traps PCI access -> QEMU -> Xen vIOMMU implementation

Roger.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-03 11:52             ` Roger Pau Monne
@ 2016-06-03 12:11               ` Tian, Kevin
  2016-06-03 16:56                 ` Stefano Stabellini
  0 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-06-03 12:11 UTC (permalink / raw)
  To: Roger Pau Monne
  Cc: Lan, Tianyu, yang.zhang.wz, sstabellini, Dong, Eddie, Nakajima,
	Jun, Andrew Cooper, ian.jackson, xen-devel, jbeulich,
	anthony.perard, boris.ostrovsky

> From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> Sent: Friday, June 03, 2016 7:53 PM
> 
> On Fri, Jun 03, 2016 at 11:21:20AM +0000, Tian, Kevin wrote:
> > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > Sent: Friday, June 03, 2016 7:02 PM
> > >
> > > On Thu, Jun 02, 2016 at 07:58:48PM +0100, Andrew Cooper wrote:
> > > > On 02/06/16 16:03, Lan, Tianyu wrote:
> > > > > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> > > > >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> > > > >>> On 26/05/16 09:29, Lan Tianyu wrote:
> > > > >>>
> > > > >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> > > > >>> much as HVM.  This alone negates qemu as a viable option.
> > > > >>>
> > > > >>> From a design point of view, having Xen needing to delegate to qemu to
> > > > >>> inject an interrupt into a guest seems backwards.
> > > > >>>
> > > > >>
> > > > >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> > > > >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> > > > >> iommu in the Xen, right?
> > > > >>
> > > > >>>
> > > > >>> A whole lot of this would be easier to reason about if/when we get a
> > > > >>> basic root port implementation in Xen, which is necessary for HVMLite,
> > > > >>> and which will make the interaction with qemu rather more clean.  It is
> > > > >>> probably worth coordinating work in this area.
> > > > >>
> > > > >> The virtual iommu also should be under basic root port in Xen, right?
> > > > >>
> > > [...]
> > > > > What's progress of PCI host bridge in Xen? From your opinion, we should
> > > > > do that first, right? Thanks.
> > > >
> > > > Very sorry for the delay.
> > > >
> > > > There are multiple interacting issues here.  On the one side, it would
> > > > be useful if we could have a central point of coordination on
> > > > PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > > > would you mind organising that?
> > >
> > > Sure. Adding Boris and Konrad.
> > >
> > > AFAIK, the current status is that Boris posted a RFC to provide some basic
> > > ACPI tables to PVH/HVMlite guests, and I'm currently working on rebasing my
> > > half-backed HVMlite Dom0 series on top of that. None of those two projects
> > > require the presence of an emulated PCI root complex inside of Xen, so
> > > there's nobody working on it ATM that I'm aware of.
> > >
> > > Speaking about the PVH/HVMlite roadmap, after those two items are done we
> > > had plans to work on having full PCI root complex emulation inside of Xen,
> > > so that we could do passthrough of PCI devices to PVH/HVMlite guests without
> > > QEMU (and of course without pcifront inside of the guest). I don't foresee
> > > any of us working on it for at least the next 6 months, so I think there's a
> > > good chance that this can be done in parallel to the work that Boris and I
> > > are doing, without any clashes. Is anyone at Intel interested in picking
> > > this up?
> >
> > How stable is the HVMLite today? Is it already in production usage?
> >
> > Wonder whether you have some detail thought how full PCI root complex
> > emulation will be done in Xen (including how to interact with Qemu)...
> 
> I haven't looked into much detail regarding all this, since as I said, it's
> still a little bit far away in the PVH/HVMlite roadmap, we have more
> pressing issues to solve before getting to the point of implementing
> PCI-passthrough. I expect Xen is going to intercept all PCI accesses and is
> then going to forward them to the ioreq servers that have been registered
> for that specific config space, but this of course needs much more thought
> and a proper design document.
> 
> > As I just wrote in another mail, if we just hit for HVM first, will it work if
> > we implement vIOMMU in Xen but still relies on Qemu root complex to
> > report to the guest?
> 
> This seems quite inefficient IMHO (but I don't know that much about all this
> vIOMMU stuff). If you implement vIOMMU inside of Xen, but the PCI root
> complex is inside of Qemu aren't you going to perform quite a lot of jumps
> between Xen and QEMU just to access the vIOMMU?
> 
> I expect something like:
> 
> Xen traps PCI access -> QEMU -> Xen vIOMMU implementation
> 

I hope the role of Qemu is just to report vIOMMU related information, such
as DMAR, etc. so guest can enumerate the presence of vIOMMU, while
the actual emulation is done by vIOMMU in hypervisor w/o going through
Qemu.

However just realized even for above purpose, there's still some interaction
required between Qemu and Xen vIOMMU, e.g. register base of vIOMMU and
devices behind vIOMMU are reported thru ACPI DRHD which means Xen vIOMMU
needs to know the configuration in Qemu which might be dirty to define such
interfaces between Qemu and hypervisor. :/

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
@ 2016-06-03 13:09           ` Lan, Tianyu
  2016-06-03 14:00             ` Andrew Cooper
  2016-06-03 13:51           ` Andrew Cooper
  1 sibling, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-06-03 13:09 UTC (permalink / raw)
  To: Tian, Kevin, Andrew Cooper, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard, Roger Pau Monne



On 6/3/2016 7:17 PM, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Friday, June 03, 2016 2:59 AM
>>
>> On 02/06/16 16:03, Lan, Tianyu wrote:
>>> On 5/27/2016 4:19 PM, Lan Tianyu wrote:
>>>> On 2016年05月26日 19:35, Andrew Cooper wrote:
>>>>> On 26/05/16 09:29, Lan Tianyu wrote:
>>>>>
>>>>> To be viable going forwards, any solution must work with PVH/HVMLite as
>>>>> much as HVM.  This alone negates qemu as a viable option.
>>>>>
>>>>> From a design point of view, having Xen needing to delegate to qemu to
>>>>> inject an interrupt into a guest seems backwards.
>>>>>
>>>>
>>>> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
>>>> the qemu virtual iommu can't work for it. We have to rewrite virtual
>>>> iommu in the Xen, right?
>>>>
>>>>>
>>>>> A whole lot of this would be easier to reason about if/when we get a
>>>>> basic root port implementation in Xen, which is necessary for HVMLite,
>>>>> and which will make the interaction with qemu rather more clean.  It is
>>>>> probably worth coordinating work in this area.
>>>>
>>>> The virtual iommu also should be under basic root port in Xen, right?
>>>>
>>>>>
>>>>> As for the individual issue of 288vcpu support, there are already
>>>>> issues
>>>>> with 64vcpu guests at the moment. While it is certainly fine to remove
>>>>> the hard limit at 255 vcpus, there is a lot of other work required to
>>>>> even get 128vcpu guests stable.
>>>>
>>>>
>>>> Could you give some points to these issues? We are enabling more vcpus
>>>> support and it can boot up 255 vcpus without IR support basically. It's
>>>> very helpful to learn about known issues.
>>>>
>>>> We will also add more tests for 128 vcpus into our regular test to find
>>>> related bugs. Increasing max vcpu to 255 should be a good start.
>>>
>>> Hi Andrew:
>>> Could you give more inputs about issues with 64 vcpus and what needs to
>>> be done to make 128vcpu guest stable? We hope to do somethings to
>>> improve them.
>>>
>>> What's progress of PCI host bridge in Xen? From your opinion, we should
>>> do that first, right? Thanks.
>>
>> Very sorry for the delay.
>>
>> There are multiple interacting issues here.  On the one side, it would
>> be useful if we could have a central point of coordination on
>> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
>> would you mind organising that?
>>
>> For the qemu/xen interaction, the current state is woeful and a tangled
>> mess.  I wish to ensure that we don't make any development decisions
>> which makes the situation worse.
>>
>> In your case, the two motivations are quite different I would recommend
>> dealing with them independently.
>>
>> IIRC, the issue with more than 255 cpus and interrupt remapping is that
>> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
>> can't be programmed to generate x2apic interrupts?  In principle, if you
>> don't have an IOAPIC, are there any other issues to be considered?  What
>> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
>> deliver xapic interrupts?
>
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.
>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> deliver interrupts to all cpus in the system if #cpu > 255.

Another key factor, Linux kernel disables x2apic mode when MAX APIC id
is > 255 if no interrupt remapping function. The reason for this is what
Kevin said. So booting up >255 cpus relies on the interrupt remapping.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
  2016-06-03 13:09           ` Lan, Tianyu
@ 2016-06-03 13:51           ` Andrew Cooper
  2016-06-03 14:31             ` Jan Beulich
                               ` (2 more replies)
  1 sibling, 3 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-06-03 13:51 UTC (permalink / raw)
  To: Tian, Kevin, Lan, Tianyu, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard, Roger Pau Monne

On 03/06/16 12:17, Tian, Kevin wrote:
>> Very sorry for the delay.
>>
>> There are multiple interacting issues here.  On the one side, it would
>> be useful if we could have a central point of coordination on
>> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
>> would you mind organising that?
>>
>> For the qemu/xen interaction, the current state is woeful and a tangled
>> mess.  I wish to ensure that we don't make any development decisions
>> which makes the situation worse.
>>
>> In your case, the two motivations are quite different I would recommend
>> dealing with them independently.
>>
>> IIRC, the issue with more than 255 cpus and interrupt remapping is that
>> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
>> can't be programmed to generate x2apic interrupts?  In principle, if you
>> don't have an IOAPIC, are there any other issues to be considered?  What
>> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
>> deliver xapic interrupts?
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.

Thanks for clarifying.

>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> deliver interrupts to all cpus in the system if #cpu > 255.

Ok.  So not ideal (and we certainly want to address it), but this isn't
a complete show stopper for a guest.

>> On the other side of things, what is IGD passthrough going to look like
>> in Skylake?  Is there any device-model interaction required (i.e. the
>> opregion), or will it work as a completely standalone device?  What are
>> your plans with the interaction of virtual graphics and shared virtual
>> memory?
>>
> The plan is to use a so-called universal pass-through driver in the guest
> which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)

This is fantastic news.

>
> ----
> Here is a brief of potential usages relying on vIOMMU:
>
> a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. 
> It requires interrupt remapping capability present on vIOMMU;
>
> b) support guest SVM (Shared Virtual Memory), which relies on the
> 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
> needs to enable both 1st level and 2nd level translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
> the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too;
>
> c) support VFIO-based user space driver (e.g. DPDK) in the guest,
> which relies on the 2nd level translation capability (IOVA->GPA) on 
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);

All of these look like interesting things to do.  I know there is a lot
of interest for b).

As a quick aside, does Xen currently boot on a Phi?  Last time I looked
at the Phi manual, I would expect Xen to crash on boot because of MCXSR
differences from more-common x86 hardware.

>
> ----
> And below is my thought viability of implementing vIOMMU in Qemu:
>
> a) enable >255 vcpus:
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Add interrupt remapping in Qemu vIOMMU;
> 	o Virtual interrupt injection in hypervisor needs to know virtual
> interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
> which requires new hypervisor interfaces as Andrew pointed out:
> 		* either for hypervisor to query IR from Qemu which is not
> good;
> 		* or for Qemu to register IR info to hypervisor which means
> partial IR knowledge implemented in hypervisor (then why not putting
> whole IR emulation in Xen?)
>
> b) support SVM
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Add 1st level translation capability in Qemu vIOMMU;
> 	o VT-d context entry points to guest 1st level translation table
> which is nest-translated by 2nd level translation table so vIOMMU
> structure can be directly linked. It means:
> 		* Xen IOMMU driver enables nested mode;
> 		* Introduce a new hypercall so Qemu vIOMMU can register
> GPA root of guest 1st level translation table which is then written
> to context entry in pIOMMU;
>
> c) support VFIO-based user space driver
>
> 	o Enable Q35 in Qemu-Xen;
> 	o Leverage existing 2nd level translation implementation in Qemu 
> vIOMMU;
> 	o Change Xen IOMMU to support (IOVA->HPA) translation which
> means decouple current logic from P2M layer (only for GPA->HPA);
> 	o As a means of shadowing approach, Xen IOMMU driver needs to
> know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
> mapping in case of any one is changed. So new interface is required
> for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
> which may need to be further cached. 
>
> ----
>
> After writing down above detail, looks it's clear that putting vIOMMU
> in Qemu is not a clean design for a) and c). For b) the hypervisor
> change is not that hacky, but for it alone seems not strong to pursue
> Qemu path. Seems we may have to go with hypervisor based 
> approach...
>
> Anyway stop here. With above background let's see whether others
> may have a better thought how to accelerate TTM of those usages
> in Xen. Xen once is a leading hypervisor for many new features, but
> recently it is not sustaining. If above usages can be enabled decoupled
> from HVMlite/virtual_root_port effort, then we can have staged plan
> to move faster (first for HVM, later for HVMLite). :-)

I dislike that we are in this situation, but I glad to see that I am not
the only one who thinks that the current situation is unsustainable.

The problem is things were hacked up in the past to assume qemu could
deal with everything like this.  Later, performance sucked sufficiently
that bit of qemu were moved back up into the hypervisor, which is why
the vIOAPIC is currently located there.  The result is a complete
tangled ratsnest.


Xen has 3 common uses for qemu, which are:
1) Emulation of legacy devices
2) PCI Passthrough
3) PV backends

3 isn't really relevant here.  For 1, we are basically just using Qemu
to provide an LPC implementation (with some populated slots for
disk/network devices).

I think it would be far cleaner to re-engineer the current Xen/qemu
interaction to more closely resemble real hardware, including
considering having multiple vIOAPICs/vIOMMUs/etc when architecturally
appropriate.  I expect that it would be a far cleaner interface to use
and extend.  I also realise that this isn't a simple task I am
suggesting, but I don't see any other viable way out.

Other issues in the mix is support for multiple device emulators, in
which case Xen is already performing first-level redirection of MMIO
requests.

For HVMLite, there is specifically no qemu, and we need something which
can function when we want PCI Passthrough to work.  I am quite confident
that the correct solution here is to have a basic host bridge/root port
implementation in Xen (as we already have 80% of this already), at which
point we don't need any qemu interaction for PCI Passthough at all, even
for HVM guests.

From this perspective, it would make sense to have emulators map IOVAs,
not GPAs.  We already have mapcache_invalidate infrastructure to flush
mappings as they are changed by the guest.


For the HVMLite side of things, my key concern is not to try and do any
development which we realistically expect to have to undo/change.  As
you said yourself, we are struggling to sustain, and really aren't
helping ourselves by doing lots of work, and subsequently redoing it
when it doesn't work; PVH is the most obvious recent example here.

If others agree, I think that it is well worth making some concrete
plans for improvements in this area for Xen 4.8.  I think the only
viable way forward is to try and get out of the current hole we are in.

Thoughts?  (especially Stefano/Anthony)

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 13:09           ` Lan, Tianyu
@ 2016-06-03 14:00             ` Andrew Cooper
  0 siblings, 0 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-06-03 14:00 UTC (permalink / raw)
  To: Lan, Tianyu, Tian, Kevin, jbeulich, sstabellini, ian.jackson,
	xen-devel, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard, Roger Pau Monne

On 03/06/16 14:09, Lan, Tianyu wrote:
>
>
> On 6/3/2016 7:17 PM, Tian, Kevin wrote:
>>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>>> Sent: Friday, June 03, 2016 2:59 AM
>>>
>>> On 02/06/16 16:03, Lan, Tianyu wrote:
>>>> On 5/27/2016 4:19 PM, Lan Tianyu wrote:
>>>>> On 2016年05月26日 19:35, Andrew Cooper wrote:
>>>>>> On 26/05/16 09:29, Lan Tianyu wrote:
>>>>>>
>>>>>> To be viable going forwards, any solution must work with
>>>>>> PVH/HVMLite as
>>>>>> much as HVM.  This alone negates qemu as a viable option.
>>>>>>
>>>>>> From a design point of view, having Xen needing to delegate to
>>>>>> qemu to
>>>>>> inject an interrupt into a guest seems backwards.
>>>>>>
>>>>>
>>>>> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
>>>>> the qemu virtual iommu can't work for it. We have to rewrite virtual
>>>>> iommu in the Xen, right?
>>>>>
>>>>>>
>>>>>> A whole lot of this would be easier to reason about if/when we get a
>>>>>> basic root port implementation in Xen, which is necessary for
>>>>>> HVMLite,
>>>>>> and which will make the interaction with qemu rather more clean. 
>>>>>> It is
>>>>>> probably worth coordinating work in this area.
>>>>>
>>>>> The virtual iommu also should be under basic root port in Xen, right?
>>>>>
>>>>>>
>>>>>> As for the individual issue of 288vcpu support, there are already
>>>>>> issues
>>>>>> with 64vcpu guests at the moment. While it is certainly fine to
>>>>>> remove
>>>>>> the hard limit at 255 vcpus, there is a lot of other work
>>>>>> required to
>>>>>> even get 128vcpu guests stable.
>>>>>
>>>>>
>>>>> Could you give some points to these issues? We are enabling more
>>>>> vcpus
>>>>> support and it can boot up 255 vcpus without IR support basically.
>>>>> It's
>>>>> very helpful to learn about known issues.
>>>>>
>>>>> We will also add more tests for 128 vcpus into our regular test to
>>>>> find
>>>>> related bugs. Increasing max vcpu to 255 should be a good start.
>>>>
>>>> Hi Andrew:
>>>> Could you give more inputs about issues with 64 vcpus and what
>>>> needs to
>>>> be done to make 128vcpu guest stable? We hope to do somethings to
>>>> improve them.
>>>>
>>>> What's progress of PCI host bridge in Xen? From your opinion, we
>>>> should
>>>> do that first, right? Thanks.
>>>
>>> Very sorry for the delay.
>>>
>>> There are multiple interacting issues here.  On the one side, it would
>>> be useful if we could have a central point of coordination on
>>> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
>>> would you mind organising that?
>>>
>>> For the qemu/xen interaction, the current state is woeful and a tangled
>>> mess.  I wish to ensure that we don't make any development decisions
>>> which makes the situation worse.
>>>
>>> In your case, the two motivations are quite different I would recommend
>>> dealing with them independently.
>>>
>>> IIRC, the issue with more than 255 cpus and interrupt remapping is that
>>> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
>>> can't be programmed to generate x2apic interrupts?  In principle, if
>>> you
>>> don't have an IOAPIC, are there any other issues to be considered? 
>>> What
>>> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
>>> deliver xapic interrupts?
>>
>> The key is the APIC ID. There is no modification to existing PCI MSI and
>> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
>> interrupt message containing 8bit APIC ID, which cannot address >255
>> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
>> enable >255 cpus with x2apic mode.
>>
>> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
>> cannot
>> deliver interrupts to all cpus in the system if #cpu > 255.
>
> Another key factor, Linux kernel disables x2apic mode when MAX APIC id
> is > 255 if no interrupt remapping function. The reason for this is what
> Kevin said. So booting up >255 cpus relies on the interrupt remapping.

That is an implementation decision of Linux, not an architectural
requirement.

We need to carefully distinguish the two (even if it doesn't affect the
planned outcome from Xen's point if view), as Linux is not the only
operating system we virtualise.


One interesting issue in this area is plain, no-frills HVMLite domains,
which have an LAPIC but no IOAPIC, as they have no legacy devices/PCI
bus/etc.  In this scenario, no vIOMMU would be required for x2apic mode,
even if the domain had >255 vcpus.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 13:51           ` Andrew Cooper
@ 2016-06-03 14:31             ` Jan Beulich
  2016-06-03 17:14             ` Stefano Stabellini
  2016-06-03 19:51             ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
  2 siblings, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2016-06-03 14:31 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: yang.zhang.wz, Tianyu Lan, Kevin Tian, sstabellini, ian.jackson,
	xen-devel, Jun Nakajima, anthony.perard, Roger Pau Monne

>>> On 03.06.16 at 15:51, <andrew.cooper3@citrix.com> wrote:
> As a quick aside, does Xen currently boot on a Phi?  Last time I looked
> at the Phi manual, I would expect Xen to crash on boot because of MCXSR
> differences from more-common x86 hardware.

It does boot, as per reports we've got. Perhaps, much like I did
until I was explicitly told there's a significant difference, you're
mixing the earlier non-self-booting one with KNL?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-03 12:11               ` Tian, Kevin
@ 2016-06-03 16:56                 ` Stefano Stabellini
  2016-06-07  5:48                   ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Stefano Stabellini @ 2016-06-03 16:56 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lan, Tianyu, yang.zhang.wz, sstabellini, jbeulich, Andrew Cooper,
	Dong, Eddie, xen-devel, Nakajima, Jun, anthony.perard,
	boris.ostrovsky, ian.jackson, Roger Pau Monne

[-- Attachment #1: Type: TEXT/PLAIN, Size: 5303 bytes --]

On Fri, 3 Jun 2016, Tian, Kevin wrote:
> > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > Sent: Friday, June 03, 2016 7:53 PM
> > 
> > On Fri, Jun 03, 2016 at 11:21:20AM +0000, Tian, Kevin wrote:
> > > > From: Roger Pau Monne [mailto:roger.pau@citrix.com]
> > > > Sent: Friday, June 03, 2016 7:02 PM
> > > >
> > > > On Thu, Jun 02, 2016 at 07:58:48PM +0100, Andrew Cooper wrote:
> > > > > On 02/06/16 16:03, Lan, Tianyu wrote:
> > > > > > On 5/27/2016 4:19 PM, Lan Tianyu wrote:
> > > > > >> On 2016年05月26日 19:35, Andrew Cooper wrote:
> > > > > >>> On 26/05/16 09:29, Lan Tianyu wrote:
> > > > > >>>
> > > > > >>> To be viable going forwards, any solution must work with PVH/HVMLite as
> > > > > >>> much as HVM.  This alone negates qemu as a viable option.
> > > > > >>>
> > > > > >>> From a design point of view, having Xen needing to delegate to qemu to
> > > > > >>> inject an interrupt into a guest seems backwards.
> > > > > >>>
> > > > > >>
> > > > > >> Sorry, I am not familiar with HVMlite. HVMlite doesn't use Qemu and
> > > > > >> the qemu virtual iommu can't work for it. We have to rewrite virtual
> > > > > >> iommu in the Xen, right?
> > > > > >>
> > > > > >>>
> > > > > >>> A whole lot of this would be easier to reason about if/when we get a
> > > > > >>> basic root port implementation in Xen, which is necessary for HVMLite,
> > > > > >>> and which will make the interaction with qemu rather more clean.  It is
> > > > > >>> probably worth coordinating work in this area.
> > > > > >>
> > > > > >> The virtual iommu also should be under basic root port in Xen, right?
> > > > > >>
> > > > [...]
> > > > > > What's progress of PCI host bridge in Xen? From your opinion, we should
> > > > > > do that first, right? Thanks.
> > > > >
> > > > > Very sorry for the delay.
> > > > >
> > > > > There are multiple interacting issues here.  On the one side, it would
> > > > > be useful if we could have a central point of coordination on
> > > > > PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > > > > would you mind organising that?
> > > >
> > > > Sure. Adding Boris and Konrad.
> > > >
> > > > AFAIK, the current status is that Boris posted a RFC to provide some basic
> > > > ACPI tables to PVH/HVMlite guests, and I'm currently working on rebasing my
> > > > half-backed HVMlite Dom0 series on top of that. None of those two projects
> > > > require the presence of an emulated PCI root complex inside of Xen, so
> > > > there's nobody working on it ATM that I'm aware of.
> > > >
> > > > Speaking about the PVH/HVMlite roadmap, after those two items are done we
> > > > had plans to work on having full PCI root complex emulation inside of Xen,
> > > > so that we could do passthrough of PCI devices to PVH/HVMlite guests without
> > > > QEMU (and of course without pcifront inside of the guest). I don't foresee
> > > > any of us working on it for at least the next 6 months, so I think there's a
> > > > good chance that this can be done in parallel to the work that Boris and I
> > > > are doing, without any clashes. Is anyone at Intel interested in picking
> > > > this up?
> > >
> > > How stable is the HVMLite today? Is it already in production usage?
> > >
> > > Wonder whether you have some detail thought how full PCI root complex
> > > emulation will be done in Xen (including how to interact with Qemu)...
> > 
> > I haven't looked into much detail regarding all this, since as I said, it's
> > still a little bit far away in the PVH/HVMlite roadmap, we have more
> > pressing issues to solve before getting to the point of implementing
> > PCI-passthrough. I expect Xen is going to intercept all PCI accesses and is
> > then going to forward them to the ioreq servers that have been registered
> > for that specific config space, but this of course needs much more thought
> > and a proper design document.
> > 
> > > As I just wrote in another mail, if we just hit for HVM first, will it work if
> > > we implement vIOMMU in Xen but still relies on Qemu root complex to
> > > report to the guest?
> > 
> > This seems quite inefficient IMHO (but I don't know that much about all this
> > vIOMMU stuff). If you implement vIOMMU inside of Xen, but the PCI root
> > complex is inside of Qemu aren't you going to perform quite a lot of jumps
> > between Xen and QEMU just to access the vIOMMU?
> > 
> > I expect something like:
> > 
> > Xen traps PCI access -> QEMU -> Xen vIOMMU implementation
> > 
> 
> I hope the role of Qemu is just to report vIOMMU related information, such
> as DMAR, etc. so guest can enumerate the presence of vIOMMU, while
> the actual emulation is done by vIOMMU in hypervisor w/o going through
> Qemu.
> 
> However just realized even for above purpose, there's still some interaction
> required between Qemu and Xen vIOMMU, e.g. register base of vIOMMU and
> devices behind vIOMMU are reported thru ACPI DRHD which means Xen vIOMMU
> needs to know the configuration in Qemu which might be dirty to define such
> interfaces between Qemu and hypervisor. :/

PCI accesses don't need to be particularly fast, they should not be on
the hot path.

How bad this interface between QEMU and vIOMMU in Xen would look like?
Can we make a short list of basic operations that we would need to
support to get a clearer idea?

[-- Attachment #2: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 13:51           ` Andrew Cooper
  2016-06-03 14:31             ` Jan Beulich
@ 2016-06-03 17:14             ` Stefano Stabellini
  2016-06-07  5:14               ` Tian, Kevin
  2016-06-03 19:51             ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
  2 siblings, 1 reply; 86+ messages in thread
From: Stefano Stabellini @ 2016-06-03 17:14 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan, Tianyu, yang.zhang.wz, Tian, Kevin, sstabellini, Nakajima,
	Jun, Dong, Eddie, ian.jackson, xen-devel, jbeulich,
	anthony.perard, Roger Pau Monne

On Fri, 3 Jun 2016, Andrew Cooper wrote:
> On 03/06/16 12:17, Tian, Kevin wrote:
> >> Very sorry for the delay.
> >>
> >> There are multiple interacting issues here.  On the one side, it would
> >> be useful if we could have a central point of coordination on
> >> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> >> would you mind organising that?
> >>
> >> For the qemu/xen interaction, the current state is woeful and a tangled
> >> mess.  I wish to ensure that we don't make any development decisions
> >> which makes the situation worse.
> >>
> >> In your case, the two motivations are quite different I would recommend
> >> dealing with them independently.
> >>
> >> IIRC, the issue with more than 255 cpus and interrupt remapping is that
> >> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
> >> can't be programmed to generate x2apic interrupts?  In principle, if you
> >> don't have an IOAPIC, are there any other issues to be considered?  What
> >> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
> >> deliver xapic interrupts?
> > The key is the APIC ID. There is no modification to existing PCI MSI and
> > IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> > interrupt message containing 8bit APIC ID, which cannot address >255
> > cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> > enable >255 cpus with x2apic mode.
> 
> Thanks for clarifying.
> 
> >
> > If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> > deliver interrupts to all cpus in the system if #cpu > 255.
> 
> Ok.  So not ideal (and we certainly want to address it), but this isn't
> a complete show stopper for a guest.
> 
> >> On the other side of things, what is IGD passthrough going to look like
> >> in Skylake?  Is there any device-model interaction required (i.e. the
> >> opregion), or will it work as a completely standalone device?  What are
> >> your plans with the interaction of virtual graphics and shared virtual
> >> memory?
> >>
> > The plan is to use a so-called universal pass-through driver in the guest
> > which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)
> 
> This is fantastic news.
> 
> >
> > ----
> > Here is a brief of potential usages relying on vIOMMU:
> >
> > a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread. 
> > It requires interrupt remapping capability present on vIOMMU;
> >
> > b) support guest SVM (Shared Virtual Memory), which relies on the
> > 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
> > needs to enable both 1st level and 2nd level translation in nested
> > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
> > the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too;
> >
> > c) support VFIO-based user space driver (e.g. DPDK) in the guest,
> > which relies on the 2nd level translation capability (IOVA->GPA) on 
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);
> 
> All of these look like interesting things to do.  I know there is a lot
> of interest for b).
> 
> As a quick aside, does Xen currently boot on a Phi?  Last time I looked
> at the Phi manual, I would expect Xen to crash on boot because of MCXSR
> differences from more-common x86 hardware.
> 
> >
> > ----
> > And below is my thought viability of implementing vIOMMU in Qemu:
> >
> > a) enable >255 vcpus:
> >
> > 	o Enable Q35 in Qemu-Xen;
> > 	o Add interrupt remapping in Qemu vIOMMU;
> > 	o Virtual interrupt injection in hypervisor needs to know virtual
> > interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
> > which requires new hypervisor interfaces as Andrew pointed out:
> > 		* either for hypervisor to query IR from Qemu which is not
> > good;
> > 		* or for Qemu to register IR info to hypervisor which means
> > partial IR knowledge implemented in hypervisor (then why not putting
> > whole IR emulation in Xen?)
> >
> > b) support SVM
> >
> > 	o Enable Q35 in Qemu-Xen;
> > 	o Add 1st level translation capability in Qemu vIOMMU;
> > 	o VT-d context entry points to guest 1st level translation table
> > which is nest-translated by 2nd level translation table so vIOMMU
> > structure can be directly linked. It means:
> > 		* Xen IOMMU driver enables nested mode;
> > 		* Introduce a new hypercall so Qemu vIOMMU can register
> > GPA root of guest 1st level translation table which is then written
> > to context entry in pIOMMU;
> >
> > c) support VFIO-based user space driver
> >
> > 	o Enable Q35 in Qemu-Xen;
> > 	o Leverage existing 2nd level translation implementation in Qemu 
> > vIOMMU;
> > 	o Change Xen IOMMU to support (IOVA->HPA) translation which
> > means decouple current logic from P2M layer (only for GPA->HPA);
> > 	o As a means of shadowing approach, Xen IOMMU driver needs to
> > know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
> > mapping in case of any one is changed. So new interface is required
> > for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
> > which may need to be further cached. 
> >
> > ----
> >
> > After writing down above detail, looks it's clear that putting vIOMMU
> > in Qemu is not a clean design for a) and c). For b) the hypervisor
> > change is not that hacky, but for it alone seems not strong to pursue
> > Qemu path. Seems we may have to go with hypervisor based 
> > approach...
> >
> > Anyway stop here. With above background let's see whether others
> > may have a better thought how to accelerate TTM of those usages
> > in Xen. Xen once is a leading hypervisor for many new features, but
> > recently it is not sustaining. If above usages can be enabled decoupled
> > from HVMlite/virtual_root_port effort, then we can have staged plan
> > to move faster (first for HVM, later for HVMLite). :-)
> 
> I dislike that we are in this situation, but I glad to see that I am not
> the only one who thinks that the current situation is unsustainable.
> 
> The problem is things were hacked up in the past to assume qemu could
> deal with everything like this.  Later, performance sucked sufficiently
> that bit of qemu were moved back up into the hypervisor, which is why
> the vIOAPIC is currently located there.  The result is a complete
> tangled ratsnest.
> 
> 
> Xen has 3 common uses for qemu, which are:
> 1) Emulation of legacy devices
> 2) PCI Passthrough
> 3) PV backends
> 
> 3 isn't really relevant here.  For 1, we are basically just using Qemu
> to provide an LPC implementation (with some populated slots for
> disk/network devices).
> 
> I think it would be far cleaner to re-engineer the current Xen/qemu
> interaction to more closely resemble real hardware, including
> considering having multiple vIOAPICs/vIOMMUs/etc when architecturally
> appropriate.  I expect that it would be a far cleaner interface to use
> and extend.  I also realise that this isn't a simple task I am
> suggesting, but I don't see any other viable way out.
> 
> Other issues in the mix is support for multiple device emulators, in
> which case Xen is already performing first-level redirection of MMIO
> requests.
> 
> For HVMLite, there is specifically no qemu, and we need something which
> can function when we want PCI Passthrough to work.  I am quite confident
> that the correct solution here is to have a basic host bridge/root port
> implementation in Xen (as we already have 80% of this already), at which
> point we don't need any qemu interaction for PCI Passthough at all, even
> for HVM guests.
> 
> >From this perspective, it would make sense to have emulators map IOVAs,
> not GPAs.  We already have mapcache_invalidate infrastructure to flush
> mappings as they are changed by the guest.
> 
> 
> For the HVMLite side of things, my key concern is not to try and do any
> development which we realistically expect to have to undo/change.  As
> you said yourself, we are struggling to sustain, and really aren't
> helping ourselves by doing lots of work, and subsequently redoing it
> when it doesn't work; PVH is the most obvious recent example here.
> 
> If others agree, I think that it is well worth making some concrete
> plans for improvements in this area for Xen 4.8.  I think the only
> viable way forward is to try and get out of the current hole we are in.
> 
> Thoughts?  (especially Stefano/Anthony)

Going back to the beginning of the discussion, whether we should enable
Q35 in QEMU or not is a distraction: of course we should enable it, but
even with Q35 in QEMU, it might not be a good idea to place the vIOMMU
emulation there.

I agree with Andrew that the current model is flawed: the boundary
between Xen and QEMU emulation is not clear enough. In addition using
QEMU on Xen introduces latency and security issues (the work to run QEMU
as non-root and using unprivileged interfaces is not complete yet).

I think of QEMU as a provider of complex, high level emulators, such as
the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily
need to be fast.

For core x86 components, such as the vIOMMU, for performance and ease of
integration with the rest of the hypervisor, it seems to me that Xen
would is the right place to implement them. As a comparison, I would
certainly argue in favor of implementing vSMMU in the hypervisor on ARM.


However the issue is the PCI root-complex, which today is in QEMU. I
don't think it is a particularly bad fit there, although I can also see
the benefit of moving it to the hypervisor. It is relevant here if it
causes problems to implementing vIOMMU in Xen.

From a software engineering perspective, it would be nice to keep the
two projects (implementing vIOMMU and moving the PCI root complex to
Xen) separate, especially given that the PCI root complex one is without
an owner and a timeline. I don't think it is fair to ask Tianyu or Kevin
to move the PCI root complex from QEMU to Xen in order to enable vIOMMU
on Xen systems.

If vIOMMU in Xen and root complex in QEMU cannot be made to work
together, then we are at an impasse. I cannot see any good way forward
unless somebody volunteers to start working on the PCI root complex
project soon to provide Kevin and Tianyu with a branch to based their
work upon.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 13:51           ` Andrew Cooper
  2016-06-03 14:31             ` Jan Beulich
  2016-06-03 17:14             ` Stefano Stabellini
@ 2016-06-03 19:51             ` Konrad Rzeszutek Wilk
  2016-06-06  9:55               ` Jan Beulich
  2 siblings, 1 reply; 86+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-06-03 19:51 UTC (permalink / raw)
  To: Andrew Cooper
  Cc: Lan, Tianyu, yang.zhang.wz, Tian, Kevin, sstabellini, Nakajima,
	Jun, Dong, Eddie, ian.jackson, xen-devel, jbeulich,
	anthony.perard, Roger Pau Monne

[-- Attachment #1: Type: text/plain, Size: 913 bytes --]

> For HVMLite, there is specifically no qemu, and we need something which
> can function when we want PCI Passthrough to work.  I am quite confident
> that the correct solution here is to have a basic host bridge/root port
> implementation in Xen (as we already have 80% of this already), at which
> point we don't need any qemu interaction for PCI Passthough at all, even
> for HVM guests.

Could you expand on this a bit?

I am asking b/c some time ago I wrote in Xen code to construct a full view
of the bridges->devices (and various in branching) so that I could renumber
the bus values and its devices (expand them) on bridges. This was solely done
so that I could use SR-IOV devices on non-SR-IOV capable BIOSes.

I am wondering how much of the basic functionality (enumeration, keeping
track, etc) could be worked in this 'basic host bridge/root port' implementation
idea of yours.

Attaching the patches.

[-- Attachment #2: 0001-pci-On-PCI-dump-device-keyhandler-include-Device-and.patch --]
[-- Type: text/plain, Size: 1505 bytes --]

>From 4ea2d880c0250c1278995e5ee7d9e48151c4e4e1 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 4 Feb 2014 12:52:35 -0500
Subject: [PATCH 1/5] pci: On PCI dump device keyhandler include Device and
 Vendor ID

As it helps in troubleshooting if the initial domain has
re-numbered the bus numbers and what Xen sees is not the
reality.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/pci.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index cdbabc2..5e5097e 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1211,9 +1211,12 @@ static int _dump_pci_devices(struct pci_seg *pseg, void *arg)
 
     list_for_each_entry ( pdev, &pseg->alldevs_list, alldevs_list )
     {
-        printk("%04x:%02x:%02x.%u - dom %-3d - node %-3d - MSIs < ",
+        int id = pci_conf_read32(pseg->nr, pdev->bus, PCI_SLOT(pdev->devfn),
+                                 PCI_FUNC(pdev->devfn), 0);
+        printk("%04x:%02x:%02x.%u (%04x:%04x)- dom %-3d - node %-3d - MSIs < ",
                pseg->nr, pdev->bus,
                PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
+               id & 0xffff, (id >> 16) & 0xffff,
                pdev->domain ? pdev->domain->domain_id : -1,
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
-- 
2.5.5


[-- Attachment #3: 0002-DEBUG-Include-upstream-bridge-information.patch --]
[-- Type: text/plain, Size: 1696 bytes --]

>From ac79b6cdd20765d30adbff40514e729a2c33e74e Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 4 Feb 2014 17:01:42 -0500
Subject: [PATCH 2/5] DEBUG: Include upstream bridge information.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/passthrough/pci.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 5e5097e..ae6df78 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1213,6 +1213,9 @@ static int _dump_pci_devices(struct pci_seg *pseg, void *arg)
     {
         int id = pci_conf_read32(pseg->nr, pdev->bus, PCI_SLOT(pdev->devfn),
                                  PCI_FUNC(pdev->devfn), 0);
+        int rc = 0;
+        u8 bus, devfn, secbus;
+
         printk("%04x:%02x:%02x.%u (%04x:%04x)- dom %-3d - node %-3d - MSIs < ",
                pseg->nr, pdev->bus,
                PCI_SLOT(pdev->devfn), PCI_FUNC(pdev->devfn),
@@ -1221,7 +1224,14 @@ static int _dump_pci_devices(struct pci_seg *pseg, void *arg)
                (pdev->node != NUMA_NO_NODE) ? pdev->node : -1);
         list_for_each_entry ( msi, &pdev->msi_list, list )
                printk("%d ", msi->irq);
-        printk(">\n");
+        bus = pdev->bus;
+        devfn = pdev->devfn;
+
+        rc = find_upstream_bridge( pseg->nr, &bus, &devfn, &secbus );
+        if ( rc < 0)
+            printk(">\n");
+        else
+            printk(">[%02x:%02x.%u]\n", bus, PCI_SLOT(devfn), PCI_FUNC(devfn));
     }
     printk("==== Bus2Bridge %04x ====\n", pseg->nr);
     spin_lock(&pseg->bus2bridge_lock);
-- 
2.5.5


[-- Attachment #4: 0003-xen-pci-assign-buses-Renumber-the-bus-if-there-is-a-.patch --]
[-- Type: text/plain, Size: 22023 bytes --]

>From 954e04936a7fdbd5a10807b901812f09b382f07a Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Mon, 12 Jan 2015 16:37:32 -0500
Subject: [PATCH 3/5] xen/pci=assign-buses: Renumber the bus if there is a need
 to (v6).

Xen can re-number the PCI buses if there are SR-IOV devices there
and the BIOS hadn't done its job.

Use pci=assign-buses,verbose to see it work.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/setup.c          |   2 +
 xen/drivers/passthrough/pci.c | 647 ++++++++++++++++++++++++++++++++++++++++++
 xen/include/xen/pci.h         |   1 +
 3 files changed, 650 insertions(+)

diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 90b1b6c..81a7d6d 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1474,6 +1474,8 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     acpi_mmcfg_init();
 
+    early_pci_reassign_busses();
+
     early_msi_init();
 
     iommu_setup();    /* setup iommu if available */
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index ae6df78..62b5f85 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -178,6 +178,8 @@ custom_param("pci-phantom", parse_phantom_dev);
 
 static u16 __read_mostly command_mask;
 static u16 __read_mostly bridge_ctl_mask;
+static unsigned int __initdata assign_busses;
+static unsigned int __initdata verbose;
 
 /*
  * The 'pci' parameter controls certain PCI device aspects.
@@ -213,6 +215,10 @@ static void __init parse_pci_param(char *s)
             cmd_mask = PCI_COMMAND_PARITY;
             brctl_mask = PCI_BRIDGE_CTL_PARITY;
         }
+        else if ( !strcmp(s, "assign-buses") )
+            assign_busses = 1;
+        else if ( !strcmp(s, "verbose") )
+            verbose = 1;
 
         if ( on )
         {
@@ -1091,6 +1097,647 @@ static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg
     return 0;
 }
 
+/* Move this to its own file */
+#define DEBUG 1
+
+struct early_pci_bus;
+
+struct early_pci_dev {
+    struct list_head bus_list;  /* Linked against 'devices */
+    unsigned int is_serial:1;
+    unsigned int is_ehci:1;
+    unsigned int is_sriov:1;
+    unsigned int is_bridge:1;
+    u16 vendor;
+    u16 device;
+    u8 devfn;
+    u16 total_vfs;
+    u16 revision;
+    u16 class;
+    struct early_pci_bus *bus; /* On what bus we are. */
+    struct early_pci_bus *bridge; /* Ourselves if we are a bridge */
+};
+struct early_pci_bus {
+    struct list_head next;
+    struct list_head devices;
+    struct list_head children;
+    struct early_pci_bus *parent; /* Bus upstream of us. */
+    struct early_pci_dev *self; /* The PCI device that controls this bus. */
+    u8 primary; /* The (parent) bus number */
+    u8 number;
+    u8 start;
+    u8 end;
+    u8 new_end; /* To be updated too */
+    u8 new_start;
+    u8 new_primary;
+    u8 old_number;
+};
+
+static struct list_head __initdata early_buses_list;
+#define PCI_CLASS_SERIAL_USB_EHCI 0x0c0320
+
+static __init struct early_pci_dev *early_alloc_pci_dev(struct early_pci_bus *bus,
+                                                        u8 devfn)
+{
+    struct early_pci_dev *dev;
+    u8 type;
+    u16 class_dev, total;
+    u32 class, id;
+    unsigned int pos;
+
+    if ( !bus )
+        return NULL;
+
+    dev = xzalloc(struct early_pci_dev);
+    if ( !dev )
+        return NULL;
+
+    INIT_LIST_HEAD(&dev->bus_list);
+    dev->devfn = devfn;
+    dev->bus = bus;
+    class = pci_conf_read32(0, bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                            PCI_CLASS_REVISION);
+
+    dev->revision = class & 0xff;
+    dev->class = class >> 8;
+    if ( dev->class == PCI_CLASS_SERIAL_USB_EHCI )
+        dev->is_ehci = 1;
+
+    class_dev = pci_conf_read16(0, bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                                PCI_CLASS_DEVICE);
+    switch ( class_dev )
+    {
+        case 0x0700: /* single port serial */
+        case 0x0702: /* multi port serial */
+        case 0x0780: /* other (e.g serial+parallel) */
+            dev->is_serial = 1;
+        default:
+            break;
+    }
+    type = pci_conf_read8(0, bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                          PCI_HEADER_TYPE);
+    switch ( type & 0x7f )
+    {
+        case PCI_HEADER_TYPE_BRIDGE:
+        case PCI_HEADER_TYPE_CARDBUS:
+            dev->is_bridge = 1;
+            break;
+        case PCI_HEADER_TYPE_NORMAL:
+            pos = pci_find_cap_offset(0, bus->number, PCI_SLOT(devfn),
+                                      PCI_FUNC(devfn), PCI_CAP_ID_EXP);
+            if (!pos)   /* Not PCIe */
+                break;
+            pos = pci_find_ext_capability(0, bus->number, devfn,
+                                          PCI_EXT_CAP_ID_SRIOV);
+            if (!pos)   /* Not SR-IOV */
+                break;
+            total = pci_conf_read16(0, bus->number, PCI_SLOT(devfn),
+                                    PCI_FUNC(devfn), pos + PCI_SRIOV_TOTAL_VF);
+            if (!total)
+                break;
+            dev->is_sriov = 1;
+            dev->total_vfs = total;
+            /* Fall through */
+        default:
+            break;
+    }
+    id = pci_conf_read32(0, bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn),
+                         PCI_VENDOR_ID);
+    dev->vendor = id & 0xffff;
+    dev->device = (id >> 16) & 0xffff;
+    /* In case MCFG is not configured we have our blacklist */
+    switch ( dev->vendor )
+    {
+        case 0x8086: /* Intel */
+            switch ( dev->device )
+            {
+                case 0x10c9: /* Intel Corporation 82576 Gigabit Network Connection (rev 01) */
+                    if ( dev->is_sriov )
+                        break;
+                    dev->is_sriov = 1;
+                    dev->total_vfs = 8;
+            }
+        default:
+            break;
+    }
+    return dev;
+}
+
+static __init struct early_pci_bus *__find_bus(struct early_pci_bus *parent,
+                                               u8 nr)
+{
+    struct early_pci_bus *child, *bus;
+
+    if ( parent->number == nr )
+        return parent;
+
+    list_for_each_entry ( child, &parent->children, next )
+    {
+        if ( child->number == nr )
+            return child;
+        bus = __find_bus(child, nr);
+        if ( bus )
+            return bus;
+    }
+    return NULL;
+}
+
+static __init struct early_pci_bus *find_bus(u8 nr)
+{
+    struct early_pci_bus *bus, *child;
+
+    list_for_each_entry ( bus, &early_buses_list, next )
+    {
+       child = __find_bus(bus, nr);
+       if ( child )
+            return child;
+    }
+    return NULL;
+}
+
+static __init struct early_pci_dev *find_dev(u8 nr, u8 devfn)
+{
+    struct early_pci_bus *bus = NULL;
+
+    bus = find_bus(nr);
+    if ( bus ) {
+        struct early_pci_dev *dev = NULL;
+
+        list_for_each_entry ( dev, &bus->devices, bus_list )
+            if ( dev->devfn == devfn )
+                return dev;
+    }
+    return NULL;
+}
+
+static __init struct early_pci_bus *early_alloc_pci_bus(struct early_pci_dev *dev, u8 nr)
+{
+    struct early_pci_bus *bus;
+
+    bus = xzalloc(struct early_pci_bus);
+    if ( !bus )
+        return NULL;
+
+    INIT_LIST_HEAD(&bus->next);
+    INIT_LIST_HEAD(&bus->devices);
+    INIT_LIST_HEAD(&bus->children);
+    bus->number = nr;
+    bus->old_number = nr;
+    bus->self = dev;
+    if ( dev )
+        if ( !dev->bridge )
+            dev->bridge = bus;
+    return bus;
+}
+
+static void __init early_free_pci_bus(struct early_pci_bus *bus)
+{
+    struct early_pci_dev *dev, *d_tmp;
+    struct early_pci_bus *b, *b_tmp;
+
+    list_for_each_entry_safe ( b, b_tmp, &bus->children, next )
+    {
+        early_free_pci_bus (b);
+        list_del ( &b->next );
+    }
+    list_for_each_entry_safe ( dev, d_tmp, &bus->devices, bus_list )
+    {
+        list_del ( &dev->bus_list );
+        xfree ( dev );
+    }
+}
+
+static void __init early_free_all(void)
+{
+    struct early_pci_bus *bus, *tmp;
+
+    list_for_each_entry_safe( bus, tmp, &early_buses_list, next )
+    {
+        early_free_pci_bus (bus);
+        list_del( &bus->next );
+        xfree(bus);
+    }
+}
+
+unsigned int __init pci_iov_scan(struct early_pci_bus *bus)
+{
+    struct early_pci_dev *dev;
+    unsigned int max = 0;
+    u8 busnr;
+
+    list_for_each_entry ( dev, &bus->devices, bus_list )
+    {
+        if ( !dev->is_sriov )
+            continue;
+        if ( !dev->total_vfs )
+            continue;
+        busnr = (dev->total_vfs) / 8; /* How many buses we will need */
+        if ( busnr > max )
+            max = busnr;
+    }
+    /* Do we have enough space for them ? */
+    if ( (bus->end - bus->start) >= max )
+        return 0;
+    return max;
+}
+
+#ifdef DEBUG
+static __init const char *spaces(unsigned int lvl)
+{
+    if (lvl == 0)
+        return " ";
+    if (lvl == 1)
+        return " +--+";
+    if (lvl == 2)
+        return "    +-+";
+    if (lvl == 3)
+        return "       +-+";
+    return "         +...+";
+}
+
+static void __init print_devs(struct early_pci_bus *parent, int lvl)
+{
+    struct early_pci_dev *dev;
+    struct early_pci_bus *bus;
+
+    list_for_each_entry( dev, &parent->devices, bus_list )
+    {
+        printk("%s%04x:%02x:%u [%04x:%04x] class %06x", spaces(lvl), parent->number,
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), dev->vendor,
+               dev->device, dev->class);
+        if ( dev->is_bridge )
+        {
+            printk(" BRIDGE");
+            if ( dev->bridge )
+            {
+                struct early_pci_bus *bridge = dev->bridge;
+                printk(" to BUS %x [spans %x->%x] primary BUS %x", bridge->number, bridge->start, bridge->end, bridge->primary);
+                printk(" (primary: %x spans %x->%x)", bridge->new_primary, bridge->new_start, bridge->new_end);
+            }
+        }
+        if ( dev->is_sriov )
+            printk(" sriov: %d", dev->total_vfs);
+        if ( dev->is_ehci )
+            printk (" EHCI DEBUG ");
+        if ( dev->is_serial )
+            printk (" SERIAL ");
+        printk("\n");
+    }
+    list_for_each_entry( bus, &parent->children, next )
+        print_devs(bus, lvl + 1);
+}
+#endif
+
+static void __init print_devices(void)
+{
+#ifdef DEBUG
+    struct early_pci_bus *bus;
+
+    if ( !verbose )
+        return;
+
+    list_for_each_entry( bus, &early_buses_list, next )
+        print_devs(bus, 0);
+#endif
+}
+
+unsigned int pci_scan_bus( struct early_pci_bus *bus);
+unsigned int __init pci_scan_slot(struct early_pci_bus *bus, unsigned int devfn)
+{
+    struct early_pci_dev *dev;
+
+    if ( find_dev(bus->number, devfn) )
+        return 0;
+
+    if ( !pci_device_detect (0, bus->number, PCI_SLOT(devfn), PCI_FUNC(devfn)) )
+        return 0;
+
+    dev = early_alloc_pci_dev(bus, devfn);
+    if ( !dev )
+        return -ENODEV;
+
+    list_add_tail(&dev->bus_list, &bus->devices);
+    return 0;
+}
+
+static int __init pci_scan_bridge(struct early_pci_bus *bus,
+                                  struct early_pci_dev *dev,
+                                  unsigned int max)
+{
+    struct early_pci_bus *child;
+    u32 buses;
+    u8 primary, secondary, subordinate;
+    unsigned int cmax = 0;
+
+    buses = pci_conf_read32(0, bus->number, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                            PCI_PRIMARY_BUS);
+
+    primary = buses & 0xFF;
+    secondary = (buses >> 8) & 0xFF;
+    subordinate = (buses >> 16) & 0xFF;
+
+    if (!primary && (primary != bus->number) && secondary && subordinate) {
+        printk("Primary bus is hard wired to 0\n");
+        primary = bus->number;
+    }
+
+    child = find_bus(secondary);
+    if ( !child )
+    {
+        child = early_alloc_pci_bus(dev, secondary);
+        if ( !child )
+            goto out;
+        /* Add to the parent's bus list */
+        list_add_tail(&child->next, &bus->children);
+        /* The primary is the upstream bus number. */
+        child->primary = primary;
+        child->start = secondary;
+        child->end = subordinate;
+        child->parent = bus;
+    }
+    cmax = pci_scan_bus(child);
+    if ( cmax > max )
+        max = cmax;
+
+    if ( child->end > max )
+        max = child->end;
+out:
+    return max;
+}
+
+unsigned int __init pci_scan_bus( struct early_pci_bus *bus)
+{
+    unsigned int max = 0, devfn;
+    struct early_pci_dev *dev;
+
+    for ( devfn = 0; devfn < 0x100; devfn++ )
+        pci_scan_slot (bus, devfn);
+
+    /* Walk all devices and create the bus structs */
+    list_for_each_entry ( dev, &bus->devices, bus_list )
+    {
+        if ( !dev->is_bridge )
+            continue;
+        if ( verbose )
+            printk("Scanning bridge %04x:%02x.%u [%04x:%04x] class %06x\n", bus->number,
+                   PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), dev->vendor, dev->device,
+                   dev->class);
+        max = pci_scan_bridge(bus, dev, max);
+    }
+    if ( max > bus->end )
+        bus->end = max;
+    return max;
+}
+
+static __init unsigned int adjust_span(struct early_pci_bus *bus,
+                                       unsigned int offset)
+{
+    struct early_pci_bus *child = NULL;
+    unsigned int scan;
+
+    bus->new_start = bus->start;
+    bus->new_end = bus->end;
+    /* We can't check against offset as the loop might have altered it. */
+    /* N.B. Ignore host bridges. */
+    if ( offset && bus->parent )
+        bus->new_start += offset;
+
+    scan = pci_iov_scan(bus);
+    offset += scan;
+
+    list_for_each_entry( child, &bus->children, next )
+    {
+        unsigned int new_offset;
+
+        new_offset = adjust_span(child , offset);
+        if ( new_offset > offset )
+            /* A new contender ! */
+            offset = new_offset;
+    }
+    bus->new_end += offset;
+    return offset;
+}
+
+static __init void adjust_primary(struct early_pci_bus *bus,
+                                  unsigned int offset)
+{
+    struct early_pci_bus *child;
+
+    list_for_each_entry( child, &bus->children, next )
+    {
+        child->new_primary = bus->new_start;
+        adjust_primary(child, offset);
+
+    }
+}
+
+static void __init pci_disable_forwarding(struct early_pci_bus *parent)
+{
+    struct early_pci_dev *dev;
+    u32 buses;
+
+    list_for_each_entry ( dev, &parent->devices, bus_list )
+    {
+        u8 bus;
+        u16 bctl;
+
+        if ( !dev->is_bridge )
+            continue;
+
+        bus = dev->bus->number;
+        buses = pci_conf_read32(0, bus, PCI_SLOT(dev->devfn),
+                            PCI_FUNC(dev->devfn), PCI_PRIMARY_BUS);
+        if ( verbose )
+            printk("%04x:%02x.%u PCI_PRIMARY_BUS read %x [%s]\n", bus,
+                   PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), buses, __func__);
+        /* Lifted from Linux but not sure if this MasterAbort masking is
+         * still needed. */
+
+        bctl = pci_conf_read32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                               PCI_BRIDGE_CONTROL);
+
+        pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                         PCI_BRIDGE_CONTROL, bctl & ~PCI_BRIDGE_CTL_MASTER_ABORT);
+
+        if ( verbose )
+            printk("%04x:%02x.%u clearing PCI_PRIMARY_BUS %x\n",  bus, PCI_SLOT(dev->devfn),
+                   PCI_FUNC(dev->devfn), buses & ~0xffffff);
+
+        /* Disable forwarding */
+        pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                         PCI_PRIMARY_BUS, buses &  ~0xffffff);
+
+        pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                         PCI_BRIDGE_CONTROL, bctl);
+    }
+}
+
+static void __init __pci_program_bridge(struct early_pci_dev *dev, u8 bus)
+{
+    u16 bctl;
+    u32 buses;
+    struct early_pci_bus *child, *bridges;
+    u8 primary, secondary, subordinate;
+
+    child = dev->bridge; /* The bridge we are serving and don't use parent. */
+    ASSERT( child );
+
+    buses = pci_conf_read32(0, bus, PCI_SLOT(dev->devfn),
+                            PCI_FUNC(dev->devfn), PCI_PRIMARY_BUS);
+    if ( verbose )
+        printk("%04x:%02x.%u PCI_PRIMARY_BUS read %x [%s]\n", bus,
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), buses, __func__);
+
+    /* Lifted from Linux but not sure if this MasterAbort masking is
+     * still needed. */
+    bctl = pci_conf_read32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                           PCI_BRIDGE_CONTROL);
+    pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     PCI_BRIDGE_CONTROL, bctl & ~PCI_BRIDGE_CTL_MASTER_ABORT);
+
+    pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     PCI_STATUS, 0xffff);
+
+    buses = (buses & 0xff000000)
+                | ((unsigned int)(child->new_primary)     <<  0)
+                | ((unsigned int)(child->new_start)   <<  8)
+                | ((unsigned int)(child->new_end) << 16);
+    if ( verbose )
+        printk("%04x:%02x.%u wrote to PCI_PRIMARY_BUS %x\n",  bus, PCI_SLOT(dev->devfn),
+               PCI_FUNC(dev->devfn), buses);
+
+    pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     PCI_PRIMARY_BUS, buses);
+
+    pci_conf_write32(0, bus, PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                     PCI_BRIDGE_CONTROL, bctl);
+
+    /* Double check that it is correct. */
+    buses = pci_conf_read32(0, bus, PCI_SLOT(dev->devfn),
+                            PCI_FUNC(dev->devfn), PCI_PRIMARY_BUS);
+    if ( verbose )
+        printk("%04x:%02x.%u PCI_PRIMARY_BUS read %x\n", bus,
+               PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn), buses);
+
+    primary = buses & 0xFF;
+    secondary = (buses >> 8) & 0xFF;
+    subordinate = (buses >> 16) & 0xFF;
+
+    ASSERT(primary == child->new_primary);
+    ASSERT(secondary == child->new_start);
+    ASSERT(subordinate == child->new_end);
+
+    child->number = child->new_start;
+    child->primary = child->new_primary;
+    child->start = child->new_start;
+    child->end = child->new_end;
+
+    pci_disable_forwarding( child ); /* Bridges below us */
+
+    list_for_each_entry ( bridges, &child->children, next )
+    {
+        if ( bridges->self )
+            __pci_program_bridge(bridges->self, child->number);
+    }
+}
+
+static void __init pci_program_bridge(struct early_pci_bus *bus)
+{
+    struct early_pci_dev *dev;
+
+    list_for_each_entry ( dev, &bus->devices, bus_list )
+    {
+        if ( !dev->is_bridge )
+            continue;
+        __pci_program_bridge(dev, bus->number);
+    }
+}
+static void __init update_console_devices(struct early_pci_bus *parent)
+{
+    struct early_pci_dev *dev;
+    struct early_pci_bus *bus;
+
+    list_for_each_entry( dev, &parent->devices, bus_list )
+    {
+        if ( dev->is_ehci || dev->is_serial || dev->is_bridge )
+        {
+            ;/* TODO */
+        }
+    }
+    list_for_each_entry( bus, &parent->children, next )
+        update_console_devices(bus);
+}
+
+void __init early_pci_reassign_busses(void)
+{
+    unsigned int nr;
+    struct early_pci_bus *bus;
+    unsigned int max = 0, adjust = 0, last_end;
+
+    if ( !assign_busses )
+        return;
+
+    INIT_LIST_HEAD(&early_buses_list);
+    for ( nr = 0; nr < 256; nr++ )
+    {
+        if ( !pci_device_detect (0, nr, 0, 0) )
+            continue;
+        if ( find_bus(nr) )
+            continue;
+        /* Host bridges do not have any parent devices ! */
+        bus = early_alloc_pci_bus(NULL, nr);
+        if ( !bus )
+            goto out;
+        bus->start = nr;
+        bus->primary = 0;   /* Points to host, which is zero */
+        max = pci_scan_bus(bus);
+        list_add_tail(&bus->next, &early_buses_list);
+    }
+    /* Walk all the devices, figure out what will be the _new_
+     * max if any. */
+    last_end = 0;
+    list_for_each_entry( bus, &early_buses_list, next )
+    {
+        unsigned int offset;
+        /* Oh no, the previous end bus number overlaps! */
+        if ( last_end > bus->start )
+        {
+            bus->new_start = last_end;
+            bus->new_end = bus->new_end + last_end;
+        }
+        last_end = bus->end;
+        offset = adjust_span(bus, 0 /* no offset ! */);
+        if (offset > adjust) {
+            adjust = offset;
+            last_end = bus->new_end;
+        }
+        adjust_primary(bus, 0);
+    }
+
+    print_devices();
+    if ( !adjust )
+    {
+        printk("No need to reassign busses.\n");
+        goto out;
+    }
+    printk("Re-assigning busses to make space for %d bus numbers.\n", adjust);
+
+    /* Walk all the bridges, disable forwarding */
+    /* Walk all bridges, reprogram with max (so new primary, secondary and such. */
+    list_for_each_entry( bus, &early_buses_list, next )
+    {
+        pci_disable_forwarding(bus);
+        pci_program_bridge(bus);
+    }
+    /* Walk all devices, re-enable serial, ehci with new bus number */
+    list_for_each_entry( bus, &early_buses_list, next )
+        update_console_devices(bus);
+
+    print_devices();
+out:
+    early_free_all();
+}
+
 void __hwdom_init setup_hwdom_pci_devices(
     struct domain *d, int (*handler)(u8 devfn, struct pci_dev *))
 {
diff --git a/xen/include/xen/pci.h b/xen/include/xen/pci.h
index 6ed29dd..ad09cce 100644
--- a/xen/include/xen/pci.h
+++ b/xen/include/xen/pci.h
@@ -111,6 +111,7 @@ struct pci_dev *pci_lock_domain_pdev(
 void setup_hwdom_pci_devices(struct domain *,
                             int (*)(u8 devfn, struct pci_dev *));
 int pci_release_devices(struct domain *d);
+void early_pci_reassign_busses(void);
 int pci_add_segment(u16 seg);
 const unsigned long *pci_get_ro_map(u16 seg);
 int pci_add_device(u16 seg, u8 bus, u8 devfn,
-- 
2.5.5


[-- Attachment #5: 0004-pci-assign-buses-Suspend-resume-the-console-device-a.patch --]
[-- Type: text/plain, Size: 7790 bytes --]

>From fa86138d42b1976b61485c886f1ba280dd23c29d Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Fri, 21 Feb 2014 11:43:51 -0500
Subject: [PATCH 4/5] pci/assign-buses: Suspend/resume the console device and
 update bus (v2).

When we suspend and resume the console devices we need the
proper bus number. With us altering the bus numbers we need
to update the bus numbers otherwise the console device might
reprogram the wrong device.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/drivers/char/ehci-dbgp.c  | 24 +++++++++++++++++++++++-
 xen/drivers/char/ns16550.c    | 37 +++++++++++++++++++++++++++++++++++++
 xen/drivers/char/serial.c     | 17 +++++++++++++++++
 xen/drivers/passthrough/pci.c | 17 ++++++++++++++++-
 xen/include/xen/serial.h      |  7 +++++++
 5 files changed, 100 insertions(+), 2 deletions(-)

diff --git a/xen/drivers/char/ehci-dbgp.c b/xen/drivers/char/ehci-dbgp.c
index 3feeafe..3266307 100644
--- a/xen/drivers/char/ehci-dbgp.c
+++ b/xen/drivers/char/ehci-dbgp.c
@@ -1437,7 +1437,27 @@ static void ehci_dbgp_resume(struct serial_port *port)
     ehci_dbgp_setup_preirq(dbgp);
     ehci_dbgp_setup_postirq(dbgp);
 }
+static int __init ehci_dbgp_is_owner(struct serial_port *port, u8 bus, u8 devfn)
+{
+    struct ehci_dbgp *dbgp = port->uart;
 
+    if ( dbgp->bus == bus && dbgp->slot == PCI_SLOT(devfn) &&
+        dbgp->func == PCI_FUNC(devfn))
+        return 1;
+    return -ENODEV;
+}
+static int __init ehci_dbgp_update_bus(struct serial_port *port, u8 old_bus,
+                                       u8 devfn, u8 new_bus)
+{
+    struct ehci_dbgp *dbgp;
+
+    if ( ehci_dbgp_is_owner (port, old_bus, devfn) < 0 )
+        return -ENODEV;
+
+    dbgp = port->uart;
+    dbgp->bus = new_bus;
+    return 1;
+}
 static struct uart_driver __read_mostly ehci_dbgp_driver = {
     .init_preirq  = ehci_dbgp_init_preirq,
     .init_postirq = ehci_dbgp_init_postirq,
@@ -1447,7 +1467,9 @@ static struct uart_driver __read_mostly ehci_dbgp_driver = {
     .tx_ready     = ehci_dbgp_tx_ready,
     .putc         = ehci_dbgp_putc,
     .flush        = ehci_dbgp_flush,
-    .getc         = ehci_dbgp_getc
+    .getc         = ehci_dbgp_getc,
+    .is_owner     = ehci_dbgp_is_owner,
+    .update_bus   = ehci_dbgp_update_bus,
 };
 
 static struct ehci_dbgp ehci_dbgp = { .state = dbgp_unsafe, .phys_port = 1 };
diff --git a/xen/drivers/char/ns16550.c b/xen/drivers/char/ns16550.c
index b2b5f56..51b71ee 100644
--- a/xen/drivers/char/ns16550.c
+++ b/xen/drivers/char/ns16550.c
@@ -821,7 +821,40 @@ static const struct vuart_info *ns16550_vuart_info(struct serial_port *port)
     return &uart->vuart;
 }
 #endif
+#ifdef HAS_PCI
+static int __init ns16550_is_owner(struct serial_port *port, u8 bus, u8 devfn)
+{
+    struct ns16550 *uart = port->uart;
+
+    if ( uart->ps_bdf_enable )
+    {
+        if ( (bus == uart->ps_bdf[0]) && (uart->ps_bdf[1] == PCI_SLOT(devfn)) &&
+             (uart->ps_bdf[2] == PCI_FUNC(devfn)) )
+            return 1;
+    }
+    if ( uart->pb_bdf_enable )
+    {
+        if ( (bus == uart->pb_bdf[0]) && (uart->pb_bdf[1] == PCI_SLOT(devfn)) &&
+             (uart->pb_bdf[2] == PCI_FUNC(devfn)) )
+            return 1;
+    }
+    return -ENODEV;
+}
+static int __init ns16550_update_bus(struct serial_port *port, u8 old_bus,
+                                      u8 devfn, u8 new_bus)
+{
+    struct ns16550 *uart;
 
+    if ( ns16550_is_owner(port, old_bus, devfn ) < 0 )
+        return -ENODEV;
+    uart = port->uart;
+    if ( uart->ps_bdf_enable )
+        uart->ps_bdf[0]= new_bus;
+    if ( uart->pb_bdf_enable )
+        uart->pb_bdf[0] = new_bus;
+    return 1;
+}
+#endif
 static struct uart_driver __read_mostly ns16550_driver = {
     .init_preirq  = ns16550_init_preirq,
     .init_postirq = ns16550_init_postirq,
@@ -835,6 +868,10 @@ static struct uart_driver __read_mostly ns16550_driver = {
 #ifdef CONFIG_ARM
     .vuart_info   = ns16550_vuart_info,
 #endif
+#ifdef HAS_PCI
+    .is_owner     = ns16550_is_owner,
+    .update_bus   = ns16550_update_bus,
+#endif
 };
 
 static int __init parse_parity_char(int c)
diff --git a/xen/drivers/char/serial.c b/xen/drivers/char/serial.c
index c583a48..f7b8178 100644
--- a/xen/drivers/char/serial.c
+++ b/xen/drivers/char/serial.c
@@ -543,6 +543,23 @@ const struct vuart_info *serial_vuart_info(int idx)
     return NULL;
 }
 
+int __init serial_is_owner(u8 bus, u8 devfn)
+{
+    int i;
+    for ( i = 0; i < ARRAY_SIZE(com); i++ )
+        if ( com[i].driver->is_owner )
+            return com[i].driver->is_owner(&com[i], bus, devfn);
+
+    return 0;
+}
+int __init serial_update_bus(u8 old_bus, u8 devfn, u8 new_bus)
+{
+    int i;
+    for ( i = 0; i < ARRAY_SIZE(com); i++ )
+        if ( com[i].driver->update_bus )
+            return com[i].driver->update_bus(&com[i], old_bus, devfn, new_bus);
+    return 0;
+}
 void serial_suspend(void)
 {
     int i;
diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index 62b5f85..d7bdbd5 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1098,6 +1098,7 @@ static int __hwdom_init _setup_hwdom_pci_devices(struct pci_seg *pseg, void *arg
 }
 
 /* Move this to its own file */
+#include <xen/serial.h>
 #define DEBUG 1
 
 struct early_pci_bus;
@@ -1661,7 +1662,14 @@ static void __init update_console_devices(struct early_pci_bus *parent)
     {
         if ( dev->is_ehci || dev->is_serial || dev->is_bridge )
         {
-            ;/* TODO */
+            int rc = 0;
+            if ( serial_is_owner(parent->old_number , dev->devfn ) < 0 )
+                continue;
+            rc = serial_update_bus(parent->old_number, dev->devfn, parent->number);
+            if ( verbose )
+                printk("%02x:%02x.%u bus %x -> %x, rc=%d\n", parent->number,
+                       PCI_SLOT(dev->devfn), PCI_FUNC(dev->devfn),
+                       parent->old_number, parent->number, rc);
         }
     }
     list_for_each_entry( bus, &parent->children, next )
@@ -1722,6 +1730,10 @@ void __init early_pci_reassign_busses(void)
     }
     printk("Re-assigning busses to make space for %d bus numbers.\n", adjust);
 
+    /* Walk all the devices, disable serial and ehci */
+    if ( !verbose)
+        serial_suspend();
+
     /* Walk all the bridges, disable forwarding */
     /* Walk all bridges, reprogram with max (so new primary, secondary and such. */
     list_for_each_entry( bus, &early_buses_list, next )
@@ -1733,6 +1745,9 @@ void __init early_pci_reassign_busses(void)
     list_for_each_entry( bus, &early_buses_list, next )
         update_console_devices(bus);
 
+    if ( !verbose )
+        serial_resume();
+
     print_devices();
 out:
     early_free_all();
diff --git a/xen/include/xen/serial.h b/xen/include/xen/serial.h
index 1212a12..2ba9da7 100644
--- a/xen/include/xen/serial.h
+++ b/xen/include/xen/serial.h
@@ -87,6 +87,10 @@ struct uart_driver {
     void  (*stop_tx)(struct serial_port *);
     /* Get serial information */
     const struct vuart_info *(*vuart_info)(struct serial_port *);
+    /* Check if the BDF matches this device */
+    int (*is_owner)(struct serial_port *, u8 , u8);
+    /* Update its BDF due to bus number changing. devfn still same. */
+    int (*update_bus)(struct serial_port *, u8, u8, u8);
 };
 
 /* 'Serial handles' are composed from the following fields. */
@@ -140,6 +144,9 @@ int serial_irq(int idx);
 /* Retrieve basic UART information to emulate it (base address, size...) */
 const struct vuart_info* serial_vuart_info(int idx);
 
+int serial_is_owner(u8 bus, u8 devfn);
+int serial_update_bus(u8 old_bus, u8 devfn, u8 bus);
+
 /* Serial suspend/resume. */
 void serial_suspend(void);
 void serial_resume(void);
-- 
2.5.5


[-- Attachment #6: 0005-pci-assign-busses-Add-Mellenox.patch --]
[-- Type: text/plain, Size: 1237 bytes --]

>From 867032964e63165019487112b6317c6e75437bee Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Thu, 24 Apr 2014 20:29:50 -0400
Subject: [PATCH 5/5] pci/assign-busses: Add Mellenox

---
 xen/drivers/passthrough/pci.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/xen/drivers/passthrough/pci.c b/xen/drivers/passthrough/pci.c
index d7bdbd5..455aed5 100644
--- a/xen/drivers/passthrough/pci.c
+++ b/xen/drivers/passthrough/pci.c
@@ -1217,7 +1217,20 @@ static __init struct early_pci_dev *early_alloc_pci_dev(struct early_pci_bus *bu
                         break;
                     dev->is_sriov = 1;
                     dev->total_vfs = 8;
+                    break;
+            }
+            break;
+        case 0x15b3:
+            switch ( dev->device )
+            {
+                case 0x673c: /* InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] */
+                    if ( dev->is_sriov )
+                        break;
+                    dev->is_sriov = 1;
+                    dev->total_vfs = 64;
+                    break;
             }
+            break;
         default:
             break;
     }
-- 
2.5.5


[-- Attachment #7: Type: text/plain, Size: 126 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 19:51             ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
@ 2016-06-06  9:55               ` Jan Beulich
  2016-06-06 17:25                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2016-06-06  9:55 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: yang.zhang.wz, Tianyu Lan, Kevin Tian, sstabellini,
	Andrew Cooper, ian.jackson, xen-devel, Jun Nakajima,
	anthony.perard, Roger Pau Monne

>>> On 03.06.16 at 21:51, <konrad.wilk@oracle.com> wrote:
>>  For HVMLite, there is specifically no qemu, and we need something which
>> can function when we want PCI Passthrough to work.  I am quite confident
>> that the correct solution here is to have a basic host bridge/root port
>> implementation in Xen (as we already have 80% of this already), at which
>> point we don't need any qemu interaction for PCI Passthough at all, even
>> for HVM guests.
> 
> Could you expand on this a bit?
> 
> I am asking b/c some time ago I wrote in Xen code to construct a full view
> of the bridges->devices (and various in branching) so that I could renumber
> the bus values and its devices (expand them) on bridges. This was solely 
> done
> so that I could use SR-IOV devices on non-SR-IOV capable BIOSes.
> 
> I am wondering how much of the basic functionality (enumeration, keeping
> track, etc) could be worked in this 'basic host bridge/root port' 
> implementation
> idea of yours.

Keep in mind that your work was for the host's PCI, while here
we're talking about the guest's.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest
  2016-06-06  9:55               ` Jan Beulich
@ 2016-06-06 17:25                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 86+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-06-06 17:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, Tianyu Lan, Kevin Tian, sstabellini,
	Andrew Cooper, ian.jackson, xen-devel, Jun Nakajima,
	anthony.perard, Roger Pau Monne

On Mon, Jun 06, 2016 at 03:55:06AM -0600, Jan Beulich wrote:
> >>> On 03.06.16 at 21:51, <konrad.wilk@oracle.com> wrote:
> >>  For HVMLite, there is specifically no qemu, and we need something which
> >> can function when we want PCI Passthrough to work.  I am quite confident
> >> that the correct solution here is to have a basic host bridge/root port
> >> implementation in Xen (as we already have 80% of this already), at which
> >> point we don't need any qemu interaction for PCI Passthough at all, even
> >> for HVM guests.
> > 
> > Could you expand on this a bit?
> > 
> > I am asking b/c some time ago I wrote in Xen code to construct a full view
> > of the bridges->devices (and various in branching) so that I could renumber
> > the bus values and its devices (expand them) on bridges. This was solely 
> > done
> > so that I could use SR-IOV devices on non-SR-IOV capable BIOSes.
> > 
> > I am wondering how much of the basic functionality (enumeration, keeping
> > track, etc) could be worked in this 'basic host bridge/root port' 
> > implementation
> > idea of yours.
> 
> Keep in mind that your work was for the host's PCI, while here
> we're talking about the guest's.

Right, but some of this accounting code could be added in Xen to help
with keeping track of physical vs virtual layouts.

> 
> Jan
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-03 17:14             ` Stefano Stabellini
@ 2016-06-07  5:14               ` Tian, Kevin
  2016-06-07  7:26                 ` Jan Beulich
  2016-06-07 10:07                 ` Stefano Stabellini
  0 siblings, 2 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-06-07  5:14 UTC (permalink / raw)
  To: Stefano Stabellini, Andrew Cooper
  Cc: Lan, Tianyu, yang.zhang.wz, xen-devel, jbeulich, ian.jackson,
	Dong, Eddie, Nakajima, Jun, anthony.perard, Roger Pau Monne

> From: Stefano Stabellini
> Sent: Saturday, June 04, 2016 1:15 AM
> 
> On Fri, 3 Jun 2016, Andrew Cooper wrote:
> > On 03/06/16 12:17, Tian, Kevin wrote:
> > >> Very sorry for the delay.
> > >>
> > >> There are multiple interacting issues here.  On the one side, it would
> > >> be useful if we could have a central point of coordination on
> > >> PVH/HVMLite work.  Roger - as the person who last did HVMLite work,
> > >> would you mind organising that?
> > >>
> > >> For the qemu/xen interaction, the current state is woeful and a tangled
> > >> mess.  I wish to ensure that we don't make any development decisions
> > >> which makes the situation worse.
> > >>
> > >> In your case, the two motivations are quite different I would recommend
> > >> dealing with them independently.
> > >>
> > >> IIRC, the issue with more than 255 cpus and interrupt remapping is that
> > >> you can only use x2apic mode with more than 255 cpus, and IOAPIC RTEs
> > >> can't be programmed to generate x2apic interrupts?  In principle, if you
> > >> don't have an IOAPIC, are there any other issues to be considered?  What
> > >> happens if you configure the LAPICs in x2apic mode, but have the IOAPIC
> > >> deliver xapic interrupts?
> > > The key is the APIC ID. There is no modification to existing PCI MSI and
> > > IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> > > interrupt message containing 8bit APIC ID, which cannot address >255
> > > cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> > > enable >255 cpus with x2apic mode.
> >
> > Thanks for clarifying.
> >
> > >
> > > If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC cannot
> > > deliver interrupts to all cpus in the system if #cpu > 255.
> >
> > Ok.  So not ideal (and we certainly want to address it), but this isn't
> > a complete show stopper for a guest.
> >
> > >> On the other side of things, what is IGD passthrough going to look like
> > >> in Skylake?  Is there any device-model interaction required (i.e. the
> > >> opregion), or will it work as a completely standalone device?  What are
> > >> your plans with the interaction of virtual graphics and shared virtual
> > >> memory?
> > >>
> > > The plan is to use a so-called universal pass-through driver in the guest
> > > which only accesses standard PCI resource (w/o opregion, PCH/MCH, etc.)
> >
> > This is fantastic news.
> >
> > >
> > > ----
> > > Here is a brief of potential usages relying on vIOMMU:
> > >
> > > a) enable >255 vcpus on Xeon Phi, as the initial purpose of this thread.
> > > It requires interrupt remapping capability present on vIOMMU;
> > >
> > > b) support guest SVM (Shared Virtual Memory), which relies on the
> > > 1st level translation table capability (GVA->GPA) on vIOMMU. pIOMMU
> > > needs to enable both 1st level and 2nd level translation in nested
> > > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough is
> > > the main usage today (to support OpenCL 2.0 SVM feature). In the
> > > future SVM might be used by other I/O devices too;
> > >
> > > c) support VFIO-based user space driver (e.g. DPDK) in the guest,
> > > which relies on the 2nd level translation capability (IOVA->GPA) on
> > > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > > vIOMMU 2nd level by replacing GPA with HPA (becomes IOVA->HPA);
> >
> > All of these look like interesting things to do.  I know there is a lot
> > of interest for b).
> >
> > As a quick aside, does Xen currently boot on a Phi?  Last time I looked
> > at the Phi manual, I would expect Xen to crash on boot because of MCXSR
> > differences from more-common x86 hardware.

Tianyu can correct me for the detail info. Xen can boot on Xeon Phi. However
we need a hacky patch in guest Linux kernel to disable dependency check
around interrupt remapping. Otherwise guest kernel boot will fail.

Now we're suffering from some performance issue. When the analysis is
ongoing, could you elaborate the limitation you see with 64vcpu guest? It
would be helpful whether we are hunting the same problem or not...

> >
> > >
> > > ----
> > > And below is my thought viability of implementing vIOMMU in Qemu:
> > >
> > > a) enable >255 vcpus:
> > >
> > > 	o Enable Q35 in Qemu-Xen;
> > > 	o Add interrupt remapping in Qemu vIOMMU;
> > > 	o Virtual interrupt injection in hypervisor needs to know virtual
> > > interrupt remapping (IR) structure, since IR is behind vIOAPIC/vMSI,
> > > which requires new hypervisor interfaces as Andrew pointed out:
> > > 		* either for hypervisor to query IR from Qemu which is not
> > > good;
> > > 		* or for Qemu to register IR info to hypervisor which means
> > > partial IR knowledge implemented in hypervisor (then why not putting
> > > whole IR emulation in Xen?)
> > >
> > > b) support SVM
> > >
> > > 	o Enable Q35 in Qemu-Xen;
> > > 	o Add 1st level translation capability in Qemu vIOMMU;
> > > 	o VT-d context entry points to guest 1st level translation table
> > > which is nest-translated by 2nd level translation table so vIOMMU
> > > structure can be directly linked. It means:
> > > 		* Xen IOMMU driver enables nested mode;
> > > 		* Introduce a new hypercall so Qemu vIOMMU can register
> > > GPA root of guest 1st level translation table which is then written
> > > to context entry in pIOMMU;
> > >
> > > c) support VFIO-based user space driver
> > >
> > > 	o Enable Q35 in Qemu-Xen;
> > > 	o Leverage existing 2nd level translation implementation in Qemu
> > > vIOMMU;
> > > 	o Change Xen IOMMU to support (IOVA->HPA) translation which
> > > means decouple current logic from P2M layer (only for GPA->HPA);
> > > 	o As a means of shadowing approach, Xen IOMMU driver needs to
> > > know both (IOVA->GPA) and (GPA->HPA) info to update (IOVA->HPA)
> > > mapping in case of any one is changed. So new interface is required
> > > for Qemu vIOMMU to propagate (IOVA->GPA) info into Xen hypervisor
> > > which may need to be further cached.
> > >
> > > ----
> > >
> > > After writing down above detail, looks it's clear that putting vIOMMU
> > > in Qemu is not a clean design for a) and c). For b) the hypervisor
> > > change is not that hacky, but for it alone seems not strong to pursue
> > > Qemu path. Seems we may have to go with hypervisor based
> > > approach...
> > >
> > > Anyway stop here. With above background let's see whether others
> > > may have a better thought how to accelerate TTM of those usages
> > > in Xen. Xen once is a leading hypervisor for many new features, but
> > > recently it is not sustaining. If above usages can be enabled decoupled
> > > from HVMlite/virtual_root_port effort, then we can have staged plan
> > > to move faster (first for HVM, later for HVMLite). :-)
> >
> > I dislike that we are in this situation, but I glad to see that I am not
> > the only one who thinks that the current situation is unsustainable.
> >
> > The problem is things were hacked up in the past to assume qemu could
> > deal with everything like this.  Later, performance sucked sufficiently
> > that bit of qemu were moved back up into the hypervisor, which is why
> > the vIOAPIC is currently located there.  The result is a complete
> > tangled ratsnest.
> >
> >
> > Xen has 3 common uses for qemu, which are:
> > 1) Emulation of legacy devices
> > 2) PCI Passthrough
> > 3) PV backends

4) Mediated passthrough as for XenGT

> >
> > 3 isn't really relevant here.  For 1, we are basically just using Qemu
> > to provide an LPC implementation (with some populated slots for
> > disk/network devices).
> >
> > I think it would be far cleaner to re-engineer the current Xen/qemu
> > interaction to more closely resemble real hardware, including
> > considering having multiple vIOAPICs/vIOMMUs/etc when architecturally
> > appropriate.  I expect that it would be a far cleaner interface to use
> > and extend.  I also realise that this isn't a simple task I am
> > suggesting, but I don't see any other viable way out.

Could you give some example why current Xen/qemu interface is not
good while moving root port into Xen makes it far cleaner? 

> >
> > Other issues in the mix is support for multiple device emulators, in
> > which case Xen is already performing first-level redirection of MMIO
> > requests.
> >
> > For HVMLite, there is specifically no qemu, and we need something which
> > can function when we want PCI Passthrough to work.  I am quite confident
> > that the correct solution here is to have a basic host bridge/root port
> > implementation in Xen (as we already have 80% of this already), at which
> > point we don't need any qemu interaction for PCI Passthough at all, even
> > for HVM guests.
> >
> > >From this perspective, it would make sense to have emulators map IOVAs,
> > not GPAs.  We already have mapcache_invalidate infrastructure to flush
> > mappings as they are changed by the guest.
> >
> >
> > For the HVMLite side of things, my key concern is not to try and do any
> > development which we realistically expect to have to undo/change.  As
> > you said yourself, we are struggling to sustain, and really aren't
> > helping ourselves by doing lots of work, and subsequently redoing it
> > when it doesn't work; PVH is the most obvious recent example here.
> >
> > If others agree, I think that it is well worth making some concrete
> > plans for improvements in this area for Xen 4.8.  I think the only
> > viable way forward is to try and get out of the current hole we are in.
> >
> > Thoughts?  (especially Stefano/Anthony)
> 
> Going back to the beginning of the discussion, whether we should enable
> Q35 in QEMU or not is a distraction: of course we should enable it, but
> even with Q35 in QEMU, it might not be a good idea to place the vIOMMU
> emulation there.
> 
> I agree with Andrew that the current model is flawed: the boundary
> between Xen and QEMU emulation is not clear enough. In addition using
> QEMU on Xen introduces latency and security issues (the work to run QEMU
> as non-root and using unprivileged interfaces is not complete yet).
> 
> I think of QEMU as a provider of complex, high level emulators, such as
> the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily
> need to be fast.

Earlier you said Qemu imposes security issues. Here you said Qemu can 
still provide complex emulators. Does it mean that security issue in Qemu
simply comes from the part which should be moved into Xen? Any
elaboration here?

> 
> For core x86 components, such as the vIOMMU, for performance and ease of
> integration with the rest of the hypervisor, it seems to me that Xen
> would is the right place to implement them. As a comparison, I would
> certainly argue in favor of implementing vSMMU in the hypervisor on ARM.
> 

After some internal discussion with Tianyu/Eddie, I realized my earlier
description is incomplete which takes only passthrough device into
consideration (as you saw it's mainly around interaction between vIOMMU
and pIOMMU). However from guest p.o.v, all the devices should be covered
by vIOMMU to match today's physical platform, including:

1) DMA-capable virtual device in Qemu, in Dom0 user space
2) PV devices, in Dom0 kernel space
3) Passthrough devices, in Xen hypervisor

A natural implementation is to have vIOMMU together with where the
DMA is emulated, which ends up to a possible way with multiple vIOMMUs
in multiple layers:

1) vIOMMU in Dom0 user
2) vIOMMU in Dom0 kernel
3) vIOMMU in Xen hypervisor

Of course we may come up an option to still keep all vIOMMUs in Xen
hypervisor, which however means every vDMA operations in Qemu or
BE driver need to issue Xen hypercall to get vIOMMU's approval. I haven't
thought thoroughly how big/complex this issue is, but it does be a
limitation from a quick thought.

So, likely we'll have to consider presence of multiple vIOMMUs, each in 
different layers, regardless of root-complex in Qemu or Xen. There
needs to be some interface abstractions to allow vIOMMU/root-complex
communicating with each other. Well, not an easy task...

In the meantime, we are confirming internally whether as an intermediate
step we can compose a virtual platform with only partial devices covered
by vIOMMU while other devices are not. There is no physical platform 
like it, and also it may break guest OS's assumption. But it could make
a staging plan viable if such configuration is possible.

> 
> However the issue is the PCI root-complex, which today is in QEMU. I
> don't think it is a particularly bad fit there, although I can also see
> the benefit of moving it to the hypervisor. It is relevant here if it
> causes problems to implementing vIOMMU in Xen.
> 
> From a software engineering perspective, it would be nice to keep the
> two projects (implementing vIOMMU and moving the PCI root complex to
> Xen) separate, especially given that the PCI root complex one is without
> an owner and a timeline. I don't think it is fair to ask Tianyu or Kevin
> to move the PCI root complex from QEMU to Xen in order to enable vIOMMU
> on Xen systems.
> 
> If vIOMMU in Xen and root complex in QEMU cannot be made to work
> together, then we are at an impasse. I cannot see any good way forward
> unless somebody volunteers to start working on the PCI root complex
> project soon to provide Kevin and Tianyu with a branch to based their
> work upon.
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest)
  2016-06-03 16:56                 ` Stefano Stabellini
@ 2016-06-07  5:48                   ` Tian, Kevin
  0 siblings, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-06-07  5:48 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Lan, Tianyu, yang.zhang.wz, xen-devel, Nakajima, Jun,
	Andrew Cooper, Dong, Eddie, jbeulich, anthony.perard,
	boris.ostrovsky, ian.jackson, Roger Pau Monne

> From: Stefano Stabellini
> Sent: Saturday, June 04, 2016 12:57 AM
> 
> > > >
> > > > How stable is the HVMLite today? Is it already in production usage?
> > > >
> > > > Wonder whether you have some detail thought how full PCI root complex
> > > > emulation will be done in Xen (including how to interact with Qemu)...
> > >
> > > I haven't looked into much detail regarding all this, since as I said, it's
> > > still a little bit far away in the PVH/HVMlite roadmap, we have more
> > > pressing issues to solve before getting to the point of implementing
> > > PCI-passthrough. I expect Xen is going to intercept all PCI accesses and is
> > > then going to forward them to the ioreq servers that have been registered
> > > for that specific config space, but this of course needs much more thought
> > > and a proper design document.
> > >
> > > > As I just wrote in another mail, if we just hit for HVM first, will it work if
> > > > we implement vIOMMU in Xen but still relies on Qemu root complex to
> > > > report to the guest?
> > >
> > > This seems quite inefficient IMHO (but I don't know that much about all this
> > > vIOMMU stuff). If you implement vIOMMU inside of Xen, but the PCI root
> > > complex is inside of Qemu aren't you going to perform quite a lot of jumps
> > > between Xen and QEMU just to access the vIOMMU?
> > >
> > > I expect something like:
> > >
> > > Xen traps PCI access -> QEMU -> Xen vIOMMU implementation
> > >
> >
> > I hope the role of Qemu is just to report vIOMMU related information, such
> > as DMAR, etc. so guest can enumerate the presence of vIOMMU, while
> > the actual emulation is done by vIOMMU in hypervisor w/o going through
> > Qemu.
> >
> > However just realized even for above purpose, there's still some interaction
> > required between Qemu and Xen vIOMMU, e.g. register base of vIOMMU and
> > devices behind vIOMMU are reported thru ACPI DRHD which means Xen vIOMMU
> > needs to know the configuration in Qemu which might be dirty to define such
> > interfaces between Qemu and hypervisor. :/
> 
> PCI accesses don't need to be particularly fast, they should not be on
> the hot path.
> 
> How bad this interface between QEMU and vIOMMU in Xen would look like?
> Can we make a short list of basic operations that we would need to
> support to get a clearer idea?

Below is a quick thought of basic operations between vIOMMU and a PCI 
root-complex, if they are not putting together:
(come from VT-d spec, section 8, "BIOS Consideration")

1) vIOMMU reports its presence including capabilities to root-complex:
	- interrupt remapping, ATS, etc.
	- type of devices (virtual, PV, passthrough)... (???TBD)

2) root-complex notifies vIOMMU about:
	- base of vIOMMU registers (DRHD)
	- devices attached to this vIOMMU (DRHD)
		* dynamic update due to hotplug or PCI resource rebalancing

3) Additionally as I mentioned in another thread, Qemu need to query
vIOMMU whether a virtual DMA should be blocked, if vIOMMU for virtual
devices are also put in Xen

Other BIOS structures (ATS, RMRR, hotplug, etc.) are optional so I haven't
thought carefully now. They may require additional interactions if required.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-07  5:14               ` Tian, Kevin
@ 2016-06-07  7:26                 ` Jan Beulich
  2016-06-07 10:07                 ` Stefano Stabellini
  1 sibling, 0 replies; 86+ messages in thread
From: Jan Beulich @ 2016-06-07  7:26 UTC (permalink / raw)
  To: Kevin Tian
  Cc: yang.zhang.wz, Tianyu Lan, Stefano Stabellini, Eddie Dong,
	Andrew Cooper, ian.jackson, xen-devel, Jun Nakajima,
	anthony.perard, Roger Pau Monne

>>> On 07.06.16 at 07:14, <kevin.tian@intel.com> wrote:
> After some internal discussion with Tianyu/Eddie, I realized my earlier
> description is incomplete which takes only passthrough device into
> consideration (as you saw it's mainly around interaction between vIOMMU
> and pIOMMU). However from guest p.o.v, all the devices should be covered
> by vIOMMU to match today's physical platform, including:
> 
> 1) DMA-capable virtual device in Qemu, in Dom0 user space
> 2) PV devices, in Dom0 kernel space
> 3) Passthrough devices, in Xen hypervisor
> 
> A natural implementation is to have vIOMMU together with where the
> DMA is emulated, which ends up to a possible way with multiple vIOMMUs
> in multiple layers:
> 
> 1) vIOMMU in Dom0 user
> 2) vIOMMU in Dom0 kernel
> 3) vIOMMU in Xen hypervisor
> 
> Of course we may come up an option to still keep all vIOMMUs in Xen
> hypervisor, which however means every vDMA operations in Qemu or
> BE driver need to issue Xen hypercall to get vIOMMU's approval. I haven't
> thought thoroughly how big/complex this issue is, but it does be a
> limitation from a quick thought.
> 
> So, likely we'll have to consider presence of multiple vIOMMUs, each in 
> different layers, regardless of root-complex in Qemu or Xen. There
> needs to be some interface abstractions to allow vIOMMU/root-complex
> communicating with each other. Well, not an easy task...

Right - for DMA-capable devices emulated in qemu, it would seem
natural to have them go through a vIOMMU in qemu. Whether
that vIOMMU implementation would have to consult the hypervisor
(or perhaps even just be a wrapper around various hypercalls, i.e.
backed by an implementation in the hypervisor) would be an
independent aspect.

Otoh, having vIOMMU in only qemu, and requiring round trips
through qemu for any of the hypervisor's internal purposes doesn't
seem like a good idea to me.

And finally I don't see the relevance of PV devices here: Their
nature makes it that they could easily be left completely independent
of an vIOMMU (as long as there's no plan to bypass a virtualization
level in the nested case, i.e. a PV frontend in L2 with a backend
living in L0).

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-07  5:14               ` Tian, Kevin
  2016-06-07  7:26                 ` Jan Beulich
@ 2016-06-07 10:07                 ` Stefano Stabellini
  2016-06-08  8:11                   ` Tian, Kevin
  1 sibling, 1 reply; 86+ messages in thread
From: Stefano Stabellini @ 2016-06-07 10:07 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lan, Tianyu, yang.zhang.wz, Stefano Stabellini, jbeulich,
	Andrew Cooper, Dong, Eddie, xen-devel, Nakajima, Jun,
	anthony.perard, ian.jackson, Roger Pau Monne

On Tue, 7 Jun 2016, Tian, Kevin wrote:
> > I think of QEMU as a provider of complex, high level emulators, such as
> > the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily
> > need to be fast.
> 
> Earlier you said Qemu imposes security issues. Here you said Qemu can 
> still provide complex emulators. Does it mean that security issue in Qemu
> simply comes from the part which should be moved into Xen? Any
> elaboration here?

It imposes security issues because, although it doesn't have to run as
root anymore, QEMU still has to run with fully privileged libxc and
xenstore handles. In other words, a malicious guest breaking into QEMU
would have relatively easy access to the whole host. There is a design
to solve this, see Ian Jackson's talk at FOSDEM this year:

https://fosdem.org/2016/schedule/event/virt_iaas_qemu_for_xen_secure_by_default/
https://fosdem.org/2016/schedule/event/virt_iaas_qemu_for_xen_secure_by_default/attachments/other/921/export/events/attachments/virt_iaas_qemu_for_xen_secure_by_default/other/921/talk.txt

Other solutions to solve this issue are stubdoms or simply using PV
guests and HVMlite guests only.

Irrespective of the problematic security angle, which is unsolved, I
think of QEMU as a provider of complex emulators, as I wrote above.

Does it make sense?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-07 10:07                 ` Stefano Stabellini
@ 2016-06-08  8:11                   ` Tian, Kevin
  2016-06-26 13:42                     ` Lan, Tianyu
  0 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-06-08  8:11 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Lan, Tianyu, yang.zhang.wz, xen-devel, jbeulich, Andrew Cooper,
	Dong, Eddie, Nakajima, Jun, anthony.perard, ian.jackson,
	Roger Pau Monne

> From: Stefano Stabellini [mailto:sstabellini@kernel.org]
> Sent: Tuesday, June 07, 2016 6:07 PM
> 
> On Tue, 7 Jun 2016, Tian, Kevin wrote:
> > > I think of QEMU as a provider of complex, high level emulators, such as
> > > the e1000, Cirrus VGA, SCSI controllers, etc., which don't necessarily
> > > need to be fast.
> >
> > Earlier you said Qemu imposes security issues. Here you said Qemu can
> > still provide complex emulators. Does it mean that security issue in Qemu
> > simply comes from the part which should be moved into Xen? Any
> > elaboration here?
> 
> It imposes security issues because, although it doesn't have to run as
> root anymore, QEMU still has to run with fully privileged libxc and
> xenstore handles. In other words, a malicious guest breaking into QEMU
> would have relatively easy access to the whole host. There is a design
> to solve this, see Ian Jackson's talk at FOSDEM this year:
> 
> https://fosdem.org/2016/schedule/event/virt_iaas_qemu_for_xen_secure_by_default/
> https://fosdem.org/2016/schedule/event/virt_iaas_qemu_for_xen_secure_by_default/a
> ttachments/other/921/export/events/attachments/virt_iaas_qemu_for_xen_secure_by_
> default/other/921/talk.txt
> 
> Other solutions to solve this issue are stubdoms or simply using PV
> guests and HVMlite guests only.
> 
> Irrespective of the problematic security angle, which is unsolved, I
> think of QEMU as a provider of complex emulators, as I wrote above.
> 
> Does it make sense?

It makes sense... I thought you used this security issue against placing 
vIOMMU in Qemu, which made me a bit confused earlier. :-)

We are still thinking feasibility of some staging plan, e.g. first implementing
some vIOMMU features w/o dependency on root-complex in Xen (HVM only)
and then later enabling full vIOMMU feature w/ root-complex in Xen (covering 
HVMLite). If we can reuse most code between two stages while shorten 
time-to-market by half (e.g. from 2yr to 1yr), it's still worthy of pursuing.
will report back soon once the idea is consolidated...

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-08  8:11                   ` Tian, Kevin
@ 2016-06-26 13:42                     ` Lan, Tianyu
  2016-06-29  3:04                       ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-06-26 13:42 UTC (permalink / raw)
  To: Tian, Kevin, Stefano Stabellini
  Cc: yang.zhang.wz, xen-devel, jbeulich, Andrew Cooper, Dong, Eddie,
	Nakajima, Jun, anthony.perard, ian.jackson, Roger Pau Monne

On 6/8/2016 4:11 PM, Tian, Kevin wrote:
> It makes sense... I thought you used this security issue against
> placing vIOMMU in Qemu, which made me a bit confused earlier. :-)
>
> We are still thinking feasibility of some staging plan, e.g. first
> implementing some vIOMMU features w/o dependency on root-complex in
> Xen (HVM only) and then later enabling full vIOMMU feature w/
> root-complex in Xen (covering HVMLite). If we can reuse most code
> between two stages while shorten time-to-market by half (e.g. from
> 2yr to 1yr), it's still worthy of pursuing. will report back soon
> once the idea is consolidated...
>
> Thanks Kevin


After discussion with Kevin, we draft a staging plan of implementing
vIOMMU in Xen based on Qemu host bridge. Both virtual devices and
passthough devices use one vIOMMU in Xen. Your comments are very 
appreciated.

1. Enable Q35 support in the hvmloader.
In the real world, VTD support starts from Q35 and OS may have such
assumption that VTD only exists on the Q35 or newer platform.
Q35 support seems necessary for vIOMMU support.

In regardless of Q35 host bridge in the Qemu or Xen hypervisor,
hvmloader needs to be compatible with Q35 and build Q35 ACPI tables.

Qemu already has Q35 emulation and so the hvmloader job can start with
Qemu. When host bridge in Xen is ready, these changes also can be reused.

2. Implement vIOMMU in Xen based on Qemu host bridge.
Add a new device type "Xen iommu" in the Qemu as a wrapper of vIOMMU
hypercalls to communicate with Xen vIOMMU.

It's in charge of:
1) Query vIOMMU capability(E,G interrupt remapping, DMA translation, SVM
and so on)
2) Create vIOMMU with predefined base address of IOMMU unit regs
3) Notify hvmloader to populate related content in the ACPI DMAR
table.(Add vIOMMU info to struct hvm_info_table)
4) Deal with DMA translation request of virtual devices and return
back translated address.
5) Attach/detach hotplug device from vIOMMU


New hypercalls for vIOMMU that are also necessary when host bridge in Xen.
1) Query vIOMMU capability
2) Create vIOMMU(IOMMU unit reg base as params)
3) Virtual device's DMA translation
4) Attach/detach hotplug device from VIOMMU


All IOMMU emulations will be done in Xen
1) DMA translation
2) Interrupt remapping
3) Shared Virtual Memory (SVM)

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-26 13:42                     ` Lan, Tianyu
@ 2016-06-29  3:04                       ` Tian, Kevin
  2016-07-05 13:37                         ` Lan, Tianyu
  0 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-06-29  3:04 UTC (permalink / raw)
  To: Lan, Tianyu, Stefano Stabellini
  Cc: yang.zhang.wz, xen-devel, jbeulich, Andrew Cooper, Dong, Eddie,
	Nakajima, Jun, anthony.perard, ian.jackson, Roger Pau Monne

> From: Lan, Tianyu
> Sent: Sunday, June 26, 2016 9:43 PM
> 
> On 6/8/2016 4:11 PM, Tian, Kevin wrote:
> > It makes sense... I thought you used this security issue against
> > placing vIOMMU in Qemu, which made me a bit confused earlier. :-)
> >
> > We are still thinking feasibility of some staging plan, e.g. first
> > implementing some vIOMMU features w/o dependency on root-complex in
> > Xen (HVM only) and then later enabling full vIOMMU feature w/
> > root-complex in Xen (covering HVMLite). If we can reuse most code
> > between two stages while shorten time-to-market by half (e.g. from
> > 2yr to 1yr), it's still worthy of pursuing. will report back soon
> > once the idea is consolidated...
> >
> > Thanks Kevin
> 
> 
> After discussion with Kevin, we draft a staging plan of implementing
> vIOMMU in Xen based on Qemu host bridge. Both virtual devices and
> passthough devices use one vIOMMU in Xen. Your comments are very
> appreciated.

The rationale here is to separate BIOS structures from actual vIOMMU
emulation. vIOMMU will be always emulated in Xen hypervisor, regardless of 
where Q35 emulation is done or whether it's HVM or HVMLite. The staging
plan is more for the BIOS structure reporting which is Q35 specific. For now
we first target Qemu Q35 emulation, with a set of vIOMMU ops introduced
as Tianyu listed below to help interact between Qemu and Xen. Later when 
Xen Q35 emulation is ready, the reporting can be done in Xen.

The main limitation of this model is on DMA emulation of Qemu virtual
devices, which needs to query Xen vIOMMU for every virtual DMA. It is 
possibly fine for virtual devices which are normally not for performance 
critical usages. Also there may be some chance to cache some translations
within Qemu like thru ATS (may not worthy of it though...).

> 
> 1. Enable Q35 support in the hvmloader.
> In the real world, VTD support starts from Q35 and OS may have such
> assumption that VTD only exists on the Q35 or newer platform.
> Q35 support seems necessary for vIOMMU support.
> 
> In regardless of Q35 host bridge in the Qemu or Xen hypervisor,
> hvmloader needs to be compatible with Q35 and build Q35 ACPI tables.
> 
> Qemu already has Q35 emulation and so the hvmloader job can start with
> Qemu. When host bridge in Xen is ready, these changes also can be reused.
> 
> 2. Implement vIOMMU in Xen based on Qemu host bridge.
> Add a new device type "Xen iommu" in the Qemu as a wrapper of vIOMMU
> hypercalls to communicate with Xen vIOMMU.
> 
> It's in charge of:
> 1) Query vIOMMU capability(E,G interrupt remapping, DMA translation, SVM
> and so on)
> 2) Create vIOMMU with predefined base address of IOMMU unit regs
> 3) Notify hvmloader to populate related content in the ACPI DMAR
> table.(Add vIOMMU info to struct hvm_info_table)
> 4) Deal with DMA translation request of virtual devices and return
> back translated address.
> 5) Attach/detach hotplug device from vIOMMU
> 
> 
> New hypercalls for vIOMMU that are also necessary when host bridge in Xen.
> 1) Query vIOMMU capability
> 2) Create vIOMMU(IOMMU unit reg base as params)
> 3) Virtual device's DMA translation
> 4) Attach/detach hotplug device from VIOMMU

We don't need 4). Hotplug device is automatically handled by the vIOMMU
with INCLUDE_ALL flag set (which should be the case if we only have one
vIOMMU in Xen). We don't need further notify this event to Xen vIOMMU.

And once we have Xen Q35 emulation in place, possibly only 3) is required
then.

> 
> 
> All IOMMU emulations will be done in Xen
> 1) DMA translation
> 2) Interrupt remapping
> 3) Shared Virtual Memory (SVM)

Please let us know your thoughts. If no one has explicit objection based 
on above rough idea, we'll go to write the high level design doc for more
detail discussion.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-06-29  3:04                       ` Tian, Kevin
@ 2016-07-05 13:37                         ` Lan, Tianyu
  2016-07-05 13:57                           ` Jan Beulich
  0 siblings, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-07-05 13:37 UTC (permalink / raw)
  To: Tian, Kevin, Stefano Stabellini, Andrew Cooper, jbeulich
  Cc: yang.zhang.wz, xen-devel, Dong, Eddie, ian.jackson, Nakajima,
	Jun, anthony.perard, Roger Pau Monne

Hi Stefano, Andrew and Jan:
Could you give us more guides here to move forward virtual iommu 
development? Thanks.

On 6/29/2016 11:04 AM, Tian, Kevin wrote:
>> From: Lan, Tianyu
>> Sent: Sunday, June 26, 2016 9:43 PM
>>
>> On 6/8/2016 4:11 PM, Tian, Kevin wrote:
>>> It makes sense... I thought you used this security issue against
>>> placing vIOMMU in Qemu, which made me a bit confused earlier. :-)
>>>
>>> We are still thinking feasibility of some staging plan, e.g. first
>>> implementing some vIOMMU features w/o dependency on root-complex in
>>> Xen (HVM only) and then later enabling full vIOMMU feature w/
>>> root-complex in Xen (covering HVMLite). If we can reuse most code
>>> between two stages while shorten time-to-market by half (e.g. from
>>> 2yr to 1yr), it's still worthy of pursuing. will report back soon
>>> once the idea is consolidated...
>>>
>>> Thanks Kevin
>>
>>
>> After discussion with Kevin, we draft a staging plan of implementing
>> vIOMMU in Xen based on Qemu host bridge. Both virtual devices and
>> passthough devices use one vIOMMU in Xen. Your comments are very
>> appreciated.
>
> The rationale here is to separate BIOS structures from actual vIOMMU
> emulation. vIOMMU will be always emulated in Xen hypervisor, regardless of
> where Q35 emulation is done or whether it's HVM or HVMLite. The staging
> plan is more for the BIOS structure reporting which is Q35 specific. For now
> we first target Qemu Q35 emulation, with a set of vIOMMU ops introduced
> as Tianyu listed below to help interact between Qemu and Xen. Later when
> Xen Q35 emulation is ready, the reporting can be done in Xen.
>
> The main limitation of this model is on DMA emulation of Qemu virtual
> devices, which needs to query Xen vIOMMU for every virtual DMA. It is
> possibly fine for virtual devices which are normally not for performance
> critical usages. Also there may be some chance to cache some translations
> within Qemu like thru ATS (may not worthy of it though...).
>
>>
>> 1. Enable Q35 support in the hvmloader.
>> In the real world, VTD support starts from Q35 and OS may have such
>> assumption that VTD only exists on the Q35 or newer platform.
>> Q35 support seems necessary for vIOMMU support.
>>
>> In regardless of Q35 host bridge in the Qemu or Xen hypervisor,
>> hvmloader needs to be compatible with Q35 and build Q35 ACPI tables.
>>
>> Qemu already has Q35 emulation and so the hvmloader job can start with
>> Qemu. When host bridge in Xen is ready, these changes also can be reused.
>>
>> 2. Implement vIOMMU in Xen based on Qemu host bridge.
>> Add a new device type "Xen iommu" in the Qemu as a wrapper of vIOMMU
>> hypercalls to communicate with Xen vIOMMU.
>>
>> It's in charge of:
>> 1) Query vIOMMU capability(E,G interrupt remapping, DMA translation, SVM
>> and so on)
>> 2) Create vIOMMU with predefined base address of IOMMU unit regs
>> 3) Notify hvmloader to populate related content in the ACPI DMAR
>> table.(Add vIOMMU info to struct hvm_info_table)
>> 4) Deal with DMA translation request of virtual devices and return
>> back translated address.
>> 5) Attach/detach hotplug device from vIOMMU
>>
>>
>> New hypercalls for vIOMMU that are also necessary when host bridge in Xen.
>> 1) Query vIOMMU capability
>> 2) Create vIOMMU(IOMMU unit reg base as params)
>> 3) Virtual device's DMA translation
>> 4) Attach/detach hotplug device from VIOMMU
>
> We don't need 4). Hotplug device is automatically handled by the vIOMMU
> with INCLUDE_ALL flag set (which should be the case if we only have one
> vIOMMU in Xen). We don't need further notify this event to Xen vIOMMU.
>
> And once we have Xen Q35 emulation in place, possibly only 3) is required
> then.
>
>>
>>
>> All IOMMU emulations will be done in Xen
>> 1) DMA translation
>> 2) Interrupt remapping
>> 3) Shared Virtual Memory (SVM)
>
> Please let us know your thoughts. If no one has explicit objection based
> on above rough idea, we'll go to write the high level design doc for more
> detail discussion.
>
> Thanks
> Kevin
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-07-05 13:37                         ` Lan, Tianyu
@ 2016-07-05 13:57                           ` Jan Beulich
  2016-07-05 14:19                             ` Lan, Tianyu
                                               ` (3 more replies)
  0 siblings, 4 replies; 86+ messages in thread
From: Jan Beulich @ 2016-07-05 13:57 UTC (permalink / raw)
  To: Kevin Tian, Tianyu Lan
  Cc: yang.zhang.wz, Stefano Stabellini, Andrew Cooper, ian.jackson,
	xen-devel, Jun Nakajima, anthony.perard, Roger Pau Monne

>>> On 05.07.16 at 15:37, <tianyu.lan@intel.com> wrote:
> Hi Stefano, Andrew and Jan:
> Could you give us more guides here to move forward virtual iommu 
> development? Thanks.

Due to ...

> On 6/29/2016 11:04 AM, Tian, Kevin wrote:
>> Please let us know your thoughts. If no one has explicit objection based
>> on above rough idea, we'll go to write the high level design doc for more
>> detail discussion.

... this I actually expected we'd get to see something, rather than
our input being waited for.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-07-05 13:57                           ` Jan Beulich
@ 2016-07-05 14:19                             ` Lan, Tianyu
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
                                               ` (2 subsequent siblings)
  3 siblings, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-07-05 14:19 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian
  Cc: yang.zhang.wz, Stefano Stabellini, Andrew Cooper, ian.jackson,
	xen-devel, Jun Nakajima, anthony.perard, Roger Pau Monne



On 7/5/2016 9:57 PM, Jan Beulich wrote:
>>>> On 05.07.16 at 15:37, <tianyu.lan@intel.com> wrote:
>> Hi Stefano, Andrew and Jan:
>> Could you give us more guides here to move forward virtual iommu
>> development? Thanks.
>
> Due to ...
>
>> On 6/29/2016 11:04 AM, Tian, Kevin wrote:
>>> Please let us know your thoughts. If no one has explicit objection based
>>> on above rough idea, we'll go to write the high level design doc for more
>>> detail discussion.
>
> ... this I actually expected we'd get to see something, rather than
> our input being waited for.

OK. I get it. Because no response, double confirm we are on the
right way.


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Discussion about virtual iommu support for Xen guest
  2016-05-27  8:19   ` Lan Tianyu
  2016-06-02 15:03     ` Lan, Tianyu
@ 2016-08-02 15:15     ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-08-02 15:15 UTC (permalink / raw)
  To: Andrew Cooper, jbeulich, sstabellini, ian.jackson, xen-devel,
	kevin.tian, Dong, Eddie, Nakajima, Jun, yang.zhang.wz,
	anthony.perard

On 5/27/2016 4:19 PM, Lan Tianyu wrote:
>> > As for the individual issue of 288vcpu support, there are already issues
>> > with 64vcpu guests at the moment. While it is certainly fine to remove
>> > the hard limit at 255 vcpus, there is a lot of other work required to
>> > even get 128vcpu guests stable.
>
> Could you give some points to these issues? We are enabling more vcpus
> support and it can boot up 255 vcpus without IR support basically. It's
> very helpful to learn about known issues.

Hi Andrew:
We are designing vIOMMU support for Xen. Increasing vcpu
from 128 to 255 also can be implemented parallelly since it doesn't
need vIOMMU support. From your previous comment "there is a lot of other
work required to even get 128vcpu guests stable", you have some concerns 
about stability of 128vcpus. I wonder what we need to do before
starting work of increasing vcpu number from 128 to 255?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Xen virtual IOMMU high level design doc
  2016-07-05 13:57                           ` Jan Beulich
  2016-07-05 14:19                             ` Lan, Tianyu
@ 2016-08-17 12:05                             ` Lan, Tianyu
  2016-08-17 12:42                               ` Paul Durrant
                                                 ` (3 more replies)
  2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
  3 siblings, 4 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-08-17 12:05 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian, Andrew Cooper, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

Hi All:
      The following is our Xen vIOMMU high level design for detail
discussion. Please have a look. Very appreciate for your comments.
This design doesn't cover changes when root port is moved to hypervisor.
We may design it later.


Content:
===============================================================================
1. Motivation of vIOMMU
	1.1 Enable more than 255 vcpus
	1.2 Support VFIO-based user space driver
	1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
	2.1 2th level translation overview
	2.2 Interrupt remapping overview
3. Xen hypervisor
	3.1 New vIOMMU hypercall interface
	3.2 2nd level translation
	3.3 Interrupt remapping
	3.4 1st level translation
	3.5 Implementation consideration
4. Qemu
	4.1 Qemu vIOMMU framework
	4.2 Dummy xen-vIOMMU driver
	4.3 Q35 vs. i440x
	4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===============================================================================
1.1 Enable more than 255 vcpu support
HPC virtualization requires more than 255 vcpus support in a single VM
to meet parallel computing requirement. More than 255 vcpus support
requires interrupt remapping capability present on vIOMMU to deliver
interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
vcpus if interrupt remapping is absent.


1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the 2nd level translation capability (IOVA->GPA) on
vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.


1.3 Support guest SVM (Shared Virtual Memory)
It relies on the 1st level translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.

2. Xen vIOMMU Architecture
================================================================================

* vIOMMU will be inside Xen hypervisor for following factors
	1) Avoid round trips between Qemu and Xen hypervisor
	2) Ease of integration with the rest of the hypervisor
	3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
level translation.

2.1 2th level translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows 2th level translation architecture.
+---------------------------------------------------------+
|Qemu                                +----------------+   |
|                                    |     Virtual    |   |
|                                    |   PCI device   |   |
|                                    |                |   |
|                                    +----------------+   |
|                                            |DMA         |
|                                            V            |
|  +--------------------+   Request  +----------------+   |
|  |                    +<-----------+                |   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |                    +----------->+                |   |
|  +---------+----------+            +-------+--------+   |
|            |                               |            |
|            |Hypercall                      |            |
+--------------------------------------------+------------+
|Hypervisor  |                               |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     |   vIOMMU    |                        |            |
|     +------+------+                        |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     | IOMMU driver|                        |            |
|     +------+------+                        |            |
|            |                               |            |
+--------------------------------------------+------------+
|HW          v                               V            |
|     +------+------+                 +-------------+     |
|     |   IOMMU     +---------------->+  Memory     |     |
|     +------+------+                 +-------------+     |
|            ^                                            |
|            |                                            |
|     +------+------+                                     |
|     | PCI Device  |                                     |
|     +-------------+                                     |
+---------------------------------------------------------+

2.2 Interrupt remapping overview.
Interrupts from virtual devices and physical devices will be delivered
to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
procedure.

+---------------------------------------------------+
|Qemu                       |VM                     |
|                           | +----------------+    |
|                           | |  Device driver |    |
|                           | +--------+-------+    |
|                           |          ^            |
|       +----------------+  | +--------+-------+    |
|       | Virtual device |  | |  IRQ subsystem |    |
|       +-------+--------+  | +--------+-------+    |
|               |           |          ^            |
|               |           |          |            |
+---------------------------+-----------------------+
|hyperviosr     |                      | VIRQ       |
|               |            +---------+--------+   |
|               |            |      vLAPIC      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |      vIOMMU      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |   vIOAPIC/vMSI   |   |
|               |            +----+----+--------+   |
|               |                 ^    ^            |
|               +-----------------+    |            |
|                                      |            |
+---------------------------------------------------+
HW                                     |IRQ
                               +-------------------+
                               |   PCI Device      |
                               +-------------------+





3 Xen hypervisor
==========================================================================

3.1 New hypercall XEN_SYSCTL_viommu_op
1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.

struct xen_sysctl_viommu_op {
	u32 cmd;
	u32 domid;
	union {
		struct {
			u32 capabilities;
		} query_capabilities;
		struct {
			u32 capabilities;
			u64 base_address;
		} create_iommu;
		struct {
			u8  bus;
			u8  devfn;
			u64 iova;
			u64 translated_addr;
			u64 addr_mask; /* Translation page size */
			IOMMUAccessFlags permisson;		
		} 2th_level_translation;
};

typedef enum {
	IOMMU_NONE = 0,
	IOMMU_RO   = 1,
	IOMMU_WO   = 2,
	IOMMU_RW   = 3,
} IOMMUAccessFlags;


Definition of VIOMMU subops:
#define XEN_SYSCTL_viommu_query_capability		0
#define XEN_SYSCTL_viommu_create			1
#define XEN_SYSCTL_viommu_destroy			2
#define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
#define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)


2) Design for subops
- XEN_SYSCTL_viommu_query_capability
       Get vIOMMU capabilities(1st/2th level translation and interrupt
remapping).

- XEN_SYSCTL_viommu_create
      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
base address.

- XEN_SYSCTL_viommu_destroy
      Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_SYSCTL_viommu_dma_translation_for_vpdev
      Translate IOVA to GPA for specified virtual PCI device with dom id,
PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.


3.2 2nd level translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
Second-level Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
Page-table pointer to context entry of physical IOMMU.

Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
translation of vIOMMU, IOMMU driver need to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
translation function is enabled. These change will not affect current
P2M logic.

3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table. The following diagram shows the logic.


3.4 1st level translation
When nested translation is enabled, any address generated by first-level
translation is used as the input address for nesting with second-level
translation. Physical IOMMU needs to enable both 1st level and 2nd level
translation in nested translation mode(GVA->GPA->HPA) for passthrough
device.

VT-d context entry points to guest 1st level translation table which
will be nest-translated by 2nd level translation table and so it
can be directly linked to context entry of physical IOMMU.

To enable 1st level translation in VM
1) Xen IOMMU driver enables nested translation mode
2) Update GPA root of guest 1st level translation table to context entry
of physical IOMMU.

All handles are in hypervisor and no interaction with Qemu.


3.5 Implementation consideration
Linux Intel IOMMU driver will fail to be loaded without 2th level
translation support even if interrupt remapping and 1th level
translation are available. This means it's needed to enable 2th level
translation first before other functions.


4 Qemu
==============================================================================
4.1 Qemu vIOMMU framework
Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
Especially for 2th level translation of virtual PCI device because
emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
framework provides callback to deal with 2th level translation when
DMA operations of virtual PCI devices happen.


4.2 Dummy xen-vIOMMU driver
1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
Share Virtual Memory) via hypercall.

2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
address and desired capability as parameters. Destroy vIOMMU when VM is
closed.

3) Virtual PCI device's 2th level translation
Qemu already provides DMA translation hook. It's called when DMA
translation of virtual PCI device happens. The dummy xen-vIOMMU passes
device bdf and IOVA into Xen hypervisor via new iommu hypercall and
return back translated GPA.


4.3 Q35 vs i440x
VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
driver has assumption that VTD only exists on Q35 and newer chipset and
we have to enable Q35 first.

Consulted with Linux/Windows IOMMU driver experts and get that these
drivers doesn't have such assumption. So we may skip Q35 implementation
and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
with virtual PCI device's DMA translation and interrupt remapping. We
are using KVM to do experiment of adding vIOMMU on the I440x and test
Linux/Windows guest. Will report back when have some results.


4.4 Report vIOMMU to hvmloader
Hvmloader is in charge of building ACPI tables for Guest OS and OS
probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
for Guest OS.

There are three ways to do that.
1) Extend struct hvm_info_table and add variables in the struct
hvm_info_table to pass vIOMMU information to hvmloader. But this
requires to add new xc interface to use struct hvm_info_table in the Qemu.

2) Pass vIOMMU information to hvmloader via Xenstore

3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
This solution is already present in the vNVDIMM design(4.3.1
Building Guest ACPI Tables
http://www.gossamer-threads.com/lists/xen/devel/439766).

The third option seems more clear and hvmloader doesn't need to deal
with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
@ 2016-08-17 12:42                               ` Paul Durrant
  2016-08-18  2:57                                 ` Lan, Tianyu
  2016-08-25 11:11                               ` Jan Beulich
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 86+ messages in thread
From: Paul Durrant @ 2016-08-17 12:42 UTC (permalink / raw)
  To: Lan, Tianyu, Jan Beulich, Kevin Tian, Andrew Cooper,
	yang.zhang.wz, Jun Nakajima, Stefano Stabellini
  Cc: Anthony Perard, Ian Jackson, xuquan8, xen-devel, Roger Pau Monne

> -----Original Message-----
> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
> Lan, Tianyu
> Sent: 17 August 2016 13:06
> To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang.wz@gmail.com; Jun
> Nakajima; Stefano Stabellini
> Cc: Anthony Perard; xuquan8@huawei.com; xen-
> devel@lists.xensource.com; Ian Jackson; Roger Pau Monne
> Subject: [Xen-devel] Xen virtual IOMMU high level design doc
> 
> Hi All:
>       The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.
> 
> 
> Content:
> ==========================================================
> =====================
> 1. Motivation of vIOMMU
> 	1.1 Enable more than 255 vcpus
> 	1.2 Support VFIO-based user space driver
> 	1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 	2.1 2th level translation overview
> 	2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 	3.1 New vIOMMU hypercall interface

Would it not have been better to build on the previously discussed (and mostly agreed) PV IOMMU interface? (See https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An RFC implementation series was also posted (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html).

  Paul
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-17 12:42                               ` Paul Durrant
@ 2016-08-18  2:57                                 ` Lan, Tianyu
  0 siblings, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-08-18  2:57 UTC (permalink / raw)
  To: Paul Durrant, Jan Beulich, Kevin Tian, Andrew Cooper,
	yang.zhang.wz, Jun Nakajima, Stefano Stabellini
  Cc: Anthony Perard, Ian Jackson, xuquan8, xen-devel, Roger Pau Monne



On 8/17/2016 8:42 PM, Paul Durrant wrote:
>> -----Original Message-----
>> From: Xen-devel [mailto:xen-devel-bounces@lists.xen.org] On Behalf Of
>> Lan, Tianyu
>> Sent: 17 August 2016 13:06
>> To: Jan Beulich; Kevin Tian; Andrew Cooper; yang.zhang.wz@gmail.com; Jun
>> Nakajima; Stefano Stabellini
>> Cc: Anthony Perard; xuquan8@huawei.com; xen-
>> devel@lists.xensource.com; Ian Jackson; Roger Pau Monne
>> Subject: [Xen-devel] Xen virtual IOMMU high level design doc
>>
>> Hi All:
>>       The following is our Xen vIOMMU high level design for detail
>> discussion. Please have a look. Very appreciate for your comments.
>> This design doesn't cover changes when root port is moved to hypervisor.
>> We may design it later.
>>
>>
>> Content:
>> ==========================================================
>> =====================
>> 1. Motivation of vIOMMU
>> 	1.1 Enable more than 255 vcpus
>> 	1.2 Support VFIO-based user space driver
>> 	1.3 Support guest Shared Virtual Memory (SVM)
>> 2. Xen vIOMMU Architecture
>> 	2.1 2th level translation overview
>> 	2.2 Interrupt remapping overview
>> 3. Xen hypervisor
>> 	3.1 New vIOMMU hypercall interface
>
> Would it not have been better to build on the previously discussed (and mostly agreed) PV IOMMU interface? (See https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01428.html). An RFC implementation series was also posted (https://lists.xenproject.org/archives/html/xen-devel/2016-02/msg01441.html).
>
>   Paul
>

Hi Paul:
Thanks for your input. Glance the patchset and it introduces hypercall
"HYPERVISOR_iommu_op". The hypercall just works for PV IOMMU now. We may
abstract it and make it work for both PV and Virtual IOMMU.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
  2016-08-17 12:42                               ` Paul Durrant
@ 2016-08-25 11:11                               ` Jan Beulich
  2016-08-31  8:39                                 ` Lan Tianyu
  2016-09-15 14:22                               ` Lan, Tianyu
  2016-11-23 18:19                               ` Edgar E. Iglesias
  3 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2016-08-25 11:11 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, Jun Nakajima, anthony.perard, xen-devel,
	Roger Pau Monne

>>> On 17.08.16 at 14:05, <tianyu.lan@intel.com> wrote:
> 1 Motivation for Xen vIOMMU
> ============================================================================
> ===
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.

I continue to question this as a valid motivation at this point in
time, for the reasons Andrew has been explaining.

> 2. Xen vIOMMU Architecture
> ============================================================================
> ====
> 
> * vIOMMU will be inside Xen hypervisor for following factors
> 	1) Avoid round trips between Qemu and Xen hypervisor
> 	2) Ease of integration with the rest of the hypervisor
> 	3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.

How does the create/destroy part of this match up with 3) right
ahead of it?

> 3 Xen hypervisor
> ==========================================================================
> 
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
> 	u32 cmd;
> 	u32 domid;
> 	union {
> 		struct {
> 			u32 capabilities;
> 		} query_capabilities;
> 		struct {
> 			u32 capabilities;
> 			u64 base_address;
> 		} create_iommu;
> 		struct {
> 			u8  bus;
> 			u8  devfn;

Please can we avoid introducing any new interfaces without segment/
domain value, even if for now it'll be always zero?

> 			u64 iova;
> 			u64 translated_addr;
> 			u64 addr_mask; /* Translation page size */
> 			IOMMUAccessFlags permisson;		
> 		} 2th_level_translation;

I suppose "translated_addr" is an output here, but for the following
fields this already isn't clear. Please add IN and OUT annotations for
clarity.

Also, may I suggest to name this "l2_translation"? (But there are
other implementation specific things to be considered here, which
I guess don't belong into a design doc discussion.)

> };
> 
> typedef enum {
> 	IOMMU_NONE = 0,
> 	IOMMU_RO   = 1,
> 	IOMMU_WO   = 2,
> 	IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability		0
> #define XEN_SYSCTL_viommu_create			1
> #define XEN_SYSCTL_viommu_destroy			2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)

l1 and l2 respectively again, please.

> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.

Missing diagram or stale sentence?

> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.

Is there a reason for this? I.e. do they unconditionally need that
functionality?

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-25 11:11                               ` Jan Beulich
@ 2016-08-31  8:39                                 ` Lan Tianyu
  2016-08-31 12:02                                   ` Jan Beulich
  0 siblings, 1 reply; 86+ messages in thread
From: Lan Tianyu @ 2016-08-31  8:39 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, Jun Nakajima, anthony.perard, xen-devel,
	Roger Pau Monne

Hi Jan:
	Sorry for later response. Thanks a lot for your comments.

On 2016年08月25日 19:11, Jan Beulich wrote:
>>>> On 17.08.16 at 14:05, <tianyu.lan@intel.com> wrote:
>> 1 Motivation for Xen vIOMMU
>> ============================================================================
>> ===
>> 1.1 Enable more than 255 vcpu support
>> HPC virtualization requires more than 255 vcpus support in a single VM
>> to meet parallel computing requirement. More than 255 vcpus support
>> requires interrupt remapping capability present on vIOMMU to deliver
>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
>> vcpus if interrupt remapping is absent.
> 
> I continue to question this as a valid motivation at this point in
> time, for the reasons Andrew has been explaining.

If we want to support Linux guest with >255 vcpus, interrupt remapping
is necessary.

From Linux commit introducing x2apic and IR mode, it said IR was
a pre-requisite for enabling x2apic mode in the CPU.
https://lwn.net/Articles/289881/

So far, no sure behavior on the other OS. We may watch Windows guest
behavior later on KVM and there is still a bug to run Windows guest with
IR function on KVM.


> 
>> 2. Xen vIOMMU Architecture
>> ============================================================================
>> ====
>>
>> * vIOMMU will be inside Xen hypervisor for following factors
>> 	1) Avoid round trips between Qemu and Xen hypervisor
>> 	2) Ease of integration with the rest of the hypervisor
>> 	3) HVMlite/PVH doesn't use Qemu
>> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
>> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
>> level translation.
> 
> How does the create/destroy part of this match up with 3) right
> ahead of it?

The create/destroy hypercalls will work for both hvm and hvmlite.
Suppose hvmlite has tool stack(E.G libxl) which can call new hypercalls
to create or destroy virtual iommu in hypervisor.

> 
>> 3 Xen hypervisor
>> ==========================================================================
>>
>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>>
>> struct xen_sysctl_viommu_op {
>> 	u32 cmd;
>> 	u32 domid;
>> 	union {
>> 		struct {
>> 			u32 capabilities;
>> 		} query_capabilities;
>> 		struct {
>> 			u32 capabilities;
>> 			u64 base_address;
>> 		} create_iommu;
>> 		struct {
>> 			u8  bus;
>> 			u8  devfn;
> 
> Please can we avoid introducing any new interfaces without segment/
> domain value, even if for now it'll be always zero?

Sure. Will add segment field.

> 
>> 			u64 iova;
>> 			u64 translated_addr;
>> 			u64 addr_mask; /* Translation page size */
>> 			IOMMUAccessFlags permisson;		
>> 		} 2th_level_translation;
> 
> I suppose "translated_addr" is an output here, but for the following
> fields this already isn't clear. Please add IN and OUT annotations for
> clarity.
> 
> Also, may I suggest to name this "l2_translation"? (But there are
> other implementation specific things to be considered here, which
> I guess don't belong into a design doc discussion.)

How about this?
        struct {
	    /* IN parameters. */
	    u8  segment;
            u8  bus;
            u8  devfn;
            u64 iova;
	    /* Out parameters. */
            u64 translated_addr;
            u64 addr_mask; /* Translation page size */
            IOMMUAccessFlags permisson;
        } l2_translation;

> 
>> };
>>
>> typedef enum {
>> 	IOMMU_NONE = 0,
>> 	IOMMU_RO   = 1,
>> 	IOMMU_WO   = 2,
>> 	IOMMU_RW   = 3,
>> } IOMMUAccessFlags;
>>
>>
>> Definition of VIOMMU subops:
>> #define XEN_SYSCTL_viommu_query_capability		0
>> #define XEN_SYSCTL_viommu_create			1
>> #define XEN_SYSCTL_viommu_destroy			2
>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
>>
>> Definition of VIOMMU capabilities
>> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
>> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)
> 
> l1 and l2 respectively again, please.

Will update.

> 
>> 3.3 Interrupt remapping
>> Interrupts from virtual devices and physical devices will be delivered
>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>> according interrupt remapping table. The following diagram shows the logic.
> 
> Missing diagram or stale sentence?

Sorry. It's stale sentence and moved the diagram to 2.2 Interrupt
remapping overview.

> 
>> 3.5 Implementation consideration
>> Linux Intel IOMMU driver will fail to be loaded without 2th level
>> translation support even if interrupt remapping and 1th level
>> translation are available. This means it's needed to enable 2th level
>> translation first before other functions.
> 
> Is there a reason for this? I.e. do they unconditionally need that
> functionality?

Yes, Linux intel IOMMU driver unconditionally needs l2 translation.
Driver checks whether there is a valid sagaw(supported Adjusted Guest
Address Widths) during initializing IOMMU data struct and return error
if not.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-31  8:39                                 ` Lan Tianyu
@ 2016-08-31 12:02                                   ` Jan Beulich
  2016-09-01  1:26                                     ` Tian, Kevin
  2016-09-01  2:35                                     ` Lan Tianyu
  0 siblings, 2 replies; 86+ messages in thread
From: Jan Beulich @ 2016-08-31 12:02 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, Jun Nakajima, anthony.perard, xen-devel,
	Roger Pau Monne

>>> On 31.08.16 at 10:39, <tianyu.lan@intel.com> wrote:
> On 2016年08月25日 19:11, Jan Beulich wrote:
>>>>> On 17.08.16 at 14:05, <tianyu.lan@intel.com> wrote:
>>> 1 Motivation for Xen vIOMMU
>>> ============================================================================
>>> ===
>>> 1.1 Enable more than 255 vcpu support
>>> HPC virtualization requires more than 255 vcpus support in a single VM
>>> to meet parallel computing requirement. More than 255 vcpus support
>>> requires interrupt remapping capability present on vIOMMU to deliver
>>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
>>> vcpus if interrupt remapping is absent.
>> 
>> I continue to question this as a valid motivation at this point in
>> time, for the reasons Andrew has been explaining.
> 
> If we want to support Linux guest with >255 vcpus, interrupt remapping
> is necessary.

I don't understand why you keep repeating this, without adding
_why_ you think there is a demand for such guests and _what_
your plans are to eliminate Andrew's concerns.

>>> 3 Xen hypervisor
>>> ==========================================================================
>>>
>>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>>>
>>> struct xen_sysctl_viommu_op {
>>> 	u32 cmd;
>>> 	u32 domid;
>>> 	union {
>>> 		struct {
>>> 			u32 capabilities;
>>> 		} query_capabilities;
>>> 		struct {
>>> 			u32 capabilities;
>>> 			u64 base_address;
>>> 		} create_iommu;
>>> 		struct {
>>> 			u8  bus;
>>> 			u8  devfn;
>> 
>> Please can we avoid introducing any new interfaces without segment/
>> domain value, even if for now it'll be always zero?
> 
> Sure. Will add segment field.
> 
>> 
>>> 			u64 iova;
>>> 			u64 translated_addr;
>>> 			u64 addr_mask; /* Translation page size */
>>> 			IOMMUAccessFlags permisson;		
>>> 		} 2th_level_translation;
>> 
>> I suppose "translated_addr" is an output here, but for the following
>> fields this already isn't clear. Please add IN and OUT annotations for
>> clarity.
>> 
>> Also, may I suggest to name this "l2_translation"? (But there are
>> other implementation specific things to be considered here, which
>> I guess don't belong into a design doc discussion.)
> 
> How about this?
>         struct {
> 	    /* IN parameters. */
> 	    u8  segment;
>             u8  bus;
>             u8  devfn;
>             u64 iova;
> 	    /* Out parameters. */
>             u64 translated_addr;
>             u64 addr_mask; /* Translation page size */
>             IOMMUAccessFlags permisson;
>         } l2_translation;

"segment" clearly needs to be a 16-bit value, but apart from that
(and missing padding fields) this looks okay.

>>> 3.5 Implementation consideration
>>> Linux Intel IOMMU driver will fail to be loaded without 2th level
>>> translation support even if interrupt remapping and 1th level
>>> translation are available. This means it's needed to enable 2th level
>>> translation first before other functions.
>> 
>> Is there a reason for this? I.e. do they unconditionally need that
>> functionality?
> 
> Yes, Linux intel IOMMU driver unconditionally needs l2 translation.
> Driver checks whether there is a valid sagaw(supported Adjusted Guest
> Address Widths) during initializing IOMMU data struct and return error
> if not.

How about my first question then?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-31 12:02                                   ` Jan Beulich
@ 2016-09-01  1:26                                     ` Tian, Kevin
  2016-09-01  2:35                                     ` Lan Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-09-01  1:26 UTC (permalink / raw)
  To: Jan Beulich, Lan, Tianyu
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Nakajima, Jun, anthony.perard, xen-devel,
	Roger Pau Monne

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, August 31, 2016 8:03 PM
> >>> 3.5 Implementation consideration
> >>> Linux Intel IOMMU driver will fail to be loaded without 2th level
> >>> translation support even if interrupt remapping and 1th level
> >>> translation are available. This means it's needed to enable 2th level
> >>> translation first before other functions.
> >>
> >> Is there a reason for this? I.e. do they unconditionally need that
> >> functionality?
> >
> > Yes, Linux intel IOMMU driver unconditionally needs l2 translation.
> > Driver checks whether there is a valid sagaw(supported Adjusted Guest
> > Address Widths) during initializing IOMMU data struct and return error
> > if not.
> 
> How about my first question then?
> 
> Jan

VT-d spec doesn't define a capability bit for the 2nd level translation 
(for 1st level or intr remapping, there do have such capability bit to
report). So architecturally there is no way to tell guest that 2nd level 
translation capability is not available, so existing Linux behavior is.... 
just correct.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-31 12:02                                   ` Jan Beulich
  2016-09-01  1:26                                     ` Tian, Kevin
@ 2016-09-01  2:35                                     ` Lan Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-09-01  2:35 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, Jun Nakajima, anthony.perard, xen-devel,
	Roger Pau Monne

On 2016年08月31日 20:02, Jan Beulich wrote:
>>>> On 31.08.16 at 10:39, <tianyu.lan@intel.com> wrote:
>> > On 2016年08月25日 19:11, Jan Beulich wrote:
>>>>>> >>>>> On 17.08.16 at 14:05, <tianyu.lan@intel.com> wrote:
>>>> >>> 1 Motivation for Xen vIOMMU
>>>> >>> ============================================================================
>>>> >>> ===
>>>> >>> 1.1 Enable more than 255 vcpu support
>>>> >>> HPC virtualization requires more than 255 vcpus support in a single VM
>>>> >>> to meet parallel computing requirement. More than 255 vcpus support
>>>> >>> requires interrupt remapping capability present on vIOMMU to deliver
>>>> >>> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
>>>> >>> vcpus if interrupt remapping is absent.
>>> >> 
>>> >> I continue to question this as a valid motivation at this point in
>>> >> time, for the reasons Andrew has been explaining.
>> > 
>> > If we want to support Linux guest with >255 vcpus, interrupt remapping
>> > is necessary.
> I don't understand why you keep repeating this, without adding
> _why_ you think there is a demand for such guests and _what_
> your plans are to eliminate Andrew's concerns.
> 

The motivation for such huge VM is for HPC(High-performance computing)
Cloud service which requires high performance parallel computing.
We just create single VM on one machine and expose more than 255 pcpus
to VM in order to make sure high performance parallel computing in VM.
One vcpu is pinged on pcpu.

For performance, we achieved high performance data(>95% native
data of stream, dgemm and sgemm benchmarks in VM) after some tuning and
optimizations. We presented these on Xen summit of this year.

For stability, Andrew found some issues of huge VM with watchdog
enabled and cause hypervisor reboot. We will reproduce and fix them.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
  2016-08-17 12:42                               ` Paul Durrant
  2016-08-25 11:11                               ` Jan Beulich
@ 2016-09-15 14:22                               ` Lan, Tianyu
  2016-10-05 18:36                                 ` Konrad Rzeszutek Wilk
  2016-11-23 18:19                               ` Edgar E. Iglesias
  3 siblings, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-09-15 14:22 UTC (permalink / raw)
  To: Andrew Cooper, Stefano Stabellini
  Cc: yang.zhang.wz, Kevin Tian, xen-devel, Jan Beulich, ian.jackson,
	xuquan8, Jun Nakajima, anthony.perard, Roger Pau Monne

Hi Andrew:
Sorry to bother you. To make sure we are on the right direction, it's
better to get feedback from you before we go further step. Could you
have a look? Thanks.

On 8/17/2016 8:05 PM, Lan, Tianyu wrote:
> Hi All:
>      The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.
>
>
> Content:
> ===============================================================================
>
> 1. Motivation of vIOMMU
>     1.1 Enable more than 255 vcpus
>     1.2 Support VFIO-based user space driver
>     1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>     2.1 2th level translation overview
>     2.2 Interrupt remapping overview
> 3. Xen hypervisor
>     3.1 New vIOMMU hypercall interface
>     3.2 2nd level translation
>     3.3 Interrupt remapping
>     3.4 1st level translation
>     3.5 Implementation consideration
> 4. Qemu
>     4.1 Qemu vIOMMU framework
>     4.2 Dummy xen-vIOMMU driver
>     4.3 Q35 vs. i440x
>     4.4 Report vIOMMU to hvmloader
>
>
> 1 Motivation for Xen vIOMMU
> ===============================================================================
>
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
>
>
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
>
>
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
>
> 2. Xen vIOMMU Architecture
> ================================================================================
>
>
> * vIOMMU will be inside Xen hypervisor for following factors
>     1) Avoid round trips between Qemu and Xen hypervisor
>     2) Ease of integration with the rest of the hypervisor
>     3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
>
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
>
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>
> The following diagram shows 2th level translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
>
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
>
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                               +-------------------+
>                               |   PCI Device      |
>                               +-------------------+
>
>
>
>
>
> 3 Xen hypervisor
> ==========================================================================
>
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
>
> struct xen_sysctl_viommu_op {
>     u32 cmd;
>     u32 domid;
>     union {
>         struct {
>             u32 capabilities;
>         } query_capabilities;
>         struct {
>             u32 capabilities;
>             u64 base_address;
>         } create_iommu;
>         struct {
>             u8  bus;
>             u8  devfn;
>             u64 iova;
>             u64 translated_addr;
>             u64 addr_mask; /* Translation page size */
>             IOMMUAccessFlags permisson;
>         } 2th_level_translation;
> };
>
> typedef enum {
>     IOMMU_NONE = 0,
>     IOMMU_RO   = 1,
>     IOMMU_WO   = 2,
>     IOMMU_RW   = 3,
> } IOMMUAccessFlags;
>
>
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability        0
> #define XEN_SYSCTL_viommu_create            1
> #define XEN_SYSCTL_viommu_destroy            2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation    (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation    (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
>
>
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>       Get vIOMMU capabilities(1st/2th level translation and interrupt
> remapping).
>
> - XEN_SYSCTL_viommu_create
>      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
>
> - XEN_SYSCTL_viommu_destroy
>      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>      Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
>
>
> 3.2 2nd level translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
>
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> Second-level Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> Page-table pointer to context entry of physical IOMMU.
>
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> translation function is enabled. These change will not affect current
> P2M logic.
>
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.
>
>
> 3.4 1st level translation
> When nested translation is enabled, any address generated by first-level
> translation is used as the input address for nesting with second-level
> translation. Physical IOMMU needs to enable both 1st level and 2nd level
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
>
> VT-d context entry points to guest 1st level translation table which
> will be nest-translated by 2nd level translation table and so it
> can be directly linked to context entry of physical IOMMU.
>
> To enable 1st level translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest 1st level translation table to context entry
> of physical IOMMU.
>
> All handles are in hypervisor and no interaction with Qemu.
>
>
> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.
>
>
> 4 Qemu
> ==============================================================================
>
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for 2th level translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with 2th level translation when
> DMA operations of virtual PCI devices happen.
>
>
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
>
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
>
> 3) Virtual PCI device's 2th level translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
>
>
> 4.3 Q35 vs i440x
> VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first.
>
> Consulted with Linux/Windows IOMMU driver experts and get that these
> drivers doesn't have such assumption. So we may skip Q35 implementation
> and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> with virtual PCI device's DMA translation and interrupt remapping. We
> are using KVM to do experiment of adding vIOMMU on the I440x and test
> Linux/Windows guest. Will report back when have some results.
>
>
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
>
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
>
> 2) Pass vIOMMU information to hvmloader via Xenstore
>
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
>
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
>
>
>

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-09-15 14:22                               ` Lan, Tianyu
@ 2016-10-05 18:36                                 ` Konrad Rzeszutek Wilk
  2016-10-11  1:52                                   ` Lan Tianyu
  0 siblings, 1 reply; 86+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-10-05 18:36 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: yang.zhang.wz, Kevin Tian, Stefano Stabellini, Jun Nakajima,
	Andrew Cooper, ian.jackson, xuquan8, xen-devel, Jan Beulich,
	anthony.perard, Roger Pau Monne

On Thu, Sep 15, 2016 at 10:22:36PM +0800, Lan, Tianyu wrote:
> Hi Andrew:
> Sorry to bother you. To make sure we are on the right direction, it's
> better to get feedback from you before we go further step. Could you
> have a look? Thanks.
> 
> On 8/17/2016 8:05 PM, Lan, Tianyu wrote:
> > Hi All:
> >      The following is our Xen vIOMMU high level design for detail
> > discussion. Please have a look. Very appreciate for your comments.
> > This design doesn't cover changes when root port is moved to hypervisor.
> > We may design it later.
> > 
> > 
> > Content:
> > ===============================================================================
> > 
> > 1. Motivation of vIOMMU
> >     1.1 Enable more than 255 vcpus
> >     1.2 Support VFIO-based user space driver
> >     1.3 Support guest Shared Virtual Memory (SVM)
> > 2. Xen vIOMMU Architecture
> >     2.1 2th level translation overview
> >     2.2 Interrupt remapping overview
> > 3. Xen hypervisor
> >     3.1 New vIOMMU hypercall interface
> >     3.2 2nd level translation
> >     3.3 Interrupt remapping
> >     3.4 1st level translation
> >     3.5 Implementation consideration
> > 4. Qemu
> >     4.1 Qemu vIOMMU framework
> >     4.2 Dummy xen-vIOMMU driver
> >     4.3 Q35 vs. i440x
> >     4.4 Report vIOMMU to hvmloader
> > 
> > 
> > 1 Motivation for Xen vIOMMU
> > ===============================================================================
> > 
> > 1.1 Enable more than 255 vcpu support
> > HPC virtualization requires more than 255 vcpus support in a single VM
> > to meet parallel computing requirement. More than 255 vcpus support
> > requires interrupt remapping capability present on vIOMMU to deliver
> > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> > vcpus if interrupt remapping is absent.
> > 
> > 
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the 2nd level translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> > 
> > 
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the 1st level translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> > 
> > 2. Xen vIOMMU Architecture
> > ================================================================================
> > 
> > 
> > * vIOMMU will be inside Xen hypervisor for following factors
> >     1) Avoid round trips between Qemu and Xen hypervisor
> >     2) Ease of integration with the rest of the hypervisor
> >     3) HVMlite/PVH doesn't use Qemu
> > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> > level translation.
> > 
> > 2.1 2th level translation overview
> > For Virtual PCI device, dummy xen-vIOMMU does translation in the
> > Qemu via new hypercall.
> > 
> > For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> > 
> > The following diagram shows 2th level translation architecture.
> > +---------------------------------------------------------+
> > |Qemu                                +----------------+   |
> > |                                    |     Virtual    |   |
> > |                                    |   PCI device   |   |
> > |                                    |                |   |
> > |                                    +----------------+   |
> > |                                            |DMA         |
> > |                                            V            |
> > |  +--------------------+   Request  +----------------+   |
> > |  |                    +<-----------+                |   |
> > |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> > |  |                    +----------->+                |   |
> > |  +---------+----------+            +-------+--------+   |
> > |            |                               |            |
> > |            |Hypercall                      |            |
> > +--------------------------------------------+------------+
> > |Hypervisor  |                               |            |
> > |            |                               |            |
> > |            v                               |            |
> > |     +------+------+                        |            |
> > |     |   vIOMMU    |                        |            |
> > |     +------+------+                        |            |
> > |            |                               |            |
> > |            v                               |            |
> > |     +------+------+                        |            |
> > |     | IOMMU driver|                        |            |
> > |     +------+------+                        |            |
> > |            |                               |            |
> > +--------------------------------------------+------------+
> > |HW          v                               V            |
> > |     +------+------+                 +-------------+     |
> > |     |   IOMMU     +---------------->+  Memory     |     |
> > |     +------+------+                 +-------------+     |
> > |            ^                                            |
> > |            |                                            |
> > |     +------+------+                                     |
> > |     | PCI Device  |                                     |
> > |     +-------------+                                     |
> > +---------------------------------------------------------+
> > 
> > 2.2 Interrupt remapping overview.
> > Interrupts from virtual devices and physical devices will be delivered
> > to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> > procedure.
> > 
> > +---------------------------------------------------+
> > |Qemu                       |VM                     |
> > |                           | +----------------+    |
> > |                           | |  Device driver |    |
> > |                           | +--------+-------+    |
> > |                           |          ^            |
> > |       +----------------+  | +--------+-------+    |
> > |       | Virtual device |  | |  IRQ subsystem |    |
> > |       +-------+--------+  | +--------+-------+    |
> > |               |           |          ^            |
> > |               |           |          |            |
> > +---------------------------+-----------------------+
> > |hyperviosr     |                      | VIRQ       |
> > |               |            +---------+--------+   |
> > |               |            |      vLAPIC      |   |
> > |               |            +---------+--------+   |
> > |               |                      ^            |
> > |               |                      |            |
> > |               |            +---------+--------+   |
> > |               |            |      vIOMMU      |   |
> > |               |            +---------+--------+   |
> > |               |                      ^            |
> > |               |                      |            |
> > |               |            +---------+--------+   |
> > |               |            |   vIOAPIC/vMSI   |   |
> > |               |            +----+----+--------+   |
> > |               |                 ^    ^            |
> > |               +-----------------+    |            |
> > |                                      |            |
> > +---------------------------------------------------+
> > HW                                     |IRQ
> >                               +-------------------+
> >                               |   PCI Device      |
> >                               +-------------------+
> > 
> > 
> > 
> > 
> > 
> > 3 Xen hypervisor
> > ==========================================================================
> > 
> > 3.1 New hypercall XEN_SYSCTL_viommu_op
> > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> > 
> > struct xen_sysctl_viommu_op {
> >     u32 cmd;
> >     u32 domid;
> >     union {
> >         struct {
> >             u32 capabilities;
> >         } query_capabilities;
> >         struct {
> >             u32 capabilities;
> >             u64 base_address;
> >         } create_iommu;
> >         struct {
> >             u8  bus;
> >             u8  devfn;
> >             u64 iova;
> >             u64 translated_addr;
> >             u64 addr_mask; /* Translation page size */
> >             IOMMUAccessFlags permisson;
> >         } 2th_level_translation;
> > };
> > 
> > typedef enum {
> >     IOMMU_NONE = 0,
> >     IOMMU_RO   = 1,
> >     IOMMU_WO   = 2,
> >     IOMMU_RW   = 3,
> > } IOMMUAccessFlags;
> > 
> > 
> > Definition of VIOMMU subops:
> > #define XEN_SYSCTL_viommu_query_capability        0
> > #define XEN_SYSCTL_viommu_create            1
> > #define XEN_SYSCTL_viommu_destroy            2
> > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
> > 
> > Definition of VIOMMU capabilities
> > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation    (1 << 0)
> > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation    (1 << 1)
> > #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
> > 
> > 
> > 2) Design for subops
> > - XEN_SYSCTL_viommu_query_capability
> >       Get vIOMMU capabilities(1st/2th level translation and interrupt
> > remapping).
> > 
> > - XEN_SYSCTL_viommu_create
> >      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> > base address.
> > 
> > - XEN_SYSCTL_viommu_destroy
> >      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> > 
> > - XEN_SYSCTL_viommu_dma_translation_for_vpdev
> >      Translate IOVA to GPA for specified virtual PCI device with dom id,
> > PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> > address mask and access permission.
> > 
> > 
> > 3.2 2nd level translation
> > 1) For virtual PCI device
> > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> > hypercall when DMA operation happens.
> > 
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > Second-level Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> > Page-table pointer to context entry of physical IOMMU.
> > 
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> > translation function is enabled. These change will not affect current
> > P2M logic.
> > 
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table. The following diagram shows the logic.
> > 

Uh? Missing diagram?

> > 
> > 3.4 1st level translation
> > When nested translation is enabled, any address generated by first-level
> > translation is used as the input address for nesting with second-level
> > translation. Physical IOMMU needs to enable both 1st level and 2nd level
> > translation in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> > 
> > VT-d context entry points to guest 1st level translation table which
> > will be nest-translated by 2nd level translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> > 
> > To enable 1st level translation in VM
> > 1) Xen IOMMU driver enables nested translation mode
> > 2) Update GPA root of guest 1st level translation table to context entry
> > of physical IOMMU.
> > 
> > All handles are in hypervisor and no interaction with Qemu.
> > 
> > 
> > 3.5 Implementation consideration
> > Linux Intel IOMMU driver will fail to be loaded without 2th level
> > translation support even if interrupt remapping and 1th level
> > translation are available. This means it's needed to enable 2th level
> > translation first before other functions.
> > 
> > 
> > 4 Qemu
> > ==============================================================================
> > 
> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for 2th level translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with 2th level translation when
> > DMA operations of virtual PCI devices happen.
> > 
> > 
> > 4.2 Dummy xen-vIOMMU driver
> > 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> > Share Virtual Memory) via hypercall.
> > 
> > 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> > address and desired capability as parameters. Destroy vIOMMU when VM is
> > closed.
> > 
> > 3) Virtual PCI device's 2th level translation
> > Qemu already provides DMA translation hook. It's called when DMA
> > translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> > device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> > return back translated GPA.
> > 
> > 
> > 4.3 Q35 vs i440x
> > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU

s/since/with/
> > driver has assumption that VTD only exists on Q35 and newer chipset and
> > we have to enable Q35 first.
> > 
> > Consulted with Linux/Windows IOMMU driver experts and get that these
> > drivers doesn't have such assumption. So we may skip Q35 implementation
> > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> > with virtual PCI device's DMA translation and interrupt remapping. We
> > are using KVM to do experiment of adding vIOMMU on the I440x and test
> > Linux/Windows guest. Will report back when have some results.

Any results?
> > 
> > 
> > 4.4 Report vIOMMU to hvmloader
> > Hvmloader is in charge of building ACPI tables for Guest OS and OS
> > probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> > vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> > for Guest OS.
> > 
> > There are three ways to do that.
> > 1) Extend struct hvm_info_table and add variables in the struct
> > hvm_info_table to pass vIOMMU information to hvmloader. But this
> > requires to add new xc interface to use struct hvm_info_table in the Qemu.
> > 
> > 2) Pass vIOMMU information to hvmloader via Xenstore
> > 
> > 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> > This solution is already present in the vNVDIMM design(4.3.1
> > Building Guest ACPI Tables
> > http://www.gossamer-threads.com/lists/xen/devel/439766).
> > 
> > The third option seems more clear and hvmloader doesn't need to deal
> > with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> > vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.

/me nods. That does seem the best option.
> > 
> > 
> > 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-10-05 18:36                                 ` Konrad Rzeszutek Wilk
@ 2016-10-11  1:52                                   ` Lan Tianyu
  0 siblings, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-10-11  1:52 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: yang.zhang.wz, Kevin Tian, Stefano Stabellini, Jun Nakajima,
	Andrew Cooper, ian.jackson, xuquan8, xen-devel, Jan Beulich,
	anthony.perard, Roger Pau Monne

On 2016年10月06日 02:36, Konrad Rzeszutek Wilk wrote:
>>> 3.3 Interrupt remapping
>>> > > Interrupts from virtual devices and physical devices will be delivered
>>> > > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>>> > > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>>> > > according interrupt remapping table. The following diagram shows the logic.
>>> > > 
> Uh? Missing diagram?

Sorry. This is stale statement. The diagram was moved to 2.2 Interrupt
remapping overview.

> 
>>> 4.3 Q35 vs i440x
>>> > > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> s/since/with/
>>> > > driver has assumption that VTD only exists on Q35 and newer chipset and
>>> > > we have to enable Q35 first.
>>> > > 
>>> > > Consulted with Linux/Windows IOMMU driver experts and get that these
>>> > > drivers doesn't have such assumption. So we may skip Q35 implementation
>>> > > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
>>> > > with virtual PCI device's DMA translation and interrupt remapping. We
>>> > > are using KVM to do experiment of adding vIOMMU on the I440x and test
>>> > > Linux/Windows guest. Will report back when have some results.
> Any results?

We have booted up Win8 guest with virtual VTD and emulated I440x
platform on Xen and guest uses virtual VTD to enable interrupt remapping
function.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Xen virtual IOMMU high level design doc V2
  2016-07-05 13:57                           ` Jan Beulich
  2016-07-05 14:19                             ` Lan, Tianyu
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
@ 2016-10-18 14:14                             ` Lan Tianyu
  2016-10-18 19:17                               ` Andrew Cooper
                                                 ` (2 more replies)
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
  3 siblings, 3 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-10-18 14:14 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian, Andrew Cooper, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

Change since V1:
	1) Update motivation for Xen vIOMMU - 288 vcpus support part
	2) Change definition of struct xen_sysctl_viommu_op
	3) Update "3.5 Implementation consideration" to explain why we needs to 
enable l2 translation first.
	4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on 
the emulated I440 chipset.
	5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===============================================================================
1. Motivation of vIOMMU
	1.1 Enable more than 255 vcpus
	1.2 Support VFIO-based user space driver
	1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
	2.1 l2 translation overview
	2.2 Interrupt remapping overview
3. Xen hypervisor
	3.1 New vIOMMU hypercall interface
	3.2 l2 translation
	3.3 Interrupt remapping
	3.4 l1 translation
	3.5 Implementation consideration
4. Qemu
	4.1 Qemu vIOMMU framework
	4.2 Dummy xen-vIOMMU driver
	4.3 Q35 vs. i440x
	4.4 Report vIOMMU to hvmloader


1 Motivation for Xen vIOMMU
===============================================================================
1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement.Ping each vcpus on separated pcpus. More than
255 vcpus support requires X2APIC and Linux disables X2APIC mode if
there is no interrupt remapping function which is present by vIOMMU.
Interrupt remapping function helps to deliver interrupt to #vcpu >255.
So we need to add vIOMMU before enabling >255 vcpus.

1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the l2 translation capability (IOVA->GPA) on
vIOMMU. pIOMMU l2 becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.


1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.

2. Xen vIOMMU Architecture
================================================================================

* vIOMMU will be inside Xen hypervisor for following factors
	1) Avoid round trips between Qemu and Xen hypervisor
	2) Ease of integration with the rest of the hypervisor
	3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destory vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows l2 translation architecture.
+---------------------------------------------------------+
|Qemu                                +----------------+   |
|                                    |     Virtual    |   |
|                                    |   PCI device   |   |
|                                    |                |   |
|                                    +----------------+   |
|                                            |DMA         |
|                                            V            |
|  +--------------------+   Request  +----------------+   |
|  |                    +<-----------+                |   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |                    +----------->+                |   |
|  +---------+----------+            +-------+--------+   |
|            |                               |            |
|            |Hypercall                      |            |
+--------------------------------------------+------------+
|Hypervisor  |                               |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     |   vIOMMU    |                        |            |
|     +------+------+                        |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     | IOMMU driver|                        |            |
|     +------+------+                        |            |
|            |                               |            |
+--------------------------------------------+------------+
|HW          v                               V            |
|     +------+------+                 +-------------+     |
|     |   IOMMU     +---------------->+  Memory     |     |
|     +------+------+                 +-------------+     |
|            ^                                            |
|            |                                            |
|     +------+------+                                     |
|     | PCI Device  |                                     |
|     +-------------+                                     |
+---------------------------------------------------------+

2.2 Interrupt remapping overview.
Interrupts from virtual devices and physical devices will be delivered
to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
procedure.

+---------------------------------------------------+
|Qemu                       |VM                     |
|                           | +----------------+    |
|                           | |  Device driver |    |
|                           | +--------+-------+    |
|                           |          ^            |
|       +----------------+  | +--------+-------+    |
|       | Virtual device |  | |  IRQ subsystem |    |
|       +-------+--------+  | +--------+-------+    |
|               |           |          ^            |
|               |           |          |            |
+---------------------------+-----------------------+
|hyperviosr     |                      | VIRQ       |
|               |            +---------+--------+   |
|               |            |      vLAPIC      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |      vIOMMU      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |   vIOAPIC/vMSI   |   |
|               |            +----+----+--------+   |
|               |                 ^    ^            |
|               +-----------------+    |            |
|                                      |            |
+---------------------------------------------------+
HW                                     |IRQ
                                +-------------------+
                                |   PCI Device      |
                                +-------------------+




3 Xen hypervisor
==========================================================================
3.1 New hypercall XEN_SYSCTL_viommu_op
This hypercall should also support pv IOMMU which is still under RFC 
review. Here only covers non-pv part.

1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.

struct xen_sysctl_viommu_op {
	u32 cmd;
	u32 domid;
	union {
		struct {
			u32 capabilities;
		} query_capabilities;
		struct {
			u32 capabilities;
			u64 base_address;
		} create_iommu;
	        struct {
			/* IN parameters. */
			u16 segment;
             		u8  bus;
             		u8  devfn;
             		u64 iova;
	    		/* Out parameters. */
             		u64 translated_addr;
             		u64 addr_mask; /* Translation page size */
             		IOMMUAccessFlags permisson;
         	} l2_translation;		
};

typedef enum {
	IOMMU_NONE = 0,
	IOMMU_RO   = 1,
	IOMMU_WO   = 2,
	IOMMU_RW   = 3,
} IOMMUAccessFlags;


Definition of VIOMMU subops:
#define XEN_SYSCTL_viommu_query_capability		0
#define XEN_SYSCTL_viommu_create			1
#define XEN_SYSCTL_viommu_destroy			2
#define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_l1_translation	(1 << 0)
#define XEN_VIOMMU_CAPABILITY_l2_translation	(1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)


2) Design for subops
- XEN_SYSCTL_viommu_query_capability
        Get vIOMMU capabilities(l1/l2 translation and interrupt
remapping).

- XEN_SYSCTL_viommu_create
       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
base address.

- XEN_SYSCTL_viommu_destroy
       Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_SYSCTL_viommu_dma_translation_for_vpdev
       Translate IOVA to GPA for specified virtual PCI device with dom id,
PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.


3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 Page-table pointer field, it provides IO page table for
IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer to context entry of physical IOMMU.

Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support l2
translation of vIOMMU, IOMMU driver need to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
defaultly and switch to shadow IO page table(IOVA->HPA) when l2
translation function is enabled. These change will not affect current
P2M logic.

3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table.


3.4 l1 translation
When nested translation is enabled, any address generated by l1
translation is used as the input address for nesting with l2
translation. Physical IOMMU needs to enable both l1 and l2 translation
in nested translation mode(GVA->GPA->HPA) for passthrough
device.

VT-d context entry points to guest l1 translation table which
will be nest-translated by l2 translation table and so it
can be directly linked to context entry of physical IOMMU.

To enable l1 translation in VM
1) Xen IOMMU driver enables nested translation mode
2) Update GPA root of guest l1 translation table to context entry
of physical IOMMU.

All handles are in hypervisor and no interaction with Qemu.


3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. Linux Intel IOMMU driver thinks l2
translation is always available when VTD exits and fail to be loaded
without l2 translation support even if interrupt remapping and l1
translation are available. So it needs to enable l2 translation first
before other functions.


4 Qemu
==============================================================================
4.1 Qemu vIOMMU framework
Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
Especially for l2 translation of virtual PCI device because
emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
framework provides callback to deal with l2 translation when
DMA operations of virtual PCI devices happen.


4.2 Dummy xen-vIOMMU driver
1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
Share Virtual Memory) via hypercall.

2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
address and desired capability as parameters. Destroy vIOMMU when VM is
closed.

3) Virtual PCI device's l2 translation
Qemu already provides DMA translation hook. It's called when DMA
translation of virtual PCI device happens. The dummy xen-vIOMMU passes
device bdf and IOVA into Xen hypervisor via new iommu hypercall and
return back translated GPA.


4.3 Q35 vs I440x
VT-D is introduced with Q35 chipset. Previous concern was that VTD
driver has assumption that VTD only exists on Q35 and newer chipset and
we have to enable Q35 first. After experiments, Linux/Windows guest can
boot up on the emulated I440x chipset with VTD and VTD driver enables
interrupt remapping function. So we can skip Q35 support to implement
vIOMMU directly.

4.4 Report vIOMMU to hvmloader
Hvmloader is in charge of building ACPI tables for Guest OS and OS
probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
for Guest OS.

There are three ways to do that.
1) Extend struct hvm_info_table and add variables in the struct
hvm_info_table to pass vIOMMU information to hvmloader. But this
requires to add new xc interface to use struct hvm_info_table in the Qemu.

2) Pass vIOMMU information to hvmloader via Xenstore

3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
This solution is already present in the vNVDIMM design(4.3.1
Building Guest ACPI Tables
http://www.gossamer-threads.com/lists/xen/devel/439766).

The third option seems more clear and hvmloader doesn't need to deal
with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
@ 2016-10-18 19:17                               ` Andrew Cooper
  2016-10-20  9:53                                 ` Tian, Kevin
  2016-10-20 14:17                                 ` Lan Tianyu
  2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
  2016-10-26  9:36                               ` Jan Beulich
  2 siblings, 2 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-10-18 19:17 UTC (permalink / raw)
  To: Lan Tianyu, Jan Beulich, Kevin Tian, yang.zhang.wz, Jun Nakajima,
	Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 18/10/16 15:14, Lan Tianyu wrote:
> Change since V1:
>     1) Update motivation for Xen vIOMMU - 288 vcpus support part
>     2) Change definition of struct xen_sysctl_viommu_op
>     3) Update "3.5 Implementation consideration" to explain why we
> needs to enable l2 translation first.
>     4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
> on the emulated I440 chipset.
>     5) Remove stale statement in the "3.3 Interrupt remapping"
>
> Content:
> ===============================================================================
>
> 1. Motivation of vIOMMU
>     1.1 Enable more than 255 vcpus
>     1.2 Support VFIO-based user space driver
>     1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
>     2.1 l2 translation overview
>     2.2 Interrupt remapping overview
> 3. Xen hypervisor
>     3.1 New vIOMMU hypercall interface
>     3.2 l2 translation
>     3.3 Interrupt remapping
>     3.4 l1 translation
>     3.5 Implementation consideration
> 4. Qemu
>     4.1 Qemu vIOMMU framework
>     4.2 Dummy xen-vIOMMU driver
>     4.3 Q35 vs. i440x
>     4.4 Report vIOMMU to hvmloader
>
>
> 1 Motivation for Xen vIOMMU
> ===============================================================================
>
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than

Pin ?

Also, grammatically speaking, I think you mean "each vcpu to separate
pcpus".

> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.

This is only a requirement for xapic interrupt sources.  x2apic
interrupt sources already deliver correctly.

> So we need to add vIOMMU before enabling >255 vcpus.
>
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.

How is userspace supposed to drive this interface?  I can't picture how
it would function.

>
>
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.

As an aside, how is IGD intending to support SVM?  Will it be with PCIe
ATS/PASID, or something rather more magic as IGD is on the same piece of
silicon?

>
> 2. Xen vIOMMU Architecture
> ================================================================================
>
>
> * vIOMMU will be inside Xen hypervisor for following factors
>     1) Avoid round trips between Qemu and Xen hypervisor
>     2) Ease of integration with the rest of the hypervisor
>     3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2
> translation.
>
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
>
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>
> The following diagram shows l2 translation architecture.

Which scenario is this?  Is this the passthrough case where the Qemu
Virtual PCI device is a shadow of the real PCI device in hardware?

> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
>
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
>
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                                +-------------------+
>                                |   PCI Device      |
>                                +-------------------+
>
>
>
>
> 3 Xen hypervisor
> ==========================================================================
>
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> This hypercall should also support pv IOMMU which is still under RFC
> review. Here only covers non-pv part.
>
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
> parameter.

Why did you choose sysctl?  As these are per-domain, domctl would be a
more logical choice.  However, neither of these should be usable by
Qemu, and we are trying to split out "normal qemu operations" into dmops
which can be safely deprivileged.

This functionality seems like it lives logically beside the ioreq server
hypercalls, wherever they eventually end up.

>
> struct xen_sysctl_viommu_op {
>     u32 cmd;
>     u32 domid;
>     union {
>         struct {
>             u32 capabilities;
>         } query_capabilities;
>         struct {
>             u32 capabilities;
>             u64 base_address;
>         } create_iommu;
>             struct {
>             /* IN parameters. */
>             u16 segment;
>                     u8  bus;
>                     u8  devfn;

I think this would be cleaner as u32 vsbdf, which makes it clear which
address space to look for sbdf in.

>                     u64 iova;
>                 /* Out parameters. */
>                     u64 translated_addr;
>                     u64 addr_mask; /* Translation page size */
>                     IOMMUAccessFlags permisson;

How is this translation intended to be used?  How do you plan to avoid
race conditions where qemu requests a translation, receives one, the
guest invalidated the mapping, and then qemu tries to use its translated
address?

There are only two ways I can see of doing this race-free.  One is to
implement a "memcpy with translation" hypercall, and the other is to
require the use of ATS in the vIOMMU, where the guest OS is required to
wait for a positive response from the vIOMMU before it can safely reuse
the mapping.

The former behaves like real hardware in that an intermediate entity
performs the translation without interacting with the DMA source.  The
latter explicitly exposing the fact that caching is going on at the
endpoint to the OS.

>             } l2_translation;       
> };
>
> typedef enum {
>     IOMMU_NONE = 0,
>     IOMMU_RO   = 1,
>     IOMMU_WO   = 2,
>     IOMMU_RW   = 3,
> } IOMMUAccessFlags;

No enumerations in an ABI please.  They are not stable in C.  Please us
a u32 and more #define's

>
>
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability        0
> #define XEN_SYSCTL_viommu_create            1
> #define XEN_SYSCTL_viommu_destroy            2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
> #define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)

How are vIOMMUs going to be modelled to guests?  On real hardware, they
all seem to end associated with a PCI device of some sort, even if it is
just the LPC bridge.

How do we deal with multiple vIOMMUs in a single guest?

>
>
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>        Get vIOMMU capabilities(l1/l2 translation and interrupt
> remapping).
>
> - XEN_SYSCTL_viommu_create
>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
>
> - XEN_SYSCTL_viommu_destroy
>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>       Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
>
>
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
>
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> l2 Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> Page-table pointer to context entry of physical IOMMU.

How are you proposing to do this shadowing?  Do we need to trap and
emulate all writes to the vIOMMU pagetables, or is there a better way to
know when the mappings need invalidating?

>
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
> translation function is enabled. These change will not affect current
> P2M logic.
>
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table.
>
>
> 3.4 l1 translation
> When nested translation is enabled, any address generated by l1
> translation is used as the input address for nesting with l2
> translation. Physical IOMMU needs to enable both l1 and l2 translation
> in nested translation mode(GVA->GPA->HPA) for passthrough
> device.

All these l1 and l2 translations are getting confusing.  Could we
perhaps call them guest translation and host translation, or is that
likely to cause other problems?

>
> VT-d context entry points to guest l1 translation table which
> will be nest-translated by l2 translation table and so it
> can be directly linked to context entry of physical IOMMU.
>
> To enable l1 translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest l1 translation table to context entry
> of physical IOMMU.
>
> All handles are in hypervisor and no interaction with Qemu.
>
>
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. Linux Intel IOMMU driver thinks l2
> translation is always available when VTD exits and fail to be loaded
> without l2 translation support even if interrupt remapping and l1
> translation are available. So it needs to enable l2 translation first
> before other functions.

What then is the purpose of the nested translation support bit in the
extended capability register?

>
>
> 4 Qemu
> ==============================================================================
>
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for l2 translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with l2 translation when
> DMA operations of virtual PCI devices happen.
>
>
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
>
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
>
> 3) Virtual PCI device's l2 translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
>
>
> 4.3 Q35 vs I440x
> VT-D is introduced with Q35 chipset. Previous concern was that VTD
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first. After experiments, Linux/Windows guest can
> boot up on the emulated I440x chipset with VTD and VTD driver enables
> interrupt remapping function. So we can skip Q35 support to implement
> vIOMMU directly.

This is good to know.

>
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
>
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the
> Qemu.
>
> 2) Pass vIOMMU information to hvmloader via Xenstore
>
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
>
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.

Part of ACPI table building has now moved into the toolstack.  Unless
the table needs creating dynamically (which doesn't appear to be the
case), it can be done without any further communication.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
  2016-10-18 19:17                               ` Andrew Cooper
@ 2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
  2016-10-20 10:11                                 ` Tian, Kevin
  2016-10-20 14:56                                 ` Lan, Tianyu
  2016-10-26  9:36                               ` Jan Beulich
  2 siblings, 2 replies; 86+ messages in thread
From: Konrad Rzeszutek Wilk @ 2016-10-18 20:26 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: yang.zhang.wz, Kevin Tian, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xuquan8, xen-devel, Jun Nakajima,
	anthony.perard, Roger Pau Monne

On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:
> Change since V1:
> 	1) Update motivation for Xen vIOMMU - 288 vcpus support part
> 	2) Change definition of struct xen_sysctl_viommu_op
> 	3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
> 	4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on the
> emulated I440 chipset.
> 	5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> ===============================================================================
> 1. Motivation of vIOMMU
> 	1.1 Enable more than 255 vcpus
> 	1.2 Support VFIO-based user space driver
> 	1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 	2.1 l2 translation overview
> 	2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 	3.1 New vIOMMU hypercall interface
> 	3.2 l2 translation
> 	3.3 Interrupt remapping
> 	3.4 l1 translation
> 	3.5 Implementation consideration
> 4. Qemu
> 	4.1 Qemu vIOMMU framework
> 	4.2 Dummy xen-vIOMMU driver
> 	4.3 Q35 vs. i440x
> 	4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===============================================================================
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

What about Windows? Does it care about this?

> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> ================================================================================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
> 	1) Avoid round trips between Qemu and Xen hypervisor
> 	2) Ease of integration with the rest of the hypervisor
> 	3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2

destroy
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                                +-------------------+
>                                |   PCI Device      |
>                                +-------------------+
> 
> 
> 
> 
> 3 Xen hypervisor
> ==========================================================================
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> This hypercall should also support pv IOMMU which is still under RFC review.
> Here only covers non-pv part.
> 
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
> 	u32 cmd;
> 	u32 domid;
> 	union {
> 		struct {
> 			u32 capabilities;
> 		} query_capabilities;
> 		struct {
> 			u32 capabilities;
> 			u64 base_address;
> 		} create_iommu;
> 	        struct {
> 			/* IN parameters. */
> 			u16 segment;
>             		u8  bus;
>             		u8  devfn;
>             		u64 iova;
> 	    		/* Out parameters. */
>             		u64 translated_addr;
>             		u64 addr_mask; /* Translation page size */
>             		IOMMUAccessFlags permisson;
>         	} l2_translation;		
> };
> 
> typedef enum {
> 	IOMMU_NONE = 0,
> 	IOMMU_RO   = 1,
> 	IOMMU_WO   = 2,
> 	IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability		0
> #define XEN_SYSCTL_viommu_create			1
> #define XEN_SYSCTL_viommu_destroy			2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_l1_translation	(1 << 0)
> #define XEN_VIOMMU_CAPABILITY_l2_translation	(1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)
> 
> 
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>        Get vIOMMU capabilities(l1/l2 translation and interrupt
> remapping).
> 
> - XEN_SYSCTL_viommu_create
>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
> 
> - XEN_SYSCTL_viommu_destroy
>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> 
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>       Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> 
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> l2 Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> Page-table pointer to context entry of physical IOMMU.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when l2

defaultly?

> translation function is enabled. These change will not affect current
> P2M logic.

What happens if the guests IO page tables have incorrect values?

For example the guest sets up the pagetables to cover some section
of HPA ranges (which are all good and permitted). But then during execution
the guest kernel decides to muck around with the pagetables and adds an HPA
range that is outside what the guest has been allocated.

What then?
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table.
> 
> 
> 3.4 l1 translation
> When nested translation is enabled, any address generated by l1
> translation is used as the input address for nesting with l2
> translation. Physical IOMMU needs to enable both l1 and l2 translation
> in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
> 
> VT-d context entry points to guest l1 translation table which
> will be nest-translated by l2 translation table and so it
> can be directly linked to context entry of physical IOMMU.

I think this means that the shared_ept will be disabled?
>
What about different versions of contexts? Say the V1 is exposed
to guest but the hardware supports V2? Are there any flags that have
swapped positions? Or is it pretty backwards compatible?
 
> To enable l1 translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest l1 translation table to context entry
> of physical IOMMU.
> 
> All handles are in hypervisor and no interaction with Qemu.

All is handled in hypervisor.
> 
> 
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. Linux Intel IOMMU driver thinks l2
> translation is always available when VTD exits and fail to be loaded
> without l2 translation support even if interrupt remapping and l1
> translation are available. So it needs to enable l2 translation first

I am lost on that sentence. Are you saying that it tries to load
the IOVA and if they fail.. then it keeps on going? What is the result
of this? That you can't do IOVA (so can't use vfio ?)

> before other functions.
> 
> 
> 4 Qemu
> ==============================================================================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for l2 translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with l2 translation when
> DMA operations of virtual PCI devices happen.

You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
could expose an vIOMMU to a guest and actually use the AMD IOMMU
in the hypervisor?
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's l2 translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs I440x
> VT-D is introduced with Q35 chipset. Previous concern was that VTD
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first. After experiments, Linux/Windows guest can
> boot up on the emulated I440x chipset with VTD and VTD driver enables
> interrupt remapping function. So we can skip Q35 support to implement
> vIOMMU directly.
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
> 
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
> 
> 2) Pass vIOMMU information to hvmloader via Xenstore
> 
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 19:17                               ` Andrew Cooper
@ 2016-10-20  9:53                                 ` Tian, Kevin
  2016-10-20 18:10                                   ` Andrew Cooper
  2016-10-20 14:17                                 ` Lan Tianyu
  1 sibling, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-10-20  9:53 UTC (permalink / raw)
  To: Andrew Cooper, Lan, Tianyu, Jan Beulich, yang.zhang.wz, Nakajima,
	Jun, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Wednesday, October 19, 2016 3:18 AM
> 
> >
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the l2 translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU l2 becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> 
> How is userspace supposed to drive this interface?  I can't picture how
> it would function.

Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
export a "caching mode" capability to indicate all guest PTE changes 
requiring explicit vIOMMU cache invalidations. Through trapping of those
invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
->HPA). When DMA mapping is established, user space driver programs 
gIOVA addresses as DMA destination to assigned device, and then upstreaming
DMA request out of this device contains gIOVA which is translated to HPA
by pIOMMU shadow page table.

> 
> >
> >
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the l1 translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> > mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> 
> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
> ATS/PASID, or something rather more magic as IGD is on the same piece of
> silicon?

Although integrated, IGD conforms to standard PCIe PASID convention.

> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> > before other functions.
> 
> What then is the purpose of the nested translation support bit in the
> extended capability register?
> 

Nested translation is for SVM virtualization. Given a DMA transaction 
containing a PASID, VT-d engine first finds the 1st translation table 
through PASID to translate from GVA to GPA, then once nested
translation capability is enabled, further translate GPA to HPA using the
2nd level translation table. Bare-metal usage is not expected to turn
on this nested bit.

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
@ 2016-10-20 10:11                                 ` Tian, Kevin
  2016-10-20 14:56                                 ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-10-20 10:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk, Lan, Tianyu
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xen-devel, Nakajima, Jun,
	anthony.perard, Roger Pau Monne

> From: Konrad Rzeszutek Wilk [mailto:konrad.wilk@oracle.com]
> Sent: Wednesday, October 19, 2016 4:27 AM
> >
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > l2 Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> > Page-table pointer to context entry of physical IOMMU.
> >
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support l2
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when l2
> 
> defaultly?
> 
> > translation function is enabled. These change will not affect current
> > P2M logic.
> 
> What happens if the guests IO page tables have incorrect values?
> 
> For example the guest sets up the pagetables to cover some section
> of HPA ranges (which are all good and permitted). But then during execution
> the guest kernel decides to muck around with the pagetables and adds an HPA
> range that is outside what the guest has been allocated.
> 
> What then?

Shadow PTE is controlled by hypervisor. Whatever IOVA->GPA mapping in
guest PTE must be validated (IOVA->GPA->HPA) before updating into the
shadow PTE. So regardless of when guest mucks its PTE, the operation is
always trapped and validated. Why do you think there is a problem?

Also guest only sees GPA. All it can operate is GPA ranges.

> >
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table.
> >
> >
> > 3.4 l1 translation
> > When nested translation is enabled, any address generated by l1
> > translation is used as the input address for nesting with l2
> > translation. Physical IOMMU needs to enable both l1 and l2 translation
> > in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> >
> > VT-d context entry points to guest l1 translation table which
> > will be nest-translated by l2 translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> 
> I think this means that the shared_ept will be disabled?
> >
> What about different versions of contexts? Say the V1 is exposed
> to guest but the hardware supports V2? Are there any flags that have
> swapped positions? Or is it pretty backwards compatible?

yes, backward compatible.

> >
> >
> > 3.5 Implementation consideration
> > VT-d spec doesn't define a capability bit for the l2 translation.
> > Architecturally there is no way to tell guest that l2 translation
> > capability is not available. Linux Intel IOMMU driver thinks l2
> > translation is always available when VTD exits and fail to be loaded
> > without l2 translation support even if interrupt remapping and l1
> > translation are available. So it needs to enable l2 translation first
> 
> I am lost on that sentence. Are you saying that it tries to load
> the IOVA and if they fail.. then it keeps on going? What is the result
> of this? That you can't do IOVA (so can't use vfio ?)

It's about VT-d capability. VT-d supports both 1st-level and 2nd-level 
translation, however only the 1st-level translation can be optionally
reported through a capability bit. There is no capability bit to say
a version doesn't support 2nd-level translation. The implication is
that, as long as a vIOMMU is exposed, guest IOMMU driver always
assumes IOVA capability available thru 2nd level translation. 

So we can first emulate a vIOMMU w/ only 2nd-level capability, and
then extend it to support 1st-level and interrupt remapping, but cannot 
do the reverse direction. I think Tianyu's point is more to describe 
enabling sequence based on this fact. :-)

> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for l2 translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with l2 translation when
> > DMA operations of virtual PCI devices happen.
> 
> You say AMD and Intel. This sounds quite OS agnostic. Does it mean you
> could expose an vIOMMU to a guest and actually use the AMD IOMMU
> in the hypervisor?

Did you mean "expose an Intel vIOMMU to guest and then use physical
AMD IOMMU in hypervisor"? I didn't think about this, but what's the value
of doing so? :-)
 
Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 19:17                               ` Andrew Cooper
  2016-10-20  9:53                                 ` Tian, Kevin
@ 2016-10-20 14:17                                 ` Lan Tianyu
  2016-10-20 20:36                                   ` Andrew Cooper
  1 sibling, 1 reply; 86+ messages in thread
From: Lan Tianyu @ 2016-10-20 14:17 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Kevin Tian, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

Hi Andrew:
	Thanks for your review.

On 2016年10月19日 03:17, Andrew Cooper wrote:
> On 18/10/16 15:14, Lan Tianyu wrote:
>> Change since V1:
>>     1) Update motivation for Xen vIOMMU - 288 vcpus support part
>>     2) Change definition of struct xen_sysctl_viommu_op
>>     3) Update "3.5 Implementation consideration" to explain why we
>> needs to enable l2 translation first.
>>     4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work
>> on the emulated I440 chipset.
>>     5) Remove stale statement in the "3.3 Interrupt remapping"
>>
>> Content:
>> ===============================================================================
>>
>> 1. Motivation of vIOMMU
>>     1.1 Enable more than 255 vcpus
>>     1.2 Support VFIO-based user space driver
>>     1.3 Support guest Shared Virtual Memory (SVM)
>> 2. Xen vIOMMU Architecture
>>     2.1 l2 translation overview
>>     2.2 Interrupt remapping overview
>> 3. Xen hypervisor
>>     3.1 New vIOMMU hypercall interface
>>     3.2 l2 translation
>>     3.3 Interrupt remapping
>>     3.4 l1 translation
>>     3.5 Implementation consideration
>> 4. Qemu
>>     4.1 Qemu vIOMMU framework
>>     4.2 Dummy xen-vIOMMU driver
>>     4.3 Q35 vs. i440x
>>     4.4 Report vIOMMU to hvmloader
>>
>>
>> 1 Motivation for Xen vIOMMU
>> ===============================================================================
>>
>> 1.1 Enable more than 255 vcpu support
>> HPC cloud service requires VM provides high performance parallel
>> computing and we hope to create a huge VM with >255 vcpu on one machine
>> to meet such requirement.Ping each vcpus on separated pcpus. More than
>
> Pin ?
>

Sorry, it's a typo.

> Also, grammatically speaking, I think you mean "each vcpu to separate
> pcpus".


Yes.

>
>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>> there is no interrupt remapping function which is present by vIOMMU.
>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>
> This is only a requirement for xapic interrupt sources.  x2apic
> interrupt sources already deliver correctly.

The key is the APIC ID. There is no modification to existing PCI MSI and
IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
interrupt message containing 8bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
enable >255 cpus with x2apic mode.

If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
cannot deliver interrupts to all cpus in the system if #cpu > 255.


>
>>
>>
>> 1.3 Support guest SVM (Shared Virtual Memory)
>> It relies on the l1 translation table capability (GVA->GPA) on
>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
>> is the main usage today (to support OpenCL 2.0 SVM feature). In the
>> future SVM might be used by other I/O devices too.
>
> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
> ATS/PASID, or something rather more magic as IGD is on the same piece of
> silicon?

IGD on Skylake supports PCIe PASID.


>
>>
>> 2. Xen vIOMMU Architecture
>> ================================================================================
>>
>>
>> * vIOMMU will be inside Xen hypervisor for following factors
>>     1) Avoid round trips between Qemu and Xen hypervisor
>>     2) Ease of integration with the rest of the hypervisor
>>     3) HVMlite/PVH doesn't use Qemu
>> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
>> /destory vIOMMU in hypervisor and deal with virtual PCI device's l2
>> translation.
>>
>> 2.1 l2 translation overview
>> For Virtual PCI device, dummy xen-vIOMMU does translation in the
>> Qemu via new hypercall.
>>
>> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
>> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
>>
>> The following diagram shows l2 translation architecture.
>
> Which scenario is this?  Is this the passthrough case where the Qemu
> Virtual PCI device is a shadow of the real PCI device in hardware?
>

No, this is for traditional virtual pci device emulated by Qemu and
passthough PCI device.


>> +---------------------------------------------------------+
>> |Qemu                                +----------------+   |
>> |                                    |     Virtual    |   |
>> |                                    |   PCI device   |   |
>> |                                    |                |   |
>> |                                    +----------------+   |
>> |                                            |DMA         |
>> |                                            V            |
>> |  +--------------------+   Request  +----------------+   |
>> |  |                    +<-----------+                |   |
>> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
>> |  |                    +----------->+                |   |
>> |  +---------+----------+            +-------+--------+   |
>> |            |                               |            |
>> |            |Hypercall                      |            |
>> +--------------------------------------------+------------+
>> |Hypervisor  |                               |            |
>> |            |                               |            |
>> |            v                               |            |
>> |     +------+------+                        |            |
>> |     |   vIOMMU    |                        |            |
>> |     +------+------+                        |            |
>> |            |                               |            |
>> |            v                               |            |
>> |     +------+------+                        |            |
>> |     | IOMMU driver|                        |            |
>> |     +------+------+                        |            |
>> |            |                               |            |
>> +--------------------------------------------+------------+
>> |HW          v                               V            |
>> |     +------+------+                 +-------------+     |
>> |     |   IOMMU     +---------------->+  Memory     |     |
>> |     +------+------+                 +-------------+     |
>> |            ^                                            |
>> |            |                                            |
>> |     +------+------+                                     |
>> |     | PCI Device  |                                     |
>> |     +-------------+                                     |
>> +---------------------------------------------------------+
>>
>> 2.2 Interrupt remapping overview.
>> Interrupts from virtual devices and physical devices will be delivered
>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
>> procedure.
>>
>> +---------------------------------------------------+
>> |Qemu                       |VM                     |
>> |                           | +----------------+    |
>> |                           | |  Device driver |    |
>> |                           | +--------+-------+    |
>> |                           |          ^            |
>> |       +----------------+  | +--------+-------+    |
>> |       | Virtual device |  | |  IRQ subsystem |    |
>> |       +-------+--------+  | +--------+-------+    |
>> |               |           |          ^            |
>> |               |           |          |            |
>> +---------------------------+-----------------------+
>> |hyperviosr     |                      | VIRQ       |
>> |               |            +---------+--------+   |
>> |               |            |      vLAPIC      |   |
>> |               |            +---------+--------+   |
>> |               |                      ^            |
>> |               |                      |            |
>> |               |            +---------+--------+   |
>> |               |            |      vIOMMU      |   |
>> |               |            +---------+--------+   |
>> |               |                      ^            |
>> |               |                      |            |
>> |               |            +---------+--------+   |
>> |               |            |   vIOAPIC/vMSI   |   |
>> |               |            +----+----+--------+   |
>> |               |                 ^    ^            |
>> |               +-----------------+    |            |
>> |                                      |            |
>> +---------------------------------------------------+
>> HW                                     |IRQ
>>                                +-------------------+
>>                                |   PCI Device      |
>>                                +-------------------+
>>
>>
>>
>>
>> 3 Xen hypervisor
>> ==========================================================================
>>
>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>> This hypercall should also support pv IOMMU which is still under RFC
>> review. Here only covers non-pv part.
>>
>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
>> parameter.
>
> Why did you choose sysctl?  As these are per-domain, domctl would be a
> more logical choice.  However, neither of these should be usable by
> Qemu, and we are trying to split out "normal qemu operations" into dmops
> which can be safely deprivileged.
>

Do you know what's the status of dmop now? I just found some discussions
about design in the maillist. We may use domctl first and move to dmop
when it's ready?

>
>>
>> struct xen_sysctl_viommu_op {
>>     u32 cmd;
>>     u32 domid;
>>     union {
>>         struct {
>>             u32 capabilities;
>>         } query_capabilities;
>>         struct {
>>             u32 capabilities;
>>             u64 base_address;
>>         } create_iommu;
>>             struct {
>>             /* IN parameters. */
>>             u16 segment;
>>                     u8  bus;
>>                     u8  devfn;
>
> I think this would be cleaner as u32 vsbdf, which makes it clear which
> address space to look for sbdf in.

Ok. Will update.

>
>>                     u64 iova;
>>                 /* Out parameters. */
>>                     u64 translated_addr;
>>                     u64 addr_mask; /* Translation page size */
>>                     IOMMUAccessFlags permisson;
>
> How is this translation intended to be used?  How do you plan to avoid
> race conditions where qemu requests a translation, receives one, the
> guest invalidated the mapping, and then qemu tries to use its translated
> address?
>
> There are only two ways I can see of doing this race-free.  One is to
> implement a "memcpy with translation" hypercall, and the other is to
> require the use of ATS in the vIOMMU, where the guest OS is required to
> wait for a positive response from the vIOMMU before it can safely reuse
> the mapping.
>
> The former behaves like real hardware in that an intermediate entity
> performs the translation without interacting with the DMA source.  The
> latter explicitly exposing the fact that caching is going on at the
> endpoint to the OS.

The former one seems to move DMA operation into hypervisor but Qemu 
vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input data 
and access length. I will dig more to figure out solution.

>
>>             } l2_translation;
>> };
>>
>> typedef enum {
>>     IOMMU_NONE = 0,
>>     IOMMU_RO   = 1,
>>     IOMMU_WO   = 2,
>>     IOMMU_RW   = 3,
>> } IOMMUAccessFlags;
>
> No enumerations in an ABI please.  They are not stable in C.  Please us
> a u32 and more #define's


Ok. Will update.

>
>>
>>
>> Definition of VIOMMU subops:
>> #define XEN_SYSCTL_viommu_query_capability        0
>> #define XEN_SYSCTL_viommu_create            1
>> #define XEN_SYSCTL_viommu_destroy            2
>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>>
>> Definition of VIOMMU capabilities
>> #define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
>> #define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
>> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
>
> How are vIOMMUs going to be modelled to guests?  On real hardware, they
> all seem to end associated with a PCI device of some sort, even if it is
> just the LPC bridge.


This design just considers one vIOMMU has all PCI device under its
specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
vIOMMU.

> 	
> How do we deal with multiple vIOMMUs in a single guest?

For multi-vIOMMU, we need to add new field in the struct iommu_op to
designate device scope of vIOMMUs if they are under same PCI
segment. This also needs to change DMAR table.

>
>>
>>
>> 2) Design for subops
>> - XEN_SYSCTL_viommu_query_capability
>>        Get vIOMMU capabilities(l1/l2 translation and interrupt
>> remapping).
>>
>> - XEN_SYSCTL_viommu_create
>>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
>> base address.
>>
>> - XEN_SYSCTL_viommu_destroy
>>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>>
>> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>>       Translate IOVA to GPA for specified virtual PCI device with dom id,
>> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
>> address mask and access permission.
>>
>>
>> 3.2 l2 translation
>> 1) For virtual PCI device
>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>> hypercall when DMA operation happens.
>>
>> 2) For physical PCI device
>> DMA operations go though physical IOMMU directly and IO page table for
>> IOVA->HPA should be loaded into physical IOMMU. When guest updates
>> l2 Page-table pointer field, it provides IO page table for
>> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
>> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
>> Page-table pointer to context entry of physical IOMMU.
>
> How are you proposing to do this shadowing?  Do we need to trap and
> emulate all writes to the vIOMMU pagetables, or is there a better way to
> know when the mappings need invalidating?

No, we don't need to trap all write to IO page table.
 From VTD spec 6.1, "Reporting the Caching Mode as Set for the
virtual hardware requires the guest software to explicitly issue
invalidation operations on the virtual hardware for any/all updates to
the guest remapping structures.The virtualizing software may trap these
guest invalidation operations to keep the shadow translation structures
consistent to guest translation structure modifications, without
resorting to other less efficient techniques."
So any updates of IO page table will follow invalidation operation and
we use them to do shadowing.

>
>>
>> Now all PCI devices in same hvm domain share one IO page table
>> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
>> translation of vIOMMU, IOMMU driver need to support multiple address
>> spaces per device entry. Using existing IO page table(GPA->HPA)
>> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
>> translation function is enabled. These change will not affect current
>> P2M logic.
>>
>> 3.3 Interrupt remapping
>> Interrupts from virtual devices and physical devices will be delivered
>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>> according interrupt remapping table.
>>
>>
>> 3.4 l1 translation
>> When nested translation is enabled, any address generated by l1
>> translation is used as the input address for nesting with l2
>> translation. Physical IOMMU needs to enable both l1 and l2 translation
>> in nested translation mode(GVA->GPA->HPA) for passthrough
>> device.
>
> All these l1 and l2 translations are getting confusing.  Could we
> perhaps call them guest translation and host translation, or is that
> likely to cause other problems?

Definitions of l1 and l2 translation from VTD spec.
first-level translation to remap a virtual address to intermediate
(guest) physical address.
second-level translations to remap a intermediate physical address to
machine (host) physical address.
guest and host translation maybe not suitable for them?

>
>>
>> VT-d context entry points to guest l1 translation table which
>> will be nest-translated by l2 translation table and so it
>> can be directly linked to context entry of physical IOMMU.
>>
>> To enable l1 translation in VM
>> 1) Xen IOMMU driver enables nested translation mode
>> 2) Update GPA root of guest l1 translation table to context entry
>> of physical IOMMU.
>>
>> All handles are in hypervisor and no interaction with Qemu.
>>
>>
>> 3.5 Implementation consideration
>> VT-d spec doesn't define a capability bit for the l2 translation.
>> Architecturally there is no way to tell guest that l2 translation
>> capability is not available. Linux Intel IOMMU driver thinks l2
>> translation is always available when VTD exits and fail to be loaded
>> without l2 translation support even if interrupt remapping and l1
>> translation are available. So it needs to enable l2 translation first
>> before other functions.
>
> What then is the purpose of the nested translation support bit in the
> extended capability register?

It's to translate output GPA from first level translation(IOVA->GPA) to HPA.

Detail please see VTD spec - 3.8 Nested Translation
"When Nesting Enable (NESTE) field is 1 in extended-context-entries,
requests-with-PASID translated through first-level translation are also
subjected to nested second-level translation. Such extendedcontext-
entries contain both the pointer to the PASID-table (which contains the
pointer to the firstlevel translation structures), and the pointer to
the second-level translation structures."


>> 4.4 Report vIOMMU to hvmloader
>> Hvmloader is in charge of building ACPI tables for Guest OS and OS
>> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
>> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
>> for Guest OS.
>>
>> There are three ways to do that.
>> 1) Extend struct hvm_info_table and add variables in the struct
>> hvm_info_table to pass vIOMMU information to hvmloader. But this
>> requires to add new xc interface to use struct hvm_info_table in the
>> Qemu.
>>
>> 2) Pass vIOMMU information to hvmloader via Xenstore
>>
>> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
>> This solution is already present in the vNVDIMM design(4.3.1
>> Building Guest ACPI Tables
>> http://www.gossamer-threads.com/lists/xen/devel/439766).
>>
>> The third option seems more clear and hvmloader doesn't need to deal
>> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
>> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
>
> Part of ACPI table building has now moved into the toolstack.  Unless
> the table needs creating dynamically (which doesn't appear to be the
> case), it can be done without any further communication.
>

The DMAR table needs to be created according input parameters.
.E,G When interrupt remapping is enabled, INTR_REMAP bit in the dmar
structure needs to be set. So we need to create table dynamically during
creating VM.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
  2016-10-20 10:11                                 ` Tian, Kevin
@ 2016-10-20 14:56                                 ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-10-20 14:56 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: yang.zhang.wz, Kevin Tian, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xuquan8, xen-devel, Jun Nakajima,
	anthony.perard, Roger Pau Monne


On 10/19/2016 4:26 AM, Konrad Rzeszutek Wilk wrote:
> On Tue, Oct 18, 2016 at 10:14:16PM +0800, Lan Tianyu wrote:

>> 1 Motivation for Xen vIOMMU
>> ===============================================================================
>> 1.1 Enable more than 255 vcpu support
>> HPC cloud service requires VM provides high performance parallel
>> computing and we hope to create a huge VM with >255 vcpu on one machine
>> to meet such requirement.Ping each vcpus on separated pcpus. More than
>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>> there is no interrupt remapping function which is present by vIOMMU.
>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>> So we need to add vIOMMU before enabling >255 vcpus.
>
> What about Windows? Does it care about this?

 From our test, win8 guest crashes when boot up 288 vcpus without IR and 
it can boot up with IR

>> 3.2 l2 translation
>> 1) For virtual PCI device
>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>> hypercall when DMA operation happens.
>>
>> 2) For physical PCI device
>> DMA operations go though physical IOMMU directly and IO page table for
>> IOVA->HPA should be loaded into physical IOMMU. When guest updates
>> l2 Page-table pointer field, it provides IO page table for
>> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
>> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
>> Page-table pointer to context entry of physical IOMMU.
>>
>> Now all PCI devices in same hvm domain share one IO page table
>> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
>> translation of vIOMMU, IOMMU driver need to support multiple address
>> spaces per device entry. Using existing IO page table(GPA->HPA)
>> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
>
> defaultly?

I mean GPA->HPA mapping will set in the assigned device's context entry 
of pIOMMU when VM creates. Just like current code works.

>
>>
>> 3.3 Interrupt remapping
>> Interrupts from virtual devices and physical devices will be delivered
>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>> according interrupt remapping table.
>>
>>
>> 3.4 l1 translation
>> When nested translation is enabled, any address generated by l1
>> translation is used as the input address for nesting with l2
>> translation. Physical IOMMU needs to enable both l1 and l2 translation
>> in nested translation mode(GVA->GPA->HPA) for passthrough
>> device.
>>
>> VT-d context entry points to guest l1 translation table which
>> will be nest-translated by l2 translation table and so it
>> can be directly linked to context entry of physical IOMMU.
>
> I think this means that the shared_ept will be disabled?

The shared_ept(GPA->HPA mapping) is used to do nested translation
for any output from l1 translation(GVA->GPA).




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-20  9:53                                 ` Tian, Kevin
@ 2016-10-20 18:10                                   ` Andrew Cooper
  0 siblings, 0 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-10-20 18:10 UTC (permalink / raw)
  To: Tian, Kevin, Lan, Tianyu, Jan Beulich, yang.zhang.wz, Nakajima,
	Jun, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 20/10/16 10:53, Tian, Kevin wrote:
>> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
>> Sent: Wednesday, October 19, 2016 3:18 AM
>>
>>> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
>>> It relies on the l2 translation capability (IOVA->GPA) on
>>> vIOMMU. pIOMMU l2 becomes a shadowing structure of
>>> vIOMMU to isolate DMA requests initiated by user space driver.
>> How is userspace supposed to drive this interface?  I can't picture how
>> it would function.
> Inside a Linux VM, VFIO provides DMA MAP/UNMAP interface to user space
> driver so gIOVA->GPA mapping can be setup on vIOMMU. vIOMMU will 
> export a "caching mode" capability to indicate all guest PTE changes 
> requiring explicit vIOMMU cache invalidations. Through trapping of those
> invalidation requests, Xen can update corresponding shadow PTEs (gIOVA
> ->HPA). When DMA mapping is established, user space driver programs 
> gIOVA addresses as DMA destination to assigned device, and then upstreaming
> DMA request out of this device contains gIOVA which is translated to HPA
> by pIOMMU shadow page table.

Ok.  So in this mode, the userspace driver owns the device, and can
choose any arbitrary gIOVA layout it chooses?  If it also programs the
DMA addresses, I guess this setup is fine.

>
>>>
>>> 1.3 Support guest SVM (Shared Virtual Memory)
>>> It relies on the l1 translation table capability (GVA->GPA) on
>>> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
>>> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
>>> is the main usage today (to support OpenCL 2.0 SVM feature). In the
>>> future SVM might be used by other I/O devices too.
>> As an aside, how is IGD intending to support SVM?  Will it be with PCIe
>> ATS/PASID, or something rather more magic as IGD is on the same piece of
>> silicon?
> Although integrated, IGD conforms to standard PCIe PASID convention.

Ok.  Any idea when hardware with SVM will be available?

>
>>> 3.5 Implementation consideration
>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>> Architecturally there is no way to tell guest that l2 translation
>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>> translation is always available when VTD exits and fail to be loaded
>>> without l2 translation support even if interrupt remapping and l1
>>> translation are available. So it needs to enable l2 translation first
>>> before other functions.
>> What then is the purpose of the nested translation support bit in the
>> extended capability register?
>>
> Nested translation is for SVM virtualization. Given a DMA transaction 
> containing a PASID, VT-d engine first finds the 1st translation table 
> through PASID to translate from GVA to GPA, then once nested
> translation capability is enabled, further translate GPA to HPA using the
> 2nd level translation table. Bare-metal usage is not expected to turn
> on this nested bit.

Ok, but what happens if a guest sees a PASSID-capable vIOMMU and itself
tries to turn on nesting?  E.g. nesting KVM inside Xen and trying to use
SVM from the L2 guest?

If there is no way to indicate to the L1 guest that nesting isn't
available (as it is already actually in use), and we can't shadow
entries on faults, what is supposed to happen?

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-20 14:17                                 ` Lan Tianyu
@ 2016-10-20 20:36                                   ` Andrew Cooper
  2016-10-22  7:32                                     ` Lan, Tianyu
  2016-10-28 15:36                                     ` Lan Tianyu
  0 siblings, 2 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-10-20 20:36 UTC (permalink / raw)
  To: Lan Tianyu, Jan Beulich, Kevin Tian, yang.zhang.wz, Jun Nakajima,
	Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne


>
>>
>>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>>> there is no interrupt remapping function which is present by vIOMMU.
>>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>>
>> This is only a requirement for xapic interrupt sources.  x2apic
>> interrupt sources already deliver correctly.
>
> The key is the APIC ID. There is no modification to existing PCI MSI and
> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
> interrupt message containing 8bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
> enable >255 cpus with x2apic mode.
>
> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
> cannot deliver interrupts to all cpus in the system if #cpu > 255.

After spending a long time reading up on this, my first observation is
that it is very difficult to find consistent information concerning the
expected content of MSI address/data fields for x86 hardware.  Having
said that, this has been very educational.

It is now clear that any MSI message can either specify an 8 bit APIC ID
directly, or request for the message to be remapped.  Apologies for my
earlier confusion.

>
>>> +---------------------------------------------------------+
>>> |Qemu                                +----------------+   |
>>> |                                    |     Virtual    |   |
>>> |                                    |   PCI device   |   |
>>> |                                    |                |   |
>>> |                                    +----------------+   |
>>> |                                            |DMA         |
>>> |                                            V            |
>>> |  +--------------------+   Request  +----------------+   |
>>> |  |                    +<-----------+                |   |
>>> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
>>> |  |                    +----------->+                |   |
>>> |  +---------+----------+            +-------+--------+   |
>>> |            |                               |            |
>>> |            |Hypercall                      |            |
>>> +--------------------------------------------+------------+
>>> |Hypervisor  |                               |            |
>>> |            |                               |            |
>>> |            v                               |            |
>>> |     +------+------+                        |            |
>>> |     |   vIOMMU    |                        |            |
>>> |     +------+------+                        |            |
>>> |            |                               |            |
>>> |            v                               |            |
>>> |     +------+------+                        |            |
>>> |     | IOMMU driver|                        |            |
>>> |     +------+------+                        |            |
>>> |            |                               |            |
>>> +--------------------------------------------+------------+
>>> |HW          v                               V            |
>>> |     +------+------+                 +-------------+     |
>>> |     |   IOMMU     +---------------->+  Memory     |     |
>>> |     +------+------+                 +-------------+     |
>>> |            ^                                            |
>>> |            |                                            |
>>> |     +------+------+                                     |
>>> |     | PCI Device  |                                     |
>>> |     +-------------+                                     |
>>> +---------------------------------------------------------+
>>>
>>> 2.2 Interrupt remapping overview.
>>> Interrupts from virtual devices and physical devices will be delivered
>>> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during
>>> this
>>> procedure.
>>>
>>> +---------------------------------------------------+
>>> |Qemu                       |VM                     |
>>> |                           | +----------------+    |
>>> |                           | |  Device driver |    |
>>> |                           | +--------+-------+    |
>>> |                           |          ^            |
>>> |       +----------------+  | +--------+-------+    |
>>> |       | Virtual device |  | |  IRQ subsystem |    |
>>> |       +-------+--------+  | +--------+-------+    |
>>> |               |           |          ^            |
>>> |               |           |          |            |
>>> +---------------------------+-----------------------+
>>> |hyperviosr     |                      | VIRQ       |
>>> |               |            +---------+--------+   |
>>> |               |            |      vLAPIC      |   |
>>> |               |            +---------+--------+   |
>>> |               |                      ^            |
>>> |               |                      |            |
>>> |               |            +---------+--------+   |
>>> |               |            |      vIOMMU      |   |
>>> |               |            +---------+--------+   |
>>> |               |                      ^            |
>>> |               |                      |            |
>>> |               |            +---------+--------+   |
>>> |               |            |   vIOAPIC/vMSI   |   |
>>> |               |            +----+----+--------+   |
>>> |               |                 ^    ^            |
>>> |               +-----------------+    |            |
>>> |                                      |            |
>>> +---------------------------------------------------+
>>> HW                                     |IRQ
>>>                                +-------------------+
>>>                                |   PCI Device      |
>>>                                +-------------------+
>>>
>>>
>>>
>>>
>>> 3 Xen hypervisor
>>> ==========================================================================
>>>
>>>
>>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>>> This hypercall should also support pv IOMMU which is still under RFC
>>> review. Here only covers non-pv part.
>>>
>>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
>>> parameter.
>>
>> Why did you choose sysctl?  As these are per-domain, domctl would be a
>> more logical choice.  However, neither of these should be usable by
>> Qemu, and we are trying to split out "normal qemu operations" into dmops
>> which can be safely deprivileged.
>>
>
> Do you know what's the status of dmop now? I just found some discussions
> about design in the maillist. We may use domctl first and move to dmop
> when it's ready?

I believe Paul is looking into respin the series early in the 4.9 dev
cycle.  I expect it won't take long until they are submitted.

>
>>
>>>                     u64 iova;
>>>                 /* Out parameters. */
>>>                     u64 translated_addr;
>>>                     u64 addr_mask; /* Translation page size */
>>>                     IOMMUAccessFlags permisson;
>>
>> How is this translation intended to be used?  How do you plan to avoid
>> race conditions where qemu requests a translation, receives one, the
>> guest invalidated the mapping, and then qemu tries to use its translated
>> address?
>>
>> There are only two ways I can see of doing this race-free.  One is to
>> implement a "memcpy with translation" hypercall, and the other is to
>> require the use of ATS in the vIOMMU, where the guest OS is required to
>> wait for a positive response from the vIOMMU before it can safely reuse
>> the mapping.
>>
>> The former behaves like real hardware in that an intermediate entity
>> performs the translation without interacting with the DMA source.  The
>> latter explicitly exposing the fact that caching is going on at the
>> endpoint to the OS.
>
> The former one seems to move DMA operation into hypervisor but Qemu
> vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input
> data and access length. I will dig more to figure out solution.

Yes - that does in principle actually move the DMA out of Qemu.

>
>>
>>>
>>>
>>> Definition of VIOMMU subops:
>>> #define XEN_SYSCTL_viommu_query_capability        0
>>> #define XEN_SYSCTL_viommu_create            1
>>> #define XEN_SYSCTL_viommu_destroy            2
>>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>>>
>>> Definition of VIOMMU capabilities
>>> #define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
>>> #define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
>>> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
>>
>> How are vIOMMUs going to be modelled to guests?  On real hardware, they
>> all seem to end associated with a PCI device of some sort, even if it is
>> just the LPC bridge.
>
>
> This design just considers one vIOMMU has all PCI device under its
> specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
> vIOMMU.

Even if the first implementation only supports a single vIOMMU, please
design the interface to cope with multiple.  It will save someone having
to go and break the API/ABI in the future when support for multiple
vIOMMUs is needed.

>
>>     
>> How do we deal with multiple vIOMMUs in a single guest?
>
> For multi-vIOMMU, we need to add new field in the struct iommu_op to
> designate device scope of vIOMMUs if they are under same PCI
> segment. This also needs to change DMAR table.
>
>>
>>>
>>>
>>> 2) Design for subops
>>> - XEN_SYSCTL_viommu_query_capability
>>>        Get vIOMMU capabilities(l1/l2 translation and interrupt
>>> remapping).
>>>
>>> - XEN_SYSCTL_viommu_create
>>>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
>>> base address.
>>>
>>> - XEN_SYSCTL_viommu_destroy
>>>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>>>
>>> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>>>       Translate IOVA to GPA for specified virtual PCI device with
>>> dom id,
>>> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
>>> address mask and access permission.
>>>
>>>
>>> 3.2 l2 translation
>>> 1) For virtual PCI device
>>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>>> hypercall when DMA operation happens.
>>>
>>> 2) For physical PCI device
>>> DMA operations go though physical IOMMU directly and IO page table for
>>> IOVA->HPA should be loaded into physical IOMMU. When guest updates
>>> l2 Page-table pointer field, it provides IO page table for
>>> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
>>> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
>>> Page-table pointer to context entry of physical IOMMU.
>>
>> How are you proposing to do this shadowing?  Do we need to trap and
>> emulate all writes to the vIOMMU pagetables, or is there a better way to
>> know when the mappings need invalidating?
>
> No, we don't need to trap all write to IO page table.
> From VTD spec 6.1, "Reporting the Caching Mode as Set for the
> virtual hardware requires the guest software to explicitly issue
> invalidation operations on the virtual hardware for any/all updates to
> the guest remapping structures.The virtualizing software may trap these
> guest invalidation operations to keep the shadow translation structures
> consistent to guest translation structure modifications, without
> resorting to other less efficient techniques."
> So any updates of IO page table will follow invalidation operation and
> we use them to do shadowing.

Ok.  That is helpful.

So, the guest makes some updates, and requests an invalidation.  This
traps into Xen, and we presumably re-shadow all state from fresh?  We
can then presumably send a synchronous invalidation request to Qemu at
this point?

How long is this likely to take?  Reshadowing all DMA and Interrupt
remapping tables sounds very expensive.

>
>>
>>>
>>> Now all PCI devices in same hvm domain share one IO page table
>>> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
>>> translation of vIOMMU, IOMMU driver need to support multiple address
>>> spaces per device entry. Using existing IO page table(GPA->HPA)
>>> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
>>> translation function is enabled. These change will not affect current
>>> P2M logic.
>>>
>>> 3.3 Interrupt remapping
>>> Interrupts from virtual devices and physical devices will be delivered
>>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>>> according interrupt remapping table.
>>>
>>>
>>> 3.4 l1 translation
>>> When nested translation is enabled, any address generated by l1
>>> translation is used as the input address for nesting with l2
>>> translation. Physical IOMMU needs to enable both l1 and l2 translation
>>> in nested translation mode(GVA->GPA->HPA) for passthrough
>>> device.
>>
>> All these l1 and l2 translations are getting confusing.  Could we
>> perhaps call them guest translation and host translation, or is that
>> likely to cause other problems?
>
> Definitions of l1 and l2 translation from VTD spec.
> first-level translation to remap a virtual address to intermediate
> (guest) physical address.
> second-level translations to remap a intermediate physical address to
> machine (host) physical address.
> guest and host translation maybe not suitable for them?

True, but what is also confusing is that what was previously the only
level of translation is now l2.

So long as it is clearly stated somewhere in the code and/or feature doc
which address spaces each level translates between (i.e. l1 translates
linear addresses into gfns, and l2 translates gfns into mfns), it will
probably be ok.

Can I recommend that you make use of the TYPE_SAFE() infrastructure to
make concrete, disperate types for any new translation functions, to
make it harder to accidentally get wrong.

>
>>
>>>
>>> VT-d context entry points to guest l1 translation table which
>>> will be nest-translated by l2 translation table and so it
>>> can be directly linked to context entry of physical IOMMU.
>>>
>>> To enable l1 translation in VM
>>> 1) Xen IOMMU driver enables nested translation mode
>>> 2) Update GPA root of guest l1 translation table to context entry
>>> of physical IOMMU.
>>>
>>> All handles are in hypervisor and no interaction with Qemu.
>>>
>>>
>>> 3.5 Implementation consideration
>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>> Architecturally there is no way to tell guest that l2 translation
>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>> translation is always available when VTD exits and fail to be loaded
>>> without l2 translation support even if interrupt remapping and l1
>>> translation are available. So it needs to enable l2 translation first
>>> before other functions.
>>
>> What then is the purpose of the nested translation support bit in the
>> extended capability register?
>
> It's to translate output GPA from first level translation(IOVA->GPA)
> to HPA.
>
> Detail please see VTD spec - 3.8 Nested Translation
> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
> requests-with-PASID translated through first-level translation are also
> subjected to nested second-level translation. Such extendedcontext-
> entries contain both the pointer to the PASID-table (which contains the
> pointer to the firstlevel translation structures), and the pointer to
> the second-level translation structures."

I didn't phrase my question very well.  I understand what the nested
translation bit means, but I don't understand why we have a problem
signalling the presence or lack of nested translations to the guest.

In other words, why can't we hide l2 translation from the guest by
simply clearing the nested translation capability?

>
>
>>> 4.4 Report vIOMMU to hvmloader
>>> Hvmloader is in charge of building ACPI tables for Guest OS and OS
>>> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
>>> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
>>> for Guest OS.
>>>
>>> There are three ways to do that.
>>> 1) Extend struct hvm_info_table and add variables in the struct
>>> hvm_info_table to pass vIOMMU information to hvmloader. But this
>>> requires to add new xc interface to use struct hvm_info_table in the
>>> Qemu.
>>>
>>> 2) Pass vIOMMU information to hvmloader via Xenstore
>>>
>>> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
>>> This solution is already present in the vNVDIMM design(4.3.1
>>> Building Guest ACPI Tables
>>> http://www.gossamer-threads.com/lists/xen/devel/439766).
>>>
>>> The third option seems more clear and hvmloader doesn't need to deal
>>> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
>>> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU
>>> driver.
>>
>> Part of ACPI table building has now moved into the toolstack.  Unless
>> the table needs creating dynamically (which doesn't appear to be the
>> case), it can be done without any further communication.
>>
>
> The DMAR table needs to be created according input parameters.
> .E,G When interrupt remapping is enabled, INTR_REMAP bit in the dmar
> structure needs to be set. So we need to create table dynamically during
> creating VM.

This is fine.  The toolstack knows the configuration of the VM, as well
as the capabilities of Xen, and should be able to construct a suitable
DMAR table at the same time as all the other ACPI tables.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-20 20:36                                   ` Andrew Cooper
@ 2016-10-22  7:32                                     ` Lan, Tianyu
  2016-10-26  9:39                                       ` Jan Beulich
  2016-10-28 15:36                                     ` Lan Tianyu
  1 sibling, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-10-22  7:32 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Kevin Tian, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 10/21/2016 4:36 AM, Andrew Cooper wrote:
>
>>
>>>
>>>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>>>> there is no interrupt remapping function which is present by vIOMMU.
>>>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>>>
>>> This is only a requirement for xapic interrupt sources.  x2apic
>>> interrupt sources already deliver correctly.
>>
>> The key is the APIC ID. There is no modification to existing PCI MSI and
>> IOAPIC with the introduction of x2apic. PCI MSI/IOAPIC can only send
>> interrupt message containing 8bit APIC ID, which cannot address >255
>> cpus. Interrupt remapping supports 32bit APIC ID so it's necessary to
>> enable >255 cpus with x2apic mode.
>>
>> If LAPIC is in x2apic while interrupt remapping is disabled, IOAPIC
>> cannot deliver interrupts to all cpus in the system if #cpu > 255.
>
> After spending a long time reading up on this, my first observation is
> that it is very difficult to find consistent information concerning the
> expected content of MSI address/data fields for x86 hardware.  Having
> said that, this has been very educational.
>
> It is now clear that any MSI message can either specify an 8 bit APIC ID
> directly, or request for the message to be remapped.  Apologies for my
> earlier confusion.

Never minder, I will describe this more detail in the following version.

>>>>
>>>>
>>>>
>>>> 3 Xen hypervisor
>>>> ==========================================================================
>>>>
>>>>
>>>> 3.1 New hypercall XEN_SYSCTL_viommu_op
>>>> This hypercall should also support pv IOMMU which is still under RFC
>>>> review. Here only covers non-pv part.
>>>>
>>>> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall
>>>> parameter.
>>>
>>> Why did you choose sysctl?  As these are per-domain, domctl would be a
>>> more logical choice.  However, neither of these should be usable by
>>> Qemu, and we are trying to split out "normal qemu operations" into dmops
>>> which can be safely deprivileged.
>>>
>>
>> Do you know what's the status of dmop now? I just found some discussions
>> about design in the maillist. We may use domctl first and move to dmop
>> when it's ready?
>
> I believe Paul is looking into respin the series early in the 4.9 dev
> cycle.  I expect it won't take long until they are submitted.

Ok. I got it. Thanks for information.

>>
>>>
>>>>
>>>>
>>>> Definition of VIOMMU subops:
>>>> #define XEN_SYSCTL_viommu_query_capability        0
>>>> #define XEN_SYSCTL_viommu_create            1
>>>> #define XEN_SYSCTL_viommu_destroy            2
>>>> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev     3
>>>>
>>>> Definition of VIOMMU capabilities
>>>> #define XEN_VIOMMU_CAPABILITY_l1_translation    (1 << 0)
>>>> #define XEN_VIOMMU_CAPABILITY_l2_translation    (1 << 1)
>>>> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping    (1 << 2)
>>>
>>> How are vIOMMUs going to be modelled to guests?  On real hardware, they
>>> all seem to end associated with a PCI device of some sort, even if it is
>>> just the LPC bridge.
>>
>>
>> This design just considers one vIOMMU has all PCI device under its
>> specified PCI Segment. "INCLUDE_PCI_ALL" bit of DRHD struct is set for
>> vIOMMU.
>
> Even if the first implementation only supports a single vIOMMU, please
> design the interface to cope with multiple.  It will save someone having
> to go and break the API/ABI in the future when support for multiple
> vIOMMUs is needed.

OK. I got.

>
>>
>>>
>>> How do we deal with multiple vIOMMUs in a single guest?
>>
>> For multi-vIOMMU, we need to add new field in the struct iommu_op to
>> designate device scope of vIOMMUs if they are under same PCI
>> segment. This also needs to change DMAR table.
>>
>>>
>>>>
>>>>
>>>> 2) Design for subops
>>>> - XEN_SYSCTL_viommu_query_capability
>>>>        Get vIOMMU capabilities(l1/l2 translation and interrupt
>>>> remapping).
>>>>
>>>> - XEN_SYSCTL_viommu_create
>>>>       Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
>>>> base address.
>>>>
>>>> - XEN_SYSCTL_viommu_destroy
>>>>       Destory vIOMMU in Xen hypervisor with dom_id as parameters.
>>>>
>>>> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>>>>       Translate IOVA to GPA for specified virtual PCI device with
>>>> dom id,
>>>> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
>>>> address mask and access permission.
>>>>
>>>>
>>>> 3.2 l2 translation
>>>> 1) For virtual PCI device
>>>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>>>> hypercall when DMA operation happens.
>>>>
>>>> 2) For physical PCI device
>>>> DMA operations go though physical IOMMU directly and IO page table for
>>>> IOVA->HPA should be loaded into physical IOMMU. When guest updates
>>>> l2 Page-table pointer field, it provides IO page table for
>>>> IOVA->GPA. vIOMMU needs to shadow l2 translation table, translate
>>>> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
>>>> Page-table pointer to context entry of physical IOMMU.
>>>
>>> How are you proposing to do this shadowing?  Do we need to trap and
>>> emulate all writes to the vIOMMU pagetables, or is there a better way to
>>> know when the mappings need invalidating?
>>
>> No, we don't need to trap all write to IO page table.
>> From VTD spec 6.1, "Reporting the Caching Mode as Set for the
>> virtual hardware requires the guest software to explicitly issue
>> invalidation operations on the virtual hardware for any/all updates to
>> the guest remapping structures.The virtualizing software may trap these
>> guest invalidation operations to keep the shadow translation structures
>> consistent to guest translation structure modifications, without
>> resorting to other less efficient techniques."
>> So any updates of IO page table will follow invalidation operation and
>> we use them to do shadowing.
>
> Ok.  That is helpful.
>
> So, the guest makes some updates, and requests an invalidation.  This
> traps into Xen, and we presumably re-shadow all state from fresh?

we may expose PSi(Page Selective Invalidation) capability and guest will 
just invalidate associated page entry rather than all.

>  We can then presumably send a synchronous invalidation request to Qemu at
> this point?

In our design, Qemu will not cache IOTLB and so no invalidation request 
is sent to Qemu. But I am still considering how to deal with in-fly DMA 
when there is invalidation request just like you mentioned.

>
> How long is this likely to take?  Reshadowing all DMA and Interrupt
> remapping tables sounds very expensive.
>
>>
>>>
>>>>
>>>> Now all PCI devices in same hvm domain share one IO page table
>>>> (GPA->HPA) in physical IOMMU driver of Xen. To support l2
>>>> translation of vIOMMU, IOMMU driver need to support multiple address
>>>> spaces per device entry. Using existing IO page table(GPA->HPA)
>>>> defaultly and switch to shadow IO page table(IOVA->HPA) when l2
>>>> translation function is enabled. These change will not affect current
>>>> P2M logic.
>>>>
>>>> 3.3 Interrupt remapping
>>>> Interrupts from virtual devices and physical devices will be delivered
>>>> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
>>>> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
>>>> according interrupt remapping table.
>>>>
>>>>
>>>> 3.4 l1 translation
>>>> When nested translation is enabled, any address generated by l1
>>>> translation is used as the input address for nesting with l2
>>>> translation. Physical IOMMU needs to enable both l1 and l2 translation
>>>> in nested translation mode(GVA->GPA->HPA) for passthrough
>>>> device.
>>>
>>> All these l1 and l2 translations are getting confusing.  Could we
>>> perhaps call them guest translation and host translation, or is that
>>> likely to cause other problems?
>>
>> Definitions of l1 and l2 translation from VTD spec.
>> first-level translation to remap a virtual address to intermediate
>> (guest) physical address.
>> second-level translations to remap a intermediate physical address to
>> machine (host) physical address.
>> guest and host translation maybe not suitable for them?
>
> True, but what is also confusing is that what was previously the only
> level of translation is now l2.
>
> So long as it is clearly stated somewhere in the code and/or feature doc
> which address spaces each level translates between (i.e. l1 translates
> linear addresses into gfns, and l2 translates gfns into mfns), it will
> probably be ok.

Sure, we should rename current level translation to l2 translation when 
introduce l1 translation and aligns with VTD spec.

>
> Can I recommend that you make use of the TYPE_SAFE() infrastructure to
> make concrete, disperate types for any new translation functions, to
> make it harder to accidentally get wrong.

Ok. I got it.

>
>>
>>>
>>>>
>>>> VT-d context entry points to guest l1 translation table which
>>>> will be nest-translated by l2 translation table and so it
>>>> can be directly linked to context entry of physical IOMMU.
>>>>
>>>> To enable l1 translation in VM
>>>> 1) Xen IOMMU driver enables nested translation mode
>>>> 2) Update GPA root of guest l1 translation table to context entry
>>>> of physical IOMMU.
>>>>
>>>> All handles are in hypervisor and no interaction with Qemu.
>>>>
>>>>
>>>> 3.5 Implementation consideration
>>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>>> Architecturally there is no way to tell guest that l2 translation
>>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>>> translation is always available when VTD exits and fail to be loaded
>>>> without l2 translation support even if interrupt remapping and l1
>>>> translation are available. So it needs to enable l2 translation first
>>>> before other functions.
>>>
>>> What then is the purpose of the nested translation support bit in the
>>> extended capability register?
>>
>> It's to translate output GPA from first level translation(IOVA->GPA)
>> to HPA.
>>
>> Detail please see VTD spec - 3.8 Nested Translation
>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>> requests-with-PASID translated through first-level translation are also
>> subjected to nested second-level translation. Such extendedcontext-
>> entries contain both the pointer to the PASID-table (which contains the
>> pointer to the firstlevel translation structures), and the pointer to
>> the second-level translation structures."
>
> I didn't phrase my question very well.  I understand what the nested
> translation bit means, but I don't understand why we have a problem
> signalling the presence or lack of nested translations to the guest.
>
> In other words, why can't we hide l2 translation from the guest by
> simply clearing the nested translation capability?

You mean to tell no support of l2 translation via nest translation bit?
But the nested translation is a different function with l2 translation
even from guest view and nested translation only works requests with
PASID (l1 translation).

Linux intel iommu driver enables l2 translation unconditionally and free 
iommu instance when failed to enable l2 translation.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
  2016-10-18 19:17                               ` Andrew Cooper
  2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
@ 2016-10-26  9:36                               ` Jan Beulich
  2016-10-26 14:53                                 ` Lan, Tianyu
  2 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2016-10-26  9:36 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne

>>> On 18.10.16 at 16:14, <tianyu.lan@intel.com> wrote:
> 1.1 Enable more than 255 vcpu support
> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement.Ping each vcpus on separated pcpus. More than
> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
> there is no interrupt remapping function which is present by vIOMMU.
> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
> So we need to add vIOMMU before enabling >255 vcpus.

I continue to dislike this completely neglecting that we can't even
have >128 vCPU-s at present. Once again - there's other work to
be done prior to lack of vIOMMU becoming the limiting factor.

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-22  7:32                                     ` Lan, Tianyu
@ 2016-10-26  9:39                                       ` Jan Beulich
  2016-10-26 15:03                                         ` Lan, Tianyu
  2016-11-03 15:41                                         ` Lan, Tianyu
  0 siblings, 2 replies; 86+ messages in thread
From: Jan Beulich @ 2016-10-26  9:39 UTC (permalink / raw)
  To: Tianyu Lan
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne

>>> On 22.10.16 at 09:32, <tianyu.lan@intel.com> wrote:
> On 10/21/2016 4:36 AM, Andrew Cooper wrote:
>>>>> 3.5 Implementation consideration
>>>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>>>> Architecturally there is no way to tell guest that l2 translation
>>>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>>>> translation is always available when VTD exits and fail to be loaded
>>>>> without l2 translation support even if interrupt remapping and l1
>>>>> translation are available. So it needs to enable l2 translation first
>>>>> before other functions.
>>>>
>>>> What then is the purpose of the nested translation support bit in the
>>>> extended capability register?
>>>
>>> It's to translate output GPA from first level translation(IOVA->GPA)
>>> to HPA.
>>>
>>> Detail please see VTD spec - 3.8 Nested Translation
>>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>>> requests-with-PASID translated through first-level translation are also
>>> subjected to nested second-level translation. Such extendedcontext-
>>> entries contain both the pointer to the PASID-table (which contains the
>>> pointer to the firstlevel translation structures), and the pointer to
>>> the second-level translation structures."
>>
>> I didn't phrase my question very well.  I understand what the nested
>> translation bit means, but I don't understand why we have a problem
>> signalling the presence or lack of nested translations to the guest.
>>
>> In other words, why can't we hide l2 translation from the guest by
>> simply clearing the nested translation capability?
> 
> You mean to tell no support of l2 translation via nest translation bit?
> But the nested translation is a different function with l2 translation
> even from guest view and nested translation only works requests with
> PASID (l1 translation).
> 
> Linux intel iommu driver enables l2 translation unconditionally and free 
> iommu instance when failed to enable l2 translation.

In which cases the wording of your description is confusing: Instead of
"Linux Intel IOMMU driver thinks l2 translation is always available when
VTD exits and fail to be loaded without l2 translation support ..." how
about using something closer to what you've replied with last?

Jan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-26  9:36                               ` Jan Beulich
@ 2016-10-26 14:53                                 ` Lan, Tianyu
  0 siblings, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-10-26 14:53 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne

On 10/26/2016 5:36 PM, Jan Beulich wrote:
>>>> On 18.10.16 at 16:14, <tianyu.lan@intel.com> wrote:
>> 1.1 Enable more than 255 vcpu support
>> HPC cloud service requires VM provides high performance parallel
>> computing and we hope to create a huge VM with >255 vcpu on one machine
>> to meet such requirement.Ping each vcpus on separated pcpus. More than
>> 255 vcpus support requires X2APIC and Linux disables X2APIC mode if
>> there is no interrupt remapping function which is present by vIOMMU.
>> Interrupt remapping function helps to deliver interrupt to #vcpu >255.
>> So we need to add vIOMMU before enabling >255 vcpus.
>
> I continue to dislike this completely neglecting that we can't even
> have >128 vCPU-s at present. Once again - there's other work to
> be done prior to lack of vIOMMU becoming the limiting factor.
>

Yes, we can increase vcpu from 128 to 255 first without vIOMMU support.
We have some draft patches to enable this. Andrew also will rework CPUID
policy and change the rule of allocating vcpu's APIC ID. So we will base
on it to increase vcpu number. VLAPIC also needs to be changed to
support >255 APIC ID. These jobs can be implemented parallel with vIOMMU.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-26  9:39                                       ` Jan Beulich
@ 2016-10-26 15:03                                         ` Lan, Tianyu
  2016-11-03 15:41                                         ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-10-26 15:03 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne



On 10/26/2016 5:39 PM, Jan Beulich wrote:
>>>> On 22.10.16 at 09:32, <tianyu.lan@intel.com> wrote:
>> On 10/21/2016 4:36 AM, Andrew Cooper wrote:
>>>>>> 3.5 Implementation consideration
>>>>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>>>>> Architecturally there is no way to tell guest that l2 translation
>>>>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>>>>> translation is always available when VTD exits and fail to be loaded
>>>>>> without l2 translation support even if interrupt remapping and l1
>>>>>> translation are available. So it needs to enable l2 translation first
>>>>>> before other functions.
>>>>>
>>>>> What then is the purpose of the nested translation support bit in the
>>>>> extended capability register?
>>>>
>>>> It's to translate output GPA from first level translation(IOVA->GPA)
>>>> to HPA.
>>>>
>>>> Detail please see VTD spec - 3.8 Nested Translation
>>>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>>>> requests-with-PASID translated through first-level translation are also
>>>> subjected to nested second-level translation. Such extendedcontext-
>>>> entries contain both the pointer to the PASID-table (which contains the
>>>> pointer to the firstlevel translation structures), and the pointer to
>>>> the second-level translation structures."
>>>
>>> I didn't phrase my question very well.  I understand what the nested
>>> translation bit means, but I don't understand why we have a problem
>>> signalling the presence or lack of nested translations to the guest.
>>>
>>> In other words, why can't we hide l2 translation from the guest by
>>> simply clearing the nested translation capability?
>>
>> You mean to tell no support of l2 translation via nest translation bit?
>> But the nested translation is a different function with l2 translation
>> even from guest view and nested translation only works requests with
>> PASID (l1 translation).
>>
>> Linux intel iommu driver enables l2 translation unconditionally and free
>> iommu instance when failed to enable l2 translation.
>
> In which cases the wording of your description is confusing: Instead of
> "Linux Intel IOMMU driver thinks l2 translation is always available when
> VTD exits and fail to be loaded without l2 translation support ..." how
> about using something closer to what you've replied with last?
>

Sorry for my pool English. Will update.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-20 20:36                                   ` Andrew Cooper
  2016-10-22  7:32                                     ` Lan, Tianyu
@ 2016-10-28 15:36                                     ` Lan Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-10-28 15:36 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Kevin Tian, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 2016年10月21日 04:36, Andrew Cooper wrote:
>>> >>
>>>> >>>                     u64 iova;
>>>> >>>                 /* Out parameters. */
>>>> >>>                     u64 translated_addr;
>>>> >>>                     u64 addr_mask; /* Translation page size */
>>>> >>>                     IOMMUAccessFlags permisson;
>>> >>
>>> >> How is this translation intended to be used?  How do you plan to avoid
>>> >> race conditions where qemu requests a translation, receives one, the
>>> >> guest invalidated the mapping, and then qemu tries to use its translated
>>> >> address?
>>> >>
>>> >> There are only two ways I can see of doing this race-free.  One is to
>>> >> implement a "memcpy with translation" hypercall, and the other is to
>>> >> require the use of ATS in the vIOMMU, where the guest OS is required to
>>> >> wait for a positive response from the vIOMMU before it can safely reuse
>>> >> the mapping.
>>> >>
>>> >> The former behaves like real hardware in that an intermediate entity
>>> >> performs the translation without interacting with the DMA source.  The
>>> >> latter explicitly exposing the fact that caching is going on at the
>>> >> endpoint to the OS.
>> >
>> > The former one seems to move DMA operation into hypervisor but Qemu
>> > vIOMMU framework just passes IOVA to dummy xen-vIOMMU without input
>> > data and access length. I will dig more to figure out solution.
> Yes - that does in principle actually move the DMA out of Qemu.

Hi Adnrew:

The first solution "Move the DMA out of Qemu": Qemu vIOMMU framework
just give a chance of doing DMA translation to dummy xen-vIOMMU device
model and DMA access operation is in the vIOMMU core code. It's hard to
move this out. There are a lot of places to call translation callback
and some these are not for DMA access(E,G Map guest memory in Qemu).

The second solution "Use ATS to sync invalidation operation.": This
requires to enable ATS for all virtual PCI devices. This is not easy to do.

The following is my proposal:
When IOMMU driver invalidates IOTLB, it also will wait until the
invalidation completion. We may use this to drain in-fly DMA operation.

Guest triggers invalidation operation and trip into vIOMMU in
hypervisor to flush cache data. After this, it should go to Qemu to
drain in-fly DMA translation.

To do that, dummy vIOMMU in Qemu registers the same MMIO region as
vIOMMU's and emulation part of invalidation operation returns
X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is supposed
to send event to Qemu and dummy vIOMMU get a chance to starts a thread
to drain in-fly DMA and return emulation done.

Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
until it's cleared. Dummy vIOMMU notifies vIOMMU drain operation
completed via hypercall, vIOMMU clears IVT bit and guest finish
invalidation operation.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V2
  2016-10-26  9:39                                       ` Jan Beulich
  2016-10-26 15:03                                         ` Lan, Tianyu
@ 2016-11-03 15:41                                         ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-11-03 15:41 UTC (permalink / raw)
  To: Jan Beulich, Andrew Cooper
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, ian.jackson,
	Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne



On 10/26/2016 5:39 PM, Jan Beulich wrote:
>>>> On 22.10.16 at 09:32, <tianyu.lan@intel.com> wrote:
>> On 10/21/2016 4:36 AM, Andrew Cooper wrote:
>>>>>> 3.5 Implementation consideration
>>>>>> VT-d spec doesn't define a capability bit for the l2 translation.
>>>>>> Architecturally there is no way to tell guest that l2 translation
>>>>>> capability is not available. Linux Intel IOMMU driver thinks l2
>>>>>> translation is always available when VTD exits and fail to be loaded
>>>>>> without l2 translation support even if interrupt remapping and l1
>>>>>> translation are available. So it needs to enable l2 translation first
>>>>>> before other functions.
>>>>>
>>>>> What then is the purpose of the nested translation support bit in the
>>>>> extended capability register?
>>>>
>>>> It's to translate output GPA from first level translation(IOVA->GPA)
>>>> to HPA.
>>>>
>>>> Detail please see VTD spec - 3.8 Nested Translation
>>>> "When Nesting Enable (NESTE) field is 1 in extended-context-entries,
>>>> requests-with-PASID translated through first-level translation are also
>>>> subjected to nested second-level translation. Such extendedcontext-
>>>> entries contain both the pointer to the PASID-table (which contains the
>>>> pointer to the firstlevel translation structures), and the pointer to
>>>> the second-level translation structures."
>>>
>>> I didn't phrase my question very well.  I understand what the nested
>>> translation bit means, but I don't understand why we have a problem
>>> signalling the presence or lack of nested translations to the guest.
>>>
>>> In other words, why can't we hide l2 translation from the guest by
>>> simply clearing the nested translation capability?
>>
>> You mean to tell no support of l2 translation via nest translation bit?
>> But the nested translation is a different function with l2 translation
>> even from guest view and nested translation only works requests with
>> PASID (l1 translation).
>>
>> Linux intel iommu driver enables l2 translation unconditionally and free
>> iommu instance when failed to enable l2 translation.
>
> In which cases the wording of your description is confusing: Instead of
> "Linux Intel IOMMU driver thinks l2 translation is always available when
> VTD exits and fail to be loaded without l2 translation support ..." how
> about using something closer to what you've replied with last?
>
> Jan
>

Hi All:
I have some updates about implementation dependency between l2
translation(DMA translation) and irq remapping.

I find there are a kernel parameter "intel_iommu=on" and kconfig option
CONFIG_INTEL_IOMMU_DEFAULT_ON which control DMA translation function.
When they aren't set, DMA translation function will not be enabled by
IOMMU driver even if some vIOMMU registers show L2 translation function
available. In the meantime, irq remapping function still can work to
support >255 vcpus.

I check distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
parameter or select the kconfig option. So we can emulate irq remapping
fist with some capability bits(e,g SAGAW of Capability Register) of l2
translation for >255 vcpus support without l2 translation emulation.

Showing l2 capability bits is to make sure IOMMU driver probe ACPI DMAR
tables successfully because IOMMU driver access these bits during
reading ACPI tables.

If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
will panic guest because it can't enable DMA remapping function via gcmd
register and "Translation Enable Status" bit in gsts register is never
set by vIOMMU. This shows actual vIOMMU status of no l2 translation
emulation and warn user should not enable l2 translation.




_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Xen virtual IOMMU high level design doc V3
  2016-07-05 13:57                           ` Jan Beulich
                                               ` (2 preceding siblings ...)
  2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
@ 2016-11-17 15:36                             ` Lan Tianyu
  2016-11-18 19:43                               ` Julien Grall
                                                 ` (3 more replies)
  3 siblings, 4 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-11-17 15:36 UTC (permalink / raw)
  To: Jan Beulich, Kevin Tian, Andrew Cooper, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

Change since V2:
	1) Update motivation for Xen vIOMMU - 288 vcpus support part
	Add descriptor about plan of increasing vcpu from 128 to 255 and
dependency between X2APIC and interrupt remapping.
	2) Update 3.1 New vIOMMU hypercall interface
	Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU
consideration consideration and drain in-fly DMA subcommand
	3) Update 3.5 implementation consideration
	We found it's still safe to enable interrupt remapping function before 
adding l2 translation(DMA translation) to increase vcpu number >255.
	4) Update 3.2 l2 translation - virtual device part
	Add proposal to deal with race between in-fly DMA and invalidation 
operation in hypervisor.
	5) Update 4.4 Report vIOMMU to hvmloader
	Add option of building ACPI DMAR table in the toolstack for discussion.

Change since V1:
	1) Update motivation for Xen vIOMMU - 288 vcpus support part
	2) Change definition of struct xen_sysctl_viommu_op
	3) Update "3.5 Implementation consideration" to explain why we needs to
enable l2 translation first.
	4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on
the emulated I440 chipset.
	5) Remove stale statement in the "3.3 Interrupt remapping"

Content:
===============================================================================
1. Motivation of vIOMMU
	1.1 Enable more than 255 vcpus
	1.2 Support VFIO-based user space driver
	1.3 Support guest Shared Virtual Memory (SVM)
2. Xen vIOMMU Architecture
	2.1 l2 translation overview
	2.2 Interrupt remapping overview
3. Xen hypervisor
	3.1 New vIOMMU hypercall interface
	3.2 l2 translation
	3.3 Interrupt remapping
	3.4 l1 translation
	3.5 Implementation consideration
4. Qemu
	4.1 Qemu vIOMMU framework
	4.2 Dummy xen-vIOMMU driver
	4.3 Q35 vs. i440x
	4.4 Report vIOMMU to hvmloader


Glossary:
================================================================================
l1 translation - first-level translation to remap a virtual address to
intermediate (guest) physical address. (GVA->GPA)
l2 translation - second-level translations to remap a intermediate
physical address to machine (host) physical address. (GPA->HPA)

1 Motivation for Xen vIOMMU
================================================================================
1.1 Enable more than 255 vcpu support
HPC cloud service requires VM provides high performance parallel
computing and we hope to create a huge VM with >255 vcpu on one machine
to meet such requirement. Pin each vcpu to separate pcpus.

Now HVM guest can support 128 vcpus at most. We can increase vcpu number
from 128 to 255 via changing some limitations and extending vcpu related
data structure. This also needs to change the rule of allocating vcpu's
APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to
change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID
improvement work will cover this to improve guest's cpu topology. We
will base on this to increase vcpu number from 128 to 255.

To support >255 vcpus, X2APIC mode in guest is necessary because legacy
APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255
vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires
interrupt mapping function of vIOMMU.

The reason for this is that there is no modification to existing PCI MSI
and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send
interrupt message containing 8-bit APIC ID, which cannot address >255
cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary
to enable >255 cpus with x2apic mode.

Both Linux and Windows requires interrupt remapping when cpu number is >255.


1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
It relies on the l2 translation capability (IOVA->GPA) on
vIOMMU. pIOMMU l2 becomes a shadowing structure of
vIOMMU to isolate DMA requests initiated by user space driver.



1.3 Support guest SVM (Shared Virtual Memory)
It relies on the l1 translation table capability (GVA->GPA) on
vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
is the main usage today (to support OpenCL 2.0 SVM feature). In the
future SVM might be used by other I/O devices too.



2. Xen vIOMMU Architecture
================================================================================

* vIOMMU will be inside Xen hypervisor for following factors
	1) Avoid round trips between Qemu and Xen hypervisor
	2) Ease of integration with the rest of the hypervisor
	3) HVMlite/PVH doesn't use Qemu
* Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
/destroy vIOMMU in hypervisor and deal with virtual PCI device's l2
translation.

2.1 l2 translation overview
For Virtual PCI device, dummy xen-vIOMMU does translation in the
Qemu via new hypercall.

For physical PCI device, vIOMMU in hypervisor shadows IO page table from
IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.

The following diagram shows l2 translation architecture.
+---------------------------------------------------------+
|Qemu                                +----------------+   |
|                                    |     Virtual    |   |
|                                    |   PCI device   |   |
|                                    |                |   |
|                                    +----------------+   |
|                                            |DMA         |
|                                            V            |
|  +--------------------+   Request  +----------------+   |
|  |                    +<-----------+                |   |
|  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
|  |                    +----------->+                |   |
|  +---------+----------+            +-------+--------+   |
|            |                               |            |
|            |Hypercall                      |            |
+--------------------------------------------+------------+
|Hypervisor  |                               |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     |   vIOMMU    |                        |            |
|     +------+------+                        |            |
|            |                               |            |
|            v                               |            |
|     +------+------+                        |            |
|     | IOMMU driver|                        |            |
|     +------+------+                        |            |
|            |                               |            |
+--------------------------------------------+------------+
|HW          v                               V            |
|     +------+------+                 +-------------+     |
|     |   IOMMU     +---------------->+  Memory     |     |
|     +------+------+                 +-------------+     |
|            ^                                            |
|            |                                            |
|     +------+------+                                     |
|     | PCI Device  |                                     |
|     +-------------+                                     |
+---------------------------------------------------------+

2.2 Interrupt remapping overview.
Interrupts from virtual devices and physical devices will be delivered
to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
procedure.

+---------------------------------------------------+
|Qemu                       |VM                     |
|                           | +----------------+    |
|                           | |  Device driver |    |
|                           | +--------+-------+    |
|                           |          ^            |
|       +----------------+  | +--------+-------+    |
|       | Virtual device |  | |  IRQ subsystem |    |
|       +-------+--------+  | +--------+-------+    |
|               |           |          ^            |
|               |           |          |            |
+---------------------------+-----------------------+
|hyperviosr     |                      | VIRQ       |
|               |            +---------+--------+   |
|               |            |      vLAPIC      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |      vIOMMU      |   |
|               |            +---------+--------+   |
|               |                      ^            |
|               |                      |            |
|               |            +---------+--------+   |
|               |            |   vIOAPIC/vMSI   |   |
|               |            +----+----+--------+   |
|               |                 ^    ^            |
|               +-----------------+    |            |
|                                      |            |
+---------------------------------------------------+
HW                                     |IRQ
                                 +-------------------+
                                 |   PCI Device      |
                                 +-------------------+




3 Xen hypervisor
==========================================================================

3.1 New hypercall XEN_dmop_viommu_op
Create a new dmop(device model operation hyercall) for vIOMMU since it
will be called by Qemu during runtime. This hypercall also should
support PV IOMMU which is still under RFC review. Here only covers
NON-PV part.

1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.

struct xen_dmop_viommu_op {
	u32 cmd;
	u32 domid;
	u32 viommu_id;
	union {
		struct {
			u32 capabilities;
		} query_capabilities;
		struct {
			/* IN parameters. */
			u32 capabilities;
			u64 base_address;
			struct {
				u32 size;
				XEN_GUEST_HANDLE_64(uint32) dev_list;
			} dev_scope;
			/* Out parameters. */
			u32 viommu_id;
		} create_iommu;
	        struct {
			/* IN parameters. */
			u32 vsbdf;
              		u64 iova;
	    		/* Out parameters. */
              		u64 translated_addr;
              		u64 addr_mask; /* Translation page size */
              		u32 permission;
          	} l2_translation;	
	}
};


Definition of VIOMMU access permission:
#define VIOMMU_NONE 	0
#define	VIOMMU_RO   	1
#define	VIOMMU_WO   	2
#define	VIOMMU_RW    	3


Definition of VIOMMU subops:
#define XEN_DMOP_viommu_query_capability		0
#define XEN_DMOP_viommu_create				1
#define XEN_DMOP_viommu_destroy				2
#define XEN_DMOP_viommu_dma_translation_for_vpdev 	3
#define XEN_DMOP_viommu_dma_drain_completed		4

Definition of VIOMMU capabilities
#define XEN_VIOMMU_CAPABILITY_l1_translation		(1 << 0)
#define XEN_VIOMMU_CAPABILITY_l2_translation		(1 << 1)
#define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)


2) Design for subops
- XEN_DMOP_viommu_query_capability
	Get vIOMMU capabilities(l1/l2 translation and interrupt
remapping).

- XEN_DMOP_viommu_create
	Create vIOMMU in Xen hypervisor with dom_id, capabilities, reg
base address and device scope. If size of device list is 0, all PCI
devices are under the vIOMMU excepts PCI devices assigned to other
VIOMMU. hypervisor returns vIOMMU id.

- XEN_DMOP_viommu_destroy
	Destory vIOMMU in Xen hypervisor with dom_id as parameters.

- XEN_DMOP_viommu_dma_translation_for_vpdev
        	Translate IOVA to GPA for specified virtual PCI device with dom
id, PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
address mask and access permission.

- XEN_DMOP_viommu_dma_drain_completed
	Notify hypervisor that dummy vIOMMU has drained in-fly DMA after
invalidation operation and vIOMMU can mark invalidation completed in
invalidation register.


3.2 l2 translation
1) For virtual PCI device
Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
hypercall when DMA operation happens.

When guest triggers a invalidation operation, there maybe in-fly DMA
request for virtual device has been translated by vIOMMU and return back
Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
sure in-fly DMA operation is completed.

When IOMMU driver invalidates IOTLB, it also will wait until the
invalidation completion. We may use this to drain in-fly DMA operation
for virtual device.

Guest triggers invalidation operation and trip into vIOMMU in
hypervisor to flush cache data. After this, it should go to Qemu to
drain in-fly DMA translation.

To do that, dummy vIOMMU in Qemu registers the same MMIO region as
vIOMMU's and emulation part of invalidation operation in Xen hypervisor
returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
thread to drain in-fly DMA and return emulation done.

Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
notifies hypervisor drain operation completed via hypercall, vIOMMU
clears IVT bit and guest finish invalidation operation.


2) For physical PCI device
DMA operations go though physical IOMMU directly and IO page table for
IOVA->HPA should be loaded into physical IOMMU. When guest updates
l2 IO page table pointer field in the context entry, it provides IO page
table for IOVA->GPA. vIOMMU needs to shadow l2 IO page table, translate
GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
Page-table pointer in context entry of physical IOMMU. IOMMU driver
invalidates associated page after changing l2 IO page table when cache
mode bit is set in capability register. We can use this to shadow IO
page table.

Now all PCI devices in same hvm domain share one IO page table
(GPA->HPA) in physical IOMMU driver of Xen. To support vIOMMU l2
translation, IOMMU driver needs to support multiple address
spaces per device entry. Using existing IO page table(GPA->HPA)
by default and switch to shadow IO page table(IOVA->HPA) when vIOMMU l2
translation function is enabled. These change will not affect current
P2M logic.

3.3 Interrupt remapping
Interrupts from virtual devices and physical devices will be delivered
to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
according interrupt remapping table.


3.4 l1 translation
To enable l1 translation in guest
1) Xen IOMMU driver enables nested translation mode
2) Shadow guest l1 translation root table(PASID table pointer) to
pIOMMU's context entry.

When pIOMMU nested translation is enabled, any address generated by l1
translation is used as the input address for nesting with l2
translation. That means pIOMMU will translate GPA->HPA during l1
translation in guest and so pIOMMU needs to enable both l1 and l2
translation in nested translation mode(GVA->GPA->HPA) for passthrough
device. The guest's l1 translation root table can be directly written
into pIOMMU context entry.

All are handled in hypervisor and no interactions with Qemu are required.

3.5 Implementation consideration
VT-d spec doesn't define a capability bit for the l2 translation.
Architecturally there is no way to tell guest that l2 translation
capability is not available. When Linux Intel IOMMU driver enables l2
translation, panic if fail to enable.

There is a kernel parameter "intel_iommu=ON" and Kconfig option
CONFIG_INTEL_IOMMU_DEFAULT_ON which control l2 translation function.
When they aren't set, l2 translation function will not be enabled by
IOMMU driver even if some vIOMMU registers show l2 translation function
available. In the meantime, irq remapping function still can work to
support >255 vcpus.

Checked distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
parameter or select the Kconfig option. So it's still safe to emulate
interrupt remapping fist with some capability bits(e,g SAGAW of
Capability Register) of l2 translation for >255 vcpus support without l2
translation emulation.

Showing l2 capability bits is to make sure IOMMU driver parses ACPI DMAR
tables successfully because IOMMU driver access these bits during
reading ACPI tables. Otherwise, IOMMU instance will freed if fail.

If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
will panic guest because it can't enable DMA remapping function via gcmd
register and "Translation Enable Status" bit in gsts register is never
set by vIOMMU. This shows actual vIOMMU status that there is no l2
translation support and warn user should not enable l2 translation.



4 Qemu
==============================================================================
4.1 Qemu vIOMMU framework
Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
Especially for l2 translation of virtual PCI device because
emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
framework provides callback to deal with l2 translation when
DMA operations of virtual PCI devices happen.


4.2 Dummy xen-vIOMMU driver
1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
Share Virtual Memory) via hypercall.

2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
address and desired capability as parameters. Destroy vIOMMU when VM is
closed.

3) Virtual PCI device's l2 translation
Qemu already provides DMA translation hook. It's called when DMA
translation of virtual PCI device happens. The dummy xen-vIOMMU passes
device bdf and IOVA into Xen hypervisor via new iommu hypercall and
return back translated GPA.


4.3 Q35 vs I440x
VT-D is introduced with Q35 chipset. Previous concern was that VTD
driver has assumption that VTD only exists on Q35 and newer chipset and
we have to enable Q35 first. After experiments, Linux/Windows guest can
boot up on the emulated I440x chipset with VTD and VTD driver enables
interrupt remapping function. So we can skip Q35 support to implement
vIOMMU directly.

4.4 Report vIOMMU to hvmloader
Hvmloader is in charge of building ACPI tables for Guest OS and OS 
probes IOMMU via ACPI DMAR table. There is two ways to pass DMAR table 
to hvmloader. So hvmloder needs to know whether vIOMMU is enabled or not 
and its capability to prepare ACPI DMAR table for Guest OS.

1) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
This solution is already present in the vNVDIMM design(4.3.1
Building Guest ACPI Tables
http://www.gossamer-threads.com/lists/xen/devel/439766).


2) Build ACPI DMAR table in toolstack
Now tool stack can boot ACPI DMAR table according VM configure and pass
though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
region is managed by Qemu and it's need to be populated into DMAR
table. We may hardcore an address in both Qemu and toolstack and use the 
same address to create vIOMMU and build DMAR table.



_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
@ 2016-11-18 19:43                               ` Julien Grall
  2016-11-21  2:21                                 ` Lan, Tianyu
  2016-11-21  7:05                               ` Tian, Kevin
                                                 ` (2 subsequent siblings)
  3 siblings, 1 reply; 86+ messages in thread
From: Julien Grall @ 2016-11-18 19:43 UTC (permalink / raw)
  To: Lan Tianyu, Jan Beulich, Andrew Cooper, Stefano Stabellini
  Cc: yang.zhang.wz, Kevin Tian, xen-devel, ian.jackson, xuquan8,
	Jun Nakajima, anthony.perard, Roger Pau Monne

Hi Lan,

On 17/11/2016 09:36, Lan Tianyu wrote:

> 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.
>
> struct xen_dmop_viommu_op {
>     u32 cmd;
>     u32 domid;
>     u32 viommu_id;
>     union {
>         struct {
>             u32 capabilities;
>         } query_capabilities;
>         struct {
>             /* IN parameters. */
>             u32 capabilities;
>             u64 base_address;
>             struct {
>                 u32 size;
>                 XEN_GUEST_HANDLE_64(uint32) dev_list;
>             } dev_scope;
>             /* Out parameters. */
>             u32 viommu_id;
>         } create_iommu;
>             struct {
>             /* IN parameters. */
>             u32 vsbdf;

I only gave a quick look through this design document. The new 
hypercalls looks arch/device agnostic except this part.

Having a virtual IOMMU on Xen ARM is something we might consider in the 
future.

In the case of ARM, a device can either be a PCI device or integrated 
device. The latter does not have a sbdf. The IOMMU will usually be 
configured with a stream ID (SID) that can be deduced from the sbdf and 
hardcoded for integrated device.

So I would rather not tie the interface to PCI and use a more generic 
name for this field. Maybe vdevid, which then can be architecture specific.

Regards,

-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-18 19:43                               ` Julien Grall
@ 2016-11-21  2:21                                 ` Lan, Tianyu
  2016-11-21 13:17                                   ` Julien Grall
  0 siblings, 1 reply; 86+ messages in thread
From: Lan, Tianyu @ 2016-11-21  2:21 UTC (permalink / raw)
  To: Julien Grall, Jan Beulich, Andrew Cooper, Stefano Stabellini
  Cc: yang.zhang.wz, Kevin Tian, xen-devel, ian.jackson, xuquan8,
	Jun Nakajima, anthony.perard, Roger Pau Monne



On 11/19/2016 3:43 AM, Julien Grall wrote:
> Hi Lan,
>
> On 17/11/2016 09:36, Lan Tianyu wrote:
>
>> 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.
>>
>> struct xen_dmop_viommu_op {
>>     u32 cmd;
>>     u32 domid;
>>     u32 viommu_id;
>>     union {
>>         struct {
>>             u32 capabilities;
>>         } query_capabilities;
>>         struct {
>>             /* IN parameters. */
>>             u32 capabilities;
>>             u64 base_address;
>>             struct {
>>                 u32 size;
>>                 XEN_GUEST_HANDLE_64(uint32) dev_list;
>>             } dev_scope;
>>             /* Out parameters. */
>>             u32 viommu_id;
>>         } create_iommu;
>>             struct {
>>             /* IN parameters. */
>>             u32 vsbdf;
>
> I only gave a quick look through this design document. The new
> hypercalls looks arch/device agnostic except this part.
>
> Having a virtual IOMMU on Xen ARM is something we might consider in the
> future.
>
> In the case of ARM, a device can either be a PCI device or integrated
> device. The latter does not have a sbdf. The IOMMU will usually be
> configured with a stream ID (SID) that can be deduced from the sbdf and
> hardcoded for integrated device.
>
> So I would rather not tie the interface to PCI and use a more generic
> name for this field. Maybe vdevid, which then can be architecture specific.

Hi Julien:
	Thanks for your input. This interface is just for virtual PCI device 
which is called by Qemu. I am not familiar with ARM. Are there any 
non-PCI emulated devices for arm in Qemu which need to be covered by vIOMMU?

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
  2016-11-18 19:43                               ` Julien Grall
@ 2016-11-21  7:05                               ` Tian, Kevin
  2016-11-23  1:36                                 ` Lan Tianyu
  2016-11-21 13:41                               ` Andrew Cooper
  2016-11-22 10:24                               ` Jan Beulich
  3 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-11-21  7:05 UTC (permalink / raw)
  To: Lan, Tianyu, Jan Beulich, Andrew Cooper, yang.zhang.wz, Nakajima,
	Jun, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

> From: Lan, Tianyu
> Sent: Thursday, November 17, 2016 11:37 PM
> 
> Change since V2:
> 	1) Update motivation for Xen vIOMMU - 288 vcpus support part
> 	Add descriptor about plan of increasing vcpu from 128 to 255 and
> dependency between X2APIC and interrupt remapping.
> 	2) Update 3.1 New vIOMMU hypercall interface
> 	Change vIOMMU hypercall from sysctl to dmop, add multi vIOMMU
> consideration consideration and drain in-fly DMA subcommand
> 	3) Update 3.5 implementation consideration
> 	We found it's still safe to enable interrupt remapping function before
> adding l2 translation(DMA translation) to increase vcpu number >255.
> 	4) Update 3.2 l2 translation - virtual device part
> 	Add proposal to deal with race between in-fly DMA and invalidation
> operation in hypervisor.
> 	5) Update 4.4 Report vIOMMU to hvmloader
> 	Add option of building ACPI DMAR table in the toolstack for discussion.
> 
> Change since V1:
> 	1) Update motivation for Xen vIOMMU - 288 vcpus support part
> 	2) Change definition of struct xen_sysctl_viommu_op
> 	3) Update "3.5 Implementation consideration" to explain why we needs to
> enable l2 translation first.
> 	4) Update "4.3 Q35 vs I440x" - Linux/Windows VTD drivers can work on
> the emulated I440 chipset.
> 	5) Remove stale statement in the "3.3 Interrupt remapping"
> 
> Content:
> =====================================================
> ==========================
> 1. Motivation of vIOMMU
> 	1.1 Enable more than 255 vcpus
> 	1.2 Support VFIO-based user space driver
> 	1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 	2.1 l2 translation overview

L2/L1 might be more readable than l2/l1. :-)

> 	2.2 Interrupt remapping overview

to be complete, need an overview of l1 translation here

> 3. Xen hypervisor
> 	3.1 New vIOMMU hypercall interface
> 	3.2 l2 translation
> 	3.3 Interrupt remapping
> 	3.4 l1 translation
> 	3.5 Implementation consideration
> 4. Qemu
> 	4.1 Qemu vIOMMU framework
> 	4.2 Dummy xen-vIOMMU driver
> 	4.3 Q35 vs. i440x
> 	4.4 Report vIOMMU to hvmloader
> 
> 
> Glossary:
> =====================================================
> ===========================
> l1 translation - first-level translation to remap a virtual address to
> intermediate (guest) physical address. (GVA->GPA)
> l2 translation - second-level translations to remap a intermediate
> physical address to machine (host) physical address. (GPA->HPA)

If a glossary section required, please make it complex (interrupt remapping, 
DMAR, etc.)

Also please stick to what spec says. I don't think 'intermediate' physical
address is a widely-used term, and GVA->GPA/GPA->HPA are only partial
usages of those structures. You may make them an example, but be
careful with the definition.

> 
> 1 Motivation for Xen vIOMMU
> =====================================================
> ===========================
> 1.1 Enable more than 255 vcpu support

vcpu->vcpus

> HPC cloud service requires VM provides high performance parallel
> computing and we hope to create a huge VM with >255 vcpu on one machine
> to meet such requirement. Pin each vcpu to separate pcpus.
> 
> Now HVM guest can support 128 vcpus at most. We can increase vcpu number
> from 128 to 255 via changing some limitations and extending vcpu related
> data structure. This also needs to change the rule of allocating vcpu's
> APIC ID. Current rule is "(APIC ID) = (vcpu index) * 2". We need to
> change it to "(APIC ID) = (vcpu index)". Andrew Cooper's CPUID
> improvement work will cover this to improve guest's cpu topology. We
> will base on this to increase vcpu number from 128 to 255.
> 
> To support >255 vcpus, X2APIC mode in guest is necessary because legacy
> APIC(XAPIC) just supports 8-bit APIC ID and it only can support 255
> vcpus at most. X2APIC mode supports 32-bit APIC ID and it requires
> interrupt mapping function of vIOMMU.
> 
> The reason for this is that there is no modification to existing PCI MSI
> and IOAPIC with the introduction of X2APIC. PCI MSI/IOAPIC can only send
> interrupt message containing 8-bit APIC ID, which cannot address >255
> cpus. Interrupt remapping supports 32-bit APIC ID and so it's necessary
> to enable >255 cpus with x2apic mode.
> 
> Both Linux and Windows requires interrupt remapping when cpu number is >255.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the l2 translation capability (IOVA->GPA) on

GIOVA->GPA to be consistent

> vIOMMU. pIOMMU l2 becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 

You may give more background of how VFIO manages user space driver
to make whole picture clearer, like what you did for >255 vcpus support.

> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the l1 translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both l1 and l2 translation in nested
> mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.

At least make it clear that SVM is to share virtual address space between
CPU and device side, so CPU virtual address can be programmed as 
DMA destination to device. The format of L1 structure is compatible to
CPU page table.

> 
> 
> 
> 2. Xen vIOMMU Architecture
> =====================================================
> ===========================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
> 	1) Avoid round trips between Qemu and Xen hypervisor
> 	2) Ease of integration with the rest of the hypervisor
> 	3) HVMlite/PVH doesn't use Qemu

3) Maximum code reuse for HVMlite/PVH which doesn't use Qemu at all

> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destroy vIOMMU in hypervisor and deal with virtual PCI device's l2
> translation.
> 
> 2.1 l2 translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from

what's "IO page table"? L2 translation?

> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows l2 translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this

from vIOAPIC and vMSI to vLAPIC

> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                                  +-------------------+
>                                  |   PCI Device      |
>                                  +-------------------+
> 
> 

You introduced SVM usage in earlier motivation, but no information in this
design section. Please make it clear whether it's a mistake. If it's the
intention not covering SVM virtualization in this design, please articulate
in the start of the design

> 
> 
> 3 Xen hypervisor
> =====================================================
> =====================
> 
> 3.1 New hypercall XEN_dmop_viommu_op
> Create a new dmop(device model operation hyercall) for vIOMMU since it
> will be called by Qemu during runtime. This hypercall also should
> support PV IOMMU which is still under RFC review. Here only covers
> NON-PV part.

Not sure dmop is a good terminology. Is it possible to build it on top of
the category used for HVMlite (suppose it needs some device model
related hypercalls from toolstack)?

> 
> 1) Definition of "struct xen_dmop_viommu_op" as new hypercall parameter.
> 
> struct xen_dmop_viommu_op {
> 	u32 cmd;
> 	u32 domid;
> 	u32 viommu_id;
> 	union {
> 		struct {
> 			u32 capabilities;

OUT parameter?

> 		} query_capabilities;
> 		struct {
> 			/* IN parameters. */
> 			u32 capabilities;
> 			u64 base_address;
> 			struct {
> 				u32 size;
> 				XEN_GUEST_HANDLE_64(uint32) dev_list;
> 			} dev_scope;
> 			/* Out parameters. */
> 			u32 viommu_id;

duplicated with earlier viommu_id?

> 		} create_iommu;
> 	        struct {
> 			/* IN parameters. */
> 			u32 vsbdf;
>               		u64 iova;
> 	    		/* Out parameters. */
>               		u64 translated_addr;
>               		u64 addr_mask; /* Translation page size */
>               		u32 permission;
>           	} l2_translation;
> 	}
> };
> 
> 
> Definition of VIOMMU access permission:

VIOMMU 'memory' access permission?

> #define VIOMMU_NONE 	0
> #define	VIOMMU_RO   	1
> #define	VIOMMU_WO   	2
> #define	VIOMMU_RW    	3
> 
> 
> Definition of VIOMMU subops:
> #define XEN_DMOP_viommu_query_capability		0
> #define XEN_DMOP_viommu_create				1
> #define XEN_DMOP_viommu_destroy				2
> #define XEN_DMOP_viommu_dma_translation_for_vpdev 	3

what's vpdev? virtual device in Qemu?

> #define XEN_DMOP_viommu_dma_drain_completed		4
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_l1_translation		(1 << 0)
> #define XEN_VIOMMU_CAPABILITY_l2_translation		(1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)
> 
> 
> 2) Design for subops
> - XEN_DMOP_viommu_query_capability
> 	Get vIOMMU capabilities(l1/l2 translation and interrupt
> remapping).
> 
> - XEN_DMOP_viommu_create
> 	Create vIOMMU in Xen hypervisor with dom_id, capabilities, reg
> base address and device scope. If size of device list is 0, all PCI
> devices are under the vIOMMU excepts PCI devices assigned to other
> VIOMMU. hypervisor returns vIOMMU id.

Is it clearer to follow VT-d spec by using a INCLUDE_ALL flag
for this purpose?

> 
> - XEN_DMOP_viommu_destroy
> 	Destory vIOMMU in Xen hypervisor with dom_id as parameters.

don't you required a viommu_id?

> 
> - XEN_DMOP_viommu_dma_translation_for_vpdev
>         	Translate IOVA to GPA for specified virtual PCI device with dom
> id, PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> - XEN_DMOP_viommu_dma_drain_completed
> 	Notify hypervisor that dummy vIOMMU has drained in-fly DMA after
> invalidation operation and vIOMMU can mark invalidation completed in
> invalidation register.
> 
> 
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> When guest triggers a invalidation operation, there maybe in-fly DMA
> request for virtual device has been translated by vIOMMU and return back
> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> sure in-fly DMA operation is completed.

Please be clear that above is required only when read/write draining is
implied. Not all invalidations require it.

> 
> When IOMMU driver invalidates IOTLB, it also will wait until the
> invalidation completion. We may use this to drain in-fly DMA operation
> for virtual device.

Host or guest IOMMU driver? We may use 'what' to drain?

> 
> Guest triggers invalidation operation and trip into vIOMMU in

trip->trap

> hypervisor to flush cache data. After this, it should go to Qemu to
> drain in-fly DMA translation.

After what? Qemu drain should happen as part of trap-emulation of
guest invalidation operation. Also who is 'it'?

> 
> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> vIOMMU's and emulation part of invalidation operation in Xen hypervisor

emulation part -> emulation handler

> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is

after flush 'physical' cache?

> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> thread to drain in-fly DMA and return emulation done.

suppose guest vcpu is blocked in this phase, right?

> 
> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> notifies hypervisor drain operation completed via hypercall, vIOMMU
> clears IVT bit and guest finish invalidation operation.

So the basic idea is to have vIOMMU implement invalidation requests
as unhandled emulation, when memory draining is specified, which
results in a io request sent to Qemu to enable specific draining for
virtual devices. Then why do you need a separate thread and use
a hypercall to notification hypervisor? Once Qemu xen-viommu
wrapper completes emulation of invalidation requests, the standard
io completion flow will resume back to Xen hypervisor to unblock the
vcpu...

> 
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> l2 IO page table pointer field in the context entry, it provides IO page
> table for IOVA->GPA. vIOMMU needs to shadow l2 IO page table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to l2
> Page-table pointer in context entry of physical IOMMU. IOMMU driver
> invalidates associated page after changing l2 IO page table when cache
> mode bit is set in capability register. We can use this to shadow IO
> page table.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support vIOMMU l2
> translation, IOMMU driver needs to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> by default and switch to shadow IO page table(IOVA->HPA) when vIOMMU l2
> translation function is enabled. These change will not affect current
> P2M logic.
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table.
> 
> 
> 3.4 l1 translation
> To enable l1 translation in guest
> 1) Xen IOMMU driver enables nested translation mode
> 2) Shadow guest l1 translation root table(PASID table pointer) to
> pIOMMU's context entry.
> 
> When pIOMMU nested translation is enabled, any address generated by l1
> translation is used as the input address for nesting with l2
> translation. That means pIOMMU will translate GPA->HPA during l1
> translation in guest and so pIOMMU needs to enable both l1 and l2
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device. The guest's l1 translation root table can be directly written
> into pIOMMU context entry.
> 
> All are handled in hypervisor and no interactions with Qemu are required.

Looks here you do cover SVM virtualization... then please include
a design in earlier section

> 
> 3.5 Implementation consideration
> VT-d spec doesn't define a capability bit for the l2 translation.
> Architecturally there is no way to tell guest that l2 translation
> capability is not available. When Linux Intel IOMMU driver enables l2
> translation, panic if fail to enable.
> 
> There is a kernel parameter "intel_iommu=ON" and Kconfig option
> CONFIG_INTEL_IOMMU_DEFAULT_ON which control l2 translation function.
> When they aren't set, l2 translation function will not be enabled by
> IOMMU driver even if some vIOMMU registers show l2 translation function
> available. In the meantime, irq remapping function still can work to
> support >255 vcpus.
> 
> Checked distribution RHEL, SLES, Oracle and ubuntu don't set the kernel
> parameter or select the Kconfig option. So it's still safe to emulate
> interrupt remapping fist with some capability bits(e,g SAGAW of
> Capability Register) of l2 translation for >255 vcpus support without l2
> translation emulation.
> 
> Showing l2 capability bits is to make sure IOMMU driver parses ACPI DMAR
> tables successfully because IOMMU driver access these bits during
> reading ACPI tables. Otherwise, IOMMU instance will freed if fail.

You said no capability bit for L2 translation. But here you say
"showing l2 capability bits"...

> 
> If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
> will panic guest because it can't enable DMA remapping function via gcmd
> register and "Translation Enable Status" bit in gsts register is never
> set by vIOMMU. This shows actual vIOMMU status that there is no l2
> translation support and warn user should not enable l2 translation.

The rationale of section 3.5 is confusing. Do you mean sth. like below?

- We can first do IRQ remapping, because DMA remapping (l1/l2) and 
IRQ remapping can be enabled separately according to VT-d spec. Enabling 
of DMA remapping will be first emulated as a failure, which may lead
to guest kernel panic if intel_iommu is turned on in the guest. But it's
not a big problem because major distributions have DMA remapping
disabled by default while IRQ remapping is enabled.

- For DMA remapping, likely you'll enable L2 translation first (there is
no capability bit) with L1 translation disabled (there is a SVM capability 
bit). 

If yes, maybe we can break this design into 3 parts too, so both
design review and implementation side can move forward step by
step?

> 
> 
> 
> 4 Qemu
> =====================================================
> =========================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for Xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for l2 translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with l2 translation when
> DMA operations of virtual PCI devices happen.
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's l2 translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs I440x
> VT-D is introduced with Q35 chipset. Previous concern was that VTD
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first. After experiments, Linux/Windows guest can
> boot up on the emulated I440x chipset with VTD and VTD driver enables
> interrupt remapping function. So we can skip Q35 support to implement
> vIOMMU directly.
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. There is two ways to pass DMAR table
> to hvmloader. So hvmloder needs to know whether vIOMMU is enabled or not
> and its capability to prepare ACPI DMAR table for Guest OS.
> 
> 1) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> 
> 2) Build ACPI DMAR table in toolstack
> Now tool stack can boot ACPI DMAR table according VM configure and pass
> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
> region is managed by Qemu and it's need to be populated into DMAR
> table. We may hardcore an address in both Qemu and toolstack and use the
> same address to create vIOMMU and build DMAR table.
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-21  2:21                                 ` Lan, Tianyu
@ 2016-11-21 13:17                                   ` Julien Grall
  2016-11-21 18:24                                     ` Stefano Stabellini
  0 siblings, 1 reply; 86+ messages in thread
From: Julien Grall @ 2016-11-21 13:17 UTC (permalink / raw)
  To: Lan, Tianyu, Jan Beulich, Andrew Cooper, Stefano Stabellini
  Cc: yang.zhang.wz, Kevin Tian, xen-devel, ian.jackson, xuquan8,
	Jun Nakajima, anthony.perard, Roger Pau Monne



On 21/11/2016 02:21, Lan, Tianyu wrote:
> On 11/19/2016 3:43 AM, Julien Grall wrote:
>> On 17/11/2016 09:36, Lan Tianyu wrote:
> Hi Julien:

Hello Lan,

>     Thanks for your input. This interface is just for virtual PCI device
> which is called by Qemu. I am not familiar with ARM. Are there any
> non-PCI emulated devices for arm in Qemu which need to be covered by
> vIOMMU?

We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM 
guests are very similar to hvmlite/pvh. I got confused and thought this 
design document was targeting pvh too.

BTW, in the design document you mention hvmlite/pvh. Does it mean you 
plan to bring support of vIOMMU for those guests later on?

Regards,


-- 
Julien Grall

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
  2016-11-18 19:43                               ` Julien Grall
  2016-11-21  7:05                               ` Tian, Kevin
@ 2016-11-21 13:41                               ` Andrew Cooper
  2016-11-22  6:02                                 ` Tian, Kevin
  2016-11-22  8:32                                 ` Lan Tianyu
  2016-11-22 10:24                               ` Jan Beulich
  3 siblings, 2 replies; 86+ messages in thread
From: Andrew Cooper @ 2016-11-21 13:41 UTC (permalink / raw)
  To: Lan Tianyu, Jan Beulich, Kevin Tian, yang.zhang.wz, Jun Nakajima,
	Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 17/11/16 15:36, Lan Tianyu wrote:
> 3.2 l2 translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
>
> When guest triggers a invalidation operation, there maybe in-fly DMA
> request for virtual device has been translated by vIOMMU and return back
> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> sure in-fly DMA operation is completed.
>
> When IOMMU driver invalidates IOTLB, it also will wait until the
> invalidation completion. We may use this to drain in-fly DMA operation
> for virtual device.
>
> Guest triggers invalidation operation and trip into vIOMMU in
> hypervisor to flush cache data. After this, it should go to Qemu to
> drain in-fly DMA translation.
>
> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> vIOMMU's and emulation part of invalidation operation in Xen hypervisor
> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> thread to drain in-fly DMA and return emulation done.
>
> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> notifies hypervisor drain operation completed via hypercall, vIOMMU
> clears IVT bit and guest finish invalidation operation.

Having the guest poll will be very inefficient.  If the invalidation
does need to reach qemu, it will be a very long time until it
completes.  Is there no interrupt based mechanism which can be used? 
That way the guest can either handle it asynchronous itself, or block
waiting on an interrupt, both of which are better than having it just
spinning.

~Andrew

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-21 13:17                                   ` Julien Grall
@ 2016-11-21 18:24                                     ` Stefano Stabellini
  0 siblings, 0 replies; 86+ messages in thread
From: Stefano Stabellini @ 2016-11-21 18:24 UTC (permalink / raw)
  To: Julien Grall
  Cc: Lan, Tianyu, yang.zhang.wz, Kevin Tian, Stefano Stabellini,
	Jun Nakajima, Andrew Cooper, ian.jackson, xuquan8, xen-devel,
	Jan Beulich, anthony.perard, Roger Pau Monne

On Mon, 21 Nov 2016, Julien Grall wrote:
> On 21/11/2016 02:21, Lan, Tianyu wrote:
> > On 11/19/2016 3:43 AM, Julien Grall wrote:
> > > On 17/11/2016 09:36, Lan Tianyu wrote:
> > Hi Julien:
> 
> Hello Lan,
> 
> >     Thanks for your input. This interface is just for virtual PCI device
> > which is called by Qemu. I am not familiar with ARM. Are there any
> > non-PCI emulated devices for arm in Qemu which need to be covered by
> > vIOMMU?
> 
> We don't use QEMU on ARM so far, so I guess it should be ok for now. ARM
> guests are very similar to hvmlite/pvh. I got confused and thought this design
> document was targeting pvh too.
> 
> BTW, in the design document you mention hvmlite/pvh. Does it mean you plan to
> bring support of vIOMMU for those guests later on?

I quickly went through the document. I don't think we should restrict
the design to only one caller: QEMU. In fact it looks like those
hypercalls, without any modifications, could be called from the
toolstack (xl/libxl) in the case of PVH guests. In other words
PVH guests might work without any addition efforts on the hypervisor
side.

And they might even work on ARM. I have a couple of suggestions to
make the hypercalls a bit more "future proof" and architecture agnostic.

Imagine a future where two vIOMMU versions are supported. We could have
a uint32_t iommu_version field to identify what version of IOMMU we are
creating (create_iommu and query_capabilities commands). This could be
useful even on Intel platforms.

Given that in the future we might support a vIOMMU that take ids other
than sbdf as input, I would change "u32 vsbdf" into the following:

  #define XENVIOMMUSPACE_vsbdf  0
  uint16_t space;
  uint64_t id;

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-21 13:41                               ` Andrew Cooper
@ 2016-11-22  6:02                                 ` Tian, Kevin
  2016-11-22  8:32                                 ` Lan Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Tian, Kevin @ 2016-11-22  6:02 UTC (permalink / raw)
  To: Andrew Cooper, Lan, Tianyu, Jan Beulich, yang.zhang.wz, Nakajima,
	Jun, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

> From: Andrew Cooper [mailto:andrew.cooper3@citrix.com]
> Sent: Monday, November 21, 2016 9:41 PM
> 
> On 17/11/16 15:36, Lan Tianyu wrote:
> > 3.2 l2 translation
> > 1) For virtual PCI device
> > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> > hypercall when DMA operation happens.
> >
> > When guest triggers a invalidation operation, there maybe in-fly DMA
> > request for virtual device has been translated by vIOMMU and return back
> > Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
> > sure in-fly DMA operation is completed.
> >
> > When IOMMU driver invalidates IOTLB, it also will wait until the
> > invalidation completion. We may use this to drain in-fly DMA operation
> > for virtual device.
> >
> > Guest triggers invalidation operation and trip into vIOMMU in
> > hypervisor to flush cache data. After this, it should go to Qemu to
> > drain in-fly DMA translation.
> >
> > To do that, dummy vIOMMU in Qemu registers the same MMIO region as
> > vIOMMU's and emulation part of invalidation operation in Xen hypervisor
> > returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
> > supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
> > thread to drain in-fly DMA and return emulation done.
> >
> > Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
> > until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
> > notifies hypervisor drain operation completed via hypercall, vIOMMU
> > clears IVT bit and guest finish invalidation operation.
> 
> Having the guest poll will be very inefficient.  If the invalidation
> does need to reach qemu, it will be a very long time until it
> completes.  Is there no interrupt based mechanism which can be used?
> That way the guest can either handle it asynchronous itself, or block
> waiting on an interrupt, both of which are better than having it just
> spinning.
> 

VT-d spec supports both poll and interrupt modes, and it's decided
by guest IOMMU driver. Not say that this design requires guest to
use poll mode. I guess Tianyu uses as an example flow.

Thanks
Kevin
_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-21 13:41                               ` Andrew Cooper
  2016-11-22  6:02                                 ` Tian, Kevin
@ 2016-11-22  8:32                                 ` Lan Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-11-22  8:32 UTC (permalink / raw)
  To: Andrew Cooper, Jan Beulich, Kevin Tian, yang.zhang.wz,
	Jun Nakajima, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 2016年11月21日 21:41, Andrew Cooper wrote:
> On 17/11/16 15:36, Lan Tianyu wrote:
>> 3.2 l2 translation
>> 1) For virtual PCI device
>> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
>> hypercall when DMA operation happens.
>>
>> When guest triggers a invalidation operation, there maybe in-fly DMA
>> request for virtual device has been translated by vIOMMU and return back
>> Qemu. Before vIOMMU tells invalidation completed, it's necessary to make
>> sure in-fly DMA operation is completed.
>>
>> When IOMMU driver invalidates IOTLB, it also will wait until the
>> invalidation completion. We may use this to drain in-fly DMA operation
>> for virtual device.
>>
>> Guest triggers invalidation operation and trip into vIOMMU in
>> hypervisor to flush cache data. After this, it should go to Qemu to
>> drain in-fly DMA translation.
>>
>> To do that, dummy vIOMMU in Qemu registers the same MMIO region as
>> vIOMMU's and emulation part of invalidation operation in Xen hypervisor
>> returns X86EMUL_UNHANDLEABLE after flush cache. MMIO emulation part is
>> supposed to send event to Qemu and dummy vIOMMU get a chance to starts a
>> thread to drain in-fly DMA and return emulation done.
>>
>> Guest polls IVT(invalidate IOTLB) bit in the IOTLB invalidate register
>> until it's cleared after triggering invalidation. Dummy vIOMMU in Qemu
>> notifies hypervisor drain operation completed via hypercall, vIOMMU
>> clears IVT bit and guest finish invalidation operation.
> 
> Having the guest poll will be very inefficient.  If the invalidation
> does need to reach qemu, it will be a very long time until it
> completes.  Is there no interrupt based mechanism which can be used? 
> That way the guest can either handle it asynchronous itself, or block
> waiting on an interrupt, both of which are better than having it just
> spinning.
> 

Hi Andrew:
VTD provides interrupt event for Queue invalidation completion. So guest
can select poll or interrupt mode to wait for invalidation completion. I
found Linux Intel IOMMU driver just used poll mode and so used it for
example. Regardless of poll and interrupt mode, guest will wait for
invalidation completion and we just need to make sure to finish draining
in-fly DMA before clearing invalidation completion bit.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
                                                 ` (2 preceding siblings ...)
  2016-11-21 13:41                               ` Andrew Cooper
@ 2016-11-22 10:24                               ` Jan Beulich
  2016-11-24  2:34                                 ` Lan Tianyu
  3 siblings, 1 reply; 86+ messages in thread
From: Jan Beulich @ 2016-11-22 10:24 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne

>>> On 17.11.16 at 16:36, <tianyu.lan@intel.com> wrote:
> 2) Build ACPI DMAR table in toolstack
> Now tool stack can boot ACPI DMAR table according VM configure and pass
> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
> region is managed by Qemu and it's need to be populated into DMAR
> table. We may hardcore an address in both Qemu and toolstack and use the 
> same address to create vIOMMU and build DMAR table.

Let's try to avoid any new hard coding of values. Both tool stack
and qemu ought to be able to retrieve a suitable address range
from the hypervisor. Or if the tool stack was to allocate it, it could
tell qemu.

Jan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-21  7:05                               ` Tian, Kevin
@ 2016-11-23  1:36                                 ` Lan Tianyu
  0 siblings, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-11-23  1:36 UTC (permalink / raw)
  To: Tian, Kevin, Jan Beulich, Andrew Cooper, yang.zhang.wz, Nakajima,
	Jun, Stefano Stabellini
  Cc: anthony.perard, xuquan8, xen-devel, ian.jackson, Roger Pau Monne

On 2016年11月21日 15:05, Tian, Kevin wrote:
>> If someone add "intel_iommu=on" kernel parameter manually, IOMMU driver
>> > will panic guest because it can't enable DMA remapping function via gcmd
>> > register and "Translation Enable Status" bit in gsts register is never
>> > set by vIOMMU. This shows actual vIOMMU status that there is no l2
>> > translation support and warn user should not enable l2 translation.
> The rationale of section 3.5 is confusing. Do you mean sth. like below?
> 
> - We can first do IRQ remapping, because DMA remapping (l1/l2) and 
> IRQ remapping can be enabled separately according to VT-d spec. Enabling 
> of DMA remapping will be first emulated as a failure, which may lead
> to guest kernel panic if intel_iommu is turned on in the guest. But it's
> not a big problem because major distributions have DMA remapping
> disabled by default while IRQ remapping is enabled.
> 
> - For DMA remapping, likely you'll enable L2 translation first (there is
> no capability bit) with L1 translation disabled (there is a SVM capability 
> bit). 
> 
> If yes, maybe we can break this design into 3 parts too, so both
> design review and implementation side can move forward step by
> step?
> 

Yes, we may implement IRQ remapping first. I will break this design into
3 parts(interrupt remapping, L2 translation and L1 translation). IRQ
remapping will be first one to be sent out for detail discussion.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
                                                 ` (2 preceding siblings ...)
  2016-09-15 14:22                               ` Lan, Tianyu
@ 2016-11-23 18:19                               ` Edgar E. Iglesias
  2016-11-23 19:09                                 ` Stefano Stabellini
  3 siblings, 1 reply; 86+ messages in thread
From: Edgar E. Iglesias @ 2016-11-23 18:19 UTC (permalink / raw)
  To: Lan, Tianyu
  Cc: yang.zhang.wz, Kevin Tian, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xuquan8, Julien Grall, xen-devel,
	Jun Nakajima, anthony.perard, Roger Pau Monne

On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> Hi All:
>      The following is our Xen vIOMMU high level design for detail
> discussion. Please have a look. Very appreciate for your comments.
> This design doesn't cover changes when root port is moved to hypervisor.
> We may design it later.

Hi,

I have a few questions.

If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
So guests will essentially create intel iommu style page-tables.

If we were to use this on Xen/ARM, we would likely be modelling an ARM
SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
Do I understand this correctly?

Has a platform agnostic PV-IOMMU been considered to support 2-stage
translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
performance too much?

Best regards,
Edgar




> 
> 
> Content:
> ===============================================================================
> 1. Motivation of vIOMMU
> 	1.1 Enable more than 255 vcpus
> 	1.2 Support VFIO-based user space driver
> 	1.3 Support guest Shared Virtual Memory (SVM)
> 2. Xen vIOMMU Architecture
> 	2.1 2th level translation overview
> 	2.2 Interrupt remapping overview
> 3. Xen hypervisor
> 	3.1 New vIOMMU hypercall interface
> 	3.2 2nd level translation
> 	3.3 Interrupt remapping
> 	3.4 1st level translation
> 	3.5 Implementation consideration
> 4. Qemu
> 	4.1 Qemu vIOMMU framework
> 	4.2 Dummy xen-vIOMMU driver
> 	4.3 Q35 vs. i440x
> 	4.4 Report vIOMMU to hvmloader
> 
> 
> 1 Motivation for Xen vIOMMU
> ===============================================================================
> 1.1 Enable more than 255 vcpu support
> HPC virtualization requires more than 255 vcpus support in a single VM
> to meet parallel computing requirement. More than 255 vcpus support
> requires interrupt remapping capability present on vIOMMU to deliver
> interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> vcpus if interrupt remapping is absent.
> 
> 
> 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> It relies on the 2nd level translation capability (IOVA->GPA) on
> vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> vIOMMU to isolate DMA requests initiated by user space driver.
> 
> 
> 1.3 Support guest SVM (Shared Virtual Memory)
> It relies on the 1st level translation table capability (GVA->GPA) on
> vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> is the main usage today (to support OpenCL 2.0 SVM feature). In the
> future SVM might be used by other I/O devices too.
> 
> 2. Xen vIOMMU Architecture
> ================================================================================
> 
> * vIOMMU will be inside Xen hypervisor for following factors
> 	1) Avoid round trips between Qemu and Xen hypervisor
> 	2) Ease of integration with the rest of the hypervisor
> 	3) HVMlite/PVH doesn't use Qemu
> * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> level translation.
> 
> 2.1 2th level translation overview
> For Virtual PCI device, dummy xen-vIOMMU does translation in the
> Qemu via new hypercall.
> 
> For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> 
> The following diagram shows 2th level translation architecture.
> +---------------------------------------------------------+
> |Qemu                                +----------------+   |
> |                                    |     Virtual    |   |
> |                                    |   PCI device   |   |
> |                                    |                |   |
> |                                    +----------------+   |
> |                                            |DMA         |
> |                                            V            |
> |  +--------------------+   Request  +----------------+   |
> |  |                    +<-----------+                |   |
> |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> |  |                    +----------->+                |   |
> |  +---------+----------+            +-------+--------+   |
> |            |                               |            |
> |            |Hypercall                      |            |
> +--------------------------------------------+------------+
> |Hypervisor  |                               |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     |   vIOMMU    |                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> |            v                               |            |
> |     +------+------+                        |            |
> |     | IOMMU driver|                        |            |
> |     +------+------+                        |            |
> |            |                               |            |
> +--------------------------------------------+------------+
> |HW          v                               V            |
> |     +------+------+                 +-------------+     |
> |     |   IOMMU     +---------------->+  Memory     |     |
> |     +------+------+                 +-------------+     |
> |            ^                                            |
> |            |                                            |
> |     +------+------+                                     |
> |     | PCI Device  |                                     |
> |     +-------------+                                     |
> +---------------------------------------------------------+
> 
> 2.2 Interrupt remapping overview.
> Interrupts from virtual devices and physical devices will be delivered
> to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> procedure.
> 
> +---------------------------------------------------+
> |Qemu                       |VM                     |
> |                           | +----------------+    |
> |                           | |  Device driver |    |
> |                           | +--------+-------+    |
> |                           |          ^            |
> |       +----------------+  | +--------+-------+    |
> |       | Virtual device |  | |  IRQ subsystem |    |
> |       +-------+--------+  | +--------+-------+    |
> |               |           |          ^            |
> |               |           |          |            |
> +---------------------------+-----------------------+
> |hyperviosr     |                      | VIRQ       |
> |               |            +---------+--------+   |
> |               |            |      vLAPIC      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |      vIOMMU      |   |
> |               |            +---------+--------+   |
> |               |                      ^            |
> |               |                      |            |
> |               |            +---------+--------+   |
> |               |            |   vIOAPIC/vMSI   |   |
> |               |            +----+----+--------+   |
> |               |                 ^    ^            |
> |               +-----------------+    |            |
> |                                      |            |
> +---------------------------------------------------+
> HW                                     |IRQ
>                               +-------------------+
>                               |   PCI Device      |
>                               +-------------------+
> 
> 
> 
> 
> 
> 3 Xen hypervisor
> ==========================================================================
> 
> 3.1 New hypercall XEN_SYSCTL_viommu_op
> 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> 
> struct xen_sysctl_viommu_op {
> 	u32 cmd;
> 	u32 domid;
> 	union {
> 		struct {
> 			u32 capabilities;
> 		} query_capabilities;
> 		struct {
> 			u32 capabilities;
> 			u64 base_address;
> 		} create_iommu;
> 		struct {
> 			u8  bus;
> 			u8  devfn;
> 			u64 iova;
> 			u64 translated_addr;
> 			u64 addr_mask; /* Translation page size */
> 			IOMMUAccessFlags permisson;		
> 		} 2th_level_translation;
> };
> 
> typedef enum {
> 	IOMMU_NONE = 0,
> 	IOMMU_RO   = 1,
> 	IOMMU_WO   = 2,
> 	IOMMU_RW   = 3,
> } IOMMUAccessFlags;
> 
> 
> Definition of VIOMMU subops:
> #define XEN_SYSCTL_viommu_query_capability		0
> #define XEN_SYSCTL_viommu_create			1
> #define XEN_SYSCTL_viommu_destroy			2
> #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
> 
> Definition of VIOMMU capabilities
> #define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
> #define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)
> #define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)
> 
> 
> 2) Design for subops
> - XEN_SYSCTL_viommu_query_capability
>       Get vIOMMU capabilities(1st/2th level translation and interrupt
> remapping).
> 
> - XEN_SYSCTL_viommu_create
>      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> base address.
> 
> - XEN_SYSCTL_viommu_destroy
>      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> 
> - XEN_SYSCTL_viommu_dma_translation_for_vpdev
>      Translate IOVA to GPA for specified virtual PCI device with dom id,
> PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> address mask and access permission.
> 
> 
> 3.2 2nd level translation
> 1) For virtual PCI device
> Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> hypercall when DMA operation happens.
> 
> 2) For physical PCI device
> DMA operations go though physical IOMMU directly and IO page table for
> IOVA->HPA should be loaded into physical IOMMU. When guest updates
> Second-level Page-table pointer field, it provides IO page table for
> IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> Page-table pointer to context entry of physical IOMMU.
> 
> Now all PCI devices in same hvm domain share one IO page table
> (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> translation of vIOMMU, IOMMU driver need to support multiple address
> spaces per device entry. Using existing IO page table(GPA->HPA)
> defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> translation function is enabled. These change will not affect current
> P2M logic.
> 
> 3.3 Interrupt remapping
> Interrupts from virtual devices and physical devices will be delivered
> to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> according interrupt remapping table. The following diagram shows the logic.
> 
> 
> 3.4 1st level translation
> When nested translation is enabled, any address generated by first-level
> translation is used as the input address for nesting with second-level
> translation. Physical IOMMU needs to enable both 1st level and 2nd level
> translation in nested translation mode(GVA->GPA->HPA) for passthrough
> device.
> 
> VT-d context entry points to guest 1st level translation table which
> will be nest-translated by 2nd level translation table and so it
> can be directly linked to context entry of physical IOMMU.
> 
> To enable 1st level translation in VM
> 1) Xen IOMMU driver enables nested translation mode
> 2) Update GPA root of guest 1st level translation table to context entry
> of physical IOMMU.
> 
> All handles are in hypervisor and no interaction with Qemu.
> 
> 
> 3.5 Implementation consideration
> Linux Intel IOMMU driver will fail to be loaded without 2th level
> translation support even if interrupt remapping and 1th level
> translation are available. This means it's needed to enable 2th level
> translation first before other functions.
> 
> 
> 4 Qemu
> ==============================================================================
> 4.1 Qemu vIOMMU framework
> Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> Especially for 2th level translation of virtual PCI device because
> emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> framework provides callback to deal with 2th level translation when
> DMA operations of virtual PCI devices happen.
> 
> 
> 4.2 Dummy xen-vIOMMU driver
> 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> Share Virtual Memory) via hypercall.
> 
> 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> address and desired capability as parameters. Destroy vIOMMU when VM is
> closed.
> 
> 3) Virtual PCI device's 2th level translation
> Qemu already provides DMA translation hook. It's called when DMA
> translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> return back translated GPA.
> 
> 
> 4.3 Q35 vs i440x
> VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> driver has assumption that VTD only exists on Q35 and newer chipset and
> we have to enable Q35 first.
> 
> Consulted with Linux/Windows IOMMU driver experts and get that these
> drivers doesn't have such assumption. So we may skip Q35 implementation
> and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> with virtual PCI device's DMA translation and interrupt remapping. We
> are using KVM to do experiment of adding vIOMMU on the I440x and test
> Linux/Windows guest. Will report back when have some results.
> 
> 
> 4.4 Report vIOMMU to hvmloader
> Hvmloader is in charge of building ACPI tables for Guest OS and OS
> probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> for Guest OS.
> 
> There are three ways to do that.
> 1) Extend struct hvm_info_table and add variables in the struct
> hvm_info_table to pass vIOMMU information to hvmloader. But this
> requires to add new xc interface to use struct hvm_info_table in the Qemu.
> 
> 2) Pass vIOMMU information to hvmloader via Xenstore
> 
> 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> This solution is already present in the vNVDIMM design(4.3.1
> Building Guest ACPI Tables
> http://www.gossamer-threads.com/lists/xen/devel/439766).
> 
> The third option seems more clear and hvmloader doesn't need to deal
> with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> 
> 
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-23 18:19                               ` Edgar E. Iglesias
@ 2016-11-23 19:09                                 ` Stefano Stabellini
  2016-11-24  2:00                                   ` Tian, Kevin
  0 siblings, 1 reply; 86+ messages in thread
From: Stefano Stabellini @ 2016-11-23 19:09 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Lan, Tianyu, yang.zhang.wz, Kevin Tian, Stefano Stabellini,
	Jan Beulich, Andrew Cooper, ian.jackson, xuquan8, Julien Grall,
	xen-devel, Jun Nakajima, anthony.perard, Roger Pau Monne

On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > Hi All:
> >      The following is our Xen vIOMMU high level design for detail
> > discussion. Please have a look. Very appreciate for your comments.
> > This design doesn't cover changes when root port is moved to hypervisor.
> > We may design it later.
> 
> Hi,
> 
> I have a few questions.
> 
> If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> So guests will essentially create intel iommu style page-tables.
> 
> If we were to use this on Xen/ARM, we would likely be modelling an ARM
> SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> Do I understand this correctly?

I think they could be called from the toolstack. This is why I was
saying in the other thread that the hypercalls should be general enough
that QEMU is not the only caller.

For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
on behalf of the guest without QEMU intervention.


> Has a platform agnostic PV-IOMMU been considered to support 2-stage
> translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> performance too much?
 
That's an interesting idea. I don't know if that's feasible, but if it
is not, then we need to be able to specify the PV-IOMMU type in the
hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.


> > 
> > 
> > Content:
> > ===============================================================================
> > 1. Motivation of vIOMMU
> > 	1.1 Enable more than 255 vcpus
> > 	1.2 Support VFIO-based user space driver
> > 	1.3 Support guest Shared Virtual Memory (SVM)
> > 2. Xen vIOMMU Architecture
> > 	2.1 2th level translation overview
> > 	2.2 Interrupt remapping overview
> > 3. Xen hypervisor
> > 	3.1 New vIOMMU hypercall interface
> > 	3.2 2nd level translation
> > 	3.3 Interrupt remapping
> > 	3.4 1st level translation
> > 	3.5 Implementation consideration
> > 4. Qemu
> > 	4.1 Qemu vIOMMU framework
> > 	4.2 Dummy xen-vIOMMU driver
> > 	4.3 Q35 vs. i440x
> > 	4.4 Report vIOMMU to hvmloader
> > 
> > 
> > 1 Motivation for Xen vIOMMU
> > ===============================================================================
> > 1.1 Enable more than 255 vcpu support
> > HPC virtualization requires more than 255 vcpus support in a single VM
> > to meet parallel computing requirement. More than 255 vcpus support
> > requires interrupt remapping capability present on vIOMMU to deliver
> > interrupt to #vcpu >255 Otherwise Linux guest fails to boot up with >255
> > vcpus if interrupt remapping is absent.
> > 
> > 
> > 1.2 Support VFIO-based user space driver (e.g. DPDK) in the guest
> > It relies on the 2nd level translation capability (IOVA->GPA) on
> > vIOMMU. pIOMMU 2nd level becomes a shadowing structure of
> > vIOMMU to isolate DMA requests initiated by user space driver.
> > 
> > 
> > 1.3 Support guest SVM (Shared Virtual Memory)
> > It relies on the 1st level translation table capability (GVA->GPA) on
> > vIOMMU. pIOMMU needs to enable both 1st level and 2nd level translation
> > in nested mode (GVA->GPA->HPA) for passthrough device. IGD passthrough
> > is the main usage today (to support OpenCL 2.0 SVM feature). In the
> > future SVM might be used by other I/O devices too.
> > 
> > 2. Xen vIOMMU Architecture
> > ================================================================================
> > 
> > * vIOMMU will be inside Xen hypervisor for following factors
> > 	1) Avoid round trips between Qemu and Xen hypervisor
> > 	2) Ease of integration with the rest of the hypervisor
> > 	3) HVMlite/PVH doesn't use Qemu
> > * Dummy xen-vIOMMU in Qemu as a wrapper of new hypercall to create
> > /destory vIOMMU in hypervisor and deal with virtual PCI device's 2th
> > level translation.
> > 
> > 2.1 2th level translation overview
> > For Virtual PCI device, dummy xen-vIOMMU does translation in the
> > Qemu via new hypercall.
> > 
> > For physical PCI device, vIOMMU in hypervisor shadows IO page table from
> > IOVA->GPA to IOVA->HPA and load page table to physical IOMMU.
> > 
> > The following diagram shows 2th level translation architecture.
> > +---------------------------------------------------------+
> > |Qemu                                +----------------+   |
> > |                                    |     Virtual    |   |
> > |                                    |   PCI device   |   |
> > |                                    |                |   |
> > |                                    +----------------+   |
> > |                                            |DMA         |
> > |                                            V            |
> > |  +--------------------+   Request  +----------------+   |
> > |  |                    +<-----------+                |   |
> > |  |  Dummy xen vIOMMU  | Target GPA |  Memory region |   |
> > |  |                    +----------->+                |   |
> > |  +---------+----------+            +-------+--------+   |
> > |            |                               |            |
> > |            |Hypercall                      |            |
> > +--------------------------------------------+------------+
> > |Hypervisor  |                               |            |
> > |            |                               |            |
> > |            v                               |            |
> > |     +------+------+                        |            |
> > |     |   vIOMMU    |                        |            |
> > |     +------+------+                        |            |
> > |            |                               |            |
> > |            v                               |            |
> > |     +------+------+                        |            |
> > |     | IOMMU driver|                        |            |
> > |     +------+------+                        |            |
> > |            |                               |            |
> > +--------------------------------------------+------------+
> > |HW          v                               V            |
> > |     +------+------+                 +-------------+     |
> > |     |   IOMMU     +---------------->+  Memory     |     |
> > |     +------+------+                 +-------------+     |
> > |            ^                                            |
> > |            |                                            |
> > |     +------+------+                                     |
> > |     | PCI Device  |                                     |
> > |     +-------------+                                     |
> > +---------------------------------------------------------+
> > 
> > 2.2 Interrupt remapping overview.
> > Interrupts from virtual devices and physical devices will be delivered
> > to vLAPIC from vIOAPIC and vMSI. vIOMMU will remap interrupt during this
> > procedure.
> > 
> > +---------------------------------------------------+
> > |Qemu                       |VM                     |
> > |                           | +----------------+    |
> > |                           | |  Device driver |    |
> > |                           | +--------+-------+    |
> > |                           |          ^            |
> > |       +----------------+  | +--------+-------+    |
> > |       | Virtual device |  | |  IRQ subsystem |    |
> > |       +-------+--------+  | +--------+-------+    |
> > |               |           |          ^            |
> > |               |           |          |            |
> > +---------------------------+-----------------------+
> > |hyperviosr     |                      | VIRQ       |
> > |               |            +---------+--------+   |
> > |               |            |      vLAPIC      |   |
> > |               |            +---------+--------+   |
> > |               |                      ^            |
> > |               |                      |            |
> > |               |            +---------+--------+   |
> > |               |            |      vIOMMU      |   |
> > |               |            +---------+--------+   |
> > |               |                      ^            |
> > |               |                      |            |
> > |               |            +---------+--------+   |
> > |               |            |   vIOAPIC/vMSI   |   |
> > |               |            +----+----+--------+   |
> > |               |                 ^    ^            |
> > |               +-----------------+    |            |
> > |                                      |            |
> > +---------------------------------------------------+
> > HW                                     |IRQ
> >                               +-------------------+
> >                               |   PCI Device      |
> >                               +-------------------+
> > 
> > 
> > 
> > 
> > 
> > 3 Xen hypervisor
> > ==========================================================================
> > 
> > 3.1 New hypercall XEN_SYSCTL_viommu_op
> > 1) Definition of "struct xen_sysctl_viommu_op" as new hypercall parameter.
> > 
> > struct xen_sysctl_viommu_op {
> > 	u32 cmd;
> > 	u32 domid;
> > 	union {
> > 		struct {
> > 			u32 capabilities;
> > 		} query_capabilities;
> > 		struct {
> > 			u32 capabilities;
> > 			u64 base_address;
> > 		} create_iommu;
> > 		struct {
> > 			u8  bus;
> > 			u8  devfn;
> > 			u64 iova;
> > 			u64 translated_addr;
> > 			u64 addr_mask; /* Translation page size */
> > 			IOMMUAccessFlags permisson;		
> > 		} 2th_level_translation;
> > };
> > 
> > typedef enum {
> > 	IOMMU_NONE = 0,
> > 	IOMMU_RO   = 1,
> > 	IOMMU_WO   = 2,
> > 	IOMMU_RW   = 3,
> > } IOMMUAccessFlags;
> > 
> > 
> > Definition of VIOMMU subops:
> > #define XEN_SYSCTL_viommu_query_capability		0
> > #define XEN_SYSCTL_viommu_create			1
> > #define XEN_SYSCTL_viommu_destroy			2
> > #define XEN_SYSCTL_viommu_dma_translation_for_vpdev 	3
> > 
> > Definition of VIOMMU capabilities
> > #define XEN_VIOMMU_CAPABILITY_1nd_level_translation	(1 << 0)
> > #define XEN_VIOMMU_CAPABILITY_2nd_level_translation	(1 << 1)
> > #define XEN_VIOMMU_CAPABILITY_interrupt_remapping	(1 << 2)
> > 
> > 
> > 2) Design for subops
> > - XEN_SYSCTL_viommu_query_capability
> >       Get vIOMMU capabilities(1st/2th level translation and interrupt
> > remapping).
> > 
> > - XEN_SYSCTL_viommu_create
> >      Create vIOMMU in Xen hypervisor with dom_id, capabilities and reg
> > base address.
> > 
> > - XEN_SYSCTL_viommu_destroy
> >      Destory vIOMMU in Xen hypervisor with dom_id as parameters.
> > 
> > - XEN_SYSCTL_viommu_dma_translation_for_vpdev
> >      Translate IOVA to GPA for specified virtual PCI device with dom id,
> > PCI device's bdf and IOVA and xen hypervisor returns translated GPA,
> > address mask and access permission.
> > 
> > 
> > 3.2 2nd level translation
> > 1) For virtual PCI device
> > Xen dummy xen-vIOMMU in Qemu translates IOVA to target GPA via new
> > hypercall when DMA operation happens.
> > 
> > 2) For physical PCI device
> > DMA operations go though physical IOMMU directly and IO page table for
> > IOVA->HPA should be loaded into physical IOMMU. When guest updates
> > Second-level Page-table pointer field, it provides IO page table for
> > IOVA->GPA. vIOMMU needs to shadow 2nd level translation table, translate
> > GPA->HPA and update shadow page table(IOVA->HPA) pointer to Second-level
> > Page-table pointer to context entry of physical IOMMU.
> > 
> > Now all PCI devices in same hvm domain share one IO page table
> > (GPA->HPA) in physical IOMMU driver of Xen. To support 2nd level
> > translation of vIOMMU, IOMMU driver need to support multiple address
> > spaces per device entry. Using existing IO page table(GPA->HPA)
> > defaultly and switch to shadow IO page table(IOVA->HPA) when 2th level
> > translation function is enabled. These change will not affect current
> > P2M logic.
> > 
> > 3.3 Interrupt remapping
> > Interrupts from virtual devices and physical devices will be delivered
> > to vlapic from vIOAPIC and vMSI. It needs to add interrupt remapping
> > hooks in the vmsi_deliver() and ioapic_deliver() to find target vlapic
> > according interrupt remapping table. The following diagram shows the logic.
> > 
> > 
> > 3.4 1st level translation
> > When nested translation is enabled, any address generated by first-level
> > translation is used as the input address for nesting with second-level
> > translation. Physical IOMMU needs to enable both 1st level and 2nd level
> > translation in nested translation mode(GVA->GPA->HPA) for passthrough
> > device.
> > 
> > VT-d context entry points to guest 1st level translation table which
> > will be nest-translated by 2nd level translation table and so it
> > can be directly linked to context entry of physical IOMMU.
> > 
> > To enable 1st level translation in VM
> > 1) Xen IOMMU driver enables nested translation mode
> > 2) Update GPA root of guest 1st level translation table to context entry
> > of physical IOMMU.
> > 
> > All handles are in hypervisor and no interaction with Qemu.
> > 
> > 
> > 3.5 Implementation consideration
> > Linux Intel IOMMU driver will fail to be loaded without 2th level
> > translation support even if interrupt remapping and 1th level
> > translation are available. This means it's needed to enable 2th level
> > translation first before other functions.
> > 
> > 
> > 4 Qemu
> > ==============================================================================
> > 4.1 Qemu vIOMMU framework
> > Qemu has a framework to create virtual IOMMU(e.g. virtual intel VTD and
> > AMD IOMMU) and report in guest ACPI table. So for xen side, a dummy
> > xen-vIOMMU wrapper is required to connect with actual vIOMMU in Xen.
> > Especially for 2th level translation of virtual PCI device because
> > emulations of virtual PCI devices are in the Qemu. Qemu's vIOMMU
> > framework provides callback to deal with 2th level translation when
> > DMA operations of virtual PCI devices happen.
> > 
> > 
> > 4.2 Dummy xen-vIOMMU driver
> > 1) Query vIOMMU capability(E,G DMA translation, Interrupt remapping and
> > Share Virtual Memory) via hypercall.
> > 
> > 2) Create vIOMMU in Xen hypervisor via new hypercall with DRHU register
> > address and desired capability as parameters. Destroy vIOMMU when VM is
> > closed.
> > 
> > 3) Virtual PCI device's 2th level translation
> > Qemu already provides DMA translation hook. It's called when DMA
> > translation of virtual PCI device happens. The dummy xen-vIOMMU passes
> > device bdf and IOVA into Xen hypervisor via new iommu hypercall and
> > return back translated GPA.
> > 
> > 
> > 4.3 Q35 vs i440x
> > VT-D is introduced since Q35 chipset. Previous concern was that IOMMU
> > driver has assumption that VTD only exists on Q35 and newer chipset and
> > we have to enable Q35 first.
> > 
> > Consulted with Linux/Windows IOMMU driver experts and get that these
> > drivers doesn't have such assumption. So we may skip Q35 implementation
> > and can emulate vIOMMU on I440x chipset. KVM already have vIOMMU support
> > with virtual PCI device's DMA translation and interrupt remapping. We
> > are using KVM to do experiment of adding vIOMMU on the I440x and test
> > Linux/Windows guest. Will report back when have some results.
> > 
> > 
> > 4.4 Report vIOMMU to hvmloader
> > Hvmloader is in charge of building ACPI tables for Guest OS and OS
> > probes IOMMU via ACPI DMAR table. So hvmloder needs to know whether
> > vIOMMU is enabled or not and its capability to prepare ACPI DMAR table
> > for Guest OS.
> > 
> > There are three ways to do that.
> > 1) Extend struct hvm_info_table and add variables in the struct
> > hvm_info_table to pass vIOMMU information to hvmloader. But this
> > requires to add new xc interface to use struct hvm_info_table in the Qemu.
> > 
> > 2) Pass vIOMMU information to hvmloader via Xenstore
> > 
> > 3) Build ACPI DMAR table in Qemu and pass it to hvmloader via Xenstore.
> > This solution is already present in the vNVDIMM design(4.3.1
> > Building Guest ACPI Tables
> > http://www.gossamer-threads.com/lists/xen/devel/439766).
> > 
> > The third option seems more clear and hvmloader doesn't need to deal
> > with vIOMMU stuffs and just pass through DMAR table to Guest OS. All
> > vIOMMU specific stuffs will be processed in the dummy xen-vIOMMU driver.
> > 
> > 
> > 
> > 
> > _______________________________________________
> > Xen-devel mailing list
> > Xen-devel@lists.xen.org
> > https://lists.xen.org/xen-devel
> 

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-23 19:09                                 ` Stefano Stabellini
@ 2016-11-24  2:00                                   ` Tian, Kevin
  2016-11-24  4:09                                     ` Edgar E. Iglesias
  0 siblings, 1 reply; 86+ messages in thread
From: Tian, Kevin @ 2016-11-24  2:00 UTC (permalink / raw)
  To: Stefano Stabellini, Edgar E. Iglesias
  Cc: Lan, Tianyu, yang.zhang.wz, xuquan8, xen-devel, Jan Beulich,
	Andrew Cooper, ian.jackson, Julien Grall, Nakajima, Jun,
	anthony.perard, Roger Pau Monne

> From: Stefano Stabellini [mailto:sstabellini@kernel.org]
> Sent: Thursday, November 24, 2016 3:09 AM
> 
> On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > > Hi All:
> > >      The following is our Xen vIOMMU high level design for detail
> > > discussion. Please have a look. Very appreciate for your comments.
> > > This design doesn't cover changes when root port is moved to hypervisor.
> > > We may design it later.
> >
> > Hi,
> >
> > I have a few questions.
> >
> > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> > So guests will essentially create intel iommu style page-tables.
> >
> > If we were to use this on Xen/ARM, we would likely be modelling an ARM
> > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> > Do I understand this correctly?
> 
> I think they could be called from the toolstack. This is why I was
> saying in the other thread that the hypercalls should be general enough
> that QEMU is not the only caller.
> 
> For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
> on behalf of the guest without QEMU intervention.
> 
> 
> > Has a platform agnostic PV-IOMMU been considered to support 2-stage
> > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> > performance too much?
> 
> That's an interesting idea. I don't know if that's feasible, but if it
> is not, then we need to be able to specify the PV-IOMMU type in the
> hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.
> 
> 

Not considered yet. PV is always possible as we've done for other I/O
devices. Ideally it could be designed being more efficient than full
emulation of vendor specific IOMMU, but also means requirement of
maintaining a new guest IOMMU driver and limitation of supporting
only newer version guest OSes. It's a tradeoff... at least not compelling 
now (may consider when we see a real need in the future).

Thanks
Kevin

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc V3
  2016-11-22 10:24                               ` Jan Beulich
@ 2016-11-24  2:34                                 ` Lan Tianyu
  0 siblings, 0 replies; 86+ messages in thread
From: Lan Tianyu @ 2016-11-24  2:34 UTC (permalink / raw)
  To: Jan Beulich
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Andrew Cooper,
	ian.jackson, Kevin Tian, xen-devel, Jun Nakajima, anthony.perard,
	Roger Pau Monne

On 2016年11月22日 18:24, Jan Beulich wrote:
>>>> On 17.11.16 at 16:36, <tianyu.lan@intel.com> wrote:
>> 2) Build ACPI DMAR table in toolstack
>> Now tool stack can boot ACPI DMAR table according VM configure and pass
>> though it to hvmloader via xenstore ACPI PT channel. But the vIOMMU MMIO
>> region is managed by Qemu and it's need to be populated into DMAR
>> table. We may hardcore an address in both Qemu and toolstack and use the 
>> same address to create vIOMMU and build DMAR table.
> 
> Let's try to avoid any new hard coding of values. Both tool stack
> and qemu ought to be able to retrieve a suitable address range
> from the hypervisor. Or if the tool stack was to allocate it, it could
> tell qemu.
> 
> Jan
> 

Hi Jan:
The address range is allocated by Qemu or toolstack and pass to
hypervisor when create vIOMMU. The vIOMMU's address range should be
under PCI address sapce and so we need to reserve a piece of PCI region
for vIOMMU in the toolstack. Then, populate base address in the vDMAR
table and tell Qemu the region via new xenstore interface if we want to
create vIOMMU in the Qemu dummy hypercall wrapper.

Another point, I am not sure whether we can create/destroy vIOMMU
directly in toolstack because virtual device models usually are handled
by Qemu. If yes, we don't need new Xenstore interface. In this case, the
dummy vIOMMU in Qemu will just cover L2 translation for virtual device.

-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-24  2:00                                   ` Tian, Kevin
@ 2016-11-24  4:09                                     ` Edgar E. Iglesias
  2016-11-24  6:49                                       ` Lan Tianyu
  0 siblings, 1 reply; 86+ messages in thread
From: Edgar E. Iglesias @ 2016-11-24  4:09 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Lan, Tianyu, yang.zhang.wz, xuquan8, Stefano Stabellini,
	Jan Beulich, Andrew Cooper, ian.jackson, Julien Grall, xen-devel,
	Nakajima, Jun, anthony.perard, Roger Pau Monne

On Thu, Nov 24, 2016 at 02:00:21AM +0000, Tian, Kevin wrote:
> > From: Stefano Stabellini [mailto:sstabellini@kernel.org]
> > Sent: Thursday, November 24, 2016 3:09 AM
> > 
> > On Wed, 23 Nov 2016, Edgar E. Iglesias wrote:
> > > On Wed, Aug 17, 2016 at 08:05:51PM +0800, Lan, Tianyu wrote:
> > > > Hi All:
> > > >      The following is our Xen vIOMMU high level design for detail
> > > > discussion. Please have a look. Very appreciate for your comments.
> > > > This design doesn't cover changes when root port is moved to hypervisor.
> > > > We may design it later.
> > >
> > > Hi,
> > >
> > > I have a few questions.
> > >
> > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> > > So guests will essentially create intel iommu style page-tables.
> > >
> > > If we were to use this on Xen/ARM, we would likely be modelling an ARM
> > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> > > Do I understand this correctly?
> > 
> > I think they could be called from the toolstack. This is why I was
> > saying in the other thread that the hypercalls should be general enough
> > that QEMU is not the only caller.
> > 
> > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
> > on behalf of the guest without QEMU intervention.

OK, I see. Or, I think I understand, not sure :-)

In QEMU when someone changes mappings in an IOMMU there will be a notifier
to tell caches upstream that mappings have changed. I think we will need to
prepare for that. I.e when TCG CPUs sit behind an IOMMU.

Another area that may need change is that on ARM we need the map-query to return
the memory attributes for the given mapping. Today QEMU or any emulator 
doesn't use it much but in the future things may change.

For SVM, whe will also need to deal with page-table faults by the IOMMU.
So I think there will need to be a channel from Xen to Guesrt to report these.

For example, what happens when a guest assigned DMA unit page-faults?
Xen needs to know how to forward this fault back to guest for fixup and the
guest needs to be able to fix it and tell the device that it's OK to contine.
E.g PCI PRI or similar.


> > > Has a platform agnostic PV-IOMMU been considered to support 2-stage
> > > translation (i.e VFIO in the guest)? Perhaps that would hurt map/unmap
> > > performance too much?
> > 
> > That's an interesting idea. I don't know if that's feasible, but if it
> > is not, then we need to be able to specify the PV-IOMMU type in the
> > hypercalls, so that you would get Intel IOMMU on x86 and SMMU on ARM.
> > 
> > 
> 
> Not considered yet. PV is always possible as we've done for other I/O
> devices. Ideally it could be designed being more efficient than full
> emulation of vendor specific IOMMU, but also means requirement of
> maintaining a new guest IOMMU driver and limitation of supporting
> only newer version guest OSes. It's a tradeoff... at least not compelling 
> now (may consider when we see a real need in the future).

Agreed. Thanks.

Best regards,
Edgar

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-24  4:09                                     ` Edgar E. Iglesias
@ 2016-11-24  6:49                                       ` Lan Tianyu
  2016-11-24 13:37                                         ` Edgar E. Iglesias
  0 siblings, 1 reply; 86+ messages in thread
From: Lan Tianyu @ 2016-11-24  6:49 UTC (permalink / raw)
  To: Edgar E. Iglesias, Tian, Kevin
  Cc: yang.zhang.wz, xuquan8, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, Julien Grall, xen-devel, Nakajima,
	Jun, anthony.perard, Roger Pau Monne

On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
>>>> Hi,
>>>> > > >
>>>> > > > I have a few questions.
>>>> > > >
>>>> > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
>>>> > > > So guests will essentially create intel iommu style page-tables.
>>>> > > >
>>>> > > > If we were to use this on Xen/ARM, we would likely be modelling an ARM
>>>> > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
>>>> > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
>>>> > > > Do I understand this correctly?
>>> > > 
>>> > > I think they could be called from the toolstack. This is why I was
>>> > > saying in the other thread that the hypercalls should be general enough
>>> > > that QEMU is not the only caller.
>>> > > 
>>> > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
>>> > > on behalf of the guest without QEMU intervention.
> OK, I see. Or, I think I understand, not sure :-)
> 
> In QEMU when someone changes mappings in an IOMMU there will be a notifier
> to tell caches upstream that mappings have changed. I think we will need to
> prepare for that. I.e when TCG CPUs sit behind an IOMMU.

For Xen side, we may notify pIOMMU driver about mapping change via
calling pIOMMU driver's API in vIOMMU.

> 
> Another area that may need change is that on ARM we need the map-query to return
> the memory attributes for the given mapping. Today QEMU or any emulator 
> doesn't use it much but in the future things may change.
> 
> For SVM, whe will also need to deal with page-table faults by the IOMMU.
> So I think there will need to be a channel from Xen to Guesrt to report these.

Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
we will trigger VTD's interrupt to notify guest about the event.

> 
> For example, what happens when a guest assigned DMA unit page-faults?
> Xen needs to know how to forward this fault back to guest for fixup and the
> guest needs to be able to fix it and tell the device that it's OK to contine.
> E.g PCI PRI or similar.
> 
> 


-- 
Best regards
Tianyu Lan

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-24  6:49                                       ` Lan Tianyu
@ 2016-11-24 13:37                                         ` Edgar E. Iglesias
  2016-11-25  2:01                                           ` Xuquan (Quan Xu)
  2016-11-25  5:53                                           ` Lan, Tianyu
  0 siblings, 2 replies; 86+ messages in thread
From: Edgar E. Iglesias @ 2016-11-24 13:37 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: yang.zhang.wz, Tian, Kevin, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xuquan8, Julien Grall, xen-devel,
	Nakajima, Jun, anthony.perard, Roger Pau Monne

On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:
> On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
> >>>> Hi,
> >>>> > > >
> >>>> > > > I have a few questions.
> >>>> > > >
> >>>> > > > If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
> >>>> > > > So guests will essentially create intel iommu style page-tables.
> >>>> > > >
> >>>> > > > If we were to use this on Xen/ARM, we would likely be modelling an ARM
> >>>> > > > SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
> >>>> > > > hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
> >>>> > > > Do I understand this correctly?
> >>> > > 
> >>> > > I think they could be called from the toolstack. This is why I was
> >>> > > saying in the other thread that the hypercalls should be general enough
> >>> > > that QEMU is not the only caller.
> >>> > > 
> >>> > > For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
> >>> > > on behalf of the guest without QEMU intervention.
> > OK, I see. Or, I think I understand, not sure :-)
> > 
> > In QEMU when someone changes mappings in an IOMMU there will be a notifier
> > to tell caches upstream that mappings have changed. I think we will need to
> > prepare for that. I.e when TCG CPUs sit behind an IOMMU.
> 
> For Xen side, we may notify pIOMMU driver about mapping change via
> calling pIOMMU driver's API in vIOMMU.

I was refering to the other way around. When a guest modifies the mappings
for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified.

I couldn't find any mention of this in the document...


> 
> > 
> > Another area that may need change is that on ARM we need the map-query to return
> > the memory attributes for the given mapping. Today QEMU or any emulator 
> > doesn't use it much but in the future things may change.

What about the mem attributes?
It's very likely we'll add support for memory attributes for IOMMU's in QEMU
at some point.
Emulated IOMMU's will thus have the ability to modify attributes (i.e SourceID's,
cacheability, etc). Perhaps we could allocate or reserve an uint64_t
for attributes TBD later in the query struct.



> > 
> > For SVM, whe will also need to deal with page-table faults by the IOMMU.
> > So I think there will need to be a channel from Xen to Guesrt to report these.
> 
> Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
> we will trigger VTD's interrupt to notify guest about the event.

OK, Cool.

Perhaps you should document how this (and the map/unmap notifiers) will work?

I also think it would be a good idea to add a little more introduction so that
some of the questions we've been asking regarding the general design are easier
to grasp.

Best regards,
Edgar

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-24 13:37                                         ` Edgar E. Iglesias
@ 2016-11-25  2:01                                           ` Xuquan (Quan Xu)
  2016-11-25  5:53                                           ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Xuquan (Quan Xu) @ 2016-11-25  2:01 UTC (permalink / raw)
  To: Edgar E. Iglesias, Lan Tianyu
  Cc: yang.zhang.wz, Tian, Kevin, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, Julien Grall, xen-devel, Nakajima,
	Jun, anthony.perard, Roger Pau Monne

On November 24, 2016 9:38 PM, <edgar.iglesias@gmail.com>
>On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:
>> On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
>> >>>> Hi,
>> >>>> > > >
>> >>>> > > > I have a few questions.
>> >>>> > > >
>> >>>> > > > If I understand correctly, you'll be emulating an Intel IOMMU in
>Xen.
>> >>>> > > > So guests will essentially create intel iommu style page-tables.
>> >>>> > > >
>> >>>> > > > If we were to use this on Xen/ARM, we would likely be
>> >>>> > > > modelling an ARM SMMU as a vIOMMU. Since Xen on ARM
>does
>> >>>> > > > not use QEMU for emulation, the hypervisor OPs for QEMUs
>xen dummy IOMMU queries would not really be used.
>> >>>> > > > Do I understand this correctly?
>> >>> > >
>> >>> > > I think they could be called from the toolstack. This is why I
>> >>> > > was saying in the other thread that the hypercalls should be
>> >>> > > general enough that QEMU is not the only caller.
>> >>> > >
>> >>> > > For PVH and ARM guests, the toolstack should be able to setup
>> >>> > > the vIOMMU on behalf of the guest without QEMU intervention.
>> > OK, I see. Or, I think I understand, not sure :-)
>> >
>> > In QEMU when someone changes mappings in an IOMMU there will be
>a
>> > notifier to tell caches upstream that mappings have changed. I think
>> > we will need to prepare for that. I.e when TCG CPUs sit behind an
>IOMMU.
>>
>> For Xen side, we may notify pIOMMU driver about mapping change via
>> calling pIOMMU driver's API in vIOMMU.
>
>I was refering to the other way around. When a guest modifies the
>mappings for a vIOMMU, the driver domain with QEMU and vDevices needs
>to be notified.
>
>I couldn't find any mention of this in the document...
>
>

Edgar,
As mentioned it supports VFIO-based user space driver (e.g. DPDK) in the guest.
I am afraid all of guest memory is pinned.. Lan, right?

Quan


_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: Xen virtual IOMMU high level design doc
  2016-11-24 13:37                                         ` Edgar E. Iglesias
  2016-11-25  2:01                                           ` Xuquan (Quan Xu)
@ 2016-11-25  5:53                                           ` Lan, Tianyu
  1 sibling, 0 replies; 86+ messages in thread
From: Lan, Tianyu @ 2016-11-25  5:53 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: yang.zhang.wz, Tian, Kevin, Stefano Stabellini, Jan Beulich,
	Andrew Cooper, ian.jackson, xuquan8, Julien Grall, xen-devel,
	Nakajima, Jun, anthony.perard, Roger Pau Monne



On 11/24/2016 9:37 PM, Edgar E. Iglesias wrote:
> On Thu, Nov 24, 2016 at 02:49:41PM +0800, Lan Tianyu wrote:
>> On 2016年11月24日 12:09, Edgar E. Iglesias wrote:
>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a few questions.
>>>>>>>>>
>>>>>>>>> If I understand correctly, you'll be emulating an Intel IOMMU in Xen.
>>>>>>>>> So guests will essentially create intel iommu style page-tables.
>>>>>>>>>
>>>>>>>>> If we were to use this on Xen/ARM, we would likely be modelling an ARM
>>>>>>>>> SMMU as a vIOMMU. Since Xen on ARM does not use QEMU for emulation, the
>>>>>>>>> hypervisor OPs for QEMUs xen dummy IOMMU queries would not really be used.
>>>>>>>>> Do I understand this correctly?
>>>>>>>
>>>>>>> I think they could be called from the toolstack. This is why I was
>>>>>>> saying in the other thread that the hypercalls should be general enough
>>>>>>> that QEMU is not the only caller.
>>>>>>>
>>>>>>> For PVH and ARM guests, the toolstack should be able to setup the vIOMMU
>>>>>>> on behalf of the guest without QEMU intervention.
>>> OK, I see. Or, I think I understand, not sure :-)
>>>
>>> In QEMU when someone changes mappings in an IOMMU there will be a notifier
>>> to tell caches upstream that mappings have changed. I think we will need to
>>> prepare for that. I.e when TCG CPUs sit behind an IOMMU.
>>
>> For Xen side, we may notify pIOMMU driver about mapping change via
>> calling pIOMMU driver's API in vIOMMU.
>
> I was refering to the other way around. When a guest modifies the mappings
> for a vIOMMU, the driver domain with QEMU and vDevices needs to be notified.
>
> I couldn't find any mention of this in the document...

Qemu side won't have iotlb cache and all DMA translation info are in the 
hypervisor. All vDevice's DMA requests are passed to hypervisor, 
hypervisor returns back translated address and then Qemu finish the DMA 
operation finally.

There is a race condition between iotlb invalidation operation and 
vDevices' in-fly DMA. We proposed a solution in "3.2 l2 translation - 
For virtual PCI device". We hope to take advantage of current ioreq 
mechanism to achieve something like notifier.

Both vIOMMU in hypervisor and dummy vIOMMU in Qemu register the same 
MMIO region. When there is a invalidation MMIO access and hypervisor 
want to notify Qemu, vIOMMU's MMIO handler returns X86EMUL_UNHANDLEABLE 
and io emulation handler is supposed to send IO request to Qemu. Dummy 
vIOMMU in Qemu receives the event and start to drain in-fly DMA 
operation.


>
>
>>
>>>
>>> Another area that may need change is that on ARM we need the map-query to return
>>> the memory attributes for the given mapping. Today QEMU or any emulator
>>> doesn't use it much but in the future things may change.
>
> What about the mem attributes?
> It's very likely we'll add support for memory attributes for IOMMU's in QEMU
> at some point.
> Emulated IOMMU's will thus have the ability to modify attributes (i.e SourceID's,
> cacheability, etc). Perhaps we could allocate or reserve an uint64_t
> for attributes TBD later in the query struct.

Sounds like you hope to extend capability variable in the query struct 
to uint64_t to support more future feature, right?

I have added "permission" variable in struct l2_translation to return 
vIOMMU's memory access permission for vDevice's DMA request. No sure it 
can meet your requirement.

>
>
>
>>>
>>> For SVM, whe will also need to deal with page-table faults by the IOMMU.
>>> So I think there will need to be a channel from Xen to Guesrt to report these.
>>
>> Yes, vIOMMU should forward the page-fault event to guest. For VTD side,
>> we will trigger VTD's interrupt to notify guest about the event.
>
> OK, Cool.
>
> Perhaps you should document how this (and the map/unmap notifiers) will work?

This is VTD specific to deal with some fault events and just like some 
other virtual device models emulate its interrupt. So I didn't put this 
in this design document.

For mapping change, please see the fist comments.

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xen.org
https://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2016-11-25  5:53 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-05-26  8:29 Discussion about virtual iommu support for Xen guest Lan Tianyu
2016-05-26  8:42 ` Dong, Eddie
2016-05-27  2:26   ` Lan Tianyu
2016-05-27  8:11     ` Tian, Kevin
2016-05-26 11:35 ` Andrew Cooper
2016-05-27  8:19   ` Lan Tianyu
2016-06-02 15:03     ` Lan, Tianyu
2016-06-02 18:58       ` Andrew Cooper
2016-06-03 11:01         ` Current PVH/HVMlite work and planning (was :Re: Discussion about virtual iommu support for Xen guest) Roger Pau Monne
2016-06-03 11:21           ` Tian, Kevin
2016-06-03 11:52             ` Roger Pau Monne
2016-06-03 12:11               ` Tian, Kevin
2016-06-03 16:56                 ` Stefano Stabellini
2016-06-07  5:48                   ` Tian, Kevin
2016-06-03 11:17         ` Discussion about virtual iommu support for Xen guest Tian, Kevin
2016-06-03 13:09           ` Lan, Tianyu
2016-06-03 14:00             ` Andrew Cooper
2016-06-03 13:51           ` Andrew Cooper
2016-06-03 14:31             ` Jan Beulich
2016-06-03 17:14             ` Stefano Stabellini
2016-06-07  5:14               ` Tian, Kevin
2016-06-07  7:26                 ` Jan Beulich
2016-06-07 10:07                 ` Stefano Stabellini
2016-06-08  8:11                   ` Tian, Kevin
2016-06-26 13:42                     ` Lan, Tianyu
2016-06-29  3:04                       ` Tian, Kevin
2016-07-05 13:37                         ` Lan, Tianyu
2016-07-05 13:57                           ` Jan Beulich
2016-07-05 14:19                             ` Lan, Tianyu
2016-08-17 12:05                             ` Xen virtual IOMMU high level design doc Lan, Tianyu
2016-08-17 12:42                               ` Paul Durrant
2016-08-18  2:57                                 ` Lan, Tianyu
2016-08-25 11:11                               ` Jan Beulich
2016-08-31  8:39                                 ` Lan Tianyu
2016-08-31 12:02                                   ` Jan Beulich
2016-09-01  1:26                                     ` Tian, Kevin
2016-09-01  2:35                                     ` Lan Tianyu
2016-09-15 14:22                               ` Lan, Tianyu
2016-10-05 18:36                                 ` Konrad Rzeszutek Wilk
2016-10-11  1:52                                   ` Lan Tianyu
2016-11-23 18:19                               ` Edgar E. Iglesias
2016-11-23 19:09                                 ` Stefano Stabellini
2016-11-24  2:00                                   ` Tian, Kevin
2016-11-24  4:09                                     ` Edgar E. Iglesias
2016-11-24  6:49                                       ` Lan Tianyu
2016-11-24 13:37                                         ` Edgar E. Iglesias
2016-11-25  2:01                                           ` Xuquan (Quan Xu)
2016-11-25  5:53                                           ` Lan, Tianyu
2016-10-18 14:14                             ` Xen virtual IOMMU high level design doc V2 Lan Tianyu
2016-10-18 19:17                               ` Andrew Cooper
2016-10-20  9:53                                 ` Tian, Kevin
2016-10-20 18:10                                   ` Andrew Cooper
2016-10-20 14:17                                 ` Lan Tianyu
2016-10-20 20:36                                   ` Andrew Cooper
2016-10-22  7:32                                     ` Lan, Tianyu
2016-10-26  9:39                                       ` Jan Beulich
2016-10-26 15:03                                         ` Lan, Tianyu
2016-11-03 15:41                                         ` Lan, Tianyu
2016-10-28 15:36                                     ` Lan Tianyu
2016-10-18 20:26                               ` Konrad Rzeszutek Wilk
2016-10-20 10:11                                 ` Tian, Kevin
2016-10-20 14:56                                 ` Lan, Tianyu
2016-10-26  9:36                               ` Jan Beulich
2016-10-26 14:53                                 ` Lan, Tianyu
2016-11-17 15:36                             ` Xen virtual IOMMU high level design doc V3 Lan Tianyu
2016-11-18 19:43                               ` Julien Grall
2016-11-21  2:21                                 ` Lan, Tianyu
2016-11-21 13:17                                   ` Julien Grall
2016-11-21 18:24                                     ` Stefano Stabellini
2016-11-21  7:05                               ` Tian, Kevin
2016-11-23  1:36                                 ` Lan Tianyu
2016-11-21 13:41                               ` Andrew Cooper
2016-11-22  6:02                                 ` Tian, Kevin
2016-11-22  8:32                                 ` Lan Tianyu
2016-11-22 10:24                               ` Jan Beulich
2016-11-24  2:34                                 ` Lan Tianyu
2016-06-03 19:51             ` Is: 'basic pci bridge and root device support. 'Was:Re: Discussion about virtual iommu support for Xen guest Konrad Rzeszutek Wilk
2016-06-06  9:55               ` Jan Beulich
2016-06-06 17:25                 ` Konrad Rzeszutek Wilk
2016-08-02 15:15     ` Lan, Tianyu
2016-05-27  8:35   ` Tian, Kevin
2016-05-27  8:46     ` Paul Durrant
2016-05-27  9:39       ` Tian, Kevin
2016-05-31  9:43   ` George Dunlap
2016-05-27  2:26 ` Yang Zhang
2016-05-27  8:13   ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).