kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
@ 2020-10-12  8:38 Tian, Kevin
  2020-10-13  6:22 ` Jason Wang
                   ` (3 more replies)
  0 siblings, 4 replies; 55+ messages in thread
From: Tian, Kevin @ 2020-10-12  8:38 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Monday, September 14, 2020 12:20 PM
>
[...]
 > If it's possible, I would suggest a generic uAPI instead of a VFIO
> specific one.
> 
> Jason suggest something like /dev/sva. There will be a lot of other
> subsystems that could benefit from this (e.g vDPA).
> 
> Have you ever considered this approach?
> 

Hi, Jason,

We did some study on this approach and below is the output. It's a
long writing but I didn't find a way to further abstract w/o losing 
necessary context. Sorry about that.

Overall the real purpose of this series is to enable IOMMU nested
translation capability with vSVA as one major usage, through
below new uAPIs:
	1) Report/enable IOMMU nested translation capability;
	2) Allocate/free PASID;
	3) Bind/unbind guest page table;
	4) Invalidate IOMMU cache;
	5) Handle IOMMU page request/response (not in this series);
1/3/4) is the minimal set for using IOMMU nested translation, with 
the other two optional. For example, the guest may enable vSVA on 
a device without using PASID. Or, it may bind its gIOVA page table 
which doesn't require page fault support. Finally, all operations can 
be applied to either physical device or subdevice.

Then we evaluated each uAPI whether generalizing it is a good thing 
both in concept and regarding to complexity.

First, unlike other uAPIs which are all backed by iommu_ops, PASID 
allocation/free is through the IOASID sub-system. From this angle
we feel generalizing PASID management does make some sense. 
First, PASID is just a number and not related to any device before 
it's bound to a page table and IOMMU domain. Second, PASID is a 
global resource (at least on Intel VT-d), while having separate VFIO/
VDPA allocation interfaces may easily cause confusion in userspace,
e.g. which interface to be used if both VFIO/VDPA devices exist. 
Moreover, an unified interface allows centralized control over how 
many PASIDs are allowed per process.

One unclear part with this generalization is about the permission.
Do we open this interface to any process or only to those which
have assigned devices? If the latter, what would be the mechanism
to coordinate between this new interface and specific passthrough 
frameworks? A more tricky case, vSVA support on ARM (Eric/Jean
please correct me) plans to do per-device PASID namespace which
is built on a bind_pasid_table iommu callback to allow guest fully 
manage its PASIDs on a given passthrough device. I'm not sure 
how such requirement can be unified w/o involving passthrough
frameworks, or whether ARM could also switch to global PASID 
style...

Second, IOMMU nested translation is a per IOMMU domain
capability. Since IOMMU domains are managed by VFIO/VDPA
 (alloc/free domain, attach/detach device, set/get domain attribute,
etc.), reporting/enabling the nesting capability is an natural 
extension to the domain uAPI of existing passthrough frameworks. 
Actually, VFIO already includes a nesting enable interface even 
before this series. So it doesn't make sense to generalize this uAPI 
out.

Then the tricky part comes with the remaining operations (3/4/5),
which are all backed by iommu_ops thus effective only within an 
IOMMU domain. To generalize them, the first thing is to find a way 
to associate the sva_FD (opened through generic /dev/sva) with an 
IOMMU domain that is created by VFIO/VDPA. The second thing is 
to replicate {domain<->device/subdevice} association in /dev/sva 
path because some operations (e.g. page fault) is triggered/handled 
per device/subdevice. Therefore, /dev/sva must provide both per-
domain and per-device uAPIs similar to what VFIO/VDPA already 
does. Moreover, mapping page fault to subdevice requires pre-
registering subdevice fault data to IOMMU layer when binding 
guest page table, while such fault data can be only retrieved from 
parent driver through VFIO/VDPA. 

However, we failed to find a good way even at the 1st step about
domain association. The iommu domains are not exposed to the
userspace, and there is no 1:1 mapping between domain and device.
In VFIO, all devices within the same VFIO container share the address
space but they may be organized in multiple IOMMU domains based
on their bus type. How (should we let) the userspace know the
domain information and open an sva_FD for each domain is the main
problem here.

In the end we just realized that doing such generalization doesn't
really lead to a clear design and instead requires tight coordination 
between /dev/sva and VFIO/VDPA for almost every new uAPI 
(especially about synchronization when the domain/device 
association is changed or when the device/subdevice is being reset/
drained). Finally it may become a usability burden to the userspace
on proper use of the two interfaces on the assigned device.
 
Based on above analysis we feel that just generalizing PASID mgmt.
might be a good thing to look at while the remaining operations are 
better being VFIO/VDPA specific uAPIs. anyway in concept those are 
just a subset of the page table management capabilities that an 
IOMMU domain affords. Since all other aspects of the IOMMU domain 
is managed by VFIO/VDPA already, continuing this path for new nesting
capability sounds natural. There is another option by generalizing the 
entire IOMMU domain management (sort of the entire vfio_iommu_
type1), but it's unclear whether such intrusive change is worthwhile 
(especially when VFIO/VDPA already goes different route even in legacy
mapping uAPI: map/unmap vs. IOTLB).

Thoughts?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-12  8:38 (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Tian, Kevin
@ 2020-10-13  6:22 ` Jason Wang
  2020-10-14  3:08   ` Tian, Kevin
  2020-10-13 10:27 ` Jean-Philippe Brucker
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-13  6:22 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/10/12 下午4:38, Tian, Kevin wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Monday, September 14, 2020 12:20 PM
>>
> [...]
>   > If it's possible, I would suggest a generic uAPI instead of a VFIO
>> specific one.
>>
>> Jason suggest something like /dev/sva. There will be a lot of other
>> subsystems that could benefit from this (e.g vDPA).
>>
>> Have you ever considered this approach?
>>
> Hi, Jason,
>
> We did some study on this approach and below is the output. It's a
> long writing but I didn't find a way to further abstract w/o losing
> necessary context. Sorry about that.
>
> Overall the real purpose of this series is to enable IOMMU nested
> translation capability with vSVA as one major usage, through
> below new uAPIs:
> 	1) Report/enable IOMMU nested translation capability;
> 	2) Allocate/free PASID;
> 	3) Bind/unbind guest page table;
> 	4) Invalidate IOMMU cache;
> 	5) Handle IOMMU page request/response (not in this series);
> 1/3/4) is the minimal set for using IOMMU nested translation, with
> the other two optional. For example, the guest may enable vSVA on
> a device without using PASID. Or, it may bind its gIOVA page table
> which doesn't require page fault support. Finally, all operations can
> be applied to either physical device or subdevice.
>
> Then we evaluated each uAPI whether generalizing it is a good thing
> both in concept and regarding to complexity.
>
> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> allocation/free is through the IOASID sub-system.


A question here, is IOASID expected to be the single management 
interface for PASID?

(I'm asking since there're already vendor specific IDA based PASID 
allocator e.g amdgpu_pasid_alloc())


>   From this angle
> we feel generalizing PASID management does make some sense.
> First, PASID is just a number and not related to any device before
> it's bound to a page table and IOMMU domain. Second, PASID is a
> global resource (at least on Intel VT-d),


I think we need a definition of "global" here. It looks to me for vt-d 
the PASID table is per device.

Another question, is this possible to have two DMAR hardware unit(at 
least I can see two even in my laptop). In this case, is PASID still a 
global resource?


>   while having separate VFIO/
> VDPA allocation interfaces may easily cause confusion in userspace,
> e.g. which interface to be used if both VFIO/VDPA devices exist.
> Moreover, an unified interface allows centralized control over how
> many PASIDs are allowed per process.


Yes.


>
> One unclear part with this generalization is about the permission.
> Do we open this interface to any process or only to those which
> have assigned devices? If the latter, what would be the mechanism
> to coordinate between this new interface and specific passthrough
> frameworks?


I'm not sure, but if you just want a permission, you probably can 
introduce new capability (CAP_XXX) for this.


>   A more tricky case, vSVA support on ARM (Eric/Jean
> please correct me) plans to do per-device PASID namespace which
> is built on a bind_pasid_table iommu callback to allow guest fully
> manage its PASIDs on a given passthrough device.


I see, so I think the answer is to prepare for the namespace support 
from the start. (btw, I don't see how namespace is handled in current 
IOASID module?)


>   I'm not sure
> how such requirement can be unified w/o involving passthrough
> frameworks, or whether ARM could also switch to global PASID
> style...
>
> Second, IOMMU nested translation is a per IOMMU domain
> capability. Since IOMMU domains are managed by VFIO/VDPA
>   (alloc/free domain, attach/detach device, set/get domain attribute,
> etc.), reporting/enabling the nesting capability is an natural
> extension to the domain uAPI of existing passthrough frameworks.
> Actually, VFIO already includes a nesting enable interface even
> before this series. So it doesn't make sense to generalize this uAPI
> out.


So my understanding is that VFIO already:

1) use multiple fds
2) separate IOMMU ops to a dedicated container fd (type1 iommu)
3) provides API to associated devices/group with a container

And all the proposal in this series is to reuse the container fd. It 
should be possible to replace e.g type1 IOMMU with a unified module.


>
> Then the tricky part comes with the remaining operations (3/4/5),
> which are all backed by iommu_ops thus effective only within an
> IOMMU domain. To generalize them, the first thing is to find a way
> to associate the sva_FD (opened through generic /dev/sva) with an
> IOMMU domain that is created by VFIO/VDPA. The second thing is
> to replicate {domain<->device/subdevice} association in /dev/sva
> path because some operations (e.g. page fault) is triggered/handled
> per device/subdevice.


Is there any reason that the #PF can not be handled via SVA fd?


>   Therefore, /dev/sva must provide both per-
> domain and per-device uAPIs similar to what VFIO/VDPA already
> does. Moreover, mapping page fault to subdevice requires pre-
> registering subdevice fault data to IOMMU layer when binding
> guest page table, while such fault data can be only retrieved from
> parent driver through VFIO/VDPA.
>
> However, we failed to find a good way even at the 1st step about
> domain association. The iommu domains are not exposed to the
> userspace, and there is no 1:1 mapping between domain and device.
> In VFIO, all devices within the same VFIO container share the address
> space but they may be organized in multiple IOMMU domains based
> on their bus type. How (should we let) the userspace know the
> domain information and open an sva_FD for each domain is the main
> problem here.


The SVA fd is not necessarily opened by userspace. It could be get 
through subsystem specific uAPIs.

E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:

1) introduce uAPI for userspace to know the number of vSVA-capable domain
2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable domain


>
> In the end we just realized that doing such generalization doesn't
> really lead to a clear design and instead requires tight coordination
> between /dev/sva and VFIO/VDPA for almost every new uAPI
> (especially about synchronization when the domain/device
> association is changed or when the device/subdevice is being reset/
> drained). Finally it may become a usability burden to the userspace
> on proper use of the two interfaces on the assigned device.
>   
> Based on above analysis we feel that just generalizing PASID mgmt.
> might be a good thing to look at while the remaining operations are
> better being VFIO/VDPA specific uAPIs. anyway in concept those are
> just a subset of the page table management capabilities that an
> IOMMU domain affords. Since all other aspects of the IOMMU domain
> is managed by VFIO/VDPA already, continuing this path for new nesting
> capability sounds natural. There is another option by generalizing the
> entire IOMMU domain management (sort of the entire vfio_iommu_
> type1), but it's unclear whether such intrusive change is worthwhile
> (especially when VFIO/VDPA already goes different route even in legacy
> mapping uAPI: map/unmap vs. IOTLB).
>
> Thoughts?


I'm ok with starting with a unified PASID management and consider the 
unified vSVA/vIOMMU uAPI later.

Thanks


>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-12  8:38 (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Tian, Kevin
  2020-10-13  6:22 ` Jason Wang
@ 2020-10-13 10:27 ` Jean-Philippe Brucker
  2020-10-14  2:11   ` Tian, Kevin
  2020-10-14  3:16 ` Tian, Kevin
  2020-11-03  9:52 ` joro
  3 siblings, 1 reply; 55+ messages in thread
From: Jean-Philippe Brucker @ 2020-10-13 10:27 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx,
	Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

On Mon, Oct 12, 2020 at 08:38:54AM +0000, Tian, Kevin wrote:
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 14, 2020 12:20 PM
> >
> [...]
>  > If it's possible, I would suggest a generic uAPI instead of a VFIO
> > specific one.
> > 
> > Jason suggest something like /dev/sva. There will be a lot of other
> > subsystems that could benefit from this (e.g vDPA).
> > 
> > Have you ever considered this approach?
> > 
> 
> Hi, Jason,
> 
> We did some study on this approach and below is the output. It's a
> long writing but I didn't find a way to further abstract w/o losing 
> necessary context. Sorry about that.
> 
> Overall the real purpose of this series is to enable IOMMU nested
> translation capability with vSVA as one major usage, through
> below new uAPIs:
> 	1) Report/enable IOMMU nested translation capability;
> 	2) Allocate/free PASID;
> 	3) Bind/unbind guest page table;
> 	4) Invalidate IOMMU cache;
> 	5) Handle IOMMU page request/response (not in this series);
> 1/3/4) is the minimal set for using IOMMU nested translation, with 
> the other two optional. For example, the guest may enable vSVA on 
> a device without using PASID. Or, it may bind its gIOVA page table 
> which doesn't require page fault support. Finally, all operations can 
> be applied to either physical device or subdevice.
> 
> Then we evaluated each uAPI whether generalizing it is a good thing 
> both in concept and regarding to complexity.
> 
> First, unlike other uAPIs which are all backed by iommu_ops, PASID 
> allocation/free is through the IOASID sub-system. From this angle
> we feel generalizing PASID management does make some sense. 
> First, PASID is just a number and not related to any device before 
> it's bound to a page table and IOMMU domain. Second, PASID is a 
> global resource (at least on Intel VT-d), while having separate VFIO/
> VDPA allocation interfaces may easily cause confusion in userspace,
> e.g. which interface to be used if both VFIO/VDPA devices exist. 
> Moreover, an unified interface allows centralized control over how 
> many PASIDs are allowed per process.
> 
> One unclear part with this generalization is about the permission.
> Do we open this interface to any process or only to those which
> have assigned devices? If the latter, what would be the mechanism
> to coordinate between this new interface and specific passthrough 
> frameworks? A more tricky case, vSVA support on ARM (Eric/Jean
> please correct me) plans to do per-device PASID namespace which
> is built on a bind_pasid_table iommu callback to allow guest fully 
> manage its PASIDs on a given passthrough device.

Yes we need a bind_pasid_table. The guest needs to allocate the PASID
tables because they are accessed via guest-physical addresses by the HW
SMMU.

With bind_pasid_table, the invalidation message also requires a scope to
invalidate a whole PASID context, in addition to invalidating a mappings
ranges.

> I'm not sure 
> how such requirement can be unified w/o involving passthrough
> frameworks, or whether ARM could also switch to global PASID 
> style...

Not planned at the moment, sorry. It requires a PV IOMMU to do PASID
allocation, which is possible with virtio-iommu but not with a vSMMU
emulation. The VM will manage its own PASID space. The upside is that we
don't need userspace access to IOASID, so I won't pester you with comments
on that part of the API :)

> Second, IOMMU nested translation is a per IOMMU domain
> capability. Since IOMMU domains are managed by VFIO/VDPA
>  (alloc/free domain, attach/detach device, set/get domain attribute,
> etc.), reporting/enabling the nesting capability is an natural 
> extension to the domain uAPI of existing passthrough frameworks. 
> Actually, VFIO already includes a nesting enable interface even 
> before this series. So it doesn't make sense to generalize this uAPI 
> out.

Agree for enabling, but for reporting we did consider adding a sysfs
interface in /sys/class/iommu/ describing an IOMMU's properties. Then
opted for VFIO capabilities to keep the API nice and contained, but if
we're breaking up the API, sysfs might be more convenient to use and
extend.

> Then the tricky part comes with the remaining operations (3/4/5),
> which are all backed by iommu_ops thus effective only within an 
> IOMMU domain. To generalize them, the first thing is to find a way 
> to associate the sva_FD (opened through generic /dev/sva) with an 
> IOMMU domain that is created by VFIO/VDPA. The second thing is 
> to replicate {domain<->device/subdevice} association in /dev/sva 
> path because some operations (e.g. page fault) is triggered/handled 
> per device/subdevice. Therefore, /dev/sva must provide both per-
> domain and per-device uAPIs similar to what VFIO/VDPA already 
> does. Moreover, mapping page fault to subdevice requires pre-
> registering subdevice fault data to IOMMU layer when binding 
> guest page table, while such fault data can be only retrieved from 
> parent driver through VFIO/VDPA. 
> 
> However, we failed to find a good way even at the 1st step about
> domain association. The iommu domains are not exposed to the
> userspace, and there is no 1:1 mapping between domain and device.
> In VFIO, all devices within the same VFIO container share the address
> space but they may be organized in multiple IOMMU domains based
> on their bus type. How (should we let) the userspace know the
> domain information and open an sva_FD for each domain is the main
> problem here.
> 
> In the end we just realized that doing such generalization doesn't
> really lead to a clear design and instead requires tight coordination 
> between /dev/sva and VFIO/VDPA for almost every new uAPI 
> (especially about synchronization when the domain/device 
> association is changed or when the device/subdevice is being reset/
> drained). Finally it may become a usability burden to the userspace
> on proper use of the two interfaces on the assigned device.
>  
> Based on above analysis we feel that just generalizing PASID mgmt.
> might be a good thing to look at while the remaining operations are 
> better being VFIO/VDPA specific uAPIs. anyway in concept those are 
> just a subset of the page table management capabilities that an 
> IOMMU domain affords. Since all other aspects of the IOMMU domain 
> is managed by VFIO/VDPA already, continuing this path for new nesting
> capability sounds natural. There is another option by generalizing the 
> entire IOMMU domain management (sort of the entire vfio_iommu_
> type1), but it's unclear whether such intrusive change is worthwhile 
> (especially when VFIO/VDPA already goes different route even in legacy
> mapping uAPI: map/unmap vs. IOTLB).

I agree with your analysis. A new coarse /dev/sva interface would need to
carry all the VFIO abstractions of container (minus map/unmap) and
group+device, which are not necessarily needed by VDPA and others, while
the original VFIO interface needs to stay for compatibility. To me it
makes more sense to extend each API separately, but have them embed common
structures (bind/inval) and share some resources through external
interfaces (IOASID, nesting properties, IOPF queue).

Thanks,
Jean

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-13 10:27 ` Jean-Philippe Brucker
@ 2020-10-14  2:11   ` Tian, Kevin
  0 siblings, 0 replies; 55+ messages in thread
From: Tian, Kevin @ 2020-10-14  2:11 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, peterx,
	Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jean-Philippe Brucker <jean-philippe@linaro.org>
> Sent: Tuesday, October 13, 2020 6:28 PM
> 
> On Mon, Oct 12, 2020 at 08:38:54AM +0000, Tian, Kevin wrote:
> > > From: Jason Wang <jasowang@redhat.com>
> > > Sent: Monday, September 14, 2020 12:20 PM
> > >
> > [...]
> >  > If it's possible, I would suggest a generic uAPI instead of a VFIO
> > > specific one.
> > >
> > > Jason suggest something like /dev/sva. There will be a lot of other
> > > subsystems that could benefit from this (e.g vDPA).
> > >
> > > Have you ever considered this approach?
> > >
> >
> > Hi, Jason,
> >
> > We did some study on this approach and below is the output. It's a
> > long writing but I didn't find a way to further abstract w/o losing
> > necessary context. Sorry about that.
> >
> > Overall the real purpose of this series is to enable IOMMU nested
> > translation capability with vSVA as one major usage, through
> > below new uAPIs:
> > 	1) Report/enable IOMMU nested translation capability;
> > 	2) Allocate/free PASID;
> > 	3) Bind/unbind guest page table;
> > 	4) Invalidate IOMMU cache;
> > 	5) Handle IOMMU page request/response (not in this series);
> > 1/3/4) is the minimal set for using IOMMU nested translation, with
> > the other two optional. For example, the guest may enable vSVA on
> > a device without using PASID. Or, it may bind its gIOVA page table
> > which doesn't require page fault support. Finally, all operations can
> > be applied to either physical device or subdevice.
> >
> > Then we evaluated each uAPI whether generalizing it is a good thing
> > both in concept and regarding to complexity.
> >
> > First, unlike other uAPIs which are all backed by iommu_ops, PASID
> > allocation/free is through the IOASID sub-system. From this angle
> > we feel generalizing PASID management does make some sense.
> > First, PASID is just a number and not related to any device before
> > it's bound to a page table and IOMMU domain. Second, PASID is a
> > global resource (at least on Intel VT-d), while having separate VFIO/
> > VDPA allocation interfaces may easily cause confusion in userspace,
> > e.g. which interface to be used if both VFIO/VDPA devices exist.
> > Moreover, an unified interface allows centralized control over how
> > many PASIDs are allowed per process.
> >
> > One unclear part with this generalization is about the permission.
> > Do we open this interface to any process or only to those which
> > have assigned devices? If the latter, what would be the mechanism
> > to coordinate between this new interface and specific passthrough
> > frameworks? A more tricky case, vSVA support on ARM (Eric/Jean
> > please correct me) plans to do per-device PASID namespace which
> > is built on a bind_pasid_table iommu callback to allow guest fully
> > manage its PASIDs on a given passthrough device.
> 
> Yes we need a bind_pasid_table. The guest needs to allocate the PASID
> tables because they are accessed via guest-physical addresses by the HW
> SMMU.
> 
> With bind_pasid_table, the invalidation message also requires a scope to
> invalidate a whole PASID context, in addition to invalidating a mappings
> ranges.
> 
> > I'm not sure
> > how such requirement can be unified w/o involving passthrough
> > frameworks, or whether ARM could also switch to global PASID
> > style...
> 
> Not planned at the moment, sorry. It requires a PV IOMMU to do PASID
> allocation, which is possible with virtio-iommu but not with a vSMMU
> emulation. The VM will manage its own PASID space. The upside is that we
> don't need userspace access to IOASID, so I won't pester you with comments
> on that part of the API :)

It makes sense. Possibly in the future when you plan to support 
SIOV-like capability then you may have to convert PASID table
to use host physical address then the same API could be reused. :)

Thanks
Kevin

> 
> > Second, IOMMU nested translation is a per IOMMU domain
> > capability. Since IOMMU domains are managed by VFIO/VDPA
> >  (alloc/free domain, attach/detach device, set/get domain attribute,
> > etc.), reporting/enabling the nesting capability is an natural
> > extension to the domain uAPI of existing passthrough frameworks.
> > Actually, VFIO already includes a nesting enable interface even
> > before this series. So it doesn't make sense to generalize this uAPI
> > out.
> 
> Agree for enabling, but for reporting we did consider adding a sysfs
> interface in /sys/class/iommu/ describing an IOMMU's properties. Then
> opted for VFIO capabilities to keep the API nice and contained, but if
> we're breaking up the API, sysfs might be more convenient to use and
> extend.
> 
> > Then the tricky part comes with the remaining operations (3/4/5),
> > which are all backed by iommu_ops thus effective only within an
> > IOMMU domain. To generalize them, the first thing is to find a way
> > to associate the sva_FD (opened through generic /dev/sva) with an
> > IOMMU domain that is created by VFIO/VDPA. The second thing is
> > to replicate {domain<->device/subdevice} association in /dev/sva
> > path because some operations (e.g. page fault) is triggered/handled
> > per device/subdevice. Therefore, /dev/sva must provide both per-
> > domain and per-device uAPIs similar to what VFIO/VDPA already
> > does. Moreover, mapping page fault to subdevice requires pre-
> > registering subdevice fault data to IOMMU layer when binding
> > guest page table, while such fault data can be only retrieved from
> > parent driver through VFIO/VDPA.
> >
> > However, we failed to find a good way even at the 1st step about
> > domain association. The iommu domains are not exposed to the
> > userspace, and there is no 1:1 mapping between domain and device.
> > In VFIO, all devices within the same VFIO container share the address
> > space but they may be organized in multiple IOMMU domains based
> > on their bus type. How (should we let) the userspace know the
> > domain information and open an sva_FD for each domain is the main
> > problem here.
> >
> > In the end we just realized that doing such generalization doesn't
> > really lead to a clear design and instead requires tight coordination
> > between /dev/sva and VFIO/VDPA for almost every new uAPI
> > (especially about synchronization when the domain/device
> > association is changed or when the device/subdevice is being reset/
> > drained). Finally it may become a usability burden to the userspace
> > on proper use of the two interfaces on the assigned device.
> >
> > Based on above analysis we feel that just generalizing PASID mgmt.
> > might be a good thing to look at while the remaining operations are
> > better being VFIO/VDPA specific uAPIs. anyway in concept those are
> > just a subset of the page table management capabilities that an
> > IOMMU domain affords. Since all other aspects of the IOMMU domain
> > is managed by VFIO/VDPA already, continuing this path for new nesting
> > capability sounds natural. There is another option by generalizing the
> > entire IOMMU domain management (sort of the entire vfio_iommu_
> > type1), but it's unclear whether such intrusive change is worthwhile
> > (especially when VFIO/VDPA already goes different route even in legacy
> > mapping uAPI: map/unmap vs. IOTLB).
> 
> I agree with your analysis. A new coarse /dev/sva interface would need to
> carry all the VFIO abstractions of container (minus map/unmap) and
> group+device, which are not necessarily needed by VDPA and others, while
> the original VFIO interface needs to stay for compatibility. To me it
> makes more sense to extend each API separately, but have them embed
> common
> structures (bind/inval) and share some resources through external
> interfaces (IOASID, nesting properties, IOPF queue).
> 
> Thanks,
> Jean

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-13  6:22 ` Jason Wang
@ 2020-10-14  3:08   ` Tian, Kevin
  2020-10-14 23:10     ` Alex Williamson
  2020-10-15  6:52     ` Jason Wang
  0 siblings, 2 replies; 55+ messages in thread
From: Tian, Kevin @ 2020-10-14  3:08 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 13, 2020 2:22 PM
> 
> 
> On 2020/10/12 下午4:38, Tian, Kevin wrote:
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Monday, September 14, 2020 12:20 PM
> >>
> > [...]
> >   > If it's possible, I would suggest a generic uAPI instead of a VFIO
> >> specific one.
> >>
> >> Jason suggest something like /dev/sva. There will be a lot of other
> >> subsystems that could benefit from this (e.g vDPA).
> >>
> >> Have you ever considered this approach?
> >>
> > Hi, Jason,
> >
> > We did some study on this approach and below is the output. It's a
> > long writing but I didn't find a way to further abstract w/o losing
> > necessary context. Sorry about that.
> >
> > Overall the real purpose of this series is to enable IOMMU nested
> > translation capability with vSVA as one major usage, through
> > below new uAPIs:
> > 	1) Report/enable IOMMU nested translation capability;
> > 	2) Allocate/free PASID;
> > 	3) Bind/unbind guest page table;
> > 	4) Invalidate IOMMU cache;
> > 	5) Handle IOMMU page request/response (not in this series);
> > 1/3/4) is the minimal set for using IOMMU nested translation, with
> > the other two optional. For example, the guest may enable vSVA on
> > a device without using PASID. Or, it may bind its gIOVA page table
> > which doesn't require page fault support. Finally, all operations can
> > be applied to either physical device or subdevice.
> >
> > Then we evaluated each uAPI whether generalizing it is a good thing
> > both in concept and regarding to complexity.
> >
> > First, unlike other uAPIs which are all backed by iommu_ops, PASID
> > allocation/free is through the IOASID sub-system.
> 
> 
> A question here, is IOASID expected to be the single management
> interface for PASID?

yes

> 
> (I'm asking since there're already vendor specific IDA based PASID
> allocator e.g amdgpu_pasid_alloc())

That comes before IOASID core was introduced. I think it should be
changed to use the new generic interface. Jacob/Jean can better
comment if other reason exists for this exception.

> 
> 
> >   From this angle
> > we feel generalizing PASID management does make some sense.
> > First, PASID is just a number and not related to any device before
> > it's bound to a page table and IOMMU domain. Second, PASID is a
> > global resource (at least on Intel VT-d),
> 
> 
> I think we need a definition of "global" here. It looks to me for vt-d
> the PASID table is per device.

PASID table is per device, thus VT-d could support per-device PASIDs
in concept. However on Intel platform we require PASIDs to be managed 
in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV 
and ENQCMD together. Thus the host creates only one 'global' PASID 
namespace but do use per-device PASID table to assure isolation between 
devices on Intel platforms. But ARM does it differently as Jean explained. 
They have a global namespace for host processes on all host-owned 
devices (same as Intel), but then per-device namespace when a device 
(and its PASID table) is assigned to userspace.

> 
> Another question, is this possible to have two DMAR hardware unit(at
> least I can see two even in my laptop). In this case, is PASID still a
> global resource?

yes

> 
> 
> >   while having separate VFIO/
> > VDPA allocation interfaces may easily cause confusion in userspace,
> > e.g. which interface to be used if both VFIO/VDPA devices exist.
> > Moreover, an unified interface allows centralized control over how
> > many PASIDs are allowed per process.
> 
> 
> Yes.
> 
> 
> >
> > One unclear part with this generalization is about the permission.
> > Do we open this interface to any process or only to those which
> > have assigned devices? If the latter, what would be the mechanism
> > to coordinate between this new interface and specific passthrough
> > frameworks?
> 
> 
> I'm not sure, but if you just want a permission, you probably can
> introduce new capability (CAP_XXX) for this.
> 
> 
> >   A more tricky case, vSVA support on ARM (Eric/Jean
> > please correct me) plans to do per-device PASID namespace which
> > is built on a bind_pasid_table iommu callback to allow guest fully
> > manage its PASIDs on a given passthrough device.
> 
> 
> I see, so I think the answer is to prepare for the namespace support
> from the start. (btw, I don't see how namespace is handled in current
> IOASID module?)

The PASID table is based on GPA when nested translation is enabled 
on ARM SMMU. This design implies that the guest manages PASID
table thus PASIDs instead of going through host-side API on assigned 
device. From this angle we don't need explicit namespace in the host 
API. Just need a way to control how many PASIDs a process is allowed 
to allocate in the global namespace. btw IOASID module already has 
'set' concept per-process and PASIDs are managed per-set. Then the 
quota control can be easily introduced in the 'set' level.

> 
> 
> >   I'm not sure
> > how such requirement can be unified w/o involving passthrough
> > frameworks, or whether ARM could also switch to global PASID
> > style...
> >
> > Second, IOMMU nested translation is a per IOMMU domain
> > capability. Since IOMMU domains are managed by VFIO/VDPA
> >   (alloc/free domain, attach/detach device, set/get domain attribute,
> > etc.), reporting/enabling the nesting capability is an natural
> > extension to the domain uAPI of existing passthrough frameworks.
> > Actually, VFIO already includes a nesting enable interface even
> > before this series. So it doesn't make sense to generalize this uAPI
> > out.
> 
> 
> So my understanding is that VFIO already:
> 
> 1) use multiple fds
> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
> 3) provides API to associated devices/group with a container
> 
> And all the proposal in this series is to reuse the container fd. It
> should be possible to replace e.g type1 IOMMU with a unified module.

yes, this is the alternative option that I raised in the last paragraph.

> 
> 
> >
> > Then the tricky part comes with the remaining operations (3/4/5),
> > which are all backed by iommu_ops thus effective only within an
> > IOMMU domain. To generalize them, the first thing is to find a way
> > to associate the sva_FD (opened through generic /dev/sva) with an
> > IOMMU domain that is created by VFIO/VDPA. The second thing is
> > to replicate {domain<->device/subdevice} association in /dev/sva
> > path because some operations (e.g. page fault) is triggered/handled
> > per device/subdevice.
> 
> 
> Is there any reason that the #PF can not be handled via SVA fd?

using per-device FDs or multiplexing all fault info through one sva_FD
is just an implementation choice. The key is to mark faults per device/
subdevice thus anyway requires a userspace-visible handle/tag to
represent device/subdevice and the domain/device association must
be constructed in this new path.

> 
> 
> >   Therefore, /dev/sva must provide both per-
> > domain and per-device uAPIs similar to what VFIO/VDPA already
> > does. Moreover, mapping page fault to subdevice requires pre-
> > registering subdevice fault data to IOMMU layer when binding
> > guest page table, while such fault data can be only retrieved from
> > parent driver through VFIO/VDPA.
> >
> > However, we failed to find a good way even at the 1st step about
> > domain association. The iommu domains are not exposed to the
> > userspace, and there is no 1:1 mapping between domain and device.
> > In VFIO, all devices within the same VFIO container share the address
> > space but they may be organized in multiple IOMMU domains based
> > on their bus type. How (should we let) the userspace know the
> > domain information and open an sva_FD for each domain is the main
> > problem here.
> 
> 
> The SVA fd is not necessarily opened by userspace. It could be get
> through subsystem specific uAPIs.
> 
> E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:
> 
> 1) introduce uAPI for userspace to know the number of vSVA-capable
> domain
> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
> domain

and also new interface to notify userspace when a domain disappears
or a device is detached? Finally looks we are creating a completely set
of new subsystem specific uAPIs just for generalizing another set of
subsystem specific uAPIs. Remember after separating PASID mgmt.
out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU 
API. Replicating them is much easier logic than developing a new glue 
mechanism in each subsystem.

> 
> 
> >
> > In the end we just realized that doing such generalization doesn't
> > really lead to a clear design and instead requires tight coordination
> > between /dev/sva and VFIO/VDPA for almost every new uAPI
> > (especially about synchronization when the domain/device
> > association is changed or when the device/subdevice is being reset/
> > drained). Finally it may become a usability burden to the userspace
> > on proper use of the two interfaces on the assigned device.
> >
> > Based on above analysis we feel that just generalizing PASID mgmt.
> > might be a good thing to look at while the remaining operations are
> > better being VFIO/VDPA specific uAPIs. anyway in concept those are
> > just a subset of the page table management capabilities that an
> > IOMMU domain affords. Since all other aspects of the IOMMU domain
> > is managed by VFIO/VDPA already, continuing this path for new nesting
> > capability sounds natural. There is another option by generalizing the
> > entire IOMMU domain management (sort of the entire vfio_iommu_
> > type1), but it's unclear whether such intrusive change is worthwhile
> > (especially when VFIO/VDPA already goes different route even in legacy
> > mapping uAPI: map/unmap vs. IOTLB).
> >
> > Thoughts?
> 
> 
> I'm ok with starting with a unified PASID management and consider the
> unified vSVA/vIOMMU uAPI later.
> 

Glad to see that we have consensus here. :)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-12  8:38 (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Tian, Kevin
  2020-10-13  6:22 ` Jason Wang
  2020-10-13 10:27 ` Jean-Philippe Brucker
@ 2020-10-14  3:16 ` Tian, Kevin
  2020-10-16 15:36   ` Jason Gunthorpe
  2020-11-03  9:52 ` joro
  3 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2020-10-14  3:16 UTC (permalink / raw)
  To: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

Hi, Alex and Jason (G),

How about your opinion for this new proposal? For now looks both
Jason (W) and Jean are OK with this direction and more discussions
are possibly required for the new /dev/ioasid interface. Internally 
we're doing a quick prototype to see any unforeseen issue with this
separation. 

Please let us know your thoughts.

Thanks
Kevin

> From: Tian, Kevin <kevin.tian@intel.com>
> Sent: Monday, October 12, 2020 4:39 PM
> 
> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Monday, September 14, 2020 12:20 PM
> >
> [...]
>  > If it's possible, I would suggest a generic uAPI instead of a VFIO
> > specific one.
> >
> > Jason suggest something like /dev/sva. There will be a lot of other
> > subsystems that could benefit from this (e.g vDPA).
> >
> > Have you ever considered this approach?
> >
> 
> Hi, Jason,
> 
> We did some study on this approach and below is the output. It's a
> long writing but I didn't find a way to further abstract w/o losing
> necessary context. Sorry about that.
> 
> Overall the real purpose of this series is to enable IOMMU nested
> translation capability with vSVA as one major usage, through
> below new uAPIs:
> 	1) Report/enable IOMMU nested translation capability;
> 	2) Allocate/free PASID;
> 	3) Bind/unbind guest page table;
> 	4) Invalidate IOMMU cache;
> 	5) Handle IOMMU page request/response (not in this series);
> 1/3/4) is the minimal set for using IOMMU nested translation, with
> the other two optional. For example, the guest may enable vSVA on
> a device without using PASID. Or, it may bind its gIOVA page table
> which doesn't require page fault support. Finally, all operations can
> be applied to either physical device or subdevice.
> 
> Then we evaluated each uAPI whether generalizing it is a good thing
> both in concept and regarding to complexity.
> 
> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> allocation/free is through the IOASID sub-system. From this angle
> we feel generalizing PASID management does make some sense.
> First, PASID is just a number and not related to any device before
> it's bound to a page table and IOMMU domain. Second, PASID is a
> global resource (at least on Intel VT-d), while having separate VFIO/
> VDPA allocation interfaces may easily cause confusion in userspace,
> e.g. which interface to be used if both VFIO/VDPA devices exist.
> Moreover, an unified interface allows centralized control over how
> many PASIDs are allowed per process.
> 
> One unclear part with this generalization is about the permission.
> Do we open this interface to any process or only to those which
> have assigned devices? If the latter, what would be the mechanism
> to coordinate between this new interface and specific passthrough
> frameworks? A more tricky case, vSVA support on ARM (Eric/Jean
> please correct me) plans to do per-device PASID namespace which
> is built on a bind_pasid_table iommu callback to allow guest fully
> manage its PASIDs on a given passthrough device. I'm not sure
> how such requirement can be unified w/o involving passthrough
> frameworks, or whether ARM could also switch to global PASID
> style...
> 
> Second, IOMMU nested translation is a per IOMMU domain
> capability. Since IOMMU domains are managed by VFIO/VDPA
>  (alloc/free domain, attach/detach device, set/get domain attribute,
> etc.), reporting/enabling the nesting capability is an natural
> extension to the domain uAPI of existing passthrough frameworks.
> Actually, VFIO already includes a nesting enable interface even
> before this series. So it doesn't make sense to generalize this uAPI
> out.
> 
> Then the tricky part comes with the remaining operations (3/4/5),
> which are all backed by iommu_ops thus effective only within an
> IOMMU domain. To generalize them, the first thing is to find a way
> to associate the sva_FD (opened through generic /dev/sva) with an
> IOMMU domain that is created by VFIO/VDPA. The second thing is
> to replicate {domain<->device/subdevice} association in /dev/sva
> path because some operations (e.g. page fault) is triggered/handled
> per device/subdevice. Therefore, /dev/sva must provide both per-
> domain and per-device uAPIs similar to what VFIO/VDPA already
> does. Moreover, mapping page fault to subdevice requires pre-
> registering subdevice fault data to IOMMU layer when binding
> guest page table, while such fault data can be only retrieved from
> parent driver through VFIO/VDPA.
> 
> However, we failed to find a good way even at the 1st step about
> domain association. The iommu domains are not exposed to the
> userspace, and there is no 1:1 mapping between domain and device.
> In VFIO, all devices within the same VFIO container share the address
> space but they may be organized in multiple IOMMU domains based
> on their bus type. How (should we let) the userspace know the
> domain information and open an sva_FD for each domain is the main
> problem here.
> 
> In the end we just realized that doing such generalization doesn't
> really lead to a clear design and instead requires tight coordination
> between /dev/sva and VFIO/VDPA for almost every new uAPI
> (especially about synchronization when the domain/device
> association is changed or when the device/subdevice is being reset/
> drained). Finally it may become a usability burden to the userspace
> on proper use of the two interfaces on the assigned device.
> 
> Based on above analysis we feel that just generalizing PASID mgmt.
> might be a good thing to look at while the remaining operations are
> better being VFIO/VDPA specific uAPIs. anyway in concept those are
> just a subset of the page table management capabilities that an
> IOMMU domain affords. Since all other aspects of the IOMMU domain
> is managed by VFIO/VDPA already, continuing this path for new nesting
> capability sounds natural. There is another option by generalizing the
> entire IOMMU domain management (sort of the entire vfio_iommu_
> type1), but it's unclear whether such intrusive change is worthwhile
> (especially when VFIO/VDPA already goes different route even in legacy
> mapping uAPI: map/unmap vs. IOTLB).
> 
> Thoughts?
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-14  3:08   ` Tian, Kevin
@ 2020-10-14 23:10     ` Alex Williamson
  2020-10-15  7:02       ` Jason Wang
  2020-10-15  6:52     ` Jason Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Alex Williamson @ 2020-10-14 23:10 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Liu, Yi L, eric.auger, baolu.lu, joro, jacob.jun.pan,
	Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, Wu,
	Hao, stefanha, iommu, kvm, Jason Gunthorpe, Michael S. Tsirkin

On Wed, 14 Oct 2020 03:08:31 +0000
"Tian, Kevin" <kevin.tian@intel.com> wrote:

> > From: Jason Wang <jasowang@redhat.com>
> > Sent: Tuesday, October 13, 2020 2:22 PM
> > 
> > 
> > On 2020/10/12 下午4:38, Tian, Kevin wrote:  
> > >> From: Jason Wang <jasowang@redhat.com>
> > >> Sent: Monday, September 14, 2020 12:20 PM
> > >>  
> > > [...]  
> > >   > If it's possible, I would suggest a generic uAPI instead of a VFIO
> > >> specific one.
> > >>
> > >> Jason suggest something like /dev/sva. There will be a lot of other
> > >> subsystems that could benefit from this (e.g vDPA).
> > >>
> > >> Have you ever considered this approach?
> > >>  
> > > Hi, Jason,
> > >
> > > We did some study on this approach and below is the output. It's a
> > > long writing but I didn't find a way to further abstract w/o losing
> > > necessary context. Sorry about that.
> > >
> > > Overall the real purpose of this series is to enable IOMMU nested
> > > translation capability with vSVA as one major usage, through
> > > below new uAPIs:
> > > 	1) Report/enable IOMMU nested translation capability;
> > > 	2) Allocate/free PASID;
> > > 	3) Bind/unbind guest page table;
> > > 	4) Invalidate IOMMU cache;
> > > 	5) Handle IOMMU page request/response (not in this series);
> > > 1/3/4) is the minimal set for using IOMMU nested translation, with
> > > the other two optional. For example, the guest may enable vSVA on
> > > a device without using PASID. Or, it may bind its gIOVA page table
> > > which doesn't require page fault support. Finally, all operations can
> > > be applied to either physical device or subdevice.
> > >
> > > Then we evaluated each uAPI whether generalizing it is a good thing
> > > both in concept and regarding to complexity.
> > >
> > > First, unlike other uAPIs which are all backed by iommu_ops, PASID
> > > allocation/free is through the IOASID sub-system.  
> > 
> > 
> > A question here, is IOASID expected to be the single management
> > interface for PASID?  
> 
> yes
> 
> > 
> > (I'm asking since there're already vendor specific IDA based PASID
> > allocator e.g amdgpu_pasid_alloc())  
> 
> That comes before IOASID core was introduced. I think it should be
> changed to use the new generic interface. Jacob/Jean can better
> comment if other reason exists for this exception.
> 
> > 
> >   
> > >   From this angle
> > > we feel generalizing PASID management does make some sense.
> > > First, PASID is just a number and not related to any device before
> > > it's bound to a page table and IOMMU domain. Second, PASID is a
> > > global resource (at least on Intel VT-d),  
> > 
> > 
> > I think we need a definition of "global" here. It looks to me for vt-d
> > the PASID table is per device.  
> 
> PASID table is per device, thus VT-d could support per-device PASIDs
> in concept. However on Intel platform we require PASIDs to be managed 
> in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV 
> and ENQCMD together. Thus the host creates only one 'global' PASID 
> namespace but do use per-device PASID table to assure isolation between 
> devices on Intel platforms. But ARM does it differently as Jean explained. 
> They have a global namespace for host processes on all host-owned 
> devices (same as Intel), but then per-device namespace when a device 
> (and its PASID table) is assigned to userspace.
> 
> > 
> > Another question, is this possible to have two DMAR hardware unit(at
> > least I can see two even in my laptop). In this case, is PASID still a
> > global resource?  
> 
> yes
> 
> > 
> >   
> > >   while having separate VFIO/
> > > VDPA allocation interfaces may easily cause confusion in userspace,
> > > e.g. which interface to be used if both VFIO/VDPA devices exist.
> > > Moreover, an unified interface allows centralized control over how
> > > many PASIDs are allowed per process.  
> > 
> > 
> > Yes.
> > 
> >   
> > >
> > > One unclear part with this generalization is about the permission.
> > > Do we open this interface to any process or only to those which
> > > have assigned devices? If the latter, what would be the mechanism
> > > to coordinate between this new interface and specific passthrough
> > > frameworks?  
> > 
> > 
> > I'm not sure, but if you just want a permission, you probably can
> > introduce new capability (CAP_XXX) for this.
> > 
> >   
> > >   A more tricky case, vSVA support on ARM (Eric/Jean
> > > please correct me) plans to do per-device PASID namespace which
> > > is built on a bind_pasid_table iommu callback to allow guest fully
> > > manage its PASIDs on a given passthrough device.  
> > 
> > 
> > I see, so I think the answer is to prepare for the namespace support
> > from the start. (btw, I don't see how namespace is handled in current
> > IOASID module?)  
> 
> The PASID table is based on GPA when nested translation is enabled 
> on ARM SMMU. This design implies that the guest manages PASID
> table thus PASIDs instead of going through host-side API on assigned 
> device. From this angle we don't need explicit namespace in the host 
> API. Just need a way to control how many PASIDs a process is allowed 
> to allocate in the global namespace. btw IOASID module already has 
> 'set' concept per-process and PASIDs are managed per-set. Then the 
> quota control can be easily introduced in the 'set' level.
> 
> > 
> >   
> > >   I'm not sure
> > > how such requirement can be unified w/o involving passthrough
> > > frameworks, or whether ARM could also switch to global PASID
> > > style...
> > >
> > > Second, IOMMU nested translation is a per IOMMU domain
> > > capability. Since IOMMU domains are managed by VFIO/VDPA
> > >   (alloc/free domain, attach/detach device, set/get domain attribute,
> > > etc.), reporting/enabling the nesting capability is an natural
> > > extension to the domain uAPI of existing passthrough frameworks.
> > > Actually, VFIO already includes a nesting enable interface even
> > > before this series. So it doesn't make sense to generalize this uAPI
> > > out.  
> > 
> > 
> > So my understanding is that VFIO already:
> > 
> > 1) use multiple fds
> > 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
> > 3) provides API to associated devices/group with a container

This is not really correct, or at least doesn't match my mental model.
A vfio container represents a set of groups (one or more devices per
group), which share an IOMMU model and context.  The user separately
opens a vfio container and group device files.  A group is associated
to the container via ioctl on the group, providing the container fd.
The user then sets the IOMMU model on the container, which selects the
vfio IOMMU uAPI they'll use.  We support multiple IOMMU models where
each vfio IOMMU backend registers a set of callbacks with vfio-core.

> > And all the proposal in this series is to reuse the container fd. It
> > should be possible to replace e.g type1 IOMMU with a unified module.  
> 
> yes, this is the alternative option that I raised in the last paragraph.

"[R]euse the container fd" is where I get lost here.  The container is
a fundamental part of vfio.  Does this instead mean to introduce a new
vfio IOMMU backend model?  The module would need to interact with vfio
via vfio_iommu_driver_ops callbacks, so this "unified module" requires
a vfio interface.  I don't understand how this contributes to something
that vdpa would also make use of.


> > > Then the tricky part comes with the remaining operations (3/4/5),
> > > which are all backed by iommu_ops thus effective only within an
> > > IOMMU domain. To generalize them, the first thing is to find a way
> > > to associate the sva_FD (opened through generic /dev/sva) with an
> > > IOMMU domain that is created by VFIO/VDPA. The second thing is
> > > to replicate {domain<->device/subdevice} association in /dev/sva
> > > path because some operations (e.g. page fault) is triggered/handled
> > > per device/subdevice.  
> > 
> > 
> > Is there any reason that the #PF can not be handled via SVA fd?  
> 
> using per-device FDs or multiplexing all fault info through one sva_FD
> is just an implementation choice. The key is to mark faults per device/
> subdevice thus anyway requires a userspace-visible handle/tag to
> represent device/subdevice and the domain/device association must
> be constructed in this new path.
> 
> > 
> >   
> > >   Therefore, /dev/sva must provide both per-
> > > domain and per-device uAPIs similar to what VFIO/VDPA already
> > > does. Moreover, mapping page fault to subdevice requires pre-
> > > registering subdevice fault data to IOMMU layer when binding
> > > guest page table, while such fault data can be only retrieved from
> > > parent driver through VFIO/VDPA.
> > >
> > > However, we failed to find a good way even at the 1st step about
> > > domain association. The iommu domains are not exposed to the
> > > userspace, and there is no 1:1 mapping between domain and device.
> > > In VFIO, all devices within the same VFIO container share the address
> > > space but they may be organized in multiple IOMMU domains based
> > > on their bus type. How (should we let) the userspace know the
> > > domain information and open an sva_FD for each domain is the main
> > > problem here.  
> > 
> > 
> > The SVA fd is not necessarily opened by userspace. It could be get
> > through subsystem specific uAPIs.
> > 
> > E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:
> > 
> > 1) introduce uAPI for userspace to know the number of vSVA-capable
> > domain
> > 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
> > domain  
> 
> and also new interface to notify userspace when a domain disappears
> or a device is detached? Finally looks we are creating a completely set
> of new subsystem specific uAPIs just for generalizing another set of
> subsystem specific uAPIs. Remember after separating PASID mgmt.
> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU 
> API. Replicating them is much easier logic than developing a new glue 
> mechanism in each subsystem.

Right, I don't see the advantage here, subsystem specific uAPIs using
common internal interfaces is what was being proposed.

> > > In the end we just realized that doing such generalization doesn't
> > > really lead to a clear design and instead requires tight coordination
> > > between /dev/sva and VFIO/VDPA for almost every new uAPI
> > > (especially about synchronization when the domain/device
> > > association is changed or when the device/subdevice is being reset/
> > > drained). Finally it may become a usability burden to the userspace
> > > on proper use of the two interfaces on the assigned device.
> > >
> > > Based on above analysis we feel that just generalizing PASID mgmt.
> > > might be a good thing to look at while the remaining operations are
> > > better being VFIO/VDPA specific uAPIs. anyway in concept those are
> > > just a subset of the page table management capabilities that an
> > > IOMMU domain affords. Since all other aspects of the IOMMU domain
> > > is managed by VFIO/VDPA already, continuing this path for new nesting
> > > capability sounds natural. There is another option by generalizing the
> > > entire IOMMU domain management (sort of the entire vfio_iommu_
> > > type1), but it's unclear whether such intrusive change is worthwhile
> > > (especially when VFIO/VDPA already goes different route even in legacy
> > > mapping uAPI: map/unmap vs. IOTLB).
> > >
> > > Thoughts?  
> > 
> > 
> > I'm ok with starting with a unified PASID management and consider the
> > unified vSVA/vIOMMU uAPI later.
> >   
> 
> Glad to see that we have consensus here. :)

I see the benefit in a common PASID quota mechanism rather than the
ad-hoc limits introduced for vfio, but vfio integration does have the
benefit of being tied to device access, whereas it seems it seems a
user will need to be granted some CAP_SVA capability separate from the
device to make use of this interface.  It's possible for vfio to honor
shared limits, just as we make use of locked memory limits shared by
the task, so I'm not sure yet the benefit provided by a separate
userspace interface outside of vfio.  A separate interface also throws
a kink is userspace use of vfio, where we expect the interface is
largely self contained, ie. if a user has access to the vfio group and
container device files, they can fully make use of their device, up to
limits imposed by things like locked memory.  I'm concerned that
management tools will actually need to understand the intended usage of
a device in order to grant new capabilities, file access, and limits to
a process making use of these features.  Hopefully your prototype will
clarify some of those aspects.  Thanks,

Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-14  3:08   ` Tian, Kevin
  2020-10-14 23:10     ` Alex Williamson
@ 2020-10-15  6:52     ` Jason Wang
  2020-10-15  7:58       ` Tian, Kevin
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-15  6:52 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/10/14 上午11:08, Tian, Kevin wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Tuesday, October 13, 2020 2:22 PM
>>
>>
>> On 2020/10/12 下午4:38, Tian, Kevin wrote:
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Sent: Monday, September 14, 2020 12:20 PM
>>>>
>>> [...]
>>>    > If it's possible, I would suggest a generic uAPI instead of a VFIO
>>>> specific one.
>>>>
>>>> Jason suggest something like /dev/sva. There will be a lot of other
>>>> subsystems that could benefit from this (e.g vDPA).
>>>>
>>>> Have you ever considered this approach?
>>>>
>>> Hi, Jason,
>>>
>>> We did some study on this approach and below is the output. It's a
>>> long writing but I didn't find a way to further abstract w/o losing
>>> necessary context. Sorry about that.
>>>
>>> Overall the real purpose of this series is to enable IOMMU nested
>>> translation capability with vSVA as one major usage, through
>>> below new uAPIs:
>>> 	1) Report/enable IOMMU nested translation capability;
>>> 	2) Allocate/free PASID;
>>> 	3) Bind/unbind guest page table;
>>> 	4) Invalidate IOMMU cache;
>>> 	5) Handle IOMMU page request/response (not in this series);
>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
>>> the other two optional. For example, the guest may enable vSVA on
>>> a device without using PASID. Or, it may bind its gIOVA page table
>>> which doesn't require page fault support. Finally, all operations can
>>> be applied to either physical device or subdevice.
>>>
>>> Then we evaluated each uAPI whether generalizing it is a good thing
>>> both in concept and regarding to complexity.
>>>
>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
>>> allocation/free is through the IOASID sub-system.
>>
>> A question here, is IOASID expected to be the single management
>> interface for PASID?
> yes
>
>> (I'm asking since there're already vendor specific IDA based PASID
>> allocator e.g amdgpu_pasid_alloc())
> That comes before IOASID core was introduced. I think it should be
> changed to use the new generic interface. Jacob/Jean can better
> comment if other reason exists for this exception.


If there's no exception it should be fixed.


>
>>
>>>    From this angle
>>> we feel generalizing PASID management does make some sense.
>>> First, PASID is just a number and not related to any device before
>>> it's bound to a page table and IOMMU domain. Second, PASID is a
>>> global resource (at least on Intel VT-d),
>>
>> I think we need a definition of "global" here. It looks to me for vt-d
>> the PASID table is per device.
> PASID table is per device, thus VT-d could support per-device PASIDs
> in concept.


I think that's the requirement of PCIE spec which said PASID + RID 
identifies the process address space ID.


>   However on Intel platform we require PASIDs to be managed
> in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> and ENQCMD together.


Any reason for such requirement? (I'm not familiar with ENQCMD, but my 
understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement 
for system-wide PASID).


> Thus the host creates only one 'global' PASID
> namespace but do use per-device PASID table to assure isolation between
> devices on Intel platforms. But ARM does it differently as Jean explained.
> They have a global namespace for host processes on all host-owned
> devices (same as Intel), but then per-device namespace when a device
> (and its PASID table) is assigned to userspace.
>
>> Another question, is this possible to have two DMAR hardware unit(at
>> least I can see two even in my laptop). In this case, is PASID still a
>> global resource?
> yes
>
>>
>>>    while having separate VFIO/
>>> VDPA allocation interfaces may easily cause confusion in userspace,
>>> e.g. which interface to be used if both VFIO/VDPA devices exist.
>>> Moreover, an unified interface allows centralized control over how
>>> many PASIDs are allowed per process.
>>
>> Yes.
>>
>>
>>> One unclear part with this generalization is about the permission.
>>> Do we open this interface to any process or only to those which
>>> have assigned devices? If the latter, what would be the mechanism
>>> to coordinate between this new interface and specific passthrough
>>> frameworks?
>>
>> I'm not sure, but if you just want a permission, you probably can
>> introduce new capability (CAP_XXX) for this.
>>
>>
>>>    A more tricky case, vSVA support on ARM (Eric/Jean
>>> please correct me) plans to do per-device PASID namespace which
>>> is built on a bind_pasid_table iommu callback to allow guest fully
>>> manage its PASIDs on a given passthrough device.
>>
>> I see, so I think the answer is to prepare for the namespace support
>> from the start. (btw, I don't see how namespace is handled in current
>> IOASID module?)
> The PASID table is based on GPA when nested translation is enabled
> on ARM SMMU. This design implies that the guest manages PASID
> table thus PASIDs instead of going through host-side API on assigned
> device. From this angle we don't need explicit namespace in the host
> API. Just need a way to control how many PASIDs a process is allowed
> to allocate in the global namespace. btw IOASID module already has
> 'set' concept per-process and PASIDs are managed per-set. Then the
> quota control can be easily introduced in the 'set' level.
>
>>
>>>    I'm not sure
>>> how such requirement can be unified w/o involving passthrough
>>> frameworks, or whether ARM could also switch to global PASID
>>> style...
>>>
>>> Second, IOMMU nested translation is a per IOMMU domain
>>> capability. Since IOMMU domains are managed by VFIO/VDPA
>>>    (alloc/free domain, attach/detach device, set/get domain attribute,
>>> etc.), reporting/enabling the nesting capability is an natural
>>> extension to the domain uAPI of existing passthrough frameworks.
>>> Actually, VFIO already includes a nesting enable interface even
>>> before this series. So it doesn't make sense to generalize this uAPI
>>> out.
>>
>> So my understanding is that VFIO already:
>>
>> 1) use multiple fds
>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
>> 3) provides API to associated devices/group with a container
>>
>> And all the proposal in this series is to reuse the container fd. It
>> should be possible to replace e.g type1 IOMMU with a unified module.
> yes, this is the alternative option that I raised in the last paragraph.
>
>>
>>> Then the tricky part comes with the remaining operations (3/4/5),
>>> which are all backed by iommu_ops thus effective only within an
>>> IOMMU domain. To generalize them, the first thing is to find a way
>>> to associate the sva_FD (opened through generic /dev/sva) with an
>>> IOMMU domain that is created by VFIO/VDPA. The second thing is
>>> to replicate {domain<->device/subdevice} association in /dev/sva
>>> path because some operations (e.g. page fault) is triggered/handled
>>> per device/subdevice.
>>
>> Is there any reason that the #PF can not be handled via SVA fd?
> using per-device FDs or multiplexing all fault info through one sva_FD
> is just an implementation choice. The key is to mark faults per device/
> subdevice thus anyway requires a userspace-visible handle/tag to
> represent device/subdevice and the domain/device association must
> be constructed in this new path.


I don't get why it requires a userspace-visible handle/tag. The binding 
between SVA fd and device fd could be done either explicitly or 
implicitly. So userspace know which (sub)device that this SVA fd is for.


>
>>
>>>    Therefore, /dev/sva must provide both per-
>>> domain and per-device uAPIs similar to what VFIO/VDPA already
>>> does. Moreover, mapping page fault to subdevice requires pre-
>>> registering subdevice fault data to IOMMU layer when binding
>>> guest page table, while such fault data can be only retrieved from
>>> parent driver through VFIO/VDPA.
>>>
>>> However, we failed to find a good way even at the 1st step about
>>> domain association. The iommu domains are not exposed to the
>>> userspace, and there is no 1:1 mapping between domain and device.
>>> In VFIO, all devices within the same VFIO container share the address
>>> space but they may be organized in multiple IOMMU domains based
>>> on their bus type. How (should we let) the userspace know the
>>> domain information and open an sva_FD for each domain is the main
>>> problem here.
>>
>> The SVA fd is not necessarily opened by userspace. It could be get
>> through subsystem specific uAPIs.
>>
>> E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:
>>
>> 1) introduce uAPI for userspace to know the number of vSVA-capable
>> domain
>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
>> domain
> and also new interface to notify userspace when a domain disappears
> or a device is detached?


You need to deal with this case even in VFIO, isn't it?


>   Finally looks we are creating a completely set
> of new subsystem specific uAPIs just for generalizing another set of
> subsystem specific uAPIs. Remember after separating PASID mgmt.
> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
> API. Replicating them is much easier logic than developing a new glue
> mechanism in each subsystem.


As discussed, the point is more than just simple generalizing. It's 
about the limitation of current uAPI. So I have the following questions:

Do we want a single PASID to be used by more than one devices? If yes, 
do we want those devices to share I/O page tables? If yes, which uAPI is 
used to program the shared I/O page tables?

Thanks


>
>>
>>> In the end we just realized that doing such generalization doesn't
>>> really lead to a clear design and instead requires tight coordination
>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
>>> (especially about synchronization when the domain/device
>>> association is changed or when the device/subdevice is being reset/
>>> drained). Finally it may become a usability burden to the userspace
>>> on proper use of the two interfaces on the assigned device.
>>>
>>> Based on above analysis we feel that just generalizing PASID mgmt.
>>> might be a good thing to look at while the remaining operations are
>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
>>> just a subset of the page table management capabilities that an
>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
>>> is managed by VFIO/VDPA already, continuing this path for new nesting
>>> capability sounds natural. There is another option by generalizing the
>>> entire IOMMU domain management (sort of the entire vfio_iommu_
>>> type1), but it's unclear whether such intrusive change is worthwhile
>>> (especially when VFIO/VDPA already goes different route even in legacy
>>> mapping uAPI: map/unmap vs. IOTLB).
>>>
>>> Thoughts?
>>
>> I'm ok with starting with a unified PASID management and consider the
>> unified vSVA/vIOMMU uAPI later.
>>
> Glad to see that we have consensus here. :)
>
> Thanks
> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-14 23:10     ` Alex Williamson
@ 2020-10-15  7:02       ` Jason Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-10-15  7:02 UTC (permalink / raw)
  To: Alex Williamson, Tian, Kevin
  Cc: Liu, Yi L, eric.auger, baolu.lu, joro, jacob.jun.pan, Raj, Ashok,
	Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, Wu, Hao, stefanha,
	iommu, kvm, Jason Gunthorpe, Michael S. Tsirkin


On 2020/10/15 上午7:10, Alex Williamson wrote:
> On Wed, 14 Oct 2020 03:08:31 +0000
> "Tian, Kevin" <kevin.tian@intel.com> wrote:
>
>>> From: Jason Wang <jasowang@redhat.com>
>>> Sent: Tuesday, October 13, 2020 2:22 PM
>>>
>>>
>>> On 2020/10/12 下午4:38, Tian, Kevin wrote:
>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>> Sent: Monday, September 14, 2020 12:20 PM
>>>>>   
>>>> [...]
>>>>    > If it's possible, I would suggest a generic uAPI instead of a VFIO
>>>>> specific one.
>>>>>
>>>>> Jason suggest something like /dev/sva. There will be a lot of other
>>>>> subsystems that could benefit from this (e.g vDPA).
>>>>>
>>>>> Have you ever considered this approach?
>>>>>   
>>>> Hi, Jason,
>>>>
>>>> We did some study on this approach and below is the output. It's a
>>>> long writing but I didn't find a way to further abstract w/o losing
>>>> necessary context. Sorry about that.
>>>>
>>>> Overall the real purpose of this series is to enable IOMMU nested
>>>> translation capability with vSVA as one major usage, through
>>>> below new uAPIs:
>>>> 	1) Report/enable IOMMU nested translation capability;
>>>> 	2) Allocate/free PASID;
>>>> 	3) Bind/unbind guest page table;
>>>> 	4) Invalidate IOMMU cache;
>>>> 	5) Handle IOMMU page request/response (not in this series);
>>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
>>>> the other two optional. For example, the guest may enable vSVA on
>>>> a device without using PASID. Or, it may bind its gIOVA page table
>>>> which doesn't require page fault support. Finally, all operations can
>>>> be applied to either physical device or subdevice.
>>>>
>>>> Then we evaluated each uAPI whether generalizing it is a good thing
>>>> both in concept and regarding to complexity.
>>>>
>>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
>>>> allocation/free is through the IOASID sub-system.
>>>
>>> A question here, is IOASID expected to be the single management
>>> interface for PASID?
>> yes
>>
>>> (I'm asking since there're already vendor specific IDA based PASID
>>> allocator e.g amdgpu_pasid_alloc())
>> That comes before IOASID core was introduced. I think it should be
>> changed to use the new generic interface. Jacob/Jean can better
>> comment if other reason exists for this exception.
>>
>>>    
>>>>    From this angle
>>>> we feel generalizing PASID management does make some sense.
>>>> First, PASID is just a number and not related to any device before
>>>> it's bound to a page table and IOMMU domain. Second, PASID is a
>>>> global resource (at least on Intel VT-d),
>>>
>>> I think we need a definition of "global" here. It looks to me for vt-d
>>> the PASID table is per device.
>> PASID table is per device, thus VT-d could support per-device PASIDs
>> in concept. However on Intel platform we require PASIDs to be managed
>> in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
>> and ENQCMD together. Thus the host creates only one 'global' PASID
>> namespace but do use per-device PASID table to assure isolation between
>> devices on Intel platforms. But ARM does it differently as Jean explained.
>> They have a global namespace for host processes on all host-owned
>> devices (same as Intel), but then per-device namespace when a device
>> (and its PASID table) is assigned to userspace.
>>
>>> Another question, is this possible to have two DMAR hardware unit(at
>>> least I can see two even in my laptop). In this case, is PASID still a
>>> global resource?
>> yes
>>
>>>    
>>>>    while having separate VFIO/
>>>> VDPA allocation interfaces may easily cause confusion in userspace,
>>>> e.g. which interface to be used if both VFIO/VDPA devices exist.
>>>> Moreover, an unified interface allows centralized control over how
>>>> many PASIDs are allowed per process.
>>>
>>> Yes.
>>>
>>>    
>>>> One unclear part with this generalization is about the permission.
>>>> Do we open this interface to any process or only to those which
>>>> have assigned devices? If the latter, what would be the mechanism
>>>> to coordinate between this new interface and specific passthrough
>>>> frameworks?
>>>
>>> I'm not sure, but if you just want a permission, you probably can
>>> introduce new capability (CAP_XXX) for this.
>>>
>>>    
>>>>    A more tricky case, vSVA support on ARM (Eric/Jean
>>>> please correct me) plans to do per-device PASID namespace which
>>>> is built on a bind_pasid_table iommu callback to allow guest fully
>>>> manage its PASIDs on a given passthrough device.
>>>
>>> I see, so I think the answer is to prepare for the namespace support
>>> from the start. (btw, I don't see how namespace is handled in current
>>> IOASID module?)
>> The PASID table is based on GPA when nested translation is enabled
>> on ARM SMMU. This design implies that the guest manages PASID
>> table thus PASIDs instead of going through host-side API on assigned
>> device. From this angle we don't need explicit namespace in the host
>> API. Just need a way to control how many PASIDs a process is allowed
>> to allocate in the global namespace. btw IOASID module already has
>> 'set' concept per-process and PASIDs are managed per-set. Then the
>> quota control can be easily introduced in the 'set' level.
>>
>>>    
>>>>    I'm not sure
>>>> how such requirement can be unified w/o involving passthrough
>>>> frameworks, or whether ARM could also switch to global PASID
>>>> style...
>>>>
>>>> Second, IOMMU nested translation is a per IOMMU domain
>>>> capability. Since IOMMU domains are managed by VFIO/VDPA
>>>>    (alloc/free domain, attach/detach device, set/get domain attribute,
>>>> etc.), reporting/enabling the nesting capability is an natural
>>>> extension to the domain uAPI of existing passthrough frameworks.
>>>> Actually, VFIO already includes a nesting enable interface even
>>>> before this series. So it doesn't make sense to generalize this uAPI
>>>> out.
>>>
>>> So my understanding is that VFIO already:
>>>
>>> 1) use multiple fds
>>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
>>> 3) provides API to associated devices/group with a container
> This is not really correct, or at least doesn't match my mental model.
> A vfio container represents a set of groups (one or more devices per
> group), which share an IOMMU model and context.  The user separately
> opens a vfio container and group device files.  A group is associated
> to the container via ioctl on the group, providing the container fd.
> The user then sets the IOMMU model on the container, which selects the
> vfio IOMMU uAPI they'll use.  We support multiple IOMMU models where
> each vfio IOMMU backend registers a set of callbacks with vfio-core.


Yes.


>
>>> And all the proposal in this series is to reuse the container fd. It
>>> should be possible to replace e.g type1 IOMMU with a unified module.
>> yes, this is the alternative option that I raised in the last paragraph.
> "[R]euse the container fd" is where I get lost here.  The container is
> a fundamental part of vfio.  Does this instead mean to introduce a new
> vfio IOMMU backend model?


Yes, a new backend model or allow using external module as its IOMMU 
backend.


>    The module would need to interact with vfio
> via vfio_iommu_driver_ops callbacks, so this "unified module" requires
> a vfio interface.  I don't understand how this contributes to something
> that vdpa would also make use of.


If an external module is allowed, then it could be reused by vDPA and 
any other subsystems that want to do vSVA.


>
>
>>>> Then the tricky part comes with the remaining operations (3/4/5),
>>>> which are all backed by iommu_ops thus effective only within an
>>>> IOMMU domain. To generalize them, the first thing is to find a way
>>>> to associate the sva_FD (opened through generic /dev/sva) with an
>>>> IOMMU domain that is created by VFIO/VDPA. The second thing is
>>>> to replicate {domain<->device/subdevice} association in /dev/sva
>>>> path because some operations (e.g. page fault) is triggered/handled
>>>> per device/subdevice.
>>>
>>> Is there any reason that the #PF can not be handled via SVA fd?
>> using per-device FDs or multiplexing all fault info through one sva_FD
>> is just an implementation choice. The key is to mark faults per device/
>> subdevice thus anyway requires a userspace-visible handle/tag to
>> represent device/subdevice and the domain/device association must
>> be constructed in this new path.
>>
>>>    
>>>>    Therefore, /dev/sva must provide both per-
>>>> domain and per-device uAPIs similar to what VFIO/VDPA already
>>>> does. Moreover, mapping page fault to subdevice requires pre-
>>>> registering subdevice fault data to IOMMU layer when binding
>>>> guest page table, while such fault data can be only retrieved from
>>>> parent driver through VFIO/VDPA.
>>>>
>>>> However, we failed to find a good way even at the 1st step about
>>>> domain association. The iommu domains are not exposed to the
>>>> userspace, and there is no 1:1 mapping between domain and device.
>>>> In VFIO, all devices within the same VFIO container share the address
>>>> space but they may be organized in multiple IOMMU domains based
>>>> on their bus type. How (should we let) the userspace know the

>>>> domain information and open an sva_FD for each domain is the main
>>>> problem here.
>>>
>>> The SVA fd is not necessarily opened by userspace. It could be get
>>> through subsystem specific uAPIs.
>>>
>>> E.g for vDPA if a vDPA device contains several vSVA-capable domains, we can:
>>>
>>> 1) introduce uAPI for userspace to know the number of vSVA-capable
>>> domain
>>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
>>> domain
>> and also new interface to notify userspace when a domain disappears
>> or a device is detached? Finally looks we are creating a completely set
>> of new subsystem specific uAPIs just for generalizing another set of
>> subsystem specific uAPIs. Remember after separating PASID mgmt.
>> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
>> API. Replicating them is much easier logic than developing a new glue
>> mechanism in each subsystem.
> Right, I don't see the advantage here, subsystem specific uAPIs using
> common internal interfaces is what was being proposed.


The problem is if PASID is per device, then this could work. But if it's 
not, we will get conflict if more than one devices (subsystems) want to 
use the same PASID to identify the same process address space. If this 
is true, we need a uAPI beyond VFIO specific one.


>
>>>> In the end we just realized that doing such generalization doesn't
>>>> really lead to a clear design and instead requires tight coordination
>>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
>>>> (especially about synchronization when the domain/device
>>>> association is changed or when the device/subdevice is being reset/
>>>> drained). Finally it may become a usability burden to the userspace
>>>> on proper use of the two interfaces on the assigned device.
>>>>
>>>> Based on above analysis we feel that just generalizing PASID mgmt.
>>>> might be a good thing to look at while the remaining operations are
>>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
>>>> just a subset of the page table management capabilities that an
>>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
>>>> is managed by VFIO/VDPA already, continuing this path for new nesting
>>>> capability sounds natural. There is another option by generalizing the
>>>> entire IOMMU domain management (sort of the entire vfio_iommu_
>>>> type1), but it's unclear whether such intrusive change is worthwhile
>>>> (especially when VFIO/VDPA already goes different route even in legacy
>>>> mapping uAPI: map/unmap vs. IOTLB).
>>>>
>>>> Thoughts?
>>>
>>> I'm ok with starting with a unified PASID management and consider the
>>> unified vSVA/vIOMMU uAPI later.
>>>    
>> Glad to see that we have consensus here. :)
> I see the benefit in a common PASID quota mechanism rather than the
> ad-hoc limits introduced for vfio, but vfio integration does have the
> benefit of being tied to device access, whereas it seems it seems a
> user will need to be granted some CAP_SVA capability separate from the
> device to make use of this interface.  It's possible for vfio to honor
> shared limits, just as we make use of locked memory limits shared by
> the task, so I'm not sure yet the benefit provided by a separate
> userspace interface outside of vfio.  A separate interface also throws
> a kink is userspace use of vfio, where we expect the interface is
> largely self contained, ie. if a user has access to the vfio group and
> container device files, they can fully make use of their device, up to
> limits imposed by things like locked memory.  I'm concerned that
> management tools will actually need to understand the intended usage of
> a device in order to grant new capabilities, file access, and limits to
> a process making use of these features.  Hopefully your prototype will
> clarify some of those aspects.  Thanks,
>
> Alex


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-15  6:52     ` Jason Wang
@ 2020-10-15  7:58       ` Tian, Kevin
  2020-10-15  8:40         ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Tian, Kevin @ 2020-10-15  7:58 UTC (permalink / raw)
  To: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 15, 2020 2:52 PM
> 
> 
> On 2020/10/14 上午11:08, Tian, Kevin wrote:
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Tuesday, October 13, 2020 2:22 PM
> >>
> >>
> >> On 2020/10/12 下午4:38, Tian, Kevin wrote:
> >>>> From: Jason Wang <jasowang@redhat.com>
> >>>> Sent: Monday, September 14, 2020 12:20 PM
> >>>>
> >>> [...]
> >>>    > If it's possible, I would suggest a generic uAPI instead of a VFIO
> >>>> specific one.
> >>>>
> >>>> Jason suggest something like /dev/sva. There will be a lot of other
> >>>> subsystems that could benefit from this (e.g vDPA).
> >>>>
> >>>> Have you ever considered this approach?
> >>>>
> >>> Hi, Jason,
> >>>
> >>> We did some study on this approach and below is the output. It's a
> >>> long writing but I didn't find a way to further abstract w/o losing
> >>> necessary context. Sorry about that.
> >>>
> >>> Overall the real purpose of this series is to enable IOMMU nested
> >>> translation capability with vSVA as one major usage, through
> >>> below new uAPIs:
> >>> 	1) Report/enable IOMMU nested translation capability;
> >>> 	2) Allocate/free PASID;
> >>> 	3) Bind/unbind guest page table;
> >>> 	4) Invalidate IOMMU cache;
> >>> 	5) Handle IOMMU page request/response (not in this series);
> >>> 1/3/4) is the minimal set for using IOMMU nested translation, with
> >>> the other two optional. For example, the guest may enable vSVA on
> >>> a device without using PASID. Or, it may bind its gIOVA page table
> >>> which doesn't require page fault support. Finally, all operations can
> >>> be applied to either physical device or subdevice.
> >>>
> >>> Then we evaluated each uAPI whether generalizing it is a good thing
> >>> both in concept and regarding to complexity.
> >>>
> >>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> >>> allocation/free is through the IOASID sub-system.
> >>
> >> A question here, is IOASID expected to be the single management
> >> interface for PASID?
> > yes
> >
> >> (I'm asking since there're already vendor specific IDA based PASID
> >> allocator e.g amdgpu_pasid_alloc())
> > That comes before IOASID core was introduced. I think it should be
> > changed to use the new generic interface. Jacob/Jean can better
> > comment if other reason exists for this exception.
> 
> 
> If there's no exception it should be fixed.
> 
> 
> >
> >>
> >>>    From this angle
> >>> we feel generalizing PASID management does make some sense.
> >>> First, PASID is just a number and not related to any device before
> >>> it's bound to a page table and IOMMU domain. Second, PASID is a
> >>> global resource (at least on Intel VT-d),
> >>
> >> I think we need a definition of "global" here. It looks to me for vt-d
> >> the PASID table is per device.
> > PASID table is per device, thus VT-d could support per-device PASIDs
> > in concept.
> 
> 
> I think that's the requirement of PCIE spec which said PASID + RID
> identifies the process address space ID.
> 
> 
> >   However on Intel platform we require PASIDs to be managed
> > in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> > and ENQCMD together.
> 
> 
> Any reason for such requirement? (I'm not familiar with ENQCMD, but my
> understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement
> for system-wide PASID).

ENQCMD is a new instruction to allow multiple processes submitting
workload to one shared workqueue. Each process has an unique PASID
saved in a MSR, which is included in the ENQCMD payload to indicate
the address space when the CPU sends to the device. As one process 
might issue ENQCMD to multiple devices, OS-wide PASID allocation is 
required both in host and guest side.

When executing ENQCMD in the guest to a SIOV device, the guest
programmed value in the PASID_MSR must be translated to a host PASID
value for proper function/isolation as PASID represents the address
space. The translation is done through a new VMCS PASID translation 
structure (per-VM, and 1:1 mapping). From this angle the host PASIDs 
must be allocated 'globally' cross all assigned devices otherwise it may 
lead to 1:N mapping when a guest process issues ENQCMD to multiple 
assigned devices/subdevices. 

There will be a KVM forum session for this topic btw.

> 
> 
> > Thus the host creates only one 'global' PASID
> > namespace but do use per-device PASID table to assure isolation between
> > devices on Intel platforms. But ARM does it differently as Jean explained.
> > They have a global namespace for host processes on all host-owned
> > devices (same as Intel), but then per-device namespace when a device
> > (and its PASID table) is assigned to userspace.
> >
> >> Another question, is this possible to have two DMAR hardware unit(at
> >> least I can see two even in my laptop). In this case, is PASID still a
> >> global resource?
> > yes
> >
> >>
> >>>    while having separate VFIO/
> >>> VDPA allocation interfaces may easily cause confusion in userspace,
> >>> e.g. which interface to be used if both VFIO/VDPA devices exist.
> >>> Moreover, an unified interface allows centralized control over how
> >>> many PASIDs are allowed per process.
> >>
> >> Yes.
> >>
> >>
> >>> One unclear part with this generalization is about the permission.
> >>> Do we open this interface to any process or only to those which
> >>> have assigned devices? If the latter, what would be the mechanism
> >>> to coordinate between this new interface and specific passthrough
> >>> frameworks?
> >>
> >> I'm not sure, but if you just want a permission, you probably can
> >> introduce new capability (CAP_XXX) for this.
> >>
> >>
> >>>    A more tricky case, vSVA support on ARM (Eric/Jean
> >>> please correct me) plans to do per-device PASID namespace which
> >>> is built on a bind_pasid_table iommu callback to allow guest fully
> >>> manage its PASIDs on a given passthrough device.
> >>
> >> I see, so I think the answer is to prepare for the namespace support
> >> from the start. (btw, I don't see how namespace is handled in current
> >> IOASID module?)
> > The PASID table is based on GPA when nested translation is enabled
> > on ARM SMMU. This design implies that the guest manages PASID
> > table thus PASIDs instead of going through host-side API on assigned
> > device. From this angle we don't need explicit namespace in the host
> > API. Just need a way to control how many PASIDs a process is allowed
> > to allocate in the global namespace. btw IOASID module already has
> > 'set' concept per-process and PASIDs are managed per-set. Then the
> > quota control can be easily introduced in the 'set' level.
> >
> >>
> >>>    I'm not sure
> >>> how such requirement can be unified w/o involving passthrough
> >>> frameworks, or whether ARM could also switch to global PASID
> >>> style...
> >>>
> >>> Second, IOMMU nested translation is a per IOMMU domain
> >>> capability. Since IOMMU domains are managed by VFIO/VDPA
> >>>    (alloc/free domain, attach/detach device, set/get domain attribute,
> >>> etc.), reporting/enabling the nesting capability is an natural
> >>> extension to the domain uAPI of existing passthrough frameworks.
> >>> Actually, VFIO already includes a nesting enable interface even
> >>> before this series. So it doesn't make sense to generalize this uAPI
> >>> out.
> >>
> >> So my understanding is that VFIO already:
> >>
> >> 1) use multiple fds
> >> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
> >> 3) provides API to associated devices/group with a container
> >>
> >> And all the proposal in this series is to reuse the container fd. It
> >> should be possible to replace e.g type1 IOMMU with a unified module.
> > yes, this is the alternative option that I raised in the last paragraph.
> >
> >>
> >>> Then the tricky part comes with the remaining operations (3/4/5),
> >>> which are all backed by iommu_ops thus effective only within an
> >>> IOMMU domain. To generalize them, the first thing is to find a way
> >>> to associate the sva_FD (opened through generic /dev/sva) with an
> >>> IOMMU domain that is created by VFIO/VDPA. The second thing is
> >>> to replicate {domain<->device/subdevice} association in /dev/sva
> >>> path because some operations (e.g. page fault) is triggered/handled
> >>> per device/subdevice.
> >>
> >> Is there any reason that the #PF can not be handled via SVA fd?
> > using per-device FDs or multiplexing all fault info through one sva_FD
> > is just an implementation choice. The key is to mark faults per device/
> > subdevice thus anyway requires a userspace-visible handle/tag to
> > represent device/subdevice and the domain/device association must
> > be constructed in this new path.
> 
> 
> I don't get why it requires a userspace-visible handle/tag. The binding
> between SVA fd and device fd could be done either explicitly or
> implicitly. So userspace know which (sub)device that this SVA fd is for.
> 
> 
> >
> >>
> >>>    Therefore, /dev/sva must provide both per-
> >>> domain and per-device uAPIs similar to what VFIO/VDPA already
> >>> does. Moreover, mapping page fault to subdevice requires pre-
> >>> registering subdevice fault data to IOMMU layer when binding
> >>> guest page table, while such fault data can be only retrieved from
> >>> parent driver through VFIO/VDPA.
> >>>
> >>> However, we failed to find a good way even at the 1st step about
> >>> domain association. The iommu domains are not exposed to the
> >>> userspace, and there is no 1:1 mapping between domain and device.
> >>> In VFIO, all devices within the same VFIO container share the address
> >>> space but they may be organized in multiple IOMMU domains based
> >>> on their bus type. How (should we let) the userspace know the
> >>> domain information and open an sva_FD for each domain is the main
> >>> problem here.
> >>
> >> The SVA fd is not necessarily opened by userspace. It could be get
> >> through subsystem specific uAPIs.
> >>
> >> E.g for vDPA if a vDPA device contains several vSVA-capable domains, we
> can:
> >>
> >> 1) introduce uAPI for userspace to know the number of vSVA-capable
> >> domain
> >> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
> >> domain
> > and also new interface to notify userspace when a domain disappears
> > or a device is detached?
> 
> 
> You need to deal with this case even in VFIO, isn't it?

No. VFIO doesn't expose domain knowledge to userspace.

> 
> 
> >   Finally looks we are creating a completely set
> > of new subsystem specific uAPIs just for generalizing another set of
> > subsystem specific uAPIs. Remember after separating PASID mgmt.
> > out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
> > API. Replicating them is much easier logic than developing a new glue
> > mechanism in each subsystem.
> 
> 
> As discussed, the point is more than just simple generalizing. It's
> about the limitation of current uAPI. So I have the following questions:
> 
> Do we want a single PASID to be used by more than one devices? 

Yes.

> If yes, do we want those devices to share I/O page tables? 

Yes.

> If yes, which uAPI is  used to program the shared I/O page tables?
> 

Page table binding needs to be done per-device, so the userspace
will use VFIO uAPI for VFIO device and vDPA uAPI for vDPA device.
The binding request is initiated by the virtual IOMMU, when capturing
guest attempt of binding page table to a virtual PASID entry for a
given device.

Thanks
Kevin

> 
> 
> >
> >>
> >>> In the end we just realized that doing such generalization doesn't
> >>> really lead to a clear design and instead requires tight coordination
> >>> between /dev/sva and VFIO/VDPA for almost every new uAPI
> >>> (especially about synchronization when the domain/device
> >>> association is changed or when the device/subdevice is being reset/
> >>> drained). Finally it may become a usability burden to the userspace
> >>> on proper use of the two interfaces on the assigned device.
> >>>
> >>> Based on above analysis we feel that just generalizing PASID mgmt.
> >>> might be a good thing to look at while the remaining operations are
> >>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
> >>> just a subset of the page table management capabilities that an
> >>> IOMMU domain affords. Since all other aspects of the IOMMU domain
> >>> is managed by VFIO/VDPA already, continuing this path for new nesting
> >>> capability sounds natural. There is another option by generalizing the
> >>> entire IOMMU domain management (sort of the entire vfio_iommu_
> >>> type1), but it's unclear whether such intrusive change is worthwhile
> >>> (especially when VFIO/VDPA already goes different route even in legacy
> >>> mapping uAPI: map/unmap vs. IOTLB).
> >>>
> >>> Thoughts?
> >>
> >> I'm ok with starting with a unified PASID management and consider the
> >> unified vSVA/vIOMMU uAPI later.
> >>
> > Glad to see that we have consensus here. :)
> >
> > Thanks
> > Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-15  7:58       ` Tian, Kevin
@ 2020-10-15  8:40         ` Jason Wang
  2020-10-15 10:14           ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-15  8:40 UTC (permalink / raw)
  To: Tian, Kevin, Liu, Yi L, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/10/15 下午3:58, Tian, Kevin wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, October 15, 2020 2:52 PM
>>
>>
>> On 2020/10/14 上午11:08, Tian, Kevin wrote:
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Sent: Tuesday, October 13, 2020 2:22 PM
>>>>
>>>>
>>>> On 2020/10/12 下午4:38, Tian, Kevin wrote:
>>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>>> Sent: Monday, September 14, 2020 12:20 PM
>>>>>>
>>>>> [...]
>>>>>     > If it's possible, I would suggest a generic uAPI instead of a VFIO
>>>>>> specific one.
>>>>>>
>>>>>> Jason suggest something like /dev/sva. There will be a lot of other
>>>>>> subsystems that could benefit from this (e.g vDPA).
>>>>>>
>>>>>> Have you ever considered this approach?
>>>>>>
>>>>> Hi, Jason,
>>>>>
>>>>> We did some study on this approach and below is the output. It's a
>>>>> long writing but I didn't find a way to further abstract w/o losing
>>>>> necessary context. Sorry about that.
>>>>>
>>>>> Overall the real purpose of this series is to enable IOMMU nested
>>>>> translation capability with vSVA as one major usage, through
>>>>> below new uAPIs:
>>>>> 	1) Report/enable IOMMU nested translation capability;
>>>>> 	2) Allocate/free PASID;
>>>>> 	3) Bind/unbind guest page table;
>>>>> 	4) Invalidate IOMMU cache;
>>>>> 	5) Handle IOMMU page request/response (not in this series);
>>>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
>>>>> the other two optional. For example, the guest may enable vSVA on
>>>>> a device without using PASID. Or, it may bind its gIOVA page table
>>>>> which doesn't require page fault support. Finally, all operations can
>>>>> be applied to either physical device or subdevice.
>>>>>
>>>>> Then we evaluated each uAPI whether generalizing it is a good thing
>>>>> both in concept and regarding to complexity.
>>>>>
>>>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
>>>>> allocation/free is through the IOASID sub-system.
>>>> A question here, is IOASID expected to be the single management
>>>> interface for PASID?
>>> yes
>>>
>>>> (I'm asking since there're already vendor specific IDA based PASID
>>>> allocator e.g amdgpu_pasid_alloc())
>>> That comes before IOASID core was introduced. I think it should be
>>> changed to use the new generic interface. Jacob/Jean can better
>>> comment if other reason exists for this exception.
>>
>> If there's no exception it should be fixed.
>>
>>
>>>>>     From this angle
>>>>> we feel generalizing PASID management does make some sense.
>>>>> First, PASID is just a number and not related to any device before
>>>>> it's bound to a page table and IOMMU domain. Second, PASID is a
>>>>> global resource (at least on Intel VT-d),
>>>> I think we need a definition of "global" here. It looks to me for vt-d
>>>> the PASID table is per device.
>>> PASID table is per device, thus VT-d could support per-device PASIDs
>>> in concept.
>>
>> I think that's the requirement of PCIE spec which said PASID + RID
>> identifies the process address space ID.
>>
>>
>>>    However on Intel platform we require PASIDs to be managed
>>> in system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
>>> and ENQCMD together.
>>
>> Any reason for such requirement? (I'm not familiar with ENQCMD, but my
>> understanding is that vSVA, SIOV or SR-IOV doesn't have the requirement
>> for system-wide PASID).
> ENQCMD is a new instruction to allow multiple processes submitting
> workload to one shared workqueue. Each process has an unique PASID
> saved in a MSR, which is included in the ENQCMD payload to indicate
> the address space when the CPU sends to the device. As one process
> might issue ENQCMD to multiple devices, OS-wide PASID allocation is
> required both in host and guest side.
>
> When executing ENQCMD in the guest to a SIOV device, the guest
> programmed value in the PASID_MSR must be translated to a host PASID
> value for proper function/isolation as PASID represents the address
> space. The translation is done through a new VMCS PASID translation
> structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
> must be allocated 'globally' cross all assigned devices otherwise it may
> lead to 1:N mapping when a guest process issues ENQCMD to multiple
> assigned devices/subdevices.
>
> There will be a KVM forum session for this topic btw.


Thanks for the background. Now I see the restrict comes from ENQCMD.


>
>>
>>> Thus the host creates only one 'global' PASID
>>> namespace but do use per-device PASID table to assure isolation between
>>> devices on Intel platforms. But ARM does it differently as Jean explained.
>>> They have a global namespace for host processes on all host-owned
>>> devices (same as Intel), but then per-device namespace when a device
>>> (and its PASID table) is assigned to userspace.
>>>
>>>> Another question, is this possible to have two DMAR hardware unit(at
>>>> least I can see two even in my laptop). In this case, is PASID still a
>>>> global resource?
>>> yes
>>>
>>>>>     while having separate VFIO/
>>>>> VDPA allocation interfaces may easily cause confusion in userspace,
>>>>> e.g. which interface to be used if both VFIO/VDPA devices exist.
>>>>> Moreover, an unified interface allows centralized control over how
>>>>> many PASIDs are allowed per process.
>>>> Yes.
>>>>
>>>>
>>>>> One unclear part with this generalization is about the permission.
>>>>> Do we open this interface to any process or only to those which
>>>>> have assigned devices? If the latter, what would be the mechanism
>>>>> to coordinate between this new interface and specific passthrough
>>>>> frameworks?
>>>> I'm not sure, but if you just want a permission, you probably can
>>>> introduce new capability (CAP_XXX) for this.
>>>>
>>>>
>>>>>     A more tricky case, vSVA support on ARM (Eric/Jean
>>>>> please correct me) plans to do per-device PASID namespace which
>>>>> is built on a bind_pasid_table iommu callback to allow guest fully
>>>>> manage its PASIDs on a given passthrough device.
>>>> I see, so I think the answer is to prepare for the namespace support
>>>> from the start. (btw, I don't see how namespace is handled in current
>>>> IOASID module?)
>>> The PASID table is based on GPA when nested translation is enabled
>>> on ARM SMMU. This design implies that the guest manages PASID
>>> table thus PASIDs instead of going through host-side API on assigned
>>> device. From this angle we don't need explicit namespace in the host
>>> API. Just need a way to control how many PASIDs a process is allowed
>>> to allocate in the global namespace. btw IOASID module already has
>>> 'set' concept per-process and PASIDs are managed per-set. Then the
>>> quota control can be easily introduced in the 'set' level.
>>>
>>>>>     I'm not sure
>>>>> how such requirement can be unified w/o involving passthrough
>>>>> frameworks, or whether ARM could also switch to global PASID
>>>>> style...
>>>>>
>>>>> Second, IOMMU nested translation is a per IOMMU domain
>>>>> capability. Since IOMMU domains are managed by VFIO/VDPA
>>>>>     (alloc/free domain, attach/detach device, set/get domain attribute,
>>>>> etc.), reporting/enabling the nesting capability is an natural
>>>>> extension to the domain uAPI of existing passthrough frameworks.
>>>>> Actually, VFIO already includes a nesting enable interface even
>>>>> before this series. So it doesn't make sense to generalize this uAPI
>>>>> out.
>>>> So my understanding is that VFIO already:
>>>>
>>>> 1) use multiple fds
>>>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
>>>> 3) provides API to associated devices/group with a container
>>>>
>>>> And all the proposal in this series is to reuse the container fd. It
>>>> should be possible to replace e.g type1 IOMMU with a unified module.
>>> yes, this is the alternative option that I raised in the last paragraph.
>>>
>>>>> Then the tricky part comes with the remaining operations (3/4/5),
>>>>> which are all backed by iommu_ops thus effective only within an
>>>>> IOMMU domain. To generalize them, the first thing is to find a way
>>>>> to associate the sva_FD (opened through generic /dev/sva) with an
>>>>> IOMMU domain that is created by VFIO/VDPA. The second thing is
>>>>> to replicate {domain<->device/subdevice} association in /dev/sva
>>>>> path because some operations (e.g. page fault) is triggered/handled
>>>>> per device/subdevice.
>>>> Is there any reason that the #PF can not be handled via SVA fd?
>>> using per-device FDs or multiplexing all fault info through one sva_FD
>>> is just an implementation choice. The key is to mark faults per device/
>>> subdevice thus anyway requires a userspace-visible handle/tag to
>>> represent device/subdevice and the domain/device association must
>>> be constructed in this new path.
>>
>> I don't get why it requires a userspace-visible handle/tag. The binding
>> between SVA fd and device fd could be done either explicitly or
>> implicitly. So userspace know which (sub)device that this SVA fd is for.
>>
>>
>>>>>     Therefore, /dev/sva must provide both per-
>>>>> domain and per-device uAPIs similar to what VFIO/VDPA already
>>>>> does. Moreover, mapping page fault to subdevice requires pre-
>>>>> registering subdevice fault data to IOMMU layer when binding
>>>>> guest page table, while such fault data can be only retrieved from
>>>>> parent driver through VFIO/VDPA.
>>>>>
>>>>> However, we failed to find a good way even at the 1st step about
>>>>> domain association. The iommu domains are not exposed to the
>>>>> userspace, and there is no 1:1 mapping between domain and device.
>>>>> In VFIO, all devices within the same VFIO container share the address
>>>>> space but they may be organized in multiple IOMMU domains based
>>>>> on their bus type. How (should we let) the userspace know the
>>>>> domain information and open an sva_FD for each domain is the main
>>>>> problem here.
>>>> The SVA fd is not necessarily opened by userspace. It could be get
>>>> through subsystem specific uAPIs.
>>>>
>>>> E.g for vDPA if a vDPA device contains several vSVA-capable domains, we
>> can:
>>>> 1) introduce uAPI for userspace to know the number of vSVA-capable
>>>> domain
>>>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each vSVA-capable
>>>> domain
>>> and also new interface to notify userspace when a domain disappears
>>> or a device is detached?
>>
>> You need to deal with this case even in VFIO, isn't it?
> No. VFIO doesn't expose domain knowledge to userspace.


Neither did the above API I think.


>
>>
>>>    Finally looks we are creating a completely set
>>> of new subsystem specific uAPIs just for generalizing another set of
>>> subsystem specific uAPIs. Remember after separating PASID mgmt.
>>> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
>>> API. Replicating them is much easier logic than developing a new glue
>>> mechanism in each subsystem.
>>
>> As discussed, the point is more than just simple generalizing. It's
>> about the limitation of current uAPI. So I have the following questions:
>>
>> Do we want a single PASID to be used by more than one devices?
> Yes.
>
>> If yes, do we want those devices to share I/O page tables?
> Yes.
>
>> If yes, which uAPI is  used to program the shared I/O page tables?
>>
> Page table binding needs to be done per-device, so the userspace
> will use VFIO uAPI for VFIO device and vDPA uAPI for vDPA device.


Any design considerations for this, I think it should be done per PASID 
instead (consider PASID is a global resource)?


> The binding request is initiated by the virtual IOMMU, when capturing
> guest attempt of binding page table to a virtual PASID entry for a
> given device.


And for L2 page table programming, if PASID is use by both e.g VFIO and 
vDPA, user need to choose one of uAPI to build l2 mappings?

Thanks


>
> Thanks
> Kevin
>
>>
>>>>> In the end we just realized that doing such generalization doesn't
>>>>> really lead to a clear design and instead requires tight coordination
>>>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
>>>>> (especially about synchronization when the domain/device
>>>>> association is changed or when the device/subdevice is being reset/
>>>>> drained). Finally it may become a usability burden to the userspace
>>>>> on proper use of the two interfaces on the assigned device.
>>>>>
>>>>> Based on above analysis we feel that just generalizing PASID mgmt.
>>>>> might be a good thing to look at while the remaining operations are
>>>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
>>>>> just a subset of the page table management capabilities that an
>>>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
>>>>> is managed by VFIO/VDPA already, continuing this path for new nesting
>>>>> capability sounds natural. There is another option by generalizing the
>>>>> entire IOMMU domain management (sort of the entire vfio_iommu_
>>>>> type1), but it's unclear whether such intrusive change is worthwhile
>>>>> (especially when VFIO/VDPA already goes different route even in legacy
>>>>> mapping uAPI: map/unmap vs. IOTLB).
>>>>>
>>>>> Thoughts?
>>>> I'm ok with starting with a unified PASID management and consider the
>>>> unified vSVA/vIOMMU uAPI later.
>>>>
>>> Glad to see that we have consensus here. :)
>>>
>>> Thanks
>>> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-15  8:40         ` Jason Wang
@ 2020-10-15 10:14           ` Liu, Yi L
  2020-10-20  6:18             ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-15 10:14 UTC (permalink / raw)
  To: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 15, 2020 4:41 PM
> 
> 
> On 2020/10/15 ??3:58, Tian, Kevin wrote:
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Thursday, October 15, 2020 2:52 PM
> >>
> >>
> >> On 2020/10/14 ??11:08, Tian, Kevin wrote:
> >>>> From: Jason Wang <jasowang@redhat.com>
> >>>> Sent: Tuesday, October 13, 2020 2:22 PM
> >>>>
> >>>>
> >>>> On 2020/10/12 ??4:38, Tian, Kevin wrote:
> >>>>>> From: Jason Wang <jasowang@redhat.com>
> >>>>>> Sent: Monday, September 14, 2020 12:20 PM
> >>>>>>
> >>>>> [...]
> >>>>>     > If it's possible, I would suggest a generic uAPI instead of
> >>>>> a VFIO
> >>>>>> specific one.
> >>>>>>
> >>>>>> Jason suggest something like /dev/sva. There will be a lot of
> >>>>>> other subsystems that could benefit from this (e.g vDPA).
> >>>>>>
> >>>>>> Have you ever considered this approach?
> >>>>>>
> >>>>> Hi, Jason,
> >>>>>
> >>>>> We did some study on this approach and below is the output. It's a
> >>>>> long writing but I didn't find a way to further abstract w/o
> >>>>> losing necessary context. Sorry about that.
> >>>>>
> >>>>> Overall the real purpose of this series is to enable IOMMU nested
> >>>>> translation capability with vSVA as one major usage, through below
> >>>>> new uAPIs:
> >>>>> 	1) Report/enable IOMMU nested translation capability;
> >>>>> 	2) Allocate/free PASID;
> >>>>> 	3) Bind/unbind guest page table;
> >>>>> 	4) Invalidate IOMMU cache;
> >>>>> 	5) Handle IOMMU page request/response (not in this series);
> >>>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
> >>>>> the other two optional. For example, the guest may enable vSVA on
> >>>>> a device without using PASID. Or, it may bind its gIOVA page table
> >>>>> which doesn't require page fault support. Finally, all operations
> >>>>> can be applied to either physical device or subdevice.
> >>>>>
> >>>>> Then we evaluated each uAPI whether generalizing it is a good
> >>>>> thing both in concept and regarding to complexity.
> >>>>>
> >>>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> >>>>> allocation/free is through the IOASID sub-system.
> >>>> A question here, is IOASID expected to be the single management
> >>>> interface for PASID?
> >>> yes
> >>>
> >>>> (I'm asking since there're already vendor specific IDA based PASID
> >>>> allocator e.g amdgpu_pasid_alloc())
> >>> That comes before IOASID core was introduced. I think it should be
> >>> changed to use the new generic interface. Jacob/Jean can better
> >>> comment if other reason exists for this exception.
> >>
> >> If there's no exception it should be fixed.
> >>
> >>
> >>>>>     From this angle
> >>>>> we feel generalizing PASID management does make some sense.
> >>>>> First, PASID is just a number and not related to any device before
> >>>>> it's bound to a page table and IOMMU domain. Second, PASID is a
> >>>>> global resource (at least on Intel VT-d),
> >>>> I think we need a definition of "global" here. It looks to me for
> >>>> vt-d the PASID table is per device.
> >>> PASID table is per device, thus VT-d could support per-device PASIDs
> >>> in concept.
> >>
> >> I think that's the requirement of PCIE spec which said PASID + RID
> >> identifies the process address space ID.
> >>
> >>
> >>>    However on Intel platform we require PASIDs to be managed in
> >>> system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> >>> and ENQCMD together.
> >>
> >> Any reason for such requirement? (I'm not familiar with ENQCMD, but
> >> my understanding is that vSVA, SIOV or SR-IOV doesn't have the
> >> requirement for system-wide PASID).
> > ENQCMD is a new instruction to allow multiple processes submitting
> > workload to one shared workqueue. Each process has an unique PASID
> > saved in a MSR, which is included in the ENQCMD payload to indicate
> > the address space when the CPU sends to the device. As one process
> > might issue ENQCMD to multiple devices, OS-wide PASID allocation is
> > required both in host and guest side.
> >
> > When executing ENQCMD in the guest to a SIOV device, the guest
> > programmed value in the PASID_MSR must be translated to a host PASID
> > value for proper function/isolation as PASID represents the address
> > space. The translation is done through a new VMCS PASID translation
> > structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
> > must be allocated 'globally' cross all assigned devices otherwise it
> > may lead to 1:N mapping when a guest process issues ENQCMD to multiple
> > assigned devices/subdevices.
> >
> > There will be a KVM forum session for this topic btw.
> 
> 
> Thanks for the background. Now I see the restrict comes from ENQCMD.
> 
> 
> >
> >>
> >>> Thus the host creates only one 'global' PASID namespace but do use
> >>> per-device PASID table to assure isolation between devices on Intel
> >>> platforms. But ARM does it differently as Jean explained.
> >>> They have a global namespace for host processes on all host-owned
> >>> devices (same as Intel), but then per-device namespace when a device
> >>> (and its PASID table) is assigned to userspace.
> >>>
> >>>> Another question, is this possible to have two DMAR hardware
> >>>> unit(at least I can see two even in my laptop). In this case, is
> >>>> PASID still a global resource?
> >>> yes
> >>>
> >>>>>     while having separate VFIO/
> >>>>> VDPA allocation interfaces may easily cause confusion in
> >>>>> userspace, e.g. which interface to be used if both VFIO/VDPA devices exist.
> >>>>> Moreover, an unified interface allows centralized control over how
> >>>>> many PASIDs are allowed per process.
> >>>> Yes.
> >>>>
> >>>>
> >>>>> One unclear part with this generalization is about the permission.
> >>>>> Do we open this interface to any process or only to those which
> >>>>> have assigned devices? If the latter, what would be the mechanism
> >>>>> to coordinate between this new interface and specific passthrough
> >>>>> frameworks?
> >>>> I'm not sure, but if you just want a permission, you probably can
> >>>> introduce new capability (CAP_XXX) for this.
> >>>>
> >>>>
> >>>>>     A more tricky case, vSVA support on ARM (Eric/Jean please
> >>>>> correct me) plans to do per-device PASID namespace which is built
> >>>>> on a bind_pasid_table iommu callback to allow guest fully manage
> >>>>> its PASIDs on a given passthrough device.
> >>>> I see, so I think the answer is to prepare for the namespace
> >>>> support from the start. (btw, I don't see how namespace is handled
> >>>> in current IOASID module?)
> >>> The PASID table is based on GPA when nested translation is enabled
> >>> on ARM SMMU. This design implies that the guest manages PASID table
> >>> thus PASIDs instead of going through host-side API on assigned
> >>> device. From this angle we don't need explicit namespace in the host
> >>> API. Just need a way to control how many PASIDs a process is allowed
> >>> to allocate in the global namespace. btw IOASID module already has
> >>> 'set' concept per-process and PASIDs are managed per-set. Then the
> >>> quota control can be easily introduced in the 'set' level.
> >>>
> >>>>>     I'm not sure
> >>>>> how such requirement can be unified w/o involving passthrough
> >>>>> frameworks, or whether ARM could also switch to global PASID
> >>>>> style...
> >>>>>
> >>>>> Second, IOMMU nested translation is a per IOMMU domain capability.
> >>>>> Since IOMMU domains are managed by VFIO/VDPA
> >>>>>     (alloc/free domain, attach/detach device, set/get domain
> >>>>> attribute, etc.), reporting/enabling the nesting capability is an
> >>>>> natural extension to the domain uAPI of existing passthrough frameworks.
> >>>>> Actually, VFIO already includes a nesting enable interface even
> >>>>> before this series. So it doesn't make sense to generalize this
> >>>>> uAPI out.
> >>>> So my understanding is that VFIO already:
> >>>>
> >>>> 1) use multiple fds
> >>>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
> >>>> 3) provides API to associated devices/group with a container
> >>>>
> >>>> And all the proposal in this series is to reuse the container fd.
> >>>> It should be possible to replace e.g type1 IOMMU with a unified module.
> >>> yes, this is the alternative option that I raised in the last paragraph.
> >>>
> >>>>> Then the tricky part comes with the remaining operations (3/4/5),
> >>>>> which are all backed by iommu_ops thus effective only within an
> >>>>> IOMMU domain. To generalize them, the first thing is to find a way
> >>>>> to associate the sva_FD (opened through generic /dev/sva) with an
> >>>>> IOMMU domain that is created by VFIO/VDPA. The second thing is to
> >>>>> replicate {domain<->device/subdevice} association in /dev/sva path
> >>>>> because some operations (e.g. page fault) is triggered/handled per
> >>>>> device/subdevice.
> >>>> Is there any reason that the #PF can not be handled via SVA fd?
> >>> using per-device FDs or multiplexing all fault info through one
> >>> sva_FD is just an implementation choice. The key is to mark faults
> >>> per device/ subdevice thus anyway requires a userspace-visible
> >>> handle/tag to represent device/subdevice and the domain/device
> >>> association must be constructed in this new path.
> >>
> >> I don't get why it requires a userspace-visible handle/tag. The
> >> binding between SVA fd and device fd could be done either explicitly
> >> or implicitly. So userspace know which (sub)device that this SVA fd is for.
> >>
> >>
> >>>>>     Therefore, /dev/sva must provide both per- domain and
> >>>>> per-device uAPIs similar to what VFIO/VDPA already does. Moreover,
> >>>>> mapping page fault to subdevice requires pre- registering
> >>>>> subdevice fault data to IOMMU layer when binding guest page table,
> >>>>> while such fault data can be only retrieved from parent driver
> >>>>> through VFIO/VDPA.
> >>>>>
> >>>>> However, we failed to find a good way even at the 1st step about
> >>>>> domain association. The iommu domains are not exposed to the
> >>>>> userspace, and there is no 1:1 mapping between domain and device.
> >>>>> In VFIO, all devices within the same VFIO container share the
> >>>>> address space but they may be organized in multiple IOMMU domains
> >>>>> based on their bus type. How (should we let) the userspace know
> >>>>> the domain information and open an sva_FD for each domain is the
> >>>>> main problem here.
> >>>> The SVA fd is not necessarily opened by userspace. It could be get
> >>>> through subsystem specific uAPIs.
> >>>>
> >>>> E.g for vDPA if a vDPA device contains several vSVA-capable
> >>>> domains, we
> >> can:
> >>>> 1) introduce uAPI for userspace to know the number of vSVA-capable
> >>>> domain
> >>>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each
> >>>> vSVA-capable domain
> >>> and also new interface to notify userspace when a domain disappears
> >>> or a device is detached?
> >>
> >> You need to deal with this case even in VFIO, isn't it?
> > No. VFIO doesn't expose domain knowledge to userspace.
> 
> 
> Neither did the above API I think.
> 
> 
> >
> >>
> >>>    Finally looks we are creating a completely set of new subsystem
> >>> specific uAPIs just for generalizing another set of subsystem
> >>> specific uAPIs. Remember after separating PASID mgmt.
> >>> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
> >>> API. Replicating them is much easier logic than developing a new
> >>> glue mechanism in each subsystem.
> >>
> >> As discussed, the point is more than just simple generalizing. It's
> >> about the limitation of current uAPI. So I have the following questions:
> >>
> >> Do we want a single PASID to be used by more than one devices?
> > Yes.
> >
> >> If yes, do we want those devices to share I/O page tables?
> > Yes.
> >
> >> If yes, which uAPI is  used to program the shared I/O page tables?
> >>
> > Page table binding needs to be done per-device, so the userspace will
> > use VFIO uAPI for VFIO device and vDPA uAPI for vDPA device.
> 
> 
> Any design considerations for this, I think it should be done per PASID instead
> (consider PASID is a global resource)?

per device and per PASID. you may have a look from the below arch. PASID
table is per device, the binding of page table are set to PASID table
entry.

"
In VT-d implementation, PASID table is per device and maintained in the host.
Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.

    .-------------.  .---------------------------.
    |   vIOMMU    |  | Guest process CR3, FL only|
    |             |  '---------------------------'
    .----------------/
    | PASID Entry |--- PASID cache flush -
    '-------------'                       |
    |             |                       V
    |             |                CR3 in GPA
    '-------------'
Guest
------| Shadow |--------------------------|--------
      v        v                          v
Host
    .-------------.  .----------------------.
    |   pIOMMU    |  | Bind FL for GVA-GPA  |
    |             |  '----------------------'
    .----------------/  |
    | PASID Entry |     V (Nested xlate)
    '----------------\.------------------------------.
    |             |   |SL for GPA-HPA, default domain|
    |             |   '------------------------------'
    '-------------'
Where:
 - FL = First level/stage one page tables
 - SL = Second level/stage two page tables
"
copied from https://lwn.net/Articles/807506/

> 
> > The binding request is initiated by the virtual IOMMU, when capturing
> > guest attempt of binding page table to a virtual PASID entry for a
> > given device.
> 
> 
> And for L2 page table programming, if PASID is use by both e.g VFIO and
> vDPA, user need to choose one of uAPI to build l2 mappings?

for L2 page table mappings, it's done by VFIO MAP/UNMAP. for vdpa, I guess
it is tlb flush. so you are right. Keeping L1/L2 page table management in
a single uAPI set is also a reason for my current series which extends VFIO
for L1 management.

Regards,
Yi Liu

> Thanks
> 
> 
> >
> > Thanks
> > Kevin
> >
> >>
> >>>>> In the end we just realized that doing such generalization doesn't
> >>>>> really lead to a clear design and instead requires tight coordination
> >>>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
> >>>>> (especially about synchronization when the domain/device
> >>>>> association is changed or when the device/subdevice is being reset/
> >>>>> drained). Finally it may become a usability burden to the userspace
> >>>>> on proper use of the two interfaces on the assigned device.
> >>>>>
> >>>>> Based on above analysis we feel that just generalizing PASID mgmt.
> >>>>> might be a good thing to look at while the remaining operations are
> >>>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
> >>>>> just a subset of the page table management capabilities that an
> >>>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
> >>>>> is managed by VFIO/VDPA already, continuing this path for new nesting
> >>>>> capability sounds natural. There is another option by generalizing the
> >>>>> entire IOMMU domain management (sort of the entire vfio_iommu_
> >>>>> type1), but it's unclear whether such intrusive change is worthwhile
> >>>>> (especially when VFIO/VDPA already goes different route even in legacy
> >>>>> mapping uAPI: map/unmap vs. IOTLB).
> >>>>>
> >>>>> Thoughts?
> >>>> I'm ok with starting with a unified PASID management and consider the
> >>>> unified vSVA/vIOMMU uAPI later.
> >>>>
> >>> Glad to see that we have consensus here. :)
> >>>
> >>> Thanks
> >>> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-14  3:16 ` Tian, Kevin
@ 2020-10-16 15:36   ` Jason Gunthorpe
  2020-10-19  8:39     ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-16 15:36 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Wed, Oct 14, 2020 at 03:16:22AM +0000, Tian, Kevin wrote:
> Hi, Alex and Jason (G),
> 
> How about your opinion for this new proposal? For now looks both
> Jason (W) and Jean are OK with this direction and more discussions
> are possibly required for the new /dev/ioasid interface. Internally 
> we're doing a quick prototype to see any unforeseen issue with this
> separation. 

Assuming VDPA and VFIO will be the only two users so duplicating
everything only twice sounds pretty restricting to me.

> > Second, IOMMU nested translation is a per IOMMU domain
> > capability. Since IOMMU domains are managed by VFIO/VDPA
> >  (alloc/free domain, attach/detach device, set/get domain attribute,
> > etc.), reporting/enabling the nesting capability is an natural
> > extension to the domain uAPI of existing passthrough frameworks.
> > Actually, VFIO already includes a nesting enable interface even
> > before this series. So it doesn't make sense to generalize this uAPI
> > out.

The subsystem that obtains an IOMMU domain for a device would have to
register it with an open FD of the '/dev/sva'. That is the connection
between the two subsystems. It would be some simple kernel internal
stuff:

  sva = get_sva_from_file(fd);
  sva_register_device_to_pasid(sva, pasid, pci_device, iommu_domain);

Not sure why this is a roadblock?

How would this be any different from having some kernel libsva that
VDPA and VFIO would both rely on?

You don't plan to just open code all this stuff in VFIO, do you?

> > Then the tricky part comes with the remaining operations (3/4/5),
> > which are all backed by iommu_ops thus effective only within an
> > IOMMU domain. To generalize them, the first thing is to find a way
> > to associate the sva_FD (opened through generic /dev/sva) with an
> > IOMMU domain that is created by VFIO/VDPA. The second thing is
> > to replicate {domain<->device/subdevice} association in /dev/sva
> > path because some operations (e.g. page fault) is triggered/handled
> > per device/subdevice. Therefore, /dev/sva must provide both per-
> > domain and per-device uAPIs similar to what VFIO/VDPA already
> > does. 

Yes, the point here was to move the general APIs out of VFIO and into
a sharable location. So, of course one would expect some duplication
during the transition period.

> > Moreover, mapping page fault to subdevice requires pre-
> > registering subdevice fault data to IOMMU layer when binding
> > guest page table, while such fault data can be only retrieved from
> > parent driver through VFIO/VDPA.

Not sure what this means, page fault should be tied to the PASID, any
hookup needed for that should be done in-kernel when the device is
connected to the PASID.

> > space but they may be organized in multiple IOMMU domains based
> > on their bus type. How (should we let) the userspace know the
> > domain information and open an sva_FD for each domain is the main
> > problem here.

Why is one sva_FD per iommu domain required? The HW can attach the
same PASID to multiple iommu domains, right?

> > In the end we just realized that doing such generalization doesn't
> > really lead to a clear design and instead requires tight coordination
> > between /dev/sva and VFIO/VDPA for almost every new uAPI
> > (especially about synchronization when the domain/device
> > association is changed or when the device/subdevice is being reset/
> > drained). Finally it may become a usability burden to the userspace
> > on proper use of the two interfaces on the assigned device.

If you have a list of things that needs to be done to attach a PCI
device to a PASID then of course they should be tidy kernel APIs
already, and not just hard wired into VFIO.

The worst outcome would be to have VDPA and VFIO have to different
ways to do all of this with a different set of bugs. Bug fixes/new
features in VFIO won't flow over to VDPA.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-16 15:36   ` Jason Gunthorpe
@ 2020-10-19  8:39     ` Liu, Yi L
  2020-10-19 14:25       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-19  8:39 UTC (permalink / raw)
  To: Jason Gunthorpe, Tian, Kevin
  Cc: Jason Wang, alex.williamson, eric.auger, baolu.lu, joro,
	jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin

Hi Jason,

Good to see your response.

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Friday, October 16, 2020 11:37 PM
> 
> On Wed, Oct 14, 2020 at 03:16:22AM +0000, Tian, Kevin wrote:
> > Hi, Alex and Jason (G),
> >
> > How about your opinion for this new proposal? For now looks both
> > Jason (W) and Jean are OK with this direction and more discussions
> > are possibly required for the new /dev/ioasid interface. Internally
> > we're doing a quick prototype to see any unforeseen issue with this
> > separation.
> 
> Assuming VDPA and VFIO will be the only two users so duplicating
> everything only twice sounds pretty restricting to me.
> 
> > > Second, IOMMU nested translation is a per IOMMU domain
> > > capability. Since IOMMU domains are managed by VFIO/VDPA
> > >  (alloc/free domain, attach/detach device, set/get domain attribute,
> > > etc.), reporting/enabling the nesting capability is an natural
> > > extension to the domain uAPI of existing passthrough frameworks.
> > > Actually, VFIO already includes a nesting enable interface even
> > > before this series. So it doesn't make sense to generalize this uAPI
> > > out.
> 
> The subsystem that obtains an IOMMU domain for a device would have to
> register it with an open FD of the '/dev/sva'. That is the connection
> between the two subsystems. It would be some simple kernel internal
> stuff:
> 
>   sva = get_sva_from_file(fd);

Is this fd provided by userspace? I suppose the /dev/sva has a set of uAPIs
which will finally program page table to host iommu driver. As far as I know,
it's weird for VFIO user. Why should VFIO user connect to a /dev/sva fd after
it sets a proper iommu type to the opened container. VFIO container already
stands for an iommu context with which userspace could program page mapping
to host iommu.

>   sva_register_device_to_pasid(sva, pasid, pci_device, iommu_domain);

So this is supposed to be called by VFIO/VDPA to register the info to /dev/sva.
right? And in dev/sva, it will also maintain the device/iommu_domain and pasid
info? will it be duplicated with VFIO/VDPA?

> Not sure why this is a roadblock?
> 
> How would this be any different from having some kernel libsva that
> VDPA and VFIO would both rely on?
> 
> You don't plan to just open code all this stuff in VFIO, do you?
> 
> > > Then the tricky part comes with the remaining operations (3/4/5),
> > > which are all backed by iommu_ops thus effective only within an
> > > IOMMU domain. To generalize them, the first thing is to find a way
> > > to associate the sva_FD (opened through generic /dev/sva) with an
> > > IOMMU domain that is created by VFIO/VDPA. The second thing is
> > > to replicate {domain<->device/subdevice} association in /dev/sva
> > > path because some operations (e.g. page fault) is triggered/handled
> > > per device/subdevice. Therefore, /dev/sva must provide both per-
> > > domain and per-device uAPIs similar to what VFIO/VDPA already
> > > does.
> 
> Yes, the point here was to move the general APIs out of VFIO and into
> a sharable location. So, of course one would expect some duplication
> during the transition period.
> 
> > > Moreover, mapping page fault to subdevice requires pre-
> > > registering subdevice fault data to IOMMU layer when binding
> > > guest page table, while such fault data can be only retrieved from
> > > parent driver through VFIO/VDPA.
> 
> Not sure what this means, page fault should be tied to the PASID, any
> hookup needed for that should be done in-kernel when the device is
> connected to the PASID.

you may refer to chapter 7.4.1.1 of VT-d spec. Page request is reported to
software together with the requestor id of the device. For the page request
injects to guest, it should have the device info.

Regards,
Yi Liu

> 
> > > space but they may be organized in multiple IOMMU domains based
> > > on their bus type. How (should we let) the userspace know the
> > > domain information and open an sva_FD for each domain is the main
> > > problem here.
> 
> Why is one sva_FD per iommu domain required? The HW can attach the
> same PASID to multiple iommu domains, right?
> 
> > > In the end we just realized that doing such generalization doesn't
> > > really lead to a clear design and instead requires tight coordination
> > > between /dev/sva and VFIO/VDPA for almost every new uAPI
> > > (especially about synchronization when the domain/device
> > > association is changed or when the device/subdevice is being reset/
> > > drained). Finally it may become a usability burden to the userspace
> > > on proper use of the two interfaces on the assigned device.
> 
> If you have a list of things that needs to be done to attach a PCI
> device to a PASID then of course they should be tidy kernel APIs
> already, and not just hard wired into VFIO.
> 
> The worst outcome would be to have VDPA and VFIO have to different
> ways to do all of this with a different set of bugs. Bug fixes/new
> features in VFIO won't flow over to VDPA.
> 
> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-19  8:39     ` Liu, Yi L
@ 2020-10-19 14:25       ` Jason Gunthorpe
  2020-10-20 10:21         ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-19 14:25 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jason Wang, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Mon, Oct 19, 2020 at 08:39:03AM +0000, Liu, Yi L wrote:
> Hi Jason,
> 
> Good to see your response.

Ah, I was away

> > > > Second, IOMMU nested translation is a per IOMMU domain
> > > > capability. Since IOMMU domains are managed by VFIO/VDPA
> > > >  (alloc/free domain, attach/detach device, set/get domain attribute,
> > > > etc.), reporting/enabling the nesting capability is an natural
> > > > extension to the domain uAPI of existing passthrough frameworks.
> > > > Actually, VFIO already includes a nesting enable interface even
> > > > before this series. So it doesn't make sense to generalize this uAPI
> > > > out.
> > 
> > The subsystem that obtains an IOMMU domain for a device would have to
> > register it with an open FD of the '/dev/sva'. That is the connection
> > between the two subsystems. It would be some simple kernel internal
> > stuff:
> > 
> >   sva = get_sva_from_file(fd);
> 
> Is this fd provided by userspace? I suppose the /dev/sva has a set of uAPIs
> which will finally program page table to host iommu driver. As far as I know,
> it's weird for VFIO user. Why should VFIO user connect to a /dev/sva fd after
> it sets a proper iommu type to the opened container. VFIO container already
> stands for an iommu context with which userspace could program page mapping
> to host iommu.

Again the point is to dis-aggregate the vIOMMU related stuff from VFIO
so it can be shared between more subsystems that need it. I'm sure
there will be some weird overlaps because we can't delete any of the
existing VFIO APIs, but that should not be a blocker.

Having VFIO run in a mode where '/dev/sva' provides all the IOMMU
handling is a possible path.

If your plan is to just opencode everything into VFIO then I don't see
how VDPA will work well, and if proper in-kernel abstractions are
built I fail to see how routing some of it through userspace is a
fundamental problem.

> >   sva_register_device_to_pasid(sva, pasid, pci_device, iommu_domain);
> 
> So this is supposed to be called by VFIO/VDPA to register the info to /dev/sva.
> right? And in dev/sva, it will also maintain the device/iommu_domain and pasid
> info? will it be duplicated with VFIO/VDPA?

Each part needs to have the information it needs? 

> > > > Moreover, mapping page fault to subdevice requires pre-
> > > > registering subdevice fault data to IOMMU layer when binding
> > > > guest page table, while such fault data can be only retrieved from
> > > > parent driver through VFIO/VDPA.
> > 
> > Not sure what this means, page fault should be tied to the PASID, any
> > hookup needed for that should be done in-kernel when the device is
> > connected to the PASID.
> 
> you may refer to chapter 7.4.1.1 of VT-d spec. Page request is reported to
> software together with the requestor id of the device. For the page request
> injects to guest, it should have the device info.

Whoever provides the vIOMMU emulation and relays the page fault to the
guest has to translate the RID - what does that have to do with VFIO?

How will VPDA provide the vIOMMU emulation?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-15 10:14           ` Liu, Yi L
@ 2020-10-20  6:18             ` Jason Wang
  2020-10-20  8:19               ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-20  6:18 UTC (permalink / raw)
  To: Liu, Yi L, Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin


On 2020/10/15 下午6:14, Liu, Yi L wrote:
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, October 15, 2020 4:41 PM
>>
>>
>> On 2020/10/15 ??3:58, Tian, Kevin wrote:
>>>> From: Jason Wang <jasowang@redhat.com>
>>>> Sent: Thursday, October 15, 2020 2:52 PM
>>>>
>>>>
>>>> On 2020/10/14 ??11:08, Tian, Kevin wrote:
>>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>>> Sent: Tuesday, October 13, 2020 2:22 PM
>>>>>>
>>>>>>
>>>>>> On 2020/10/12 ??4:38, Tian, Kevin wrote:
>>>>>>>> From: Jason Wang <jasowang@redhat.com>
>>>>>>>> Sent: Monday, September 14, 2020 12:20 PM
>>>>>>>>
>>>>>>> [...]
>>>>>>>      > If it's possible, I would suggest a generic uAPI instead of
>>>>>>> a VFIO
>>>>>>>> specific one.
>>>>>>>>
>>>>>>>> Jason suggest something like /dev/sva. There will be a lot of
>>>>>>>> other subsystems that could benefit from this (e.g vDPA).
>>>>>>>>
>>>>>>>> Have you ever considered this approach?
>>>>>>>>
>>>>>>> Hi, Jason,
>>>>>>>
>>>>>>> We did some study on this approach and below is the output. It's a
>>>>>>> long writing but I didn't find a way to further abstract w/o
>>>>>>> losing necessary context. Sorry about that.
>>>>>>>
>>>>>>> Overall the real purpose of this series is to enable IOMMU nested
>>>>>>> translation capability with vSVA as one major usage, through below
>>>>>>> new uAPIs:
>>>>>>> 	1) Report/enable IOMMU nested translation capability;
>>>>>>> 	2) Allocate/free PASID;
>>>>>>> 	3) Bind/unbind guest page table;
>>>>>>> 	4) Invalidate IOMMU cache;
>>>>>>> 	5) Handle IOMMU page request/response (not in this series);
>>>>>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
>>>>>>> the other two optional. For example, the guest may enable vSVA on
>>>>>>> a device without using PASID. Or, it may bind its gIOVA page table
>>>>>>> which doesn't require page fault support. Finally, all operations
>>>>>>> can be applied to either physical device or subdevice.
>>>>>>>
>>>>>>> Then we evaluated each uAPI whether generalizing it is a good
>>>>>>> thing both in concept and regarding to complexity.
>>>>>>>
>>>>>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
>>>>>>> allocation/free is through the IOASID sub-system.
>>>>>> A question here, is IOASID expected to be the single management
>>>>>> interface for PASID?
>>>>> yes
>>>>>
>>>>>> (I'm asking since there're already vendor specific IDA based PASID
>>>>>> allocator e.g amdgpu_pasid_alloc())
>>>>> That comes before IOASID core was introduced. I think it should be
>>>>> changed to use the new generic interface. Jacob/Jean can better
>>>>> comment if other reason exists for this exception.
>>>> If there's no exception it should be fixed.
>>>>
>>>>
>>>>>>>      From this angle
>>>>>>> we feel generalizing PASID management does make some sense.
>>>>>>> First, PASID is just a number and not related to any device before
>>>>>>> it's bound to a page table and IOMMU domain. Second, PASID is a
>>>>>>> global resource (at least on Intel VT-d),
>>>>>> I think we need a definition of "global" here. It looks to me for
>>>>>> vt-d the PASID table is per device.
>>>>> PASID table is per device, thus VT-d could support per-device PASIDs
>>>>> in concept.
>>>> I think that's the requirement of PCIE spec which said PASID + RID
>>>> identifies the process address space ID.
>>>>
>>>>
>>>>>     However on Intel platform we require PASIDs to be managed in
>>>>> system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
>>>>> and ENQCMD together.
>>>> Any reason for such requirement? (I'm not familiar with ENQCMD, but
>>>> my understanding is that vSVA, SIOV or SR-IOV doesn't have the
>>>> requirement for system-wide PASID).
>>> ENQCMD is a new instruction to allow multiple processes submitting
>>> workload to one shared workqueue. Each process has an unique PASID
>>> saved in a MSR, which is included in the ENQCMD payload to indicate
>>> the address space when the CPU sends to the device. As one process
>>> might issue ENQCMD to multiple devices, OS-wide PASID allocation is
>>> required both in host and guest side.
>>>
>>> When executing ENQCMD in the guest to a SIOV device, the guest
>>> programmed value in the PASID_MSR must be translated to a host PASID
>>> value for proper function/isolation as PASID represents the address
>>> space. The translation is done through a new VMCS PASID translation
>>> structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
>>> must be allocated 'globally' cross all assigned devices otherwise it
>>> may lead to 1:N mapping when a guest process issues ENQCMD to multiple
>>> assigned devices/subdevices.
>>>
>>> There will be a KVM forum session for this topic btw.
>>
>> Thanks for the background. Now I see the restrict comes from ENQCMD.
>>
>>
>>>>> Thus the host creates only one 'global' PASID namespace but do use
>>>>> per-device PASID table to assure isolation between devices on Intel
>>>>> platforms. But ARM does it differently as Jean explained.
>>>>> They have a global namespace for host processes on all host-owned
>>>>> devices (same as Intel), but then per-device namespace when a device
>>>>> (and its PASID table) is assigned to userspace.
>>>>>
>>>>>> Another question, is this possible to have two DMAR hardware
>>>>>> unit(at least I can see two even in my laptop). In this case, is
>>>>>> PASID still a global resource?
>>>>> yes
>>>>>
>>>>>>>      while having separate VFIO/
>>>>>>> VDPA allocation interfaces may easily cause confusion in
>>>>>>> userspace, e.g. which interface to be used if both VFIO/VDPA devices exist.
>>>>>>> Moreover, an unified interface allows centralized control over how
>>>>>>> many PASIDs are allowed per process.
>>>>>> Yes.
>>>>>>
>>>>>>
>>>>>>> One unclear part with this generalization is about the permission.
>>>>>>> Do we open this interface to any process or only to those which
>>>>>>> have assigned devices? If the latter, what would be the mechanism
>>>>>>> to coordinate between this new interface and specific passthrough
>>>>>>> frameworks?
>>>>>> I'm not sure, but if you just want a permission, you probably can
>>>>>> introduce new capability (CAP_XXX) for this.
>>>>>>
>>>>>>
>>>>>>>      A more tricky case, vSVA support on ARM (Eric/Jean please
>>>>>>> correct me) plans to do per-device PASID namespace which is built
>>>>>>> on a bind_pasid_table iommu callback to allow guest fully manage
>>>>>>> its PASIDs on a given passthrough device.
>>>>>> I see, so I think the answer is to prepare for the namespace
>>>>>> support from the start. (btw, I don't see how namespace is handled
>>>>>> in current IOASID module?)
>>>>> The PASID table is based on GPA when nested translation is enabled
>>>>> on ARM SMMU. This design implies that the guest manages PASID table
>>>>> thus PASIDs instead of going through host-side API on assigned
>>>>> device. From this angle we don't need explicit namespace in the host
>>>>> API. Just need a way to control how many PASIDs a process is allowed
>>>>> to allocate in the global namespace. btw IOASID module already has
>>>>> 'set' concept per-process and PASIDs are managed per-set. Then the
>>>>> quota control can be easily introduced in the 'set' level.
>>>>>
>>>>>>>      I'm not sure
>>>>>>> how such requirement can be unified w/o involving passthrough
>>>>>>> frameworks, or whether ARM could also switch to global PASID
>>>>>>> style...
>>>>>>>
>>>>>>> Second, IOMMU nested translation is a per IOMMU domain capability.
>>>>>>> Since IOMMU domains are managed by VFIO/VDPA
>>>>>>>      (alloc/free domain, attach/detach device, set/get domain
>>>>>>> attribute, etc.), reporting/enabling the nesting capability is an
>>>>>>> natural extension to the domain uAPI of existing passthrough frameworks.
>>>>>>> Actually, VFIO already includes a nesting enable interface even
>>>>>>> before this series. So it doesn't make sense to generalize this
>>>>>>> uAPI out.
>>>>>> So my understanding is that VFIO already:
>>>>>>
>>>>>> 1) use multiple fds
>>>>>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
>>>>>> 3) provides API to associated devices/group with a container
>>>>>>
>>>>>> And all the proposal in this series is to reuse the container fd.
>>>>>> It should be possible to replace e.g type1 IOMMU with a unified module.
>>>>> yes, this is the alternative option that I raised in the last paragraph.
>>>>>
>>>>>>> Then the tricky part comes with the remaining operations (3/4/5),
>>>>>>> which are all backed by iommu_ops thus effective only within an
>>>>>>> IOMMU domain. To generalize them, the first thing is to find a way
>>>>>>> to associate the sva_FD (opened through generic /dev/sva) with an
>>>>>>> IOMMU domain that is created by VFIO/VDPA. The second thing is to
>>>>>>> replicate {domain<->device/subdevice} association in /dev/sva path
>>>>>>> because some operations (e.g. page fault) is triggered/handled per
>>>>>>> device/subdevice.
>>>>>> Is there any reason that the #PF can not be handled via SVA fd?
>>>>> using per-device FDs or multiplexing all fault info through one
>>>>> sva_FD is just an implementation choice. The key is to mark faults
>>>>> per device/ subdevice thus anyway requires a userspace-visible
>>>>> handle/tag to represent device/subdevice and the domain/device
>>>>> association must be constructed in this new path.
>>>> I don't get why it requires a userspace-visible handle/tag. The
>>>> binding between SVA fd and device fd could be done either explicitly
>>>> or implicitly. So userspace know which (sub)device that this SVA fd is for.
>>>>
>>>>
>>>>>>>      Therefore, /dev/sva must provide both per- domain and
>>>>>>> per-device uAPIs similar to what VFIO/VDPA already does. Moreover,
>>>>>>> mapping page fault to subdevice requires pre- registering
>>>>>>> subdevice fault data to IOMMU layer when binding guest page table,
>>>>>>> while such fault data can be only retrieved from parent driver
>>>>>>> through VFIO/VDPA.
>>>>>>>
>>>>>>> However, we failed to find a good way even at the 1st step about
>>>>>>> domain association. The iommu domains are not exposed to the
>>>>>>> userspace, and there is no 1:1 mapping between domain and device.
>>>>>>> In VFIO, all devices within the same VFIO container share the
>>>>>>> address space but they may be organized in multiple IOMMU domains
>>>>>>> based on their bus type. How (should we let) the userspace know
>>>>>>> the domain information and open an sva_FD for each domain is the
>>>>>>> main problem here.
>>>>>> The SVA fd is not necessarily opened by userspace. It could be get
>>>>>> through subsystem specific uAPIs.
>>>>>>
>>>>>> E.g for vDPA if a vDPA device contains several vSVA-capable
>>>>>> domains, we
>>>> can:
>>>>>> 1) introduce uAPI for userspace to know the number of vSVA-capable
>>>>>> domain
>>>>>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each
>>>>>> vSVA-capable domain
>>>>> and also new interface to notify userspace when a domain disappears
>>>>> or a device is detached?
>>>> You need to deal with this case even in VFIO, isn't it?
>>> No. VFIO doesn't expose domain knowledge to userspace.
>>
>> Neither did the above API I think.
>>
>>
>>>>>     Finally looks we are creating a completely set of new subsystem
>>>>> specific uAPIs just for generalizing another set of subsystem
>>>>> specific uAPIs. Remember after separating PASID mgmt.
>>>>> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
>>>>> API. Replicating them is much easier logic than developing a new
>>>>> glue mechanism in each subsystem.
>>>> As discussed, the point is more than just simple generalizing. It's
>>>> about the limitation of current uAPI. So I have the following questions:
>>>>
>>>> Do we want a single PASID to be used by more than one devices?
>>> Yes.
>>>
>>>> If yes, do we want those devices to share I/O page tables?
>>> Yes.
>>>
>>>> If yes, which uAPI is  used to program the shared I/O page tables?
>>>>
>>> Page table binding needs to be done per-device, so the userspace will
>>> use VFIO uAPI for VFIO device and vDPA uAPI for vDPA device.
>>
>> Any design considerations for this, I think it should be done per PASID instead
>> (consider PASID is a global resource)?
> per device and per PASID. you may have a look from the below arch. PASID
> table is per device, the binding of page table are set to PASID table
> entry.
>
> "
> In VT-d implementation, PASID table is per device and maintained in the host.
> Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.
>
>      .-------------.  .---------------------------.
>      |   vIOMMU    |  | Guest process CR3, FL only|
>      |             |  '---------------------------'
>      .----------------/
>      | PASID Entry |--- PASID cache flush -
>      '-------------'                       |
>      |             |                       V
>      |             |                CR3 in GPA
>      '-------------'
> Guest
> ------| Shadow |--------------------------|--------
>        v        v                          v
> Host
>      .-------------.  .----------------------.
>      |   pIOMMU    |  | Bind FL for GVA-GPA  |
>      |             |  '----------------------'
>      .----------------/  |
>      | PASID Entry |     V (Nested xlate)
>      '----------------\.------------------------------.
>      |             |   |SL for GPA-HPA, default domain|
>      |             |   '------------------------------'
>      '-------------'
> Where:
>   - FL = First level/stage one page tables
>   - SL = Second level/stage two page tables
> "
> copied from https://lwn.net/Articles/807506/


Yes, but since PASID is a global identifier now, I think kernel should 
track the a device list per PASID? So for such binding, PASID should be 
sufficient for uAPI.


>
>>> The binding request is initiated by the virtual IOMMU, when capturing
>>> guest attempt of binding page table to a virtual PASID entry for a
>>> given device.
>>
>> And for L2 page table programming, if PASID is use by both e.g VFIO and
>> vDPA, user need to choose one of uAPI to build l2 mappings?
> for L2 page table mappings, it's done by VFIO MAP/UNMAP. for vdpa, I guess
> it is tlb flush. so you are right. Keeping L1/L2 page table management in
> a single uAPI set is also a reason for my current series which extends VFIO
> for L1 management.


I'm afraid that would introduce confusing to userspace. E.g:

1) when having only vDPA device, it uses vDPA uAPI to do l2 management
2) when vDPA shares PASID with VFIO, it will use VFIO uAPI to do the l2 
management?

Thanks


>
> Regards,
> Yi Liu
>
>> Thanks
>>
>>
>>> Thanks
>>> Kevin
>>>
>>>>>>> In the end we just realized that doing such generalization doesn't
>>>>>>> really lead to a clear design and instead requires tight coordination
>>>>>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
>>>>>>> (especially about synchronization when the domain/device
>>>>>>> association is changed or when the device/subdevice is being reset/
>>>>>>> drained). Finally it may become a usability burden to the userspace
>>>>>>> on proper use of the two interfaces on the assigned device.
>>>>>>>
>>>>>>> Based on above analysis we feel that just generalizing PASID mgmt.
>>>>>>> might be a good thing to look at while the remaining operations are
>>>>>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
>>>>>>> just a subset of the page table management capabilities that an
>>>>>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
>>>>>>> is managed by VFIO/VDPA already, continuing this path for new nesting
>>>>>>> capability sounds natural. There is another option by generalizing the
>>>>>>> entire IOMMU domain management (sort of the entire vfio_iommu_
>>>>>>> type1), but it's unclear whether such intrusive change is worthwhile
>>>>>>> (especially when VFIO/VDPA already goes different route even in legacy
>>>>>>> mapping uAPI: map/unmap vs. IOTLB).
>>>>>>>
>>>>>>> Thoughts?
>>>>>> I'm ok with starting with a unified PASID management and consider the
>>>>>> unified vSVA/vIOMMU uAPI later.
>>>>>>
>>>>> Glad to see that we have consensus here. :)
>>>>>
>>>>> Thanks
>>>>> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20  6:18             ` Jason Wang
@ 2020-10-20  8:19               ` Liu, Yi L
  2020-10-20  9:19                 ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20  8:19 UTC (permalink / raw)
  To: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

Hey Jason,

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 20, 2020 2:18 PM
> 
> On 2020/10/15 ??6:14, Liu, Yi L wrote:
> >> From: Jason Wang <jasowang@redhat.com>
> >> Sent: Thursday, October 15, 2020 4:41 PM
> >>
> >>
> >> On 2020/10/15 ??3:58, Tian, Kevin wrote:
> >>>> From: Jason Wang <jasowang@redhat.com>
> >>>> Sent: Thursday, October 15, 2020 2:52 PM
> >>>>
> >>>>
> >>>> On 2020/10/14 ??11:08, Tian, Kevin wrote:
> >>>>>> From: Jason Wang <jasowang@redhat.com>
> >>>>>> Sent: Tuesday, October 13, 2020 2:22 PM
> >>>>>>
> >>>>>>
> >>>>>> On 2020/10/12 ??4:38, Tian, Kevin wrote:
> >>>>>>>> From: Jason Wang <jasowang@redhat.com>
> >>>>>>>> Sent: Monday, September 14, 2020 12:20 PM
> >>>>>>>>
> >>>>>>> [...]
> >>>>>>>      > If it's possible, I would suggest a generic uAPI instead of
> >>>>>>> a VFIO
> >>>>>>>> specific one.
> >>>>>>>>
> >>>>>>>> Jason suggest something like /dev/sva. There will be a lot of
> >>>>>>>> other subsystems that could benefit from this (e.g vDPA).
> >>>>>>>>
> >>>>>>>> Have you ever considered this approach?
> >>>>>>>>
> >>>>>>> Hi, Jason,
> >>>>>>>
> >>>>>>> We did some study on this approach and below is the output. It's a
> >>>>>>> long writing but I didn't find a way to further abstract w/o
> >>>>>>> losing necessary context. Sorry about that.
> >>>>>>>
> >>>>>>> Overall the real purpose of this series is to enable IOMMU nested
> >>>>>>> translation capability with vSVA as one major usage, through below
> >>>>>>> new uAPIs:
> >>>>>>> 	1) Report/enable IOMMU nested translation capability;
> >>>>>>> 	2) Allocate/free PASID;
> >>>>>>> 	3) Bind/unbind guest page table;
> >>>>>>> 	4) Invalidate IOMMU cache;
> >>>>>>> 	5) Handle IOMMU page request/response (not in this series);
> >>>>>>> 1/3/4) is the minimal set for using IOMMU nested translation, with
> >>>>>>> the other two optional. For example, the guest may enable vSVA on
> >>>>>>> a device without using PASID. Or, it may bind its gIOVA page table
> >>>>>>> which doesn't require page fault support. Finally, all operations
> >>>>>>> can be applied to either physical device or subdevice.
> >>>>>>>
> >>>>>>> Then we evaluated each uAPI whether generalizing it is a good
> >>>>>>> thing both in concept and regarding to complexity.
> >>>>>>>
> >>>>>>> First, unlike other uAPIs which are all backed by iommu_ops, PASID
> >>>>>>> allocation/free is through the IOASID sub-system.
> >>>>>> A question here, is IOASID expected to be the single management
> >>>>>> interface for PASID?
> >>>>> yes
> >>>>>
> >>>>>> (I'm asking since there're already vendor specific IDA based PASID
> >>>>>> allocator e.g amdgpu_pasid_alloc())
> >>>>> That comes before IOASID core was introduced. I think it should be
> >>>>> changed to use the new generic interface. Jacob/Jean can better
> >>>>> comment if other reason exists for this exception.
> >>>> If there's no exception it should be fixed.
> >>>>
> >>>>
> >>>>>>>      From this angle
> >>>>>>> we feel generalizing PASID management does make some sense.
> >>>>>>> First, PASID is just a number and not related to any device before
> >>>>>>> it's bound to a page table and IOMMU domain. Second, PASID is a
> >>>>>>> global resource (at least on Intel VT-d),
> >>>>>> I think we need a definition of "global" here. It looks to me for
> >>>>>> vt-d the PASID table is per device.
> >>>>> PASID table is per device, thus VT-d could support per-device PASIDs
> >>>>> in concept.
> >>>> I think that's the requirement of PCIE spec which said PASID + RID
> >>>> identifies the process address space ID.
> >>>>
> >>>>
> >>>>>     However on Intel platform we require PASIDs to be managed in
> >>>>> system-wide (cross host and guest) when combining vSVA, SIOV, SR-IOV
> >>>>> and ENQCMD together.
> >>>> Any reason for such requirement? (I'm not familiar with ENQCMD, but
> >>>> my understanding is that vSVA, SIOV or SR-IOV doesn't have the
> >>>> requirement for system-wide PASID).
> >>> ENQCMD is a new instruction to allow multiple processes submitting
> >>> workload to one shared workqueue. Each process has an unique PASID
> >>> saved in a MSR, which is included in the ENQCMD payload to indicate
> >>> the address space when the CPU sends to the device. As one process
> >>> might issue ENQCMD to multiple devices, OS-wide PASID allocation is
> >>> required both in host and guest side.
> >>>
> >>> When executing ENQCMD in the guest to a SIOV device, the guest
> >>> programmed value in the PASID_MSR must be translated to a host PASID
> >>> value for proper function/isolation as PASID represents the address
> >>> space. The translation is done through a new VMCS PASID translation
> >>> structure (per-VM, and 1:1 mapping). From this angle the host PASIDs
> >>> must be allocated 'globally' cross all assigned devices otherwise it
> >>> may lead to 1:N mapping when a guest process issues ENQCMD to multiple
> >>> assigned devices/subdevices.
> >>>
> >>> There will be a KVM forum session for this topic btw.
> >>
> >> Thanks for the background. Now I see the restrict comes from ENQCMD.
> >>
> >>
> >>>>> Thus the host creates only one 'global' PASID namespace but do use
> >>>>> per-device PASID table to assure isolation between devices on Intel
> >>>>> platforms. But ARM does it differently as Jean explained.
> >>>>> They have a global namespace for host processes on all host-owned
> >>>>> devices (same as Intel), but then per-device namespace when a device
> >>>>> (and its PASID table) is assigned to userspace.
> >>>>>
> >>>>>> Another question, is this possible to have two DMAR hardware
> >>>>>> unit(at least I can see two even in my laptop). In this case, is
> >>>>>> PASID still a global resource?
> >>>>> yes
> >>>>>
> >>>>>>>      while having separate VFIO/
> >>>>>>> VDPA allocation interfaces may easily cause confusion in
> >>>>>>> userspace, e.g. which interface to be used if both VFIO/VDPA devices
> exist.
> >>>>>>> Moreover, an unified interface allows centralized control over how
> >>>>>>> many PASIDs are allowed per process.
> >>>>>> Yes.
> >>>>>>
> >>>>>>
> >>>>>>> One unclear part with this generalization is about the permission.
> >>>>>>> Do we open this interface to any process or only to those which
> >>>>>>> have assigned devices? If the latter, what would be the mechanism
> >>>>>>> to coordinate between this new interface and specific passthrough
> >>>>>>> frameworks?
> >>>>>> I'm not sure, but if you just want a permission, you probably can
> >>>>>> introduce new capability (CAP_XXX) for this.
> >>>>>>
> >>>>>>
> >>>>>>>      A more tricky case, vSVA support on ARM (Eric/Jean please
> >>>>>>> correct me) plans to do per-device PASID namespace which is built
> >>>>>>> on a bind_pasid_table iommu callback to allow guest fully manage
> >>>>>>> its PASIDs on a given passthrough device.
> >>>>>> I see, so I think the answer is to prepare for the namespace
> >>>>>> support from the start. (btw, I don't see how namespace is handled
> >>>>>> in current IOASID module?)
> >>>>> The PASID table is based on GPA when nested translation is enabled
> >>>>> on ARM SMMU. This design implies that the guest manages PASID table
> >>>>> thus PASIDs instead of going through host-side API on assigned
> >>>>> device. From this angle we don't need explicit namespace in the host
> >>>>> API. Just need a way to control how many PASIDs a process is allowed
> >>>>> to allocate in the global namespace. btw IOASID module already has
> >>>>> 'set' concept per-process and PASIDs are managed per-set. Then the
> >>>>> quota control can be easily introduced in the 'set' level.
> >>>>>
> >>>>>>>      I'm not sure
> >>>>>>> how such requirement can be unified w/o involving passthrough
> >>>>>>> frameworks, or whether ARM could also switch to global PASID
> >>>>>>> style...
> >>>>>>>
> >>>>>>> Second, IOMMU nested translation is a per IOMMU domain capability.
> >>>>>>> Since IOMMU domains are managed by VFIO/VDPA
> >>>>>>>      (alloc/free domain, attach/detach device, set/get domain
> >>>>>>> attribute, etc.), reporting/enabling the nesting capability is an
> >>>>>>> natural extension to the domain uAPI of existing passthrough
> frameworks.
> >>>>>>> Actually, VFIO already includes a nesting enable interface even
> >>>>>>> before this series. So it doesn't make sense to generalize this
> >>>>>>> uAPI out.
> >>>>>> So my understanding is that VFIO already:
> >>>>>>
> >>>>>> 1) use multiple fds
> >>>>>> 2) separate IOMMU ops to a dedicated container fd (type1 iommu)
> >>>>>> 3) provides API to associated devices/group with a container
> >>>>>>
> >>>>>> And all the proposal in this series is to reuse the container fd.
> >>>>>> It should be possible to replace e.g type1 IOMMU with a unified module.
> >>>>> yes, this is the alternative option that I raised in the last paragraph.
> >>>>>
> >>>>>>> Then the tricky part comes with the remaining operations (3/4/5),
> >>>>>>> which are all backed by iommu_ops thus effective only within an
> >>>>>>> IOMMU domain. To generalize them, the first thing is to find a way
> >>>>>>> to associate the sva_FD (opened through generic /dev/sva) with an
> >>>>>>> IOMMU domain that is created by VFIO/VDPA. The second thing is to
> >>>>>>> replicate {domain<->device/subdevice} association in /dev/sva path
> >>>>>>> because some operations (e.g. page fault) is triggered/handled per
> >>>>>>> device/subdevice.
> >>>>>> Is there any reason that the #PF can not be handled via SVA fd?
> >>>>> using per-device FDs or multiplexing all fault info through one
> >>>>> sva_FD is just an implementation choice. The key is to mark faults
> >>>>> per device/ subdevice thus anyway requires a userspace-visible
> >>>>> handle/tag to represent device/subdevice and the domain/device
> >>>>> association must be constructed in this new path.
> >>>> I don't get why it requires a userspace-visible handle/tag. The
> >>>> binding between SVA fd and device fd could be done either explicitly
> >>>> or implicitly. So userspace know which (sub)device that this SVA fd is for.
> >>>>
> >>>>
> >>>>>>>      Therefore, /dev/sva must provide both per- domain and
> >>>>>>> per-device uAPIs similar to what VFIO/VDPA already does. Moreover,
> >>>>>>> mapping page fault to subdevice requires pre- registering
> >>>>>>> subdevice fault data to IOMMU layer when binding guest page table,
> >>>>>>> while such fault data can be only retrieved from parent driver
> >>>>>>> through VFIO/VDPA.
> >>>>>>>
> >>>>>>> However, we failed to find a good way even at the 1st step about
> >>>>>>> domain association. The iommu domains are not exposed to the
> >>>>>>> userspace, and there is no 1:1 mapping between domain and device.
> >>>>>>> In VFIO, all devices within the same VFIO container share the
> >>>>>>> address space but they may be organized in multiple IOMMU domains
> >>>>>>> based on their bus type. How (should we let) the userspace know
> >>>>>>> the domain information and open an sva_FD for each domain is the
> >>>>>>> main problem here.
> >>>>>> The SVA fd is not necessarily opened by userspace. It could be get
> >>>>>> through subsystem specific uAPIs.
> >>>>>>
> >>>>>> E.g for vDPA if a vDPA device contains several vSVA-capable
> >>>>>> domains, we
> >>>> can:
> >>>>>> 1) introduce uAPI for userspace to know the number of vSVA-capable
> >>>>>> domain
> >>>>>> 2) introduce e.g VDPA_GET_SVA_FD to get the fd for each
> >>>>>> vSVA-capable domain
> >>>>> and also new interface to notify userspace when a domain disappears
> >>>>> or a device is detached?
> >>>> You need to deal with this case even in VFIO, isn't it?
> >>> No. VFIO doesn't expose domain knowledge to userspace.
> >>
> >> Neither did the above API I think.
> >>
> >>
> >>>>>     Finally looks we are creating a completely set of new subsystem
> >>>>> specific uAPIs just for generalizing another set of subsystem
> >>>>> specific uAPIs. Remember after separating PASID mgmt.
> >>>>> out then most of remaining vSVA uAPIs are simpler wrapper of IOMMU
> >>>>> API. Replicating them is much easier logic than developing a new
> >>>>> glue mechanism in each subsystem.
> >>>> As discussed, the point is more than just simple generalizing. It's
> >>>> about the limitation of current uAPI. So I have the following questions:
> >>>>
> >>>> Do we want a single PASID to be used by more than one devices?
> >>> Yes.
> >>>
> >>>> If yes, do we want those devices to share I/O page tables?
> >>> Yes.
> >>>
> >>>> If yes, which uAPI is  used to program the shared I/O page tables?
> >>>>
> >>> Page table binding needs to be done per-device, so the userspace will
> >>> use VFIO uAPI for VFIO device and vDPA uAPI for vDPA device.
> >>
> >> Any design considerations for this, I think it should be done per PASID instead
> >> (consider PASID is a global resource)?
> > per device and per PASID. you may have a look from the below arch. PASID
> > table is per device, the binding of page table are set to PASID table
> > entry.
> >
> > "
> > In VT-d implementation, PASID table is per device and maintained in the host.
> > Guest PASID table is shadowed in VMM where virtual IOMMU is emulated.
> >
> >      .-------------.  .---------------------------.
> >      |   vIOMMU    |  | Guest process CR3, FL only|
> >      |             |  '---------------------------'
> >      .----------------/
> >      | PASID Entry |--- PASID cache flush -
> >      '-------------'                       |
> >      |             |                       V
> >      |             |                CR3 in GPA
> >      '-------------'
> > Guest
> > ------| Shadow |--------------------------|--------
> >        v        v                          v
> > Host
> >      .-------------.  .----------------------.
> >      |   pIOMMU    |  | Bind FL for GVA-GPA  |
> >      |             |  '----------------------'
> >      .----------------/  |
> >      | PASID Entry |     V (Nested xlate)
> >      '----------------\.------------------------------.
> >      |             |   |SL for GPA-HPA, default domain|
> >      |             |   '------------------------------'
> >      '-------------'
> > Where:
> >   - FL = First level/stage one page tables
> >   - SL = Second level/stage two page tables
> > "
> > copied from https://lwn.net/Articles/807506/
> 
> 
> Yes, but since PASID is a global identifier now, I think kernel should
> track the a device list per PASID?

We have such track. It's done in iommu driver. You can refer to the
struct intel_svm. PASID is a global identifier, but it doesn’t affect that
the PASID table is per-device.

> So for such binding, PASID should be
> sufficient for uAPI.

not quite get it. PASID may be bound to multiple devices, how do
you figure out the target device if you don’t provide such info.

> 
> 
> >
> >>> The binding request is initiated by the virtual IOMMU, when capturing
> >>> guest attempt of binding page table to a virtual PASID entry for a
> >>> given device.
> >>
> >> And for L2 page table programming, if PASID is use by both e.g VFIO and
> >> vDPA, user need to choose one of uAPI to build l2 mappings?
> > for L2 page table mappings, it's done by VFIO MAP/UNMAP. for vdpa, I guess
> > it is tlb flush. so you are right. Keeping L1/L2 page table management in
> > a single uAPI set is also a reason for my current series which extends VFIO
> > for L1 management.
> 
> 
> I'm afraid that would introduce confusing to userspace. E.g:
> 
> 1) when having only vDPA device, it uses vDPA uAPI to do l2 management
> 2) when vDPA shares PASID with VFIO, it will use VFIO uAPI to do the l2
> management?

I think vDPA will still use its own l2 for the l2 mappings. not sure why you
need vDPA use VFIO's l2 management. I don't think it is the case.

Regards,
Yi Liu

> Thanks
> 
> 
> >
> > Regards,
> > Yi Liu
> >
> >> Thanks
> >>
> >>
> >>> Thanks
> >>> Kevin
> >>>
> >>>>>>> In the end we just realized that doing such generalization doesn't
> >>>>>>> really lead to a clear design and instead requires tight coordination
> >>>>>>> between /dev/sva and VFIO/VDPA for almost every new uAPI
> >>>>>>> (especially about synchronization when the domain/device
> >>>>>>> association is changed or when the device/subdevice is being reset/
> >>>>>>> drained). Finally it may become a usability burden to the userspace
> >>>>>>> on proper use of the two interfaces on the assigned device.
> >>>>>>>
> >>>>>>> Based on above analysis we feel that just generalizing PASID mgmt.
> >>>>>>> might be a good thing to look at while the remaining operations are
> >>>>>>> better being VFIO/VDPA specific uAPIs. anyway in concept those are
> >>>>>>> just a subset of the page table management capabilities that an
> >>>>>>> IOMMU domain affords. Since all other aspects of the IOMMU domain
> >>>>>>> is managed by VFIO/VDPA already, continuing this path for new nesting
> >>>>>>> capability sounds natural. There is another option by generalizing the
> >>>>>>> entire IOMMU domain management (sort of the entire vfio_iommu_
> >>>>>>> type1), but it's unclear whether such intrusive change is worthwhile
> >>>>>>> (especially when VFIO/VDPA already goes different route even in legacy
> >>>>>>> mapping uAPI: map/unmap vs. IOTLB).
> >>>>>>>
> >>>>>>> Thoughts?
> >>>>>> I'm ok with starting with a unified PASID management and consider the
> >>>>>> unified vSVA/vIOMMU uAPI later.
> >>>>>>
> >>>>> Glad to see that we have consensus here. :)
> >>>>>
> >>>>> Thanks
> >>>>> Kevin


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20  8:19               ` Liu, Yi L
@ 2020-10-20  9:19                 ` Jason Wang
  2020-10-20  9:40                   ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-20  9:19 UTC (permalink / raw)
  To: Liu, Yi L, Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

Hi Yi:

On 2020/10/20 下午4:19, Liu, Yi L wrote:
>> Yes, but since PASID is a global identifier now, I think kernel should
>> track the a device list per PASID?
> We have such track. It's done in iommu driver. You can refer to the
> struct intel_svm. PASID is a global identifier, but it doesn’t affect that
> the PASID table is per-device.
>
>> So for such binding, PASID should be
>> sufficient for uAPI.
> not quite get it. PASID may be bound to multiple devices, how do
> you figure out the target device if you don’t provide such info.


I may miss soemthing but is there any reason that userspace need to 
figure out the target device? PASID is about address space not a 
specific device I think.


>
>>>>> The binding request is initiated by the virtual IOMMU, when capturing
>>>>> guest attempt of binding page table to a virtual PASID entry for a
>>>>> given device.
>>>> And for L2 page table programming, if PASID is use by both e.g VFIO and
>>>> vDPA, user need to choose one of uAPI to build l2 mappings?
>>> for L2 page table mappings, it's done by VFIO MAP/UNMAP. for vdpa, I guess
>>> it is tlb flush. so you are right. Keeping L1/L2 page table management in
>>> a single uAPI set is also a reason for my current series which extends VFIO
>>> for L1 management.
>> I'm afraid that would introduce confusing to userspace. E.g:
>>
>> 1) when having only vDPA device, it uses vDPA uAPI to do l2 management
>> 2) when vDPA shares PASID with VFIO, it will use VFIO uAPI to do the l2
>> management?
> I think vDPA will still use its own l2 for the l2 mappings. not sure why you
> need vDPA use VFIO's l2 management. I don't think it is the case.


See previous discussion with Kevin. If I understand correctly, you 
expect a shared L2 table if vDPA and VFIO device are using the same PASID.

In this case, if l2 is still managed separately, there will be 
duplicated request of map and unmap.

Thanks


>
> Regards,
> Yi Liu
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20  9:19                 ` Jason Wang
@ 2020-10-20  9:40                   ` Liu, Yi L
  2020-10-20 13:54                     ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20  9:40 UTC (permalink / raw)
  To: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro
  Cc: jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

> From: Jason Wang <jasowang@redhat.com>
> Sent: Tuesday, October 20, 2020 5:20 PM
> 
> Hi Yi:
> 
> On 2020/10/20 ??4:19, Liu, Yi L wrote:
> >> Yes, but since PASID is a global identifier now, I think kernel
> >> should track the a device list per PASID?
> > We have such track. It's done in iommu driver. You can refer to the
> > struct intel_svm. PASID is a global identifier, but it doesn’t affect
> > that the PASID table is per-device.
> >
> >> So for such binding, PASID should be
> >> sufficient for uAPI.
> > not quite get it. PASID may be bound to multiple devices, how do you
> > figure out the target device if you don’t provide such info.
> 
> 
> I may miss soemthing but is there any reason that userspace need to figure out
> the target device? PASID is about address space not a specific device I think.

If you have multiple devices assigned to a VM, you won't expect to bind all
of them to a PASID in a single bind operation, right? you may want to bind
only the devices you really mean. This manner should be more flexible and
reasonable. :-)

> 
> >
> >>>>> The binding request is initiated by the virtual IOMMU, when
> >>>>> capturing guest attempt of binding page table to a virtual PASID
> >>>>> entry for a given device.
> >>>> And for L2 page table programming, if PASID is use by both e.g VFIO
> >>>> and vDPA, user need to choose one of uAPI to build l2 mappings?
> >>> for L2 page table mappings, it's done by VFIO MAP/UNMAP. for vdpa, I
> >>> guess it is tlb flush. so you are right. Keeping L1/L2 page table
> >>> management in a single uAPI set is also a reason for my current
> >>> series which extends VFIO for L1 management.
> >> I'm afraid that would introduce confusing to userspace. E.g:
> >>
> >> 1) when having only vDPA device, it uses vDPA uAPI to do l2
> >> management
> >> 2) when vDPA shares PASID with VFIO, it will use VFIO uAPI to do the
> >> l2 management?
> > I think vDPA will still use its own l2 for the l2 mappings. not sure
> > why you need vDPA use VFIO's l2 management. I don't think it is the case.
> 
> 
> See previous discussion with Kevin. If I understand correctly, you expect a shared
> L2 table if vDPA and VFIO device are using the same PASID.

L2 table sharing is not mandatory. The mapping is the same, but no need to
assume L2 tables are shared. Especially for VFIO/vDPA devices. Even within
a passthru framework, like VFIO, if the attributes of backend IOMMU are not
the same, the L2 page table are not shared, but the mapping is the same.

> In this case, if l2 is still managed separately, there will be duplicated request of
> map and unmap.

yes, but this is not a functional issue, right? If we want to solve it, we
should have a single uAPI set which can handle both L1 and L2 management.
That's also why you proposed to replace type1 driver. right?

Regards,
Yi Liu

> 
> Thanks
> 
> 
> >
> > Regards,
> > Yi Liu
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-19 14:25       ` Jason Gunthorpe
@ 2020-10-20 10:21         ` Liu, Yi L
  2020-10-20 14:02           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20 10:21 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, October 19, 2020 10:25 PM
> 
> On Mon, Oct 19, 2020 at 08:39:03AM +0000, Liu, Yi L wrote:
> > Hi Jason,
> >
> > Good to see your response.
> 
> Ah, I was away

got it. :-)

> > > > > Second, IOMMU nested translation is a per IOMMU domain
> > > > > capability. Since IOMMU domains are managed by VFIO/VDPA
> > > > > (alloc/free domain, attach/detach device, set/get domain
> > > > > attribute, etc.), reporting/enabling the nesting capability is
> > > > > an natural extension to the domain uAPI of existing passthrough
> frameworks.
> > > > > Actually, VFIO already includes a nesting enable interface even
> > > > > before this series. So it doesn't make sense to generalize this
> > > > > uAPI out.
> > >
> > > The subsystem that obtains an IOMMU domain for a device would have
> > > to register it with an open FD of the '/dev/sva'. That is the
> > > connection between the two subsystems. It would be some simple
> > > kernel internal
> > > stuff:
> > >
> > >   sva = get_sva_from_file(fd);
> >
> > Is this fd provided by userspace? I suppose the /dev/sva has a set of
> > uAPIs which will finally program page table to host iommu driver. As
> > far as I know, it's weird for VFIO user. Why should VFIO user connect
> > to a /dev/sva fd after it sets a proper iommu type to the opened
> > container. VFIO container already stands for an iommu context with
> > which userspace could program page mapping to host iommu.
> 
> Again the point is to dis-aggregate the vIOMMU related stuff from VFIO so it
> can
> be shared between more subsystems that need it.

I understand you here. :-)

> I'm sure there will be some
> weird overlaps because we can't delete any of the existing VFIO APIs, but
> that
> should not be a blocker.

but the weird thing is what we should consider. And it perhaps not just
overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
container is IOMMU context from the day it was defined. It could be the
blocker. :-(

> Having VFIO run in a mode where '/dev/sva' provides all the IOMMU handling is
> a possible path.

This looks to be similar with the proposal from Jason Wang and Kevin Tian.
It is an idea to add "/dev/iommu" and delegate the IOMMU domain alloc,
device attach/detach which is no in passthru framework to an independent
kernel driver. Just as Jason Wang said replace vfio iommu type1 driver.

Jason Wang:
 "And all the proposal in this series is to reuse the container fd. It 
 should be possible to replace e.g type1 IOMMU with a unified module."
link: https://lore.kernel.org/kvm/20201019142526.GJ6219@nvidia.com/T/#md49fe9ac9d9eff6ddf5b8c2ee2f27eb2766f66f3

Kevin Tian:
 "Based on above, I feel a more reasonable way is to first make a 
 /dev/iommu uAPI supporting DMA map/unmap usages and then 
 introduce vSVA to it. Doing this order is because DMA map/unmap 
 is widely used thus can better help verify the core logic with 
 many existing devices."
link: https://lore.kernel.org/kvm/MWHPR11MB1645C702D148A2852B41FCA08C230@MWHPR11MB1645.namprd11.prod.outlook.com/

> 
> If your plan is to just opencode everything into VFIO then I don't
> see how VDPA will work well, and if proper in-kernel abstractions are built I
> fail to see how
> routing some of it through userspace is a fundamental problem.

I'm not expert on vDPA for now, but I saw you three open source
veterans have a similar idea for a place to cover IOMMU handling,
I think it may be a valuable thing to do. I said "may be" as I'm not
sure about Alex's opinion on such idea. But the sure thing is this
idea may introduce weird overlap even re-definition of existing
thing as I replied above. We need to evaluate the impact and mature
the idea step by step. That means it would take time, so perhaps we
may do it in a staging way. First having a "/dev/iommu" be ready to
handle page MAP/UNMAP which can be used by both VFIO and vDPA, mean-
while let VFIO grow up (adding features) by itself and consider to
adopt the new /dev/iommu later once /dev/iommu is competent. Of
course this needs Alex's approval. And then adding new features
to /dev/iommu, like SVA.

> 
> > >   sva_register_device_to_pasid(sva, pasid, pci_device,
> > > iommu_domain);
> >
> > So this is supposed to be called by VFIO/VDPA to register the info to
> > /dev/sva.
> > right? And in dev/sva, it will also maintain the device/iommu_domain
> > and pasid info? will it be duplicated with VFIO/VDPA?
> 
> Each part needs to have the information it needs?

yeah, but it's the duplication which I'm not very much in. Perhaps the idea
from Jason Wang and Kevin would avoid such duplication.

> > > > > Moreover, mapping page fault to subdevice requires pre-
> > > > > registering subdevice fault data to IOMMU layer when binding
> > > > > guest page table, while such fault data can be only retrieved
> > > > > from parent driver through VFIO/VDPA.
> > >
> > > Not sure what this means, page fault should be tied to the PASID,
> > > any hookup needed for that should be done in-kernel when the device
> > > is connected to the PASID.
> >
> > you may refer to chapter 7.4.1.1 of VT-d spec. Page request is
> > reported to software together with the requestor id of the device. For
> > the page request injects to guest, it should have the device info.
> 
> Whoever provides the vIOMMU emulation and relays the page fault to the guest
> has to translate the RID -

that's the point. But the device info (especially the sub-device info) is
within the passthru framework (e.g. VFIO). So page fault reporting needs
to go through passthru framework.

> what does that have to do with VFIO?
> 
> How will VPDA provide the vIOMMU emulation?

a pardon here. I believe vIOMMU emulation should be based on IOMMU vendor
specification, right? you may correct me if I'm missing anything.

> Jason

Regards,
Yi Liu

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20  9:40                   ` Liu, Yi L
@ 2020-10-20 13:54                     ` Jason Gunthorpe
  2020-10-20 14:00                       ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 13:54 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Oct 20, 2020 at 09:40:14AM +0000, Liu, Yi L wrote:

> > See previous discussion with Kevin. If I understand correctly, you expect a shared
> > L2 table if vDPA and VFIO device are using the same PASID.
> 
> L2 table sharing is not mandatory. The mapping is the same, but no need to
> assume L2 tables are shared. Especially for VFIO/vDPA devices. Even within
> a passthru framework, like VFIO, if the attributes of backend IOMMU are not
> the same, the L2 page table are not shared, but the mapping is the same.

I think not being able to share the PASID shows exactly why this VFIO
centric approach is bad.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 13:54                     ` Jason Gunthorpe
@ 2020-10-20 14:00                       ` Liu, Yi L
  2020-10-20 14:05                         ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20 14:00 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, October 20, 2020 9:55 PM
> 
> On Tue, Oct 20, 2020 at 09:40:14AM +0000, Liu, Yi L wrote:
> 
> > > See previous discussion with Kevin. If I understand correctly, you expect a
> shared
> > > L2 table if vDPA and VFIO device are using the same PASID.
> >
> > L2 table sharing is not mandatory. The mapping is the same, but no need to
> > assume L2 tables are shared. Especially for VFIO/vDPA devices. Even within
> > a passthru framework, like VFIO, if the attributes of backend IOMMU are not
> > the same, the L2 page table are not shared, but the mapping is the same.
> 
> I think not being able to share the PASID shows exactly why this VFIO
> centric approach is bad.

no, I didn't say PASID is not sharable. My point is sharing L2 page table is
not mandatory.

Regards,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 10:21         ` Liu, Yi L
@ 2020-10-20 14:02           ` Jason Gunthorpe
  2020-10-20 14:19             ` Liu, Yi L
  2020-10-20 16:24             ` Raj, Ashok
  0 siblings, 2 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 14:02 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Tian, Kevin, Jason Wang, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Tue, Oct 20, 2020 at 10:21:41AM +0000, Liu, Yi L wrote:

> > I'm sure there will be some
> > weird overlaps because we can't delete any of the existing VFIO APIs, but
> > that
> > should not be a blocker.
> 
> but the weird thing is what we should consider. And it perhaps not just
> overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
> container is IOMMU context from the day it was defined. It could be the
> blocker. :-(

So maybe you have to broaden the VFIO container to be usable by other
subsystems. The discussion here is about what the uAPI should look
like in a fairly abstract way. When we say 'dev/sva' it just some
placeholder for a shared cdev that provides the necessary
dis-aggregated functionality 

It could be an existing cdev with broader functionaltiy, it could
really be /dev/iommu, etc. This is up to the folks building it to
decide.

> I'm not expert on vDPA for now, but I saw you three open source
> veterans have a similar idea for a place to cover IOMMU handling,
> I think it may be a valuable thing to do. I said "may be" as I'm not
> sure about Alex's opinion on such idea. But the sure thing is this
> idea may introduce weird overlap even re-definition of existing
> thing as I replied above. We need to evaluate the impact and mature
> the idea step by step. 

This has happened before, uAPIs do get obsoleted and replaced with
more general/better versions. It is often too hard to create a uAPI
that lasts for decades when the HW landscape is constantly changing
and sometime a reset is needed. 

The jump to shared PASID based IOMMU feels like one of those moments here.

> > Whoever provides the vIOMMU emulation and relays the page fault to the guest
> > has to translate the RID -
> 
> that's the point. But the device info (especially the sub-device info) is
> within the passthru framework (e.g. VFIO). So page fault reporting needs
> to go through passthru framework.
>
> > what does that have to do with VFIO?
> > 
> > How will VPDA provide the vIOMMU emulation?
> 
> a pardon here. I believe vIOMMU emulation should be based on IOMMU vendor
> specification, right? you may correct me if I'm missing anything.

I'm asking how will VDPA translate the RID when VDPA triggers a page
fault that has to be relayed to the guest. VDPA also has virtual PCI
devices it creates.

We can't rely on VFIO to be the place that the vIOMMU lives because it
excludes/complicates everything that is not VFIO from using that
stuff.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 14:00                       ` Liu, Yi L
@ 2020-10-20 14:05                         ` Jason Gunthorpe
  2020-10-20 14:09                           ` Liu, Yi L
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 14:05 UTC (permalink / raw)
  To: Liu, Yi L
  Cc: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Oct 20, 2020 at 02:00:31PM +0000, Liu, Yi L wrote:
> > From: Jason Gunthorpe <jgg@nvidia.com>
> > Sent: Tuesday, October 20, 2020 9:55 PM
> > 
> > On Tue, Oct 20, 2020 at 09:40:14AM +0000, Liu, Yi L wrote:
> > 
> > > > See previous discussion with Kevin. If I understand correctly, you expect a
> > shared
> > > > L2 table if vDPA and VFIO device are using the same PASID.
> > >
> > > L2 table sharing is not mandatory. The mapping is the same, but no need to
> > > assume L2 tables are shared. Especially for VFIO/vDPA devices. Even within
> > > a passthru framework, like VFIO, if the attributes of backend IOMMU are not
> > > the same, the L2 page table are not shared, but the mapping is the same.
> > 
> > I think not being able to share the PASID shows exactly why this VFIO
> > centric approach is bad.
> 
> no, I didn't say PASID is not sharable. My point is sharing L2 page table is
> not mandatory.

IMHO a PASID should be 1:1 with a page table, what does it even mean
to share a PASID but have different page tables?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 14:05                         ` Jason Gunthorpe
@ 2020-10-20 14:09                           ` Liu, Yi L
  0 siblings, 0 replies; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20 14:09 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Jason Wang, Tian, Kevin, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, October 20, 2020 10:05 PM
> To: Liu, Yi L <yi.l.liu@intel.com>
> 
> On Tue, Oct 20, 2020 at 02:00:31PM +0000, Liu, Yi L wrote:
> > > From: Jason Gunthorpe <jgg@nvidia.com>
> > > Sent: Tuesday, October 20, 2020 9:55 PM
> > >
> > > On Tue, Oct 20, 2020 at 09:40:14AM +0000, Liu, Yi L wrote:
> > >
> > > > > See previous discussion with Kevin. If I understand correctly,
> > > > > you expect a
> > > shared
> > > > > L2 table if vDPA and VFIO device are using the same PASID.
> > > >
> > > > L2 table sharing is not mandatory. The mapping is the same, but no
> > > > need to assume L2 tables are shared. Especially for VFIO/vDPA
> > > > devices. Even within a passthru framework, like VFIO, if the
> > > > attributes of backend IOMMU are not the same, the L2 page table are not
> shared, but the mapping is the same.
> > >
> > > I think not being able to share the PASID shows exactly why this
> > > VFIO centric approach is bad.
> >
> > no, I didn't say PASID is not sharable. My point is sharing L2 page
> > table is not mandatory.
> 
> IMHO a PASID should be 1:1 with a page table, what does it even mean to share
> a PASID but have different page tables?

PASID is actually 1:1 with an address space. Not really needs to be 1:1 with
page table. :-)

Regards,
Yi Liu

> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 14:02           ` Jason Gunthorpe
@ 2020-10-20 14:19             ` Liu, Yi L
  2020-10-21  2:21               ` Jason Wang
  2020-10-20 16:24             ` Raj, Ashok
  1 sibling, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-20 14:19 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, October 20, 2020 10:02 PM
[...]
> > > Whoever provides the vIOMMU emulation and relays the page fault to the
> guest
> > > has to translate the RID -
> >
> > that's the point. But the device info (especially the sub-device info) is
> > within the passthru framework (e.g. VFIO). So page fault reporting needs
> > to go through passthru framework.
> >
> > > what does that have to do with VFIO?
> > >
> > > How will VPDA provide the vIOMMU emulation?
> >
> > a pardon here. I believe vIOMMU emulation should be based on IOMMU
> vendor
> > specification, right? you may correct me if I'm missing anything.
> 
> I'm asking how will VDPA translate the RID when VDPA triggers a page
> fault that has to be relayed to the guest. VDPA also has virtual PCI
> devices it creates.

I've got a question. Does vDPA work with vIOMMU so far? e.g. Intel vIOMMU
or other type vIOMMU.

Regards,
Yi Liu


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 14:02           ` Jason Gunthorpe
  2020-10-20 14:19             ` Liu, Yi L
@ 2020-10-20 16:24             ` Raj, Ashok
  2020-10-20 17:03               ` Jason Gunthorpe
  1 sibling, 1 reply; 55+ messages in thread
From: Raj, Ashok @ 2020-10-20 16:24 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

Hi Jason,


On Tue, Oct 20, 2020 at 11:02:17AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 10:21:41AM +0000, Liu, Yi L wrote:
> 
> > > I'm sure there will be some
> > > weird overlaps because we can't delete any of the existing VFIO APIs, but
> > > that
> > > should not be a blocker.
> > 
> > but the weird thing is what we should consider. And it perhaps not just
> > overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
> > container is IOMMU context from the day it was defined. It could be the
> > blocker. :-(
> 
> So maybe you have to broaden the VFIO container to be usable by other
> subsystems. The discussion here is about what the uAPI should look
> like in a fairly abstract way. When we say 'dev/sva' it just some
> placeholder for a shared cdev that provides the necessary
> dis-aggregated functionality 
> 
> It could be an existing cdev with broader functionaltiy, it could
> really be /dev/iommu, etc. This is up to the folks building it to
> decide.
> 
> > I'm not expert on vDPA for now, but I saw you three open source
> > veterans have a similar idea for a place to cover IOMMU handling,
> > I think it may be a valuable thing to do. I said "may be" as I'm not
> > sure about Alex's opinion on such idea. But the sure thing is this
> > idea may introduce weird overlap even re-definition of existing
> > thing as I replied above. We need to evaluate the impact and mature
> > the idea step by step. 
> 
> This has happened before, uAPIs do get obsoleted and replaced with
> more general/better versions. It is often too hard to create a uAPI
> that lasts for decades when the HW landscape is constantly changing
> and sometime a reset is needed. 

I'm throwing this out with a lot of hesitation, but I'm going to :-)

So we have been disussing this for months now, with some high level vision
trying to get the uAPI's solidified with a vDPA hardware that might
potentially have SIOV/SVM like extensions in hardware which actualy doesn't
exist today. Understood people have plans. 

Given that vDPA today has diverged already with duplicating use of IOMMU
api's without making an effort to gravitate to /dev/iommu as how you are
proposing.

I think we all understand creating a permanent uAPI is hard, and they can
evolve in future. 

Maybe  we should start work on how to converge on generalizing the IOMMU
story first with what we have today (vDPA + VFIO) convergence and let it evolve 
with real hardware and new features like SVM/SIOV in mind. This is going 
to take time and we can start with what we have today for pulling vDPA and 
VFIO pieces first.

The question is should we hold hostage the current vSVM/vIOMMU efforts
without even having made an effort for current vDPA/VFIO convergence. 

> 
> The jump to shared PASID based IOMMU feels like one of those moments here.

As we have all noted, even without PASID we have divergence today?


> 
> > > Whoever provides the vIOMMU emulation and relays the page fault to the guest
> > > has to translate the RID -
> > 
> > that's the point. But the device info (especially the sub-device info) is
> > within the passthru framework (e.g. VFIO). So page fault reporting needs
> > to go through passthru framework.
> >
> > > what does that have to do with VFIO?
> > > 
> > > How will VPDA provide the vIOMMU emulation?
> > 
> > a pardon here. I believe vIOMMU emulation should be based on IOMMU vendor
> > specification, right? you may correct me if I'm missing anything.
> 
> I'm asking how will VDPA translate the RID when VDPA triggers a page
> fault that has to be relayed to the guest. VDPA also has virtual PCI
> devices it creates.
> 
> We can't rely on VFIO to be the place that the vIOMMU lives because it
> excludes/complicates everything that is not VFIO from using that
> stuff.
> 
> Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 16:24             ` Raj, Ashok
@ 2020-10-20 17:03               ` Jason Gunthorpe
  2020-10-20 19:51                 ` Raj, Ashok
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 17:03 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Tue, Oct 20, 2020 at 09:24:30AM -0700, Raj, Ashok wrote:
> Hi Jason,
> 
> 
> On Tue, Oct 20, 2020 at 11:02:17AM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 20, 2020 at 10:21:41AM +0000, Liu, Yi L wrote:
> > 
> > > > I'm sure there will be some
> > > > weird overlaps because we can't delete any of the existing VFIO APIs, but
> > > > that
> > > > should not be a blocker.
> > > 
> > > but the weird thing is what we should consider. And it perhaps not just
> > > overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
> > > container is IOMMU context from the day it was defined. It could be the
> > > blocker. :-(
> > 
> > So maybe you have to broaden the VFIO container to be usable by other
> > subsystems. The discussion here is about what the uAPI should look
> > like in a fairly abstract way. When we say 'dev/sva' it just some
> > placeholder for a shared cdev that provides the necessary
> > dis-aggregated functionality 
> > 
> > It could be an existing cdev with broader functionaltiy, it could
> > really be /dev/iommu, etc. This is up to the folks building it to
> > decide.
> > 
> > > I'm not expert on vDPA for now, but I saw you three open source
> > > veterans have a similar idea for a place to cover IOMMU handling,
> > > I think it may be a valuable thing to do. I said "may be" as I'm not
> > > sure about Alex's opinion on such idea. But the sure thing is this
> > > idea may introduce weird overlap even re-definition of existing
> > > thing as I replied above. We need to evaluate the impact and mature
> > > the idea step by step. 
> > 
> > This has happened before, uAPIs do get obsoleted and replaced with
> > more general/better versions. It is often too hard to create a uAPI
> > that lasts for decades when the HW landscape is constantly changing
> > and sometime a reset is needed. 
> 
> I'm throwing this out with a lot of hesitation, but I'm going to :-)
> 
> So we have been disussing this for months now, with some high level vision
> trying to get the uAPI's solidified with a vDPA hardware that might
> potentially have SIOV/SVM like extensions in hardware which actualy doesn't
> exist today. Understood people have plans. 

> Given that vDPA today has diverged already with duplicating use of IOMMU
> api's without making an effort to gravitate to /dev/iommu as how you are
> proposing.

I see it more like, given that we already know we have multiple users
of IOMMU, adding new IOMMU focused features has to gravitate toward
some kind of convergance.

Currently things are not so bad, VDPA is just getting started and the
current IOMMU feature set is not so big.

PASID/vIOMMU/etc/et are all stressing this more, I think the
responsibility falls to the people proposing these features to do the
architecture work.

> The question is should we hold hostage the current vSVM/vIOMMU efforts
> without even having made an effort for current vDPA/VFIO convergence. 

I don't think it is "held hostage" it is a "no shortcuts" approach,
there was always a recognition that future VDPA drivers will need some
work to integrated with vIOMMU realted stuff.

This is no different than the IMS discussion. The first proposed patch
was really simple, but a layering violation.

The correct solution was some wild 20 patch series modernizing how x86
interrupts works because it had outgrown itself. This general approach
to use the shared MSI infrastructure was pointed out at the very
beginning of IMS, BTW.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 17:03               ` Jason Gunthorpe
@ 2020-10-20 19:51                 ` Raj, Ashok
  2020-10-20 19:55                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Raj, Ashok @ 2020-10-20 19:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Tue, Oct 20, 2020 at 02:03:36PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 09:24:30AM -0700, Raj, Ashok wrote:
> > Hi Jason,
> > 
> > 
> > On Tue, Oct 20, 2020 at 11:02:17AM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 10:21:41AM +0000, Liu, Yi L wrote:
> > > 
> > > > > I'm sure there will be some
> > > > > weird overlaps because we can't delete any of the existing VFIO APIs, but
> > > > > that
> > > > > should not be a blocker.
> > > > 
> > > > but the weird thing is what we should consider. And it perhaps not just
> > > > overlap, it may be a re-definition of VFIO container. As I mentioned, VFIO
> > > > container is IOMMU context from the day it was defined. It could be the
> > > > blocker. :-(
> > > 
> > > So maybe you have to broaden the VFIO container to be usable by other
> > > subsystems. The discussion here is about what the uAPI should look
> > > like in a fairly abstract way. When we say 'dev/sva' it just some
> > > placeholder for a shared cdev that provides the necessary
> > > dis-aggregated functionality 
> > > 
> > > It could be an existing cdev with broader functionaltiy, it could
> > > really be /dev/iommu, etc. This is up to the folks building it to
> > > decide.
> > > 
> > > > I'm not expert on vDPA for now, but I saw you three open source
> > > > veterans have a similar idea for a place to cover IOMMU handling,
> > > > I think it may be a valuable thing to do. I said "may be" as I'm not
> > > > sure about Alex's opinion on such idea. But the sure thing is this
> > > > idea may introduce weird overlap even re-definition of existing
> > > > thing as I replied above. We need to evaluate the impact and mature
> > > > the idea step by step. 
> > > 
> > > This has happened before, uAPIs do get obsoleted and replaced with
> > > more general/better versions. It is often too hard to create a uAPI
> > > that lasts for decades when the HW landscape is constantly changing
> > > and sometime a reset is needed. 
> > 
> > I'm throwing this out with a lot of hesitation, but I'm going to :-)
> > 
> > So we have been disussing this for months now, with some high level vision
> > trying to get the uAPI's solidified with a vDPA hardware that might
> > potentially have SIOV/SVM like extensions in hardware which actualy doesn't
> > exist today. Understood people have plans. 
> 
> > Given that vDPA today has diverged already with duplicating use of IOMMU
> > api's without making an effort to gravitate to /dev/iommu as how you are
> > proposing.
> 
> I see it more like, given that we already know we have multiple users
> of IOMMU, adding new IOMMU focused features has to gravitate toward
> some kind of convergance.
> 
> Currently things are not so bad, VDPA is just getting started and the
> current IOMMU feature set is not so big.
> 
> PASID/vIOMMU/etc/et are all stressing this more, I think the
> responsibility falls to the people proposing these features to do the
> architecture work.
> 
> > The question is should we hold hostage the current vSVM/vIOMMU efforts
> > without even having made an effort for current vDPA/VFIO convergence. 
> 
> I don't think it is "held hostage" it is a "no shortcuts" approach,
> there was always a recognition that future VDPA drivers will need some
> work to integrated with vIOMMU realted stuff.

I think we agreed (or agree to disagree and commit) for device types that 
we have for SIOV, VFIO based approach works well without having to re-invent 
another way to do the same things. Not looking for a shortcut by any means, 
but we need to plan around existing hardware though. Looks like vDPA took 
some shortcuts then to not abstract iommu uAPI instead :-)? When all
necessary hardware was available.. This would be a solved puzzle. 


> 
> This is no different than the IMS discussion. The first proposed patch
> was really simple, but a layering violation.
> 
> The correct solution was some wild 20 patch series modernizing how x86

That was more like 48 patches, not 20 :-). But we had a real device with
IMS to model and create these new abstractions and test them against. 

For vDPA+SVM we have non-intersecting conversations at the moment with no
real hardware to model our discussion around. 

> interrupts works because it had outgrown itself. This general approach
> to use the shared MSI infrastructure was pointed out at the very
> beginning of IMS, BTW.

Agreed, and thankfully Thomas worked hard and made it a lot easier :-). 
Today IMS only deals with on device store. Although IMS could mean 
just simply having system memory to hold the interrupt attributes. 
This is how some of the graphics devices would be with context 
holding interrupt attributes. 

But certainly not rushing this since we need a REAL user to be there before we
support DEV_MSI that uses msg_addr/msg_data held in system memory. 


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 19:51                 ` Raj, Ashok
@ 2020-10-20 19:55                   ` Jason Gunthorpe
  2020-10-20 20:08                     ` Raj, Ashok
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 19:55 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> I think we agreed (or agree to disagree and commit) for device types that 
> we have for SIOV, VFIO based approach works well without having to re-invent 
> another way to do the same things. Not looking for a shortcut by any means, 
> but we need to plan around existing hardware though. Looks like vDPA took 
> some shortcuts then to not abstract iommu uAPI instead :-)? When all
> necessary hardware was available.. This would be a solved puzzle. 

I think it is the opposite, vIOMMU and related has outgrown VFIO as
the "home" and needs to stand alone.

Apparently the HW that will need PASID for vDPA is Intel HW, so if
more is needed to do a good design you are probably the only one that
can get it/do it.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 19:55                   ` Jason Gunthorpe
@ 2020-10-20 20:08                     ` Raj, Ashok
  2020-10-20 20:14                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Raj, Ashok @ 2020-10-20 20:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > I think we agreed (or agree to disagree and commit) for device types that 
> > we have for SIOV, VFIO based approach works well without having to re-invent 
> > another way to do the same things. Not looking for a shortcut by any means, 
> > but we need to plan around existing hardware though. Looks like vDPA took 
> > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > necessary hardware was available.. This would be a solved puzzle. 
> 
> I think it is the opposite, vIOMMU and related has outgrown VFIO as
> the "home" and needs to stand alone.
> 
> Apparently the HW that will need PASID for vDPA is Intel HW, so if

So just to make this clear, I did check internally if there are any plans
for vDPA + SVM. There are none at the moment. It seems like you have
better insight into our plans ;-). Please do let me know who confirmed vDPA
roadmap with you and I would love to talk to them to clear the air.


Cheers,
Ashok

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 20:08                     ` Raj, Ashok
@ 2020-10-20 20:14                       ` Jason Gunthorpe
  2020-10-20 20:27                         ` Raj, Ashok
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-20 20:14 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > I think we agreed (or agree to disagree and commit) for device types that 
> > > we have for SIOV, VFIO based approach works well without having to re-invent 
> > > another way to do the same things. Not looking for a shortcut by any means, 
> > > but we need to plan around existing hardware though. Looks like vDPA took 
> > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > necessary hardware was available.. This would be a solved puzzle. 
> > 
> > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > the "home" and needs to stand alone.
> > 
> > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> 
> So just to make this clear, I did check internally if there are any plans
> for vDPA + SVM. There are none at the moment. 

Not SVM, SIOV.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 20:14                       ` Jason Gunthorpe
@ 2020-10-20 20:27                         ` Raj, Ashok
  2020-10-21 11:48                           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Raj, Ashok @ 2020-10-20 20:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > I think we agreed (or agree to disagree and commit) for device types that 
> > > > we have for SIOV, VFIO based approach works well without having to re-invent 
> > > > another way to do the same things. Not looking for a shortcut by any means, 
> > > > but we need to plan around existing hardware though. Looks like vDPA took 
> > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > necessary hardware was available.. This would be a solved puzzle. 
> > > 
> > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > the "home" and needs to stand alone.
> > > 
> > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > 
> > So just to make this clear, I did check internally if there are any plans
> > for vDPA + SVM. There are none at the moment. 
> 
> Not SVM, SIOV.

... And that included.. I should have said vDPA + PASID, No current plans. 
I have no idea who set expectations with you. Can you please put me in touch 
with that person, privately is fine.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 14:19             ` Liu, Yi L
@ 2020-10-21  2:21               ` Jason Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-10-21  2:21 UTC (permalink / raw)
  To: Liu, Yi L, Jason Gunthorpe
  Cc: Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro,
	jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin, Zhu,
	Lingshan


On 2020/10/20 下午10:19, Liu, Yi L wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Tuesday, October 20, 2020 10:02 PM
> [...]
>>>> Whoever provides the vIOMMU emulation and relays the page fault to the
>> guest
>>>> has to translate the RID -
>>> that's the point. But the device info (especially the sub-device info) is
>>> within the passthru framework (e.g. VFIO). So page fault reporting needs
>>> to go through passthru framework.
>>>
>>>> what does that have to do with VFIO?
>>>>
>>>> How will VPDA provide the vIOMMU emulation?
>>> a pardon here. I believe vIOMMU emulation should be based on IOMMU
>> vendor
>>> specification, right? you may correct me if I'm missing anything.
>> I'm asking how will VDPA translate the RID when VDPA triggers a page
>> fault that has to be relayed to the guest. VDPA also has virtual PCI
>> devices it creates.
> I've got a question. Does vDPA work with vIOMMU so far? e.g. Intel vIOMMU
> or other type vIOMMU.


The kernel code is ready. Note that vhost suppport for vIOMMU is even 
earlier than VFIO.

The API is designed to be generic is not limited to any specific type of 
vIOMMU.

For qemu, it just need a patch to implement map/unmap notifier as what 
VFIO did.

Thanks



>
> Regards,
> Yi Liu
>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-20 20:27                         ` Raj, Ashok
@ 2020-10-21 11:48                           ` Jason Gunthorpe
  2020-10-21 17:51                             ` Raj, Ashok
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-21 11:48 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
> On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> > On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > > I think we agreed (or agree to disagree and commit) for device types that 
> > > > > we have for SIOV, VFIO based approach works well without having to re-invent 
> > > > > another way to do the same things. Not looking for a shortcut by any means, 
> > > > > but we need to plan around existing hardware though. Looks like vDPA took 
> > > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > > necessary hardware was available.. This would be a solved puzzle. 
> > > > 
> > > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > > the "home" and needs to stand alone.
> > > > 
> > > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > > 
> > > So just to make this clear, I did check internally if there are any plans
> > > for vDPA + SVM. There are none at the moment. 
> > 
> > Not SVM, SIOV.
> 
> ... And that included.. I should have said vDPA + PASID, No current plans. 
> I have no idea who set expectations with you. Can you please put me in touch 
> with that person, privately is fine.

It was the team that aruged VDPA had to be done through VFIO - SIOV
and PASID was one of their reasons it had to be VFIO, check the list
archives

If they didn't plan to use it, bit of a strawman argument, right?

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 11:48                           ` Jason Gunthorpe
@ 2020-10-21 17:51                             ` Raj, Ashok
  2020-10-21 18:24                               ` Jason Gunthorpe
  2020-10-22  2:55                               ` Jason Wang
  0 siblings, 2 replies; 55+ messages in thread
From: Raj, Ashok @ 2020-10-21 17:51 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Wed, Oct 21, 2020 at 08:48:29AM -0300, Jason Gunthorpe wrote:
> On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
> > On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
> > > On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
> > > > On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
> > > > > On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
> > > > > > I think we agreed (or agree to disagree and commit) for device types that 
> > > > > > we have for SIOV, VFIO based approach works well without having to re-invent 
> > > > > > another way to do the same things. Not looking for a shortcut by any means, 
> > > > > > but we need to plan around existing hardware though. Looks like vDPA took 
> > > > > > some shortcuts then to not abstract iommu uAPI instead :-)? When all
> > > > > > necessary hardware was available.. This would be a solved puzzle. 
> > > > > 
> > > > > I think it is the opposite, vIOMMU and related has outgrown VFIO as
> > > > > the "home" and needs to stand alone.
> > > > > 
> > > > > Apparently the HW that will need PASID for vDPA is Intel HW, so if
> > > > 
> > > > So just to make this clear, I did check internally if there are any plans
> > > > for vDPA + SVM. There are none at the moment. 
> > > 
> > > Not SVM, SIOV.
> > 
> > ... And that included.. I should have said vDPA + PASID, No current plans. 
> > I have no idea who set expectations with you. Can you please put me in touch 
> > with that person, privately is fine.
> 
> It was the team that aruged VDPA had to be done through VFIO - SIOV
> and PASID was one of their reasons it had to be VFIO, check the list
> archives

Humm... I could search the arhives, but the point is I'm confirming that
there is no forward looking plan!

And who ever did was it was based on probably strawman hypothetical argument that wasn't
grounded in reality. 

> 
> If they didn't plan to use it, bit of a strawman argument, right?

This doesn't need to continue like the debates :-) Pun intended :-)

I don't think it makes any sense to have an abstract strawman argument
design discussion. Yi is looking into for pasid management alone. Rest 
of the IOMMU related topics should wait until we have another 
*real* use requiring consolidation. 

Contrary to your argument, vDPA went with a half blown device only 
iommu user without considering existing abstractions like containers 
and such in VFIO is part of the reason the gap is big at the moment.
And you might not agree, but that's beside the point. 

Rather than pivot ourselves around hypothetical, strawman,
non-intersecting, suggesting architecture without having done a proof of
concept to validate the proposal should stop. We have to ground ourselves
in reality.

The use cases we have so far for SIOV, VFIO and mdev seem to be the right
candidates and addresses them well. Now you might disagree, but as noted we
all agreed to move past this.


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 17:51                             ` Raj, Ashok
@ 2020-10-21 18:24                               ` Jason Gunthorpe
  2020-10-21 20:03                                 ` Raj, Ashok
  2020-10-22  2:55                               ` Jason Wang
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-21 18:24 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Wed, Oct 21, 2020 at 10:51:46AM -0700, Raj, Ashok wrote:

> > If they didn't plan to use it, bit of a strawman argument, right?
> 
> This doesn't need to continue like the debates :-) Pun intended :-)
> 
> I don't think it makes any sense to have an abstract strawman argument
> design discussion. Yi is looking into for pasid management alone. Rest 
> of the IOMMU related topics should wait until we have another 
> *real* use requiring consolidation. 

Actually I'm really annoyed right now that the other Intel team wasted
quiet a lot of the rest of our time on arguing about vDPA and vfio
with no actual interest in this technology.

So you'll excuse me if I'm not particularly enamored with this
discussion right now.

> Contrary to your argument, vDPA went with a half blown device only 
> iommu user without considering existing abstractions like containers

VDPA IOMMU was done *for Intel*, as the kind of half-architected thing
you are advocating should be allowed for IDXD here. Not sure why you
think bashing that work is going to help your case here.

I'm saying Intel needs to get its architecture together and stop
ceating this mess across the kernel to support Intel devices.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 18:24                               ` Jason Gunthorpe
@ 2020-10-21 20:03                                 ` Raj, Ashok
  2020-10-21 23:32                                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: Raj, Ashok @ 2020-10-21 20:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Wed, Oct 21, 2020 at 03:24:42PM -0300, Jason Gunthorpe wrote:
> 
> > Contrary to your argument, vDPA went with a half blown device only 
> > iommu user without considering existing abstractions like containers
> 
> VDPA IOMMU was done *for Intel*, as the kind of half-architected thing
> you are advocating should be allowed for IDXD here. Not sure why you
> think bashing that work is going to help your case here.

I'm not bashing that work, sorry if it comes out that way, 
but just feels like double standards.

I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
SVM is orthogonal to how we achieve mdev passthrough to guest and vSVM. 
We visited that exact thing multiple times. Doing SVM is quite simple and 
doesn't carry the weight of other (Kevin explained this in detail 
not too long ago) long list of things we need to accomplish for mdev pass through. 

For SVM, just access to hw, mmio and bind_mm to get a PASID bound with
IOMMU. 

For IDXD that creates passthough devices for guest access and vSVM is
through the VFIO path. 

For guest SVM, we expose mdev's to guest OS, idxd in the guest provides vSVM
services. vSVM is *not* build around native SVM interfaces. 

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 20:03                                 ` Raj, Ashok
@ 2020-10-21 23:32                                   ` Jason Gunthorpe
  2020-10-21 23:53                                     ` Raj, Ashok
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-10-21 23:32 UTC (permalink / raw)
  To: Raj, Ashok
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan

On Wed, Oct 21, 2020 at 01:03:15PM -0700, Raj, Ashok wrote:

> I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
> SVM is orthogonal to how we achieve mdev passthrough to guest and
> vSVM.

Everyone assumes that vIOMMU and SIOV aka PASID is going to be needed
on the VDPA side as well, I think that is why JasonW brought this up
in the first place.

We may not see vSVA for VDPA, but that seems like some special sub
mode of all the other vIOMMU and PASID stuff, and not a completely
distinct thing.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 23:32                                   ` Jason Gunthorpe
@ 2020-10-21 23:53                                     ` Raj, Ashok
  0 siblings, 0 replies; 55+ messages in thread
From: Raj, Ashok @ 2020-10-21 23:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, Jason Wang, alex.williamson, eric.auger,
	baolu.lu, joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin, Zhu, Lingshan, Ashok Raj

On Wed, Oct 21, 2020 at 08:32:18PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 21, 2020 at 01:03:15PM -0700, Raj, Ashok wrote:
> 
> > I'm not sure why you tie in IDXD and VDPA here. How IDXD uses native
> > SVM is orthogonal to how we achieve mdev passthrough to guest and
> > vSVM.
> 
> Everyone assumes that vIOMMU and SIOV aka PASID is going to be needed
> on the VDPA side as well, I think that is why JasonW brought this up
> in the first place.

True, to that effect we are working on trying to move PASID allocation
outside of VFIO, so both agents VFIO and vDPA with PASID, when that comes
available can support one way to allocate and manage PASID's from user
space.

Since the IOASID is almost standalone, this is possible.

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-21 17:51                             ` Raj, Ashok
  2020-10-21 18:24                               ` Jason Gunthorpe
@ 2020-10-22  2:55                               ` Jason Wang
  2020-10-22  3:54                                 ` Liu, Yi L
  1 sibling, 1 reply; 55+ messages in thread
From: Jason Wang @ 2020-10-22  2:55 UTC (permalink / raw)
  To: Raj, Ashok, Jason Gunthorpe
  Cc: Liu, Yi L, Tian, Kevin, alex.williamson, eric.auger, baolu.lu,
	joro, jacob.jun.pan, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Michael S. Tsirkin, Zhu,
	Lingshan


On 2020/10/22 上午1:51, Raj, Ashok wrote:
> On Wed, Oct 21, 2020 at 08:48:29AM -0300, Jason Gunthorpe wrote:
>> On Tue, Oct 20, 2020 at 01:27:13PM -0700, Raj, Ashok wrote:
>>> On Tue, Oct 20, 2020 at 05:14:03PM -0300, Jason Gunthorpe wrote:
>>>> On Tue, Oct 20, 2020 at 01:08:44PM -0700, Raj, Ashok wrote:
>>>>> On Tue, Oct 20, 2020 at 04:55:57PM -0300, Jason Gunthorpe wrote:
>>>>>> On Tue, Oct 20, 2020 at 12:51:46PM -0700, Raj, Ashok wrote:
>>>>>>> I think we agreed (or agree to disagree and commit) for device types that
>>>>>>> we have for SIOV, VFIO based approach works well without having to re-invent
>>>>>>> another way to do the same things. Not looking for a shortcut by any means,
>>>>>>> but we need to plan around existing hardware though. Looks like vDPA took
>>>>>>> some shortcuts then to not abstract iommu uAPI instead :-)? When all
>>>>>>> necessary hardware was available.. This would be a solved puzzle.
>>>>>> I think it is the opposite, vIOMMU and related has outgrown VFIO as
>>>>>> the "home" and needs to stand alone.
>>>>>>
>>>>>> Apparently the HW that will need PASID for vDPA is Intel HW, so if
>>>>> So just to make this clear, I did check internally if there are any plans
>>>>> for vDPA + SVM. There are none at the moment.
>>>> Not SVM, SIOV.
>>> ... And that included.. I should have said vDPA + PASID, No current plans.
>>> I have no idea who set expectations with you. Can you please put me in touch
>>> with that person, privately is fine.
>> It was the team that aruged VDPA had to be done through VFIO - SIOV
>> and PASID was one of their reasons it had to be VFIO, check the list
>> archives
> Humm... I could search the arhives, but the point is I'm confirming that
> there is no forward looking plan!
>
> And who ever did was it was based on probably strawman hypothetical argument that wasn't
> grounded in reality.
>
>> If they didn't plan to use it, bit of a strawman argument, right?
> This doesn't need to continue like the debates :-) Pun intended :-)
>
> I don't think it makes any sense to have an abstract strawman argument
> design discussion. Yi is looking into for pasid management alone. Rest
> of the IOMMU related topics should wait until we have another
> *real* use requiring consolidation.
>
> Contrary to your argument, vDPA went with a half blown device only
> iommu user without considering existing abstractions like containers
> and such in VFIO is part of the reason the gap is big at the moment.
> And you might not agree, but that's beside the point.


Can you explain why it must care VFIO abstractions? vDPA is trying to 
hide device details which is fundamentally different with what VFIO 
wants to do. vDPA allows the parent to deal with IOMMU stuffs, and if 
necessary, the parent can talk with IOMMU drivers directly via IOMMU APIs.


>   
>
> Rather than pivot ourselves around hypothetical, strawman,
> non-intersecting, suggesting architecture without having done a proof of
> concept to validate the proposal should stop. We have to ground ourselves
> in reality.


The reality is VFIO should not be the only user for (v)SVA/SIOV/PASID. 
The kernel hard already had users like GPU or uacce.


>
> The use cases we have so far for SIOV, VFIO and mdev seem to be the right
> candidates and addresses them well. Now you might disagree, but as noted we
> all agreed to move past this.


The mdev is not perfect for sure, but it's another topic.

If you(Intel) don't have plan to do vDPA, you should not prevent other 
vendors from implementing PASID capable hardware through non-VFIO 
subsystem/uAPI on top of your SIOV architecture. Isn't it?

So if Intel has the willing to collaborate on the POC, I'd happy to 
help. E.g it's not hard to have a PASID capable virtio device through 
qemu, and we can start from there.

Thanks


>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* RE: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-22  2:55                               ` Jason Wang
@ 2020-10-22  3:54                                 ` Liu, Yi L
  2020-10-22  4:38                                   ` Jason Wang
  0 siblings, 1 reply; 55+ messages in thread
From: Liu, Yi L @ 2020-10-22  3:54 UTC (permalink / raw)
  To: Jason Wang, Raj, Ashok, Jason Gunthorpe
  Cc: Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro,
	jacob.jun.pan, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, Wu,
	Hao, stefanha, iommu, kvm, Michael S. Tsirkin, Zhu, Lingshan

Hi Jason,

> From: Jason Wang <jasowang@redhat.com>
> Sent: Thursday, October 22, 2020 10:56 AM
> 
[...]
> If you(Intel) don't have plan to do vDPA, you should not prevent other vendors
> from implementing PASID capable hardware through non-VFIO subsystem/uAPI
> on top of your SIOV architecture. Isn't it?

yes, that's true.

> So if Intel has the willing to collaborate on the POC, I'd happy to help. E.g it's not
> hard to have a PASID capable virtio device through qemu, and we can start from
> there.

actually, I'm already doing a poc to move the PASID allocation/free interface
out of VFIO. So that other users could use it as well. I think this is also
what you replied previously. :-) I'll send out when it's ready and seek for
your help on mature it. does it sound good to you?

Regards,
Yi Liu

> 
> Thanks
> 
> 
> >


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-22  3:54                                 ` Liu, Yi L
@ 2020-10-22  4:38                                   ` Jason Wang
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Wang @ 2020-10-22  4:38 UTC (permalink / raw)
  To: Liu, Yi L, Raj, Ashok, Jason Gunthorpe
  Cc: Tian, Kevin, alex.williamson, eric.auger, baolu.lu, joro,
	jacob.jun.pan, Tian, Jun J, Sun, Yi Y, jean-philippe, peterx, Wu,
	Hao, stefanha, iommu, kvm, Michael S. Tsirkin, Zhu, Lingshan


On 2020/10/22 上午11:54, Liu, Yi L wrote:
> Hi Jason,
>
>> From: Jason Wang <jasowang@redhat.com>
>> Sent: Thursday, October 22, 2020 10:56 AM
>>
> [...]
>> If you(Intel) don't have plan to do vDPA, you should not prevent other vendors
>> from implementing PASID capable hardware through non-VFIO subsystem/uAPI
>> on top of your SIOV architecture. Isn't it?
> yes, that's true.
>
>> So if Intel has the willing to collaborate on the POC, I'd happy to help. E.g it's not
>> hard to have a PASID capable virtio device through qemu, and we can start from
>> there.
> actually, I'm already doing a poc to move the PASID allocation/free interface
> out of VFIO. So that other users could use it as well. I think this is also
> what you replied previously. :-) I'll send out when it's ready and seek for
> your help on mature it. does it sound good to you?


Yes, fine with me.

Thanks


>
> Regards,
> Yi Liu
>
>> Thanks
>>
>>


^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-10-12  8:38 (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Tian, Kevin
                   ` (2 preceding siblings ...)
  2020-10-14  3:16 ` Tian, Kevin
@ 2020-11-03  9:52 ` joro
  2020-11-03 12:56   ` Jason Gunthorpe
  3 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03  9:52 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Jason Wang, Liu, Yi L, alex.williamson, eric.auger, baolu.lu,
	jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y, jean-philippe,
	peterx, Wu, Hao, stefanha, iommu, kvm, Jason Gunthorpe,
	Michael S. Tsirkin

On Mon, Oct 12, 2020 at 08:38:54AM +0000, Tian, Kevin wrote:
> > From: Jason Wang <jasowang@redhat.com>

> > Jason suggest something like /dev/sva. There will be a lot of other
> > subsystems that could benefit from this (e.g vDPA).

Honestly, I fail to see the benefit of offloading these IOMMU specific
setup tasks to user-space.

The ways PASID and the device partitioning it allows are used are very
device specific. A GPU will be partitioned completly different than a
network card. So the device drivers should use the (v)SVA APIs to setup
the partitioning in a way which makes sense for the device.

And VFIO is of course a user by itself, as it allows assigning device
partitions to guests. Or even allow assigning complete devices and allow
the guests to partition it themselfes.

So having said this, what is the benefit of exposing those SVA internals
to user-space?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03  9:52 ` joro
@ 2020-11-03 12:56   ` Jason Gunthorpe
  2020-11-03 13:18     ` joro
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 12:56 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 10:52:09AM +0100, joro@8bytes.org wrote:
> So having said this, what is the benefit of exposing those SVA internals
> to user-space?

Only the device use of the PASID is device specific, the actual PASID
and everything on the IOMMU side is generic.

There is enough API there it doesn't make sense to duplicate it into
every single SVA driver.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 12:56   ` Jason Gunthorpe
@ 2020-11-03 13:18     ` joro
  2020-11-03 13:23       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03 13:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote:
> On Tue, Nov 03, 2020 at 10:52:09AM +0100, joro@8bytes.org wrote:
> > So having said this, what is the benefit of exposing those SVA internals
> > to user-space?
> 
> Only the device use of the PASID is device specific, the actual PASID
> and everything on the IOMMU side is generic.
> 
> There is enough API there it doesn't make sense to duplicate it into
> every single SVA driver.

What generic things have to be done by the drivers besides
allocating/deallocating PASIDs and binding an address space to it?

Is there anything which isn't better handled in a kernel-internal
library which drivers just use?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 13:18     ` joro
@ 2020-11-03 13:23       ` Jason Gunthorpe
  2020-11-03 14:03         ` joro
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 13:23 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 02:18:52PM +0100, joro@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 08:56:43AM -0400, Jason Gunthorpe wrote:
> > On Tue, Nov 03, 2020 at 10:52:09AM +0100, joro@8bytes.org wrote:
> > > So having said this, what is the benefit of exposing those SVA internals
> > > to user-space?
> > 
> > Only the device use of the PASID is device specific, the actual PASID
> > and everything on the IOMMU side is generic.
> > 
> > There is enough API there it doesn't make sense to duplicate it into
> > every single SVA driver.
> 
> What generic things have to be done by the drivers besides
> allocating/deallocating PASIDs and binding an address space to it?
> 
> Is there anything which isn't better handled in a kernel-internal
> library which drivers just use?

Userspace needs fine grained control over the composition of the page
table behind the PASID, 1:1 with the mm_struct is only one use case.

Userspace needs to be able to handle IOMMU faults, apparently

The Intel guys had a bunch of other stuff too, looking through the new
API they are proposing for vfio gives some flavour what they think is
needed..

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 13:23       ` Jason Gunthorpe
@ 2020-11-03 14:03         ` joro
  2020-11-03 14:06           ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03 14:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 09:23:35AM -0400, Jason Gunthorpe wrote:
> Userspace needs fine grained control over the composition of the page
> table behind the PASID, 1:1 with the mm_struct is only one use case.

VFIO already offers an interface for that. It shouldn't be too
complicated to expand that for PASID-bound page-tables.

> Userspace needs to be able to handle IOMMU faults, apparently

Could be implemented by a fault-fd handed out by VFIO.

> The Intel guys had a bunch of other stuff too, looking through the new
> API they are proposing for vfio gives some flavour what they think is
> needed..

I really don't think that user-space should have to deal with details
like PASIDs or other IOMMU internals, unless absolutly necessary. This
is an OS we work on, and the idea behind an OS is to abstract the
hardware away.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 14:03         ` joro
@ 2020-11-03 14:06           ` Jason Gunthorpe
  2020-11-03 14:35             ` joro
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 14:06 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 03:03:18PM +0100, joro@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 09:23:35AM -0400, Jason Gunthorpe wrote:
> > Userspace needs fine grained control over the composition of the page
> > table behind the PASID, 1:1 with the mm_struct is only one use case.
> 
> VFIO already offers an interface for that. It shouldn't be too
> complicated to expand that for PASID-bound page-tables.
> 
> > Userspace needs to be able to handle IOMMU faults, apparently
> 
> Could be implemented by a fault-fd handed out by VFIO.

The point is that other places beyond VFIO need this

> I really don't think that user-space should have to deal with details
> like PASIDs or other IOMMU internals, unless absolutly necessary. This
> is an OS we work on, and the idea behind an OS is to abstract the
> hardware away.

Sure, but sometimes it is necessary, and in those cases the answer
can't be "rewrite a SVA driver to use vfio"

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 14:06           ` Jason Gunthorpe
@ 2020-11-03 14:35             ` joro
  2020-11-03 15:22               ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03 14:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote:
> The point is that other places beyond VFIO need this

Which and why?

> Sure, but sometimes it is necessary, and in those cases the answer
> can't be "rewrite a SVA driver to use vfio"

This is getting to abstract. Can you come up with an example where
handling this in VFIO or an endpoint device kernel driver does not work?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 14:35             ` joro
@ 2020-11-03 15:22               ` Jason Gunthorpe
  2020-11-03 16:55                 ` joro
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 15:22 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 03:35:32PM +0100, joro@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 10:06:42AM -0400, Jason Gunthorpe wrote:
> > The point is that other places beyond VFIO need this
> 
> Which and why?
>
> > Sure, but sometimes it is necessary, and in those cases the answer
> > can't be "rewrite a SVA driver to use vfio"
> 
> This is getting to abstract. Can you come up with an example where
> handling this in VFIO or an endpoint device kernel driver does not work?

This whole thread was brought up by IDXD which has a SVA driver and
now wants to add a vfio-mdev driver too. SVA devices that want to be
plugged into VMs are going to be common - this architecture that a SVA
driver cannot cover the kvm case seems problematic.

Yes, everything can have a SVA driver and a vfio-mdev, it works just
fine, but it is not very clean or simple.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 15:22               ` Jason Gunthorpe
@ 2020-11-03 16:55                 ` joro
  2020-11-03 17:48                   ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03 16:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote:
> This whole thread was brought up by IDXD which has a SVA driver and
> now wants to add a vfio-mdev driver too. SVA devices that want to be
> plugged into VMs are going to be common - this architecture that a SVA
> driver cannot cover the kvm case seems problematic.

Isn't that the same pattern as having separate drivers for VFs and the
parent device in SR-IOV?

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 16:55                 ` joro
@ 2020-11-03 17:48                   ` Jason Gunthorpe
  2020-11-03 19:14                     ` joro
  0 siblings, 1 reply; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-03 17:48 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 05:55:40PM +0100, joro@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 11:22:23AM -0400, Jason Gunthorpe wrote:
> > This whole thread was brought up by IDXD which has a SVA driver and
> > now wants to add a vfio-mdev driver too. SVA devices that want to be
> > plugged into VMs are going to be common - this architecture that a SVA
> > driver cannot cover the kvm case seems problematic.
> 
> Isn't that the same pattern as having separate drivers for VFs and the
> parent device in SR-IOV?

I think the same PCI driver with a small flag to support the PF or
VF is not the same as two completely different drivers in different
subsystems

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 17:48                   ` Jason Gunthorpe
@ 2020-11-03 19:14                     ` joro
  2020-11-04 19:29                       ` Jason Gunthorpe
  0 siblings, 1 reply; 55+ messages in thread
From: joro @ 2020-11-03 19:14 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 01:48:51PM -0400, Jason Gunthorpe wrote:
> I think the same PCI driver with a small flag to support the PF or
> VF is not the same as two completely different drivers in different
> subsystems

There are counter-examples: ixgbe vs. ixgbevf.

Note that also a single driver can support both, an SVA device and an
mdev device, sharing code for accessing parts of the device like queues
and handling interrupts.

Regards,

	Joerg

^ permalink raw reply	[flat|nested] 55+ messages in thread

* Re: (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs
  2020-11-03 19:14                     ` joro
@ 2020-11-04 19:29                       ` Jason Gunthorpe
  0 siblings, 0 replies; 55+ messages in thread
From: Jason Gunthorpe @ 2020-11-04 19:29 UTC (permalink / raw)
  To: joro
  Cc: Tian, Kevin, Jason Wang, Liu, Yi L, alex.williamson, eric.auger,
	baolu.lu, jacob.jun.pan, Raj, Ashok, Tian, Jun J, Sun, Yi Y,
	jean-philippe, peterx, Wu, Hao, stefanha, iommu, kvm,
	Michael S. Tsirkin

On Tue, Nov 03, 2020 at 08:14:29PM +0100, joro@8bytes.org wrote:
> On Tue, Nov 03, 2020 at 01:48:51PM -0400, Jason Gunthorpe wrote:
> > I think the same PCI driver with a small flag to support the PF or
> > VF is not the same as two completely different drivers in different
> > subsystems
> 
> There are counter-examples: ixgbe vs. ixgbevf.
>
> Note that also a single driver can support both, an SVA device and an
> mdev device, sharing code for accessing parts of the device like queues
> and handling interrupts.

Needing a mdev device at all is the larger issue, mdev means the
kernel must carry a lot of emulation code depending on how the SVA
device is designed. Eg creating queues may require an emulated BAR.

Shifting that code to userspace and having a single clean 'SVA'
interface from the kernel for the device makes a lot more sense,
esepcially from a security perspective.

Forcing all vIOMMU stuff to only use VFIO permanently closes this as
an option.

Jason

^ permalink raw reply	[flat|nested] 55+ messages in thread

end of thread, other threads:[~2020-11-04 19:29 UTC | newest]

Thread overview: 55+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-12  8:38 (proposal) RE: [PATCH v7 00/16] vfio: expose virtual Shared Virtual Addressing to VMs Tian, Kevin
2020-10-13  6:22 ` Jason Wang
2020-10-14  3:08   ` Tian, Kevin
2020-10-14 23:10     ` Alex Williamson
2020-10-15  7:02       ` Jason Wang
2020-10-15  6:52     ` Jason Wang
2020-10-15  7:58       ` Tian, Kevin
2020-10-15  8:40         ` Jason Wang
2020-10-15 10:14           ` Liu, Yi L
2020-10-20  6:18             ` Jason Wang
2020-10-20  8:19               ` Liu, Yi L
2020-10-20  9:19                 ` Jason Wang
2020-10-20  9:40                   ` Liu, Yi L
2020-10-20 13:54                     ` Jason Gunthorpe
2020-10-20 14:00                       ` Liu, Yi L
2020-10-20 14:05                         ` Jason Gunthorpe
2020-10-20 14:09                           ` Liu, Yi L
2020-10-13 10:27 ` Jean-Philippe Brucker
2020-10-14  2:11   ` Tian, Kevin
2020-10-14  3:16 ` Tian, Kevin
2020-10-16 15:36   ` Jason Gunthorpe
2020-10-19  8:39     ` Liu, Yi L
2020-10-19 14:25       ` Jason Gunthorpe
2020-10-20 10:21         ` Liu, Yi L
2020-10-20 14:02           ` Jason Gunthorpe
2020-10-20 14:19             ` Liu, Yi L
2020-10-21  2:21               ` Jason Wang
2020-10-20 16:24             ` Raj, Ashok
2020-10-20 17:03               ` Jason Gunthorpe
2020-10-20 19:51                 ` Raj, Ashok
2020-10-20 19:55                   ` Jason Gunthorpe
2020-10-20 20:08                     ` Raj, Ashok
2020-10-20 20:14                       ` Jason Gunthorpe
2020-10-20 20:27                         ` Raj, Ashok
2020-10-21 11:48                           ` Jason Gunthorpe
2020-10-21 17:51                             ` Raj, Ashok
2020-10-21 18:24                               ` Jason Gunthorpe
2020-10-21 20:03                                 ` Raj, Ashok
2020-10-21 23:32                                   ` Jason Gunthorpe
2020-10-21 23:53                                     ` Raj, Ashok
2020-10-22  2:55                               ` Jason Wang
2020-10-22  3:54                                 ` Liu, Yi L
2020-10-22  4:38                                   ` Jason Wang
2020-11-03  9:52 ` joro
2020-11-03 12:56   ` Jason Gunthorpe
2020-11-03 13:18     ` joro
2020-11-03 13:23       ` Jason Gunthorpe
2020-11-03 14:03         ` joro
2020-11-03 14:06           ` Jason Gunthorpe
2020-11-03 14:35             ` joro
2020-11-03 15:22               ` Jason Gunthorpe
2020-11-03 16:55                 ` joro
2020-11-03 17:48                   ` Jason Gunthorpe
2020-11-03 19:14                     ` joro
2020-11-04 19:29                       ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).