All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] virtio-iommu version 0.4
@ 2017-08-04 18:19 Jean-Philippe Brucker
  0 siblings, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
    http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
                   ` (6 preceding siblings ...)
  2017-09-12 17:13 ` Auger Eric
@ 2017-10-03 13:04 ` Auger Eric
  7 siblings, 0 replies; 26+ messages in thread
From: Auger Eric @ 2017-10-03 13:04 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
When rebasing the v0.4 driver on master I observe a regression: commands
are not received properly by QEMU (typically an attach command is
received with a type of 0). After a bisection of the guest kernel the
first commit the problem appears is:

commit e3067861ba6650a566a6273738c23c956ad55c02
arm64: add basic VMAP_STACK support

After reverting this patch, things resume working.

I observe the problem with a 4kB page guest kernel.

Thanks

Eric


> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-20  9:37     ` Auger Eric
  2017-09-21 10:29       ` Jean-Philippe Brucker
@ 2017-09-21 10:29       ` Jean-Philippe Brucker
  1 sibling, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-21 10:29 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved regions of the group the
> device belongs to. Is that correct?

Yes

>>
>>> I really think the spec should clarify what exact resv regions the
>>> device should return in case of VFIO device and non VFIO device.
>>
>> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
>> example in Implementation Notes. Do you think the MSI examples at the end
>> need improvement as well? I can try to explain that RESV_MSI regions in
>> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
>> regions from the host.
> 
> I think I would expect an explanation detailing returned reserved
> regions for pure emulated devices and HW/VFIO devices.
> 
> Another unrelated remark:
> - you should add a permission violation error.

Do you mean reporting errors from device to driver? It's on the top of my
list, I might add it to v0.5 (without recoverable/PRI mode at the moment)

> - wrt the probe request ACK protocol, this looks pretty heavy as both
> the driver and the device need to parse the req_probe buffer. The device
> need to fill in the output buffer and then read the same info on the
> input buffer. Couldn't we imagine something simpler?

I can't, but I can get rid of the ack protocol and leave space if someone
needs to re-introduce it. I quite like it but don't think it'll be useful
to the few features I'm planning to add. A future re-introduction will
simply have to add a global feature bit VIRTIO_IOMMU_F_PROBE_ACK.

Note that the device isn't required to implement the ACK phase, we simply
give it a chance to verify that the driver understood its information. It
could simply check if the ACK bit is present and ignore the request in
this case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-20  9:37     ` Auger Eric
@ 2017-09-21 10:29       ` Jean-Philippe Brucker
  2017-09-21 10:29       ` Jean-Philippe Brucker
  1 sibling, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-21 10:29 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

On 20/09/17 10:37, Auger Eric wrote:
> Hi Jean,
> On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
>> Hi Eric,
>>
>> On 12/09/17 18:13, Auger Eric wrote:
>>> 2.6.7
>>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>>> At the moment struct virtio_iommu_req_probe flags is missing in your
>>> header. As such I understood the ACK protocol was not implemented by the
>>> driver in your branch.
>>
>> Uh indeed. And yet I could swear I've written that code... somewhere. I
>> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
>>
>>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>>> header too.
>>
>> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
>> (though it is a mouthful).
>>
>>> 2.6.8.2:
>>> - I am really confused about what the device should report as resv
>>> regions depending on the PE nature (VFIO or not VFIO)
>>>
>>> In other iommu drivers, the resv regions are populated by the iommu
>>> driver through its get_resv_regions callback. They are usually composed
>>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>>> specific (device specific) reserved regions:
>>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>>> are the guest reserved regions.
>>>
>>> First in the current virtio-iommu driver I don't see the
>>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>>> driver should compute the non IOMMU specific MSI regions ie. this is
>>> not the responsability of the virtio-iommu device.
>>
>> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
>> mapping the MSI doorbell. But the driver has to know whether the doorbell
>> region is translated or bypassed.
> Sorry I was talking about *non* IOMMU specific MSI regions, typically
> the regions corresponding to guest PCI host bridge windows. This is
> normally computed in the iommu driver and I didn't that that in the
> existing virtio-iommu driver.

Ah right, I don't think the virtio-iommu device has to report the windows
of the emulated host bridge. RESV is useful for things that are not
obvious to the guest, for instance the physical PCI bridges that are
hidden from the guest.

It's an interesting point though, I can imagine non-Linux guest that are
not as well-equipped with dealing with things like PCI bridge windows, and
could benefit from the device reporting them.

In the end it is up to the device implementation to decide what regions to
report, and make sure that the guest is aware of the various traps in IOVA
space. For things like emulated bridges, the device can expect the guest
to find out about them. For physical bridges/SW_MSI of the host, it should
report the region and make sure that the guest doesn't map them. I'll add
a few more examples to the Implementation Notes, but I suspect reading
your Qemu source code will always be more helpful to people.

>>> Then why is it more the job of the device to return the guest iommu
>>> specific region rather than the driver itself?
>>
>> The MSI region is architectural on x86 IOMMUs, but
>> implementation-defined on virtio-iommu. It depends which platform the host
>> is emulating. In Linux, x86 IOMMU drivers register the bypass region
>> because there always is an IOAPIC on the other end, with a fixed MSI
>> address. But virtio-iommu may be either behind a GIC, an APIC or some
>> other IRQ chip.
>>
>> The driver *could* go over all the irqchips/platforms it knows and try to
>> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
>> them, but it would look horrible. I much prefer having a well-defined way
>> of doing this, so a description from the device.
> 
> This means I must have target specific code in the virtio-iommu device
> which is unusual, right? I was initially thinking you could handle that
> on the driver side using a config set for ARM|ARM64. But on the other
> hand you should communicate the info to the device ...

But the device has to know that it has a region that DMA transactions
bypass, right? If you want to implement MSI bypass, then you already have
to add a special case to your device code, and reporting it in a probe
shouldn't require a lot more work. For example, amdvi_translate() has a
special case for amdvi_is_interrupt_addr().

>>> Then I understand this is the responsability of the virtio-iommu device
>>> to gather information about the host resv regions in case of VFIO EP.
>>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
>>
>> Yes, all regions reported in sysfs reserved_regions in the host would be
>> reported as RESV_T_RESERVED by virtio-iommu.
> So to summarize if the probe request is sent to an emulated device, we
> should return the target specific MSI window. We can't and don't return
> the non IOMMU specific guest reserved windows.
> 
> For a VFIO device, we would return all reserved regions of the group the
> device belongs to. Is that correct?

Yes

>>
>>> I really think the spec should clarify what exact resv regions the
>>> device should return in case of VFIO device and non VFIO device.
>>
>> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
>> example in Implementation Notes. Do you think the MSI examples at the end
>> need improvement as well? I can try to explain that RESV_MSI regions in
>> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
>> regions from the host.
> 
> I think I would expect an explanation detailing returned reserved
> regions for pure emulated devices and HW/VFIO devices.
> 
> Another unrelated remark:
> - you should add a permission violation error.

Do you mean reporting errors from device to driver? It's on the top of my
list, I might add it to v0.5 (without recoverable/PRI mode at the moment)

> - wrt the probe request ACK protocol, this looks pretty heavy as both
> the driver and the device need to parse the req_probe buffer. The device
> need to fill in the output buffer and then read the same info on the
> input buffer. Couldn't we imagine something simpler?

I can't, but I can get rid of the ack protocol and leave space if someone
needs to re-introduce it. I quite like it but don't think it'll be useful
to the few features I'm planning to add. A future re-introduction will
simply have to add a global feature bit VIRTIO_IOMMU_F_PROBE_ACK.

Note that the device isn't required to implement the ACK phase, we simply
give it a chance to verify that the driver understood its information. It
could simply check if the ACK bit is present and ignore the request in
this case.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:48   ` [virtio-dev] " Jean-Philippe Brucker
@ 2017-09-21  6:41     ` Tian, Kevin
  0 siblings, 0 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:41 UTC (permalink / raw)
  To: Jean-Philippe Brucker, Auger Eric, iommu, kvm, virtualization,
	virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, September 6, 2017 7:49 PM
> 
> 
> > 2.6.8.2.1
> > Multiple overlapping RESV_MEM properties are merged together. Device
> > requirement? if same types I assume?
> 
> Combination rules apply to device and driver. When the driver puts
> multiple endpoints in the same domain, combination rules apply. They will
> become important once the guest attempts to do things like combining
> endpoints with different PASID capabilities in the same domain. I'm
> considering these endpoints to be behind different physical IOMMUs.
> 
> For reserved regions, we specify what the driver should do and what the
> device should enforce with regard to mapping IOVAs of overlapping regions.
> For example, if a device has some virtual address RESV_T_MSI and an other
> device has the same virtual address RESV_T_IDENTITY, what should the
> driver do?
> 
> I think it should apply the RESV_T_IDENTITY. RESV_T_MSI is just a special
> case of RESV_T_RESERVED, it's a hint for the IRQ subsystem and doesn't
> have a meaning within a domain. From DMA mappings point of view, it is
> effectively the same as RESV_T_RESERVED. When merging
> RESV_T_RESERVED and
> RESV_T_IDENTITY, we should make it RESV_T_IDENTITY. Because it is
> required
> for one endpoint to work (the one with IDENTITY) and is not accessed by
> the other (the driver must not allocate this IOVA range for DMA).
> 
> More generally, I'd like to add the following combination table to the
> spec, that shows the resulting reserved region type after merging regions
> from two endpoints. N: no reserved region, R: RESV_T_RESERVED, M:
> RESV_T_MSI, I: RESV_T_IDENTITY. It is symmetric so I didn't fill the
> bottom half.
> 
>       | N | R | M | I
>     ------------------
>     N | N | R | M | ?
>     ------------------
>     R |   | R | R | I
>     ------------------
>     M |   |   | M | I
>     ------------------
>     I |   |   |   | I
> 
> The 'I' column is problematic. If one endpoint has an IDENTITY region and
> the other doesn't have anything at that virtual address, then making that
> region an identity mapping will result in the second endpoint being able
> to access firmware memory. If using nested translation, stage-2 can help
> us here. But in general we should allow the device to reject an attach
> that would result in a N+I combination if the host considers it too dodgy.
> I think the other combinations are safe enough.
> 

will overlap happen in real? Can we simplify the spec to have device
not reporting overlapping regions?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:54     ` Jean-Philippe Brucker
  2017-09-21  6:27       ` Tian, Kevin
@ 2017-09-21  6:27       ` Tian, Kevin
  1 sibling, 0 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think you really meant
"host iommu driver can create identity mapping for assigned device",
correct?

Then yes, I think above works.

> table
> control over to the guest, the host won't be in control of IOVA->GPA
> mappings and will need to gracefully ask the guest to do it.
> 
> I'm not aware of any firmware description resembling Intel RMRR or AMD
> IVMD on ARM platforms. I do think ARM platforms could need
> MEM_T_IDENTITY
> for requesting the guest to map MSI windows when page-table handover is
> in
> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
> mapping must be installed by the guest). But since a vSMMU would need a
> solution as well, I think I'll try to implement something more generic.

curious do you need identity mapping full IOVA->GPA->HPA translation, 
or just in GPA->HPA stage sufficient for above MSI scenario?

> 
> > 2.6.8.2.3 Device Requirements: Property RESV_MEM
> >
> > --citation start--
> > If an endpoint is attached to an address space, the device SHOULD leave
> > any access targeting one of its
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
> > regions pass through untranslated. In other words, the device SHOULD
> > handle such a region as if it was identity-mapped (virtual address equal to
> > physical address). If the endpoint is not attached to any address space,
> > then the device MAY abort the transaction.
> > --citation end
> >
> > I have a question for the last sentence. From definition of BYPASS, it's
> > orthogonal to whether there is an address space attached, then should
> > we still allow "May abort" behavior?
> 
> The behavior is left as an implementation choice, and I'm not sure it's
> worth enforcing in the architecture. If the endpoint isn't attached to any
> domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
> necessarily able to do DMA at all. The virtio-iommu device may setup DMA
> mastering lazily, in which case any DMA transaction would abort, or have
> setup DMA already, in which case the endpoint can access MEM_T_BYPASS
> regions.
> 

fair enough. thanks

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-06 11:54     ` Jean-Philippe Brucker
@ 2017-09-21  6:27       ` Tian, Kevin
  2017-09-21  6:27       ` Tian, Kevin
  1 sibling, 0 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-09-21  6:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker
> Sent: Wednesday, September 6, 2017 7:55 PM
> 
> Hi Kevin,
> 
> On 28/08/17 08:39, Tian, Kevin wrote:
> > Here comes some comments:
> >
> > 1.1 Motivation
> >
> > You describe I/O page faults handling as future work. Seems you
> considered
> > only recoverable fault (since "aka. PCI PRI" being used). What about other
> > unrecoverable faults e.g. what to do if a virtual DMA request doesn't find
> > a valid mapping? Even when there is no PRI support, we need some basic
> > form of fault reporting mechanism to indicate such errors to guest.
> 
> I am considering recoverable faults as the end goal, but reporting
> unrecoverable faults should use the same queue, with slightly different
> fields and no need for the driver to reply to the device.

what about adding a placeholder for now? Though same mechanism
can be reused, it's an essential part to make virtio-iommu architecture
complete even before talking support for recoverable faults. :-)

> 
> > 2.6.8.2 Property RESV_MEM
> >
> > I'm not immediately clear when
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> > should be explicitly reported. Is there any real example on bare metal
> IOMMU?
> > usually reserved memory is reported to CPU through other method (e.g.
> e820
> > on x86 platform). Of course MSI is a special case which is covered by
> BYPASS
> > and MSI flag... If yes, maybe you can also include an example in
> implementation
> > notes.
> 
> The RESV_MEM regions only describes IOVA space for the moment, not
> guest-physical, so I guess it provides different information than e820.
> 
> I think a useful example is the PCI bridge windows reported by the Linux
> host to userspace using RESV_RESERVED regions (see
> iommu_dma_get_resv_regions). If I understand correctly, they represent
> DMA
> addresses that shouldn't be accessed by endpoints because they won't
> reach
> the IOMMU. These are specific to the physical topology: a device will have
> different reserved regions depending on the PCI slot it occupies.
> 
> When handled properly, PCI bridge windows quickly become a nuisance.
> With
> kvmtool we observed that carving out their addresses globally removes a
> lot of useful GPA space from the guest. Without a virtual IOMMU we can
> either ignore them and hope everything will be fine, or remove all
> reserved regions from the GPA space (which currently means editing by
> hand
> the static guest-physical map...)
> 
> That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It
> describes
> reserved IOVAs for a specific endpoint, and therefore removes the need to
> carve the window out of the whole guest.

Understand and thanks for elaboration.

> 
> > Another thing I want to ask your opinion, about whether there is value of
> > adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> > in the address space. It's similar to Reserved Memory Region Reporting
> > (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> > memory ranges which may be DMA target and has to be identity mapped
> > when DMA remapping is enabled. I'm not sure whether ARM has similar
> > capability and whether there might be a general usage beyond VT-d. For
> > now the only usage in my mind is to assign a device with RMRR associated
> > on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> > propagated to the guest (since identity mapping also means reservation
> > of virtual address space).
> 
> Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
> used for both iGPU and USB controllers on my x86 machines. Do you know
> more precisely what they are used for by the firmware?

VTd spec has a clear description:

3.14 Handling Requests to Reserved System Memory
Reserved system memory regions are typically allocated by BIOS at boot 
time and reported to OS as reserved address ranges in the system memory 
map. Requests-without-PASID to these reserved regions may either occur 
as a result of operations performed by the system software driver (for 
example in the case of DMA from unified memory access (UMA) graphics 
controllers to graphics reserved memory), or may be initiated by non 
system software (for example in case of DMA performed by a USB 
controller under BIOS SMM control for legacy keyboard emulation). 
For proper functioning of these legacy reserved memory usages, when 
system software enables DMA remapping, the second-level translation 
structures for the respective devices are expected to be set up to provide
identity mapping for the specified reserved memory regions with read 
and write permissions.

(one specific example for GPU happens in legacy VGA usage in early
boot time before actual graphics driver is loaded)

> 
> It's not necessary with the base virtio-iommu device though (v0.4),
> because the device can create the identity mappings itself and report them
> to the guest as MEM_T_BYPASS. However, when we start handing page

when you say "the device can create ...", I think you really meant
"host iommu driver can create identity mapping for assigned device",
correct?

Then yes, I think above works.

> table
> control over to the guest, the host won't be in control of IOVA->GPA
> mappings and will need to gracefully ask the guest to do it.
> 
> I'm not aware of any firmware description resembling Intel RMRR or AMD
> IVMD on ARM platforms. I do think ARM platforms could need
> MEM_T_IDENTITY
> for requesting the guest to map MSI windows when page-table handover is
> in
> use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
> mapping must be installed by the guest). But since a vSMMU would need a
> solution as well, I think I'll try to implement something more generic.

curious do you need identity mapping full IOVA->GPA->HPA translation, 
or just in GPA->HPA stage sufficient for above MSI scenario?

> 
> > 2.6.8.2.3 Device Requirements: Property RESV_MEM
> >
> > --citation start--
> > If an endpoint is attached to an address space, the device SHOULD leave
> > any access targeting one of its
> VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS
> > regions pass through untranslated. In other words, the device SHOULD
> > handle such a region as if it was identity-mapped (virtual address equal to
> > physical address). If the endpoint is not attached to any address space,
> > then the device MAY abort the transaction.
> > --citation end
> >
> > I have a question for the last sentence. From definition of BYPASS, it's
> > orthogonal to whether there is an address space attached, then should
> > we still allow "May abort" behavior?
> 
> The behavior is left as an implementation choice, and I'm not sure it's
> worth enforcing in the architecture. If the endpoint isn't attached to any
> domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
> necessarily able to do DMA at all. The virtio-iommu device may setup DMA
> mastering lazily, in which case any DMA transaction would abort, or have
> setup DMA already, in which case the endpoint can access MEM_T_BYPASS
> regions.
> 

fair enough. thanks

Kevin

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-19 10:47   ` Jean-Philippe Brucker
  2017-09-20  9:37     ` Auger Eric
@ 2017-09-20  9:37     ` Auger Eric
  2017-09-21 10:29       ` Jean-Philippe Brucker
  2017-09-21 10:29       ` Jean-Philippe Brucker
  1 sibling, 2 replies; 26+ messages in thread
From: Auger Eric @ 2017-09-20  9:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-19 10:47   ` Jean-Philippe Brucker
@ 2017-09-20  9:37     ` Auger Eric
  2017-09-20  9:37     ` Auger Eric
  1 sibling, 0 replies; 26+ messages in thread
From: Auger Eric @ 2017-09-20  9:37 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

Hi Jean,
On 19/09/2017 12:47, Jean-Philippe Brucker wrote:
> Hi Eric,
> 
> On 12/09/17 18:13, Auger Eric wrote:
>> 2.6.7
>> - As I am currently integrating v0.4 in QEMU here are some other comments:
>> At the moment struct virtio_iommu_req_probe flags is missing in your
>> header. As such I understood the ACK protocol was not implemented by the
>> driver in your branch.
> 
> Uh indeed. And yet I could swear I've written that code... somewhere. I
> will add it to the batch of v0.5 changes, it shouldn't be too invasive.
> 
>> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
>> header too.
> 
> Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
> (though it is a mouthful).
> 
>> 2.6.8.2:
>> - I am really confused about what the device should report as resv
>> regions depending on the PE nature (VFIO or not VFIO)
>>
>> In other iommu drivers, the resv regions are populated by the iommu
>> driver through its get_resv_regions callback. They are usually composed
>> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
>> specific (device specific) reserved regions:
>> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
>> are the guest reserved regions.
>>
>> First in the current virtio-iommu driver I don't see the
>> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
>> driver should compute the non IOMMU specific MSI regions ie. this is
>> not the responsability of the virtio-iommu device.
> 
> For SW_MSI, certainly. The driver allocates a fixed IOVA region for
> mapping the MSI doorbell. But the driver has to know whether the doorbell
> region is translated or bypassed.
Sorry I was talking about *non* IOMMU specific MSI regions, typically
the regions corresponding to guest PCI host bridge windows. This is
normally computed in the iommu driver and I didn't that that in the
existing virtio-iommu driver.
> 
>> Then why is it more the job of the device to return the guest iommu
>> specific region rather than the driver itself?
> 
> The MSI region is architectural on x86 IOMMUs, but
> implementation-defined on virtio-iommu. It depends which platform the host
> is emulating. In Linux, x86 IOMMU drivers register the bypass region
> because there always is an IOAPIC on the other end, with a fixed MSI
> address. But virtio-iommu may be either behind a GIC, an APIC or some
> other IRQ chip.
> 
> The driver *could* go over all the irqchips/platforms it knows and try to
> guess if there is a fixed doorbell or if it needs to reserve an IOVA for
> them, but it would look horrible. I much prefer having a well-defined way
> of doing this, so a description from the device.

This means I must have target specific code in the virtio-iommu device
which is unusual, right? I was initially thinking you could handle that
on the driver side using a config set for ARM|ARM64. But on the other
hand you should communicate the info to the device ...

> 
>> Then I understand this is the responsability of the virtio-iommu device
>> to gather information about the host resv regions in case of VFIO EP.
>> Typically the host PCIe host bridge windows cannot be used for IOVA.
>> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> Yes, all regions reported in sysfs reserved_regions in the host would be
> reported as RESV_T_RESERVED by virtio-iommu.
So to summarize if the probe request is sent to an emulated device, we
should return the target specific MSI window. We can't and don't return
the non IOMMU specific guest reserved windows.

For a VFIO device, we would return all reserved regions of the group the
device belongs to. Is that correct?
> 
>> I really think the spec should clarify what exact resv regions the
>> device should return in case of VFIO device and non VFIO device.
> 
> Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
> example in Implementation Notes. Do you think the MSI examples at the end
> need improvement as well? I can try to explain that RESV_MSI regions in
> virtio-iommu are only those of the emulated platform, not the HW or SW MSI
> regions from the host.

I think I would expect an explanation detailing returned reserved
regions for pure emulated devices and HW/VFIO devices.

Another unrelated remark:
- you should add a permission violation error.
- wrt the probe request ACK protocol, this looks pretty heavy as both
the driver and the device need to parse the req_probe buffer. The device
need to fill in the output buffer and then read the same info on the
input buffer. Couldn't we imagine something simpler?

Thanks

Eric
> 
> Thanks,
> Jean
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13 ` Auger Eric
  2017-09-13  3:47   ` Bharat Bhushan
       [not found]   ` <072f5a14-baae-57a9-9c5b-b93163c67075-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-09-19 10:47   ` Jean-Philippe Brucker
  2017-09-20  9:37     ` Auger Eric
  2017-09-20  9:37     ` Auger Eric
  2017-09-19 10:47   ` Jean-Philippe Brucker
  3 siblings, 2 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-19 10:47 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Will Deacon, Robin Murphy, Lorenzo Pieralisi, mst, jasowang,
	Marc Zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13 ` Auger Eric
                     ` (2 preceding siblings ...)
  2017-09-19 10:47   ` Jean-Philippe Brucker
@ 2017-09-19 10:47   ` Jean-Philippe Brucker
  3 siblings, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-19 10:47 UTC (permalink / raw)
  To: Auger Eric, iommu, kvm, virtualization, virtio-dev
  Cc: Lorenzo Pieralisi, mst, Marc Zyngier, Will Deacon, Robin Murphy,
	eric.auger.pro

Hi Eric,

On 12/09/17 18:13, Auger Eric wrote:
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.

Uh indeed. And yet I could swear I've written that code... somewhere. I
will add it to the batch of v0.5 changes, it shouldn't be too invasive.

> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
> header too.

Yes, keeping VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is probably best
(though it is a mouthful).

> 2.6.8.2:
> - I am really confused about what the device should report as resv
> regions depending on the PE nature (VFIO or not VFIO)
>
> In other iommu drivers, the resv regions are populated by the iommu
> driver through its get_resv_regions callback. They are usually composed
> of an iommu specific MSI region (mapped or bypassed) and non IOMMU
> specific (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
>
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is
> not the responsability of the virtio-iommu device.

For SW_MSI, certainly. The driver allocates a fixed IOVA region for
mapping the MSI doorbell. But the driver has to know whether the doorbell
region is translated or bypassed.

> Then why is it more the job of the device to return the guest iommu
> specific region rather than the driver itself?

The MSI region is architectural on x86 IOMMUs, but
implementation-defined on virtio-iommu. It depends which platform the host
is emulating. In Linux, x86 IOMMU drivers register the bypass region
because there always is an IOAPIC on the other end, with a fixed MSI
address. But virtio-iommu may be either behind a GIC, an APIC or some
other IRQ chip.

The driver *could* go over all the irqchips/platforms it knows and try to
guess if there is a fixed doorbell or if it needs to reserve an IOVA for
them, but it would look horrible. I much prefer having a well-defined way
of doing this, so a description from the device.

> Then I understand this is the responsability of the virtio-iommu device
> to gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.

Yes, all regions reported in sysfs reserved_regions in the host would be
reported as RESV_T_RESERVED by virtio-iommu.

> I really think the spec should clarify what exact resv regions the
> device should return in case of VFIO device and non VFIO device.

Agreed. I will add something about RESV_T_RESERVED with the PCI bridge
example in Implementation Notes. Do you think the MSI examples at the end
need improvement as well? I can try to explain that RESV_MSI regions in
virtio-iommu are only those of the emulated platform, not the HW or SW MSI
regions from the host.

Thanks,
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
       [not found]   ` <072f5a14-baae-57a9-9c5b-b93163c67075-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
@ 2017-09-13  3:47     ` Bharat Bhushan
  0 siblings, 0 replies; 26+ messages in thread
From: Bharat Bhushan @ 2017-09-13  3:47 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker-5wv7dgnIgG8@public.gmane.org>;
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; kvm-u79uwXL29TY76Z2rM5mHXA@public.gmane.org;
> virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org; virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b@public.gmane.org
> Cc: will.deacon-5wv7dgnIgG8@public.gmane.org; robin.murphy-5wv7dgnIgG8@public.gmane.org;
> lorenzo.pieralisi-5wv7dgnIgG8@public.gmane.org; mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; jasowang-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org;
> marc.zyngier-5wv7dgnIgG8@public.gmane.org; eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org; Bharat Bhushan
> <bharat.bhushan-3arQi8VN3Tc@public.gmane.org>; peterx-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org; kevin.tian-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device should
> return in case of VFIO device and non VFIO device.

My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat

> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here because
> > v0.4 already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call "Page
> >   Table Handover" (or simply, from the host POV, "Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a reference,
> >     it's a draft. It's also not an OASIS document yet. It may be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-09-12 17:13 ` Auger Eric
@ 2017-09-13  3:47   ` Bharat Bhushan
       [not found]   ` <072f5a14-baae-57a9-9c5b-b93163c67075-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: Bharat Bhushan @ 2017-09-13  3:47 UTC (permalink / raw)
  To: Auger Eric, Jean-Philippe Brucker, iommu, kvm, virtualization,
	virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi Eric,

> -----Original Message-----
> From: Auger Eric [mailto:eric.auger@redhat.com]
> Sent: Tuesday, September 12, 2017 10:43 PM
> To: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>;
> iommu@lists.linux-foundation.org; kvm@vger.kernel.org;
> virtualization@lists.linux-foundation.org; virtio-dev@lists.oasis-open.org
> Cc: will.deacon@arm.com; robin.murphy@arm.com;
> lorenzo.pieralisi@arm.com; mst@redhat.com; jasowang@redhat.com;
> marc.zyngier@arm.com; eric.auger.pro@gmail.com; Bharat Bhushan
> <bharat.bhushan@nxp.com>; peterx@redhat.com; kevin.tian@intel.com
> Subject: Re: [RFC] virtio-iommu version 0.4
> 
> Hi jean,
> 
> On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> > This is the continuation of my proposal for virtio-iommu, the para-
> > virtualized IOMMU. Here is a summary of the changes since last time [1]:
> >
> > * The virtio-iommu document now resembles an actual specification. It is
> >   split into a formal description of the virtio device, and implementation
> >   notes. Please find sources and binaries at [2].
> >
> > * Added a probe request to describe to the guest different properties that
> >   do not fit in firmware or in the virtio config space. This is a
> >   necessary stepping stone for extending the virtio-iommu.
> >
> > * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
> >   Bhushan.
> >
> > You can find the Linux driver and kvmtool device at [4] and [5]. I
> > plan to rework driver and kvmtool device slightly before sending the
> > patches.
> >
> > To understand the virtio-iommu, I advise to first read introduction
> > and motivation, then skim through implementation notes and finally
> > look at the device specification.
> >
> > I wasn't sure how to organize the review. For those who prefer to
> > comment inline, I attached v0.4 of device-operations.tex and
> > topology.tex+MSI.tex to this thread. They are the biggest chunks of
> > the document. But LaTeX isn't very pleasant to read, so you can simply
> > send a list of comments in relation to section numbers and a few words of
> context, we'll manage.
> >
> > ---
> > Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to
> > compare more easily differences since the RFC (see [6]), but haven't
> > been made public so far. This is the first public posting since
> > initial proposal [1], and the following describes all changes.
> >
> > ## v0.1 ##
> >
> > Content is the same as the RFC, but formatted to LaTeX. 'make'
> > generates one PDF and one HTML document.
> >
> > ## v0.2 ##
> >
> > Add introductions, improve topology example and firmware description
> > based on feedback and a number of useful discussions.
> >
> > ## v0.3 ##
> >
> > Add normative sections (MUST, SHOULD, etc). Clarify some things,
> > tighten the device and driver behaviour. Unmap semantics are
> > consolidated; they are now closer to VFIO Type1 v2 semantics.
> >
> > ## v0.4 ##
> >
> > Introduce PROBE requests. They provide per-endpoint information to the
> > driver that couldn't be described otherwise.
> >
> > For the moment, they allow to handle MSIs on x86 virtual platforms
> > (see 3.2). To do that we communicate reserved IOVA regions, that will
> > also be useful for describing regions that cannot be mapped for a
> > given endpoint, for instance addresses that correspond to a PCI bridge
> window.
> >
> > Introducing such a large framework for this tiny feature may seem
> > overkill, but it is needed for future extensions of the virtio-iommu
> > and I believe it really is worth the effort.
> 
> 2.6.7
> - As I am currently integrating v0.4 in QEMU here are some other comments:
> At the moment struct virtio_iommu_req_probe flags is missing in your
> header. As such I understood the ACK protocol was not implemented by the
> driver in your branch.
> - VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is
> VIRTIO_IOMMU_T_MASK in your header too.
> 2.6.8.2:
> - I am really confused about what the device should report as resv regions
> depending on the PE nature (VFIO or not VFIO)
> 
> In other iommu drivers, the resv regions are populated by the iommu driver
> through its get_resv_regions callback. They are usually composed of an
> iommu specific MSI region (mapped or bypassed) and non IOMMU specific
> (device specific) reserved regions:
> iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
> are the guest reserved regions.
> 
> First in the current virtio-iommu driver I don't see the
> iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
> driver should compute the non IOMMU specific MSI regions. ie. this is not
> the responsability of the virtio-iommu device.
> 
> Then why is it more the job of the device to return the guest iommu specific
> region rather than the driver itself?
> 
> Then I understand this is the responsability of the virtio-iommu device to
> gather information about the host resv regions in case of VFIO EP.
> Typically the host PCIe host bridge windows cannot be used for IOVA.
> Also the host MSI reserved IOVA window cannot be used. Do you agree.
> 
> I really think the spec should clarify what exact resv regions the device should
> return in case of VFIO device and non VFIO device.

My understanding is that reserved regions are 1) Host reserved regions ( This is applicable for VFIO) 2) Guest MSI reserved region (This is applicable for emulated devices as well)
Does this make sense to have reserved regions per address-space/domain?

Thanks
-Bharat

> 
> Thanks
> 
> Eric
> 
> >
> > ## Future ##
> >
> > Other extensions are in preparation. I won't detail them here because
> > v0.4 already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> >
> > * Then on top of that, a major improvement will describe hardware
> >   acceleration features available to the guest. There is what I call "Page
> >   Table Handover" (or simply, from the host POV, "Nested"), the ability
> >   for the guest to manipulate its own page tables instead of sending
> >   MAP/UNMAP requests to the host. This, along with IO Page Fault
> >   reporting, will also permit SVM virtualization on different platforms.
> >
> > Thanks,
> > Jean
> >
> > [1] http://www.spinics.net/lists/kvm/msg147990.html
> > [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
> >     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
> >     I reiterate the disclaimers: don't use this document as a reference,
> >     it's a draft. It's also not an OASIS document yet. It may be riddled
> >     with mistakes. As this is a working draft, it is unstable and I do not
> >     guarantee backward compatibility of future versions.
> > [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> > [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
> >     Warning: UAPI headers have changed! They didn't follow the spec,
> >     please update. (Use branch v0.1, that has the old headers, for the
> >     Qemu prototype [3])
> > [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
> >     Warning: command-line has changed! Use --viommu vfio[,opts] and
> >     --viommu virtio[,opts] to instantiate a device.
> > [6]
> > http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> >

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
                   ` (4 preceding siblings ...)
  2017-08-23 13:55 ` Auger Eric
@ 2017-09-12 17:13 ` Auger Eric
  2017-09-13  3:47   ` Bharat Bhushan
                     ` (3 more replies)
  2017-09-12 17:13 ` Auger Eric
  2017-10-03 13:04 ` Auger Eric
  7 siblings, 4 replies; 26+ messages in thread
From: Auger Eric @ 2017-09-12 17:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
                   ` (5 preceding siblings ...)
  2017-09-12 17:13 ` Auger Eric
@ 2017-09-12 17:13 ` Auger Eric
  2017-10-03 13:04 ` Auger Eric
  7 siblings, 0 replies; 26+ messages in thread
From: Auger Eric @ 2017-09-12 17:13 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, robin.murphy,
	eric.auger.pro

Hi jean,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.

2.6.7
- As I am currently integrating v0.4 in QEMU here are some other comments:
At the moment struct virtio_iommu_req_probe flags is missing in your
header. As such I understood the ACK protocol was not implemented by the
driver in your branch.
- VIRTIO_IOMMU_PROBE_PROPERTY_TYPE_MASK is VIRTIO_IOMMU_T_MASK in your
header too.
2.6.8.2:
- I am really confused about what the device should report as resv
regions depending on the PE nature (VFIO or not VFIO)

In other iommu drivers, the resv regions are populated by the iommu
driver through its get_resv_regions callback. They are usually composed
of an iommu specific MSI region (mapped or bypassed) and non IOMMU
specific (device specific) reserved regions:
iommu_dma_get_resv_regions(). In the case of virtio-iommu driver, those
are the guest reserved regions.

First in the current virtio-iommu driver I don't see the
iommu_dma_get_resv_regions call. Do you agree that the virtio-iommu
driver should compute the non IOMMU specific MSI regions. ie. this is
not the responsability of the virtio-iommu device.

Then why is it more the job of the device to return the guest iommu
specific region rather than the driver itself?

Then I understand this is the responsability of the virtio-iommu device
to gather information about the host resv regions in case of VFIO EP.
Typically the host PCIe host bridge windows cannot be used for IOVA.
Also the host MSI reserved IOVA window cannot be used. Do you agree.

I really think the spec should clarify what exact resv regions the
device should return in case of VFIO device and non VFIO device.

Thanks

Eric

> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-28  7:39   ` Tian, Kevin
@ 2017-09-06 11:54     ` Jean-Philippe Brucker
  2017-09-21  6:27       ` Tian, Kevin
  2017-09-21  6:27       ` Tian, Kevin
  2017-09-06 11:54     ` Jean-Philippe Brucker
  1 sibling, 2 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which case any DMA transaction would abort, or have
setup DMA already, in which case the endpoint can access MEM_T_BYPASS regions.

Thanks!
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-28  7:39   ` Tian, Kevin
  2017-09-06 11:54     ` Jean-Philippe Brucker
@ 2017-09-06 11:54     ` Jean-Philippe Brucker
  1 sibling, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-09-06 11:54 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

Hi Kevin,

On 28/08/17 08:39, Tian, Kevin wrote:
> Here comes some comments:
> 
> 1.1 Motivation
> 
> You describe I/O page faults handling as future work. Seems you considered
> only recoverable fault (since "aka. PCI PRI" being used). What about other
> unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
> a valid mapping? Even when there is no PRI support, we need some basic
> form of fault reporting mechanism to indicate such errors to guest.

I am considering recoverable faults as the end goal, but reporting
unrecoverable faults should use the same queue, with slightly different
fields and no need for the driver to reply to the device.

> 2.6.8.2 Property RESV_MEM
> 
> I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
> should be explicitly reported. Is there any real example on bare metal IOMMU?
> usually reserved memory is reported to CPU through other method (e.g. e820
> on x86 platform). Of course MSI is a special case which is covered by BYPASS 
> and MSI flag... If yes, maybe you can also include an example in implementation 
> notes.

The RESV_MEM regions only describes IOVA space for the moment, not
guest-physical, so I guess it provides different information than e820.

I think a useful example is the PCI bridge windows reported by the Linux
host to userspace using RESV_RESERVED regions (see
iommu_dma_get_resv_regions). If I understand correctly, they represent DMA
addresses that shouldn't be accessed by endpoints because they won't reach
the IOMMU. These are specific to the physical topology: a device will have
different reserved regions depending on the PCI slot it occupies.

When handled properly, PCI bridge windows quickly become a nuisance. With
kvmtool we observed that carving out their addresses globally removes a
lot of useful GPA space from the guest. Without a virtual IOMMU we can
either ignore them and hope everything will be fine, or remove all
reserved regions from the GPA space (which currently means editing by hand
the static guest-physical map...)

That's where RESV_MEM_T_ABORT comes handy with virtio-iommu. It describes
reserved IOVAs for a specific endpoint, and therefore removes the need to
carve the window out of the whole guest.

> Another thing I want to ask your opinion, about whether there is value of
> adding another subtype (MEM_T_IDENTITY), asking for identity mapping
> in the address space. It's similar to Reserved Memory Region Reporting
> (RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
> memory ranges which may be DMA target and has to be identity mapped
> when DMA remapping is enabled. I'm not sure whether ARM has similar
> capability and whether there might be a general usage beyond VT-d. For
> now the only usage in my mind is to assign a device with RMRR associated
> on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
> propagated to the guest (since identity mapping also means reservation
> of virtual address space).

Yes I think adding MEM_T_IDENTITY will be necessary. I can see they are
used for both iGPU and USB controllers on my x86 machines. Do you know
more precisely what they are used for by the firmware?

It's not necessary with the base virtio-iommu device though (v0.4),
because the device can create the identity mappings itself and report them
to the guest as MEM_T_BYPASS. However, when we start handing page table
control over to the guest, the host won't be in control of IOVA->GPA
mappings and will need to gracefully ask the guest to do it.

I'm not aware of any firmware description resembling Intel RMRR or AMD
IVMD on ARM platforms. I do think ARM platforms could need MEM_T_IDENTITY
for requesting the guest to map MSI windows when page-table handover is in
use (MSI addresses are translated by the physical SMMU, so a IOVA->GPA
mapping must be installed by the guest). But since a vSMMU would need a
solution as well, I think I'll try to implement something more generic.

> 2.6.8.2.3 Device Requirements: Property RESV_MEM
> 
> --citation start--
> If an endpoint is attached to an address space, the device SHOULD leave 
> any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
> regions pass through untranslated. In other words, the device SHOULD 
> handle such a region as if it was identity-mapped (virtual address equal to
> physical address). If the endpoint is not attached to any address space, 
> then the device MAY abort the transaction.
> --citation end
> 
> I have a question for the last sentence. From definition of BYPASS, it's
> orthogonal to whether there is an address space attached, then should
> we still allow "May abort" behavior? 

The behavior is left as an implementation choice, and I'm not sure it's
worth enforcing in the architecture. If the endpoint isn't attached to any
domain then (unless VIRTIO_IOMMU_F_BYPASS is negotiated), it isn't
necessarily able to do DMA at all. The virtio-iommu device may setup DMA
mastering lazily, in which case any DMA transaction would abort, or have
setup DMA already, in which case the endpoint can access MEM_T_BYPASS regions.

Thanks!
Jean

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-23 10:01 ` Jean-Philippe Brucker
@ 2017-08-28  7:39   ` Tian, Kevin
  2017-09-06 11:54     ` Jean-Philippe Brucker
  2017-09-06 11:54     ` Jean-Philippe Brucker
  0 siblings, 2 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-08-28  7:39 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Wednesday, August 23, 2017 6:01 PM
> 
> On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> > Other extensions are in preparation. I won't detail them here because
> v0.4
> > already is a lot to digest, but in short, building on top of PROBE:
> >
> > * First, since the IOMMU is paravirtualized, the device can expose some
> >   properties of the physical topology to the guest, and let it allocate
> >   resources more efficiently. For example, when the virtio-iommu
> manages
> >   both physical and emulated endpoints, with different underlying
> IOMMUs,
> >   we now have a way to describe multiple page and block granularities,
> >   instead of forcing the guest to use the most restricted one for all
> >   endpoints. This will most likely be in v0.5.
> 
> In order to extend requests with PASIDs and (later) nested mode, I intend
> to rename "address_space" field to "domain", since it is a lot more
> precise about what the field is referring to and the current name would
> make these extensions confusing. Please find the rationale at [1].
> "ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS"
> will
> be "VIRTIO_IOMMU_F_DOMAIN_BITS".
> 
> For those that had time to read this version, do you have other comments
> and suggestions about v0.4? Otherwise it is the only update I have for
> v0.5 (along with fine-grained address range and page size properties from
> the quoted text) and I will send it soon.
> 
> In particular, please tell me now if you see the need for other
> destructive changes like this one. They will be impossible to introduce
> once a driver or device is upstream.
> 
> Thanks,
> Jean
> 
> [1] https://www.spinics.net/lists/kvm/msg154573.html

Here comes some comments:

1.1 Motivation

You describe I/O page faults handling as future work. Seems you considered
only recoverable fault (since "aka. PCI PRI" being used). What about other
unrecoverable faults e.g. what to do if a virtual DMA request doesn't find 
a valid mapping? Even when there is no PRI support, we need some basic
form of fault reporting mechanism to indicate such errors to guest.

2.6.8.2 Property RESV_MEM

I'm not immediately clear when VIRTIO_IOMMU_PROBE_RESV_MEM_T_ABORT
should be explicitly reported. Is there any real example on bare metal IOMMU?
usually reserved memory is reported to CPU through other method (e.g. e820
on x86 platform). Of course MSI is a special case which is covered by BYPASS 
and MSI flag... If yes, maybe you can also include an example in implementation 
notes.

Another thing I want to ask your opinion, about whether there is value of
adding another subtype (MEM_T_IDENTITY), asking for identity mapping
in the address space. It's similar to Reserved Memory Region Reporting
(RMRR) structure defined in VT-d, to indicate BIOS allocated reserved
memory ranges which may be DMA target and has to be identity mapped
when DMA remapping is enabled. I'm not sure whether ARM has similar
capability and whether there might be a general usage beyond VT-d. For
now the only usage in my mind is to assign a device with RMRR associated
on VT-d (Intel GPU, or some USB controllers) where the RMRR info needs
propagated to the guest (since identity mapping also means reservation
of virtual address space).

2.6.8.2.3 Device Requirements: Property RESV_MEM

--citation start--
If an endpoint is attached to an address space, the device SHOULD leave 
any access targeting one of its VIRTIO_IOMMU_PROBE_RESV_MEM_T_BYPASS 
regions pass through untranslated. In other words, the device SHOULD 
handle such a region as if it was identity-mapped (virtual address equal to
physical address). If the endpoint is not attached to any address space, 
then the device MAY abort the transaction.
--citation end

I have a question for the last sentence. From definition of BYPASS, it's
orthogonal to whether there is an address space attached, then should
we still allow "May abort" behavior? 

Thanks
Kevin 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
                   ` (3 preceding siblings ...)
  2017-08-23 10:01 ` Jean-Philippe Brucker
@ 2017-08-23 13:55 ` Auger Eric
  2017-09-06 11:48   ` [virtio-dev] " Jean-Philippe Brucker
  2017-09-12 17:13 ` Auger Eric
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Auger Eric @ 2017-08-23 13:55 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger.pro, bharat.bhushan, peterx, kevin.tian

Hi Jean-Philippe,

On 04/08/2017 20:19, Jean-Philippe Brucker wrote:
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.

Please find some comments/questions below:

2.6.7:1
I do not understand the footnode #6 sentence: 'Without a specific
definition of the ACK requirements for a given property type, it simply
means that the driver read all fields of that property."
2.6.7:2
what if the driver has not provided a big enough buffer and the device
cannot report all supported properties?
2.6.8.2:
Couldn't we use the same terminology as iommu_resv_type in iommu.h?
the negative form is not the easiest to understand for me.
Why F_MSI?
2.6.8.2.1
Multiple overlapping RESV_MEM properties are merged together. Device
requirement? if same types I assume?

what if the device is a physical assigned device. does the device report
the host doorbell or the guest doorbell, may be worth to clarify.

3.1.1 last paragraph
you talk about device ID but this is a terminology only known by ARM I
think. Actually it is introduced later in 3.1.2.

3.1.2
About ACPI integration. Isn't IORT ARM specific? Will other
architectures use that table? In IORT we also have Named Component node
which look pretty the same. Could you elaborate of the difference?
Device Object name: isn't it the path to the LNR0005 in the namespace?

Thanks

Eric
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.
> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.
> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
                   ` (2 preceding siblings ...)
  2017-08-23 10:01 ` Jean-Philippe Brucker
@ 2017-08-23 10:01 ` Jean-Philippe Brucker
  2017-08-28  7:39   ` Tian, Kevin
  2017-08-23 13:55 ` Auger Eric
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-23 10:01 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
  2017-08-14  8:27 ` Tian, Kevin
  2017-08-14  8:27 ` Tian, Kevin
@ 2017-08-23 10:01 ` Jean-Philippe Brucker
  2017-08-23 10:01 ` Jean-Philippe Brucker
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-23 10:01 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

On 04/08/17 19:19, Jean-Philippe Brucker wrote:
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

In order to extend requests with PASIDs and (later) nested mode, I intend
to rename "address_space" field to "domain", since it is a lot more
precise about what the field is referring to and the current name would
make these extensions confusing. Please find the rationale at [1].
"ioasid_bits" will be "domain_bits" and "VIRTIO_IOMMU_F_IOASID_BITS" will
be "VIRTIO_IOMMU_F_DOMAIN_BITS".

For those that had time to read this version, do you have other comments
and suggestions about v0.4? Otherwise it is the only update I have for
v0.5 (along with fine-grained address range and page size properties from
the quoted text) and I will send it soon.

In particular, please tell me now if you see the need for other
destructive changes like this one. They will be impossible to introduce
once a driver or device is upstream.

Thanks,
Jean

[1] https://www.spinics.net/lists/kvm/msg154573.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
       [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-08-14  9:38     ` Jean-Philippe Brucker
  0 siblings, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-14  9:38 UTC (permalink / raw)
  To: Tian, Kevin, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	kvm-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	virtio-dev-sDuHXQ4OtrM4h7I2RyI4rWD2FQJk+8+b
  Cc: mst-H+wXaHxf7aLQT0dZR+AlfA, marc.zyngier-5wv7dgnIgG8,
	jasowang-H+wXaHxf7aLQT0dZR+AlfA, will.deacon-5wv7dgnIgG8,
	eric.auger.pro-Re5JQEeQqe8AvxtiuMwx3w

On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/<dev>. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [RFC] virtio-iommu version 0.4
  2017-08-14  8:27 ` Tian, Kevin
       [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2017-08-14  9:38   ` Jean-Philippe Brucker
  1 sibling, 0 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-14  9:38 UTC (permalink / raw)
  To: Tian, Kevin, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

On 14/08/17 09:27, Tian, Kevin wrote:
>> * First, since the IOMMU is paravirtualized, the device can expose some
>>   properties of the physical topology to the guest, and let it allocate
>>   resources more efficiently. For example, when the virtio-iommu manages
>>   both physical and emulated endpoints, with different underlying IOMMUs,
>>   we now have a way to describe multiple page and block granularities,
>>   instead of forcing the guest to use the most restricted one for all
>>   endpoints. This will most likely be in v0.5.
> 
> emulated IOMMU has similar requirement, e.g. available PASID bits,
> address widths, etc. which may break guest usage if not aligned to
> physical limitation. Suppose we can introduce a general interface
> through VFIO for all vIOMMU incarnations. 

A nice location for this kind of info would be sysfs, as discussed in the
SVM virtualization thread [1]. Properties of an IOMMU could be described
in /sys/class/iommu/<dev>. Properties of a PCI device are available in its
PASID/PRI capabilities. For platform devices we'll have to look at DT and
ACPI properties in /sys/firmware.

>> * Then on top of that, a major improvement will describe hardware
>>   acceleration features available to the guest. There is what I call "Page
>>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>>   for the guest to manipulate its own page tables instead of sending
>>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>>   reporting, will also permit SVM virtualization on different platforms.
> 
> what's your planned cadence for future versions? :-)

Hard to say, it depends on a number of things. I have various other tasks
eating up my bandwidth at the moment and I may have to considerably rework
this version depending on the feedback it gets. Ideally, I would like to
get the base driver merged and a proposal for hardware acceleration out by
the end of the year, but I obviously can't make any guarantee.

Thanks,
Jean

[1] https://lists.gnu.org/archive/html/qemu-devel/2017-07/msg05731.html

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
@ 2017-08-14  8:27 ` Tian, Kevin
       [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2017-08-14  9:38   ` Jean-Philippe Brucker
  2017-08-14  8:27 ` Tian, Kevin
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-08-14  8:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: [RFC] virtio-iommu version 0.4
  2017-08-04 18:19 Jean-Philippe Brucker
  2017-08-14  8:27 ` Tian, Kevin
@ 2017-08-14  8:27 ` Tian, Kevin
  2017-08-23 10:01 ` Jean-Philippe Brucker
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Tian, Kevin @ 2017-08-14  8:27 UTC (permalink / raw)
  To: Jean-Philippe Brucker, iommu, kvm, virtualization, virtio-dev
  Cc: lorenzo.pieralisi, mst, marc.zyngier, will.deacon, eric.auger,
	robin.murphy, eric.auger.pro

> From: Jean-Philippe Brucker [mailto:jean-philippe.brucker@arm.com]
> Sent: Saturday, August 5, 2017 2:19 AM
> 
> This is the continuation of my proposal for virtio-iommu, the para-
> virtualized IOMMU. Here is a summary of the changes since last time [1]:
> 
> * The virtio-iommu document now resembles an actual specification. It is
>   split into a formal description of the virtio device, and implementation
>   notes. Please find sources and binaries at [2].
> 
> * Added a probe request to describe to the guest different properties that
>   do not fit in firmware or in the virtio config space. This is a
>   necessary stepping stone for extending the virtio-iommu.
> 
> * There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
>   Bhushan.
> 
> You can find the Linux driver and kvmtool device at [4] and [5]. I
> plan to rework driver and kvmtool device slightly before sending the
> patches.
> 
> To understand the virtio-iommu, I advise to first read introduction and
> motivation, then skim through implementation notes and finally look at the
> device specification.
> 
> I wasn't sure how to organize the review. For those who prefer to comment
> inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
> to this thread. They are the biggest chunks of the document. But LaTeX
> isn't very pleasant to read, so you can simply send a list of comments in
> relation to section numbers and a few words of context, we'll manage.
> 
> ---
> Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
> more easily differences since the RFC (see [6]), but haven't been made
> public so far. This is the first public posting since initial proposal
> [1], and the following describes all changes.
> 
> ## v0.1 ##
> 
> Content is the same as the RFC, but formatted to LaTeX. 'make' generates
> one PDF and one HTML document.
> 
> ## v0.2 ##
> 
> Add introductions, improve topology example and firmware description
> based
> on feedback and a number of useful discussions.
> 
> ## v0.3 ##
> 
> Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
> the device and driver behaviour. Unmap semantics are consolidated; they
> are now closer to VFIO Type1 v2 semantics.
> 
> ## v0.4 ##
> 
> Introduce PROBE requests. They provide per-endpoint information to the
> driver that couldn't be described otherwise.
> 
> For the moment, they allow to handle MSIs on x86 virtual platforms (see
> 3.2). To do that we communicate reserved IOVA regions, that will also be
> useful for describing regions that cannot be mapped for a given endpoint,
> for instance addresses that correspond to a PCI bridge window.
> 
> Introducing such a large framework for this tiny feature may seem
> overkill, but it is needed for future extensions of the virtio-iommu and I
> believe it really is worth the effort.
> 
> ## Future ##
> 
> Other extensions are in preparation. I won't detail them here because v0.4
> already is a lot to digest, but in short, building on top of PROBE:
> 
> * First, since the IOMMU is paravirtualized, the device can expose some
>   properties of the physical topology to the guest, and let it allocate
>   resources more efficiently. For example, when the virtio-iommu manages
>   both physical and emulated endpoints, with different underlying IOMMUs,
>   we now have a way to describe multiple page and block granularities,
>   instead of forcing the guest to use the most restricted one for all
>   endpoints. This will most likely be in v0.5.

emulated IOMMU has similar requirement, e.g. available PASID bits,
address widths, etc. which may break guest usage if not aligned to
physical limitation. Suppose we can introduce a general interface
through VFIO for all vIOMMU incarnations. 

> 
> * Then on top of that, a major improvement will describe hardware
>   acceleration features available to the guest. There is what I call "Page
>   Table Handover" (or simply, from the host POV, "Nested"), the ability
>   for the guest to manipulate its own page tables instead of sending
>   MAP/UNMAP requests to the host. This, along with IO Page Fault
>   reporting, will also permit SVM virtualization on different platforms.

what's your planned cadence for future versions? :-)

> 
> Thanks,
> Jean
> 
> [1] http://www.spinics.net/lists/kvm/msg147990.html
> [2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
>     http://www.linux-arm.org/git?p=virtio-
> iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
>     I reiterate the disclaimers: don't use this document as a reference,
>     it's a draft. It's also not an OASIS document yet. It may be riddled
>     with mistakes. As this is a working draft, it is unstable and I do not
>     guarantee backward compatibility of future versions.
> [3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
> [4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
>     Warning: UAPI headers have changed! They didn't follow the spec,
>     please update. (Use branch v0.1, that has the old headers, for the
>     Qemu prototype [3])
> [5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
>     Warning: command-line has changed! Use --viommu vfio[,opts] and
>     --viommu virtio[,opts] to instantiate a device.
> [6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [RFC] virtio-iommu version 0.4
@ 2017-08-04 18:19 Jean-Philippe Brucker
  2017-08-14  8:27 ` Tian, Kevin
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Jean-Philippe Brucker @ 2017-08-04 18:19 UTC (permalink / raw)
  To: iommu, kvm, virtualization, virtio-dev
  Cc: will.deacon, robin.murphy, lorenzo.pieralisi, mst, jasowang,
	marc.zyngier, eric.auger, eric.auger.pro, bharat.bhushan, peterx,
	kevin.tian

This is the continuation of my proposal for virtio-iommu, the para-
virtualized IOMMU. Here is a summary of the changes since last time [1]:

* The virtio-iommu document now resembles an actual specification. It is
  split into a formal description of the virtio device, and implementation
  notes. Please find sources and binaries at [2].

* Added a probe request to describe to the guest different properties that
  do not fit in firmware or in the virtio config space. This is a
  necessary stepping stone for extending the virtio-iommu.

* There is a working Qemu prototype [3], thanks to Eric Auger and Bharat
  Bhushan.

You can find the Linux driver and kvmtool device at [4] and [5]. I
plan to rework driver and kvmtool device slightly before sending the
patches.

To understand the virtio-iommu, I advise to first read introduction and
motivation, then skim through implementation notes and finally look at the
device specification.

I wasn't sure how to organize the review. For those who prefer to comment
inline, I attached v0.4 of device-operations.tex and topology.tex+MSI.tex
to this thread. They are the biggest chunks of the document. But LaTeX
isn't very pleasant to read, so you can simply send a list of comments in
relation to section numbers and a few words of context, we'll manage.

---
Version numbers 0.1-0.4 are arbitrary. I'm hoping they allow to compare
more easily differences since the RFC (see [6]), but haven't been made
public so far. This is the first public posting since initial proposal
[1], and the following describes all changes.

## v0.1 ##

Content is the same as the RFC, but formatted to LaTeX. 'make' generates
one PDF and one HTML document.

## v0.2 ##

Add introductions, improve topology example and firmware description based
on feedback and a number of useful discussions.

## v0.3 ##

Add normative sections (MUST, SHOULD, etc). Clarify some things, tighten
the device and driver behaviour. Unmap semantics are consolidated; they
are now closer to VFIO Type1 v2 semantics.

## v0.4 ##

Introduce PROBE requests. They provide per-endpoint information to the
driver that couldn't be described otherwise.

For the moment, they allow to handle MSIs on x86 virtual platforms (see
3.2). To do that we communicate reserved IOVA regions, that will also be
useful for describing regions that cannot be mapped for a given endpoint,
for instance addresses that correspond to a PCI bridge window.

Introducing such a large framework for this tiny feature may seem
overkill, but it is needed for future extensions of the virtio-iommu and I
believe it really is worth the effort.

## Future ##

Other extensions are in preparation. I won't detail them here because v0.4
already is a lot to digest, but in short, building on top of PROBE:

* First, since the IOMMU is paravirtualized, the device can expose some
  properties of the physical topology to the guest, and let it allocate
  resources more efficiently. For example, when the virtio-iommu manages
  both physical and emulated endpoints, with different underlying IOMMUs,
  we now have a way to describe multiple page and block granularities,
  instead of forcing the guest to use the most restricted one for all
  endpoints. This will most likely be in v0.5.

* Then on top of that, a major improvement will describe hardware
  acceleration features available to the guest. There is what I call "Page
  Table Handover" (or simply, from the host POV, "Nested"), the ability
  for the guest to manipulate its own page tables instead of sending
  MAP/UNMAP requests to the host. This, along with IO Page Fault
  reporting, will also permit SVM virtualization on different platforms.

Thanks,
Jean

[1] http://www.spinics.net/lists/kvm/msg147990.html
[2] git://linux-arm.org/virtio-iommu.git branch viommu/v0.4
    http://www.linux-arm.org/git?p=virtio-iommu.git;a=blob;f=dist/v0.4/virtio-iommu-v0.4.pdf
    I reiterate the disclaimers: don't use this document as a reference,
    it's a draft. It's also not an OASIS document yet. It may be riddled
    with mistakes. As this is a working draft, it is unstable and I do not
    guarantee backward compatibility of future versions.
[3] https://lists.gnu.org/archive/html/qemu-arm/2017-08/msg00004.html
[4] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.4
    Warning: UAPI headers have changed! They didn't follow the spec,
    please update. (Use branch v0.1, that has the old headers, for the
    Qemu prototype [3])
[5] git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.4
    Warning: command-line has changed! Use --viommu vfio[,opts] and
    --viommu virtio[,opts] to instantiate a device.
[6] http://www.linux-arm.org/git?p=virtio-iommu.git;a=tree;f=dist/diffs

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2017-10-03 13:04 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-08-04 18:19 [RFC] virtio-iommu version 0.4 Jean-Philippe Brucker
2017-08-04 18:19 Jean-Philippe Brucker
2017-08-14  8:27 ` Tian, Kevin
     [not found]   ` <AADFC41AFE54684AB9EE6CBC0274A5D190D7173C-0J0gbvR4kThpB2pF5aRoyrfspsVTdybXVpNB7YpNyf8@public.gmane.org>
2017-08-14  9:38     ` Jean-Philippe Brucker
2017-08-14  9:38   ` Jean-Philippe Brucker
2017-08-14  8:27 ` Tian, Kevin
2017-08-23 10:01 ` Jean-Philippe Brucker
2017-08-23 10:01 ` Jean-Philippe Brucker
2017-08-28  7:39   ` Tian, Kevin
2017-09-06 11:54     ` Jean-Philippe Brucker
2017-09-21  6:27       ` Tian, Kevin
2017-09-21  6:27       ` Tian, Kevin
2017-09-06 11:54     ` Jean-Philippe Brucker
2017-08-23 13:55 ` Auger Eric
2017-09-06 11:48   ` [virtio-dev] " Jean-Philippe Brucker
2017-09-21  6:41     ` Tian, Kevin
2017-09-12 17:13 ` Auger Eric
2017-09-13  3:47   ` Bharat Bhushan
     [not found]   ` <072f5a14-baae-57a9-9c5b-b93163c67075-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
2017-09-13  3:47     ` Bharat Bhushan
2017-09-19 10:47   ` Jean-Philippe Brucker
2017-09-20  9:37     ` Auger Eric
2017-09-20  9:37     ` Auger Eric
2017-09-21 10:29       ` Jean-Philippe Brucker
2017-09-21 10:29       ` Jean-Philippe Brucker
2017-09-19 10:47   ` Jean-Philippe Brucker
2017-09-12 17:13 ` Auger Eric
2017-10-03 13:04 ` Auger Eric

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.