All of lore.kernel.org
 help / color / mirror / Atom feed
* Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-17 21:36 ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-17 21:36 UTC (permalink / raw)
  To: Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel
  Cc: Kishon Vijay Abraham I, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, Vidya Sagar, iommu, Jingoo Han, Joao Pinto

Joerg, Christoph, Marek, Robin,

I believe that the driver for our PCIe endpoint controller hardware will 
need to explicitly manage its IOVA space more than current APIs allow. 
I'd like to discuss how to make that possible.

First some background on our hardware:

NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This 
can operate in either root port or endpoint mode. I'm particularly 
interested in endpoint mode.

Our particular instantiation of this controller exposes a single 
function with a single software-controlled PCIe BAR to the PCIe bus 
(there are also BARs for access to DMA controller registers and outbound 
MSI configuration, which can both be enabled/disabled but not used for 
any other purpose). When a transaction is received from the PCIe bus, 
the following happens:

1) Transaction is matched against the BAR base/size (in PCIe address 
space) to determine whether it "hits" this BAR or not.

2) The transaction's address is processed by the PCIe controller's ATU 
(Address Translation Unit), which can re-write the address that the 
transaction accesses.

Our particular instantiation of the hardware only has 2 entries in the 
ATU mapping table, which gives very little flexibility in setting up a 
mapping.

As an FYI, ATU entries can match PCIe transactions either:
a) Any transaction received on a particular BAR.
b) Any transaction received within a single contiguous window of PCIe 
address space. This kind of mapping entry obviously has to be set up 
after device enumeration is complete so that it can match the correct 
PCIe address.

Each ATU entry maps a single contiguous set of PCIe addresses to a 
single contiguous set of IOVAs which are passed to the IOMMU. 
Transactions can pass through the ATU without being translated if desired.

3) The transaction is passed to the IOMMU, which can again re-write the 
address that the transaction accesses.

4) The transaction is passed to the memory controller and reads/writes DRAM.

In general, we want to be able to expose a large and dynamic set of data 
buffers to the PCIe bus; certainly /far/ more than two separate buffers 
(the number of ATU table entries). With current Linux APIs, these 
buffers will not be located in contiguous or adjacent physical (DRAM) or 
virtual (IOVA) addresses, nor in any particular window of physical or 
IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only 
expose one or two contiguous ranges of IOVA space. These two sets of 
requirements are at odds!

So, I'd like to propose some new APIs that the PCIe endpoint driver can use:

1) Allocate/reserve an IOVA range of specified size, but don't map 
anything into the IOVA range.

2) De-allocate the IOVA range allocated in (1).

3) Map a specific set (scatter-gather list I suppose) of 
already-allocated/extant physical addresses into part of an IOVA range 
allocated in (1).

4) Unmap a portion of an IOVA range that was mapped by (3).

One final note:

The memory controller can translate accesses to a small region of DRAM 
address space into accesses to an interrupt generation module. This 
allows devices attached to the PCIe bus to generate interrupts to 
software running on the system with the PCIe endpoint controller. Thus I 
deliberately described API 3 above as mapping a specific physical 
address into IOVA space, as opposed to mapping an existing DRAM 
allocation into IOVA space, in order to allow mapping this interrupt 
generation address space into IOVA space. If we needed separate APIs to 
map physical addresses vs. DRAM allocations into IOVA space, that would 
likely be fine too.

Does this API proposal sound reasonable?

I have heard from some NVIDIA developers that the above APIs rather go 
against the principle that individual drivers should not be aware of the 
presence/absence of an IOMMU, and hence direct management of IOVA 
allocation/layout is deliberately avoided, and hence there hasn't been a 
need/desire for this kind of API in the past. However, I think our 
current hardware design and use-case rather requires it. Do you agree?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-17 21:36 ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-17 21:36 UTC (permalink / raw)
  To: Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA, Vidya Sagar,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joao Pinto,
	Jingoo Han, Bjorn Helgaas, Kishon Vijay Abraham I

Joerg, Christoph, Marek, Robin,

I believe that the driver for our PCIe endpoint controller hardware will 
need to explicitly manage its IOVA space more than current APIs allow. 
I'd like to discuss how to make that possible.

First some background on our hardware:

NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This 
can operate in either root port or endpoint mode. I'm particularly 
interested in endpoint mode.

Our particular instantiation of this controller exposes a single 
function with a single software-controlled PCIe BAR to the PCIe bus 
(there are also BARs for access to DMA controller registers and outbound 
MSI configuration, which can both be enabled/disabled but not used for 
any other purpose). When a transaction is received from the PCIe bus, 
the following happens:

1) Transaction is matched against the BAR base/size (in PCIe address 
space) to determine whether it "hits" this BAR or not.

2) The transaction's address is processed by the PCIe controller's ATU 
(Address Translation Unit), which can re-write the address that the 
transaction accesses.

Our particular instantiation of the hardware only has 2 entries in the 
ATU mapping table, which gives very little flexibility in setting up a 
mapping.

As an FYI, ATU entries can match PCIe transactions either:
a) Any transaction received on a particular BAR.
b) Any transaction received within a single contiguous window of PCIe 
address space. This kind of mapping entry obviously has to be set up 
after device enumeration is complete so that it can match the correct 
PCIe address.

Each ATU entry maps a single contiguous set of PCIe addresses to a 
single contiguous set of IOVAs which are passed to the IOMMU. 
Transactions can pass through the ATU without being translated if desired.

3) The transaction is passed to the IOMMU, which can again re-write the 
address that the transaction accesses.

4) The transaction is passed to the memory controller and reads/writes DRAM.

In general, we want to be able to expose a large and dynamic set of data 
buffers to the PCIe bus; certainly /far/ more than two separate buffers 
(the number of ATU table entries). With current Linux APIs, these 
buffers will not be located in contiguous or adjacent physical (DRAM) or 
virtual (IOVA) addresses, nor in any particular window of physical or 
IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only 
expose one or two contiguous ranges of IOVA space. These two sets of 
requirements are at odds!

So, I'd like to propose some new APIs that the PCIe endpoint driver can use:

1) Allocate/reserve an IOVA range of specified size, but don't map 
anything into the IOVA range.

2) De-allocate the IOVA range allocated in (1).

3) Map a specific set (scatter-gather list I suppose) of 
already-allocated/extant physical addresses into part of an IOVA range 
allocated in (1).

4) Unmap a portion of an IOVA range that was mapped by (3).

One final note:

The memory controller can translate accesses to a small region of DRAM 
address space into accesses to an interrupt generation module. This 
allows devices attached to the PCIe bus to generate interrupts to 
software running on the system with the PCIe endpoint controller. Thus I 
deliberately described API 3 above as mapping a specific physical 
address into IOVA space, as opposed to mapping an existing DRAM 
allocation into IOVA space, in order to allow mapping this interrupt 
generation address space into IOVA space. If we needed separate APIs to 
map physical addresses vs. DRAM allocations into IOVA space, that would 
likely be fine too.

Does this API proposal sound reasonable?

I have heard from some NVIDIA developers that the above APIs rather go 
against the principle that individual drivers should not be aware of the 
presence/absence of an IOMMU, and hence direct management of IOVA 
allocation/layout is deliberately avoided, and hence there hasn't been a 
need/desire for this kind of API in the past. However, I think our 
current hardware design and use-case rather requires it. Do you agree?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18  8:37   ` poza-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 10+ messages in thread
From: poza @ 2018-09-18  8:37 UTC (permalink / raw)
  To: Stephen Warren
  Cc: Christoph Hellwig, Marek Szyprowski, Robin Murphy, Joerg Roedel,
	Kishon Vijay Abraham I, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, Vidya Sagar, iommu, Jingoo Han, Joao Pinto,
	linux-pci-owner


On 2018-09-18 03:06, Stephen Warren wrote:
> Joerg, Christoph, Marek, Robin,
> 
> I believe that the driver for our PCIe endpoint controller hardware
> will need to explicitly manage its IOVA space more than current APIs
> allow. I'd like to discuss how to make that possible.
> 
> First some background on our hardware:
> 
> NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller.
> This can operate in either root port or endpoint mode. I'm
> particularly interested in endpoint mode.
> 
> Our particular instantiation of this controller exposes a single
> function with a single software-controlled PCIe BAR to the PCIe bus
> (there are also BARs for access to DMA controller registers and
> outbound MSI configuration, which can both be enabled/disabled but not
> used for any other purpose). When a transaction is received from the
> PCIe bus, the following happens:
> 
> 1) Transaction is matched against the BAR base/size (in PCIe address
> space) to determine whether it "hits" this BAR or not.
> 
> 2) The transaction's address is processed by the PCIe controller's ATU
> (Address Translation Unit), which can re-write the address that the
> transaction accesses.
> 
> Our particular instantiation of the hardware only has 2 entries in the
> ATU mapping table, which gives very little flexibility in setting up a
> mapping.
> 
> As an FYI, ATU entries can match PCIe transactions either:
> a) Any transaction received on a particular BAR.
> b) Any transaction received within a single contiguous window of PCIe
> address space. This kind of mapping entry obviously has to be set up
> after device enumeration is complete so that it can match the correct
> PCIe address.
> 
> Each ATU entry maps a single contiguous set of PCIe addresses to a
> single contiguous set of IOVAs which are passed to the IOMMU.
> Transactions can pass through the ATU without being translated if
> desired.
> 
> 3) The transaction is passed to the IOMMU, which can again re-write
> the address that the transaction accesses.
> 
> 4) The transaction is passed to the memory controller and reads/writes 
> DRAM.
> 
> In general, we want to be able to expose a large and dynamic set of
> data buffers to the PCIe bus; certainly /far/ more than two separate
> buffers (the number of ATU table entries). With current Linux APIs,
> these buffers will not be located in contiguous or adjacent physical
> (DRAM) or virtual (IOVA) addresses, nor in any particular window of
> physical or IOVA addresses. However, the ATU's mapping from PCIe to
> IOVA can only expose one or two contiguous ranges of IOVA space. These
> two sets of requirements are at odds!
> 
> So, I'd like to propose some new APIs that the PCIe endpoint driver can 
> use:
> 
> 1) Allocate/reserve an IOVA range of specified size, but don't map
> anything into the IOVA range.

I had done some work on this in the past, those patches were tested on 
Broadcom HW.

https://lkml.org/lkml/2017/5/16/23,
https://lkml.org/lkml/2017/5/16/21,
https://lkml.org/lkml/2017/5/16/19

I could not pursue it further, since I do not have the same HW to test 
it.
Although now in Qualcomm SOC, we do use Synopsis Designware PCIe 
controller
but we dont restrict inbound addresses range for our SOC.

of course these patches can easily be ported, and extended.
they basically reserve IOVA ranges based on inbound dma-ranges DT 
property.

Regards,
Oza.

> 
> 2) De-allocate the IOVA range allocated in (1).
> 
> 3) Map a specific set (scatter-gather list I suppose) of
> already-allocated/extant physical addresses into part of an IOVA range
> allocated in (1).
> 
> 4) Unmap a portion of an IOVA range that was mapped by (3).
> 
> One final note:
> 
> The memory controller can translate accesses to a small region of DRAM
> address space into accesses to an interrupt generation module. This
> allows devices attached to the PCIe bus to generate interrupts to
> software running on the system with the PCIe endpoint controller. Thus
> I deliberately described API 3 above as mapping a specific physical
> address into IOVA space, as opposed to mapping an existing DRAM
> allocation into IOVA space, in order to allow mapping this interrupt
> generation address space into IOVA space. If we needed separate APIs
> to map physical addresses vs. DRAM allocations into IOVA space, that
> would likely be fine too.
> 
> Does this API proposal sound reasonable?
> 
> I have heard from some NVIDIA developers that the above APIs rather go
> against the principle that individual drivers should not be aware of
> the presence/absence of an IOMMU, and hence direct management of IOVA
> allocation/layout is deliberately avoided, and hence there hasn't been
> a need/desire for this kind of API in the past. However, I think our
> current hardware design and use-case rather requires it. Do you agree?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18  8:37   ` poza-sgV2jX0FEOL9JmXXK+q4OQ
  0 siblings, 0 replies; 10+ messages in thread
From: poza-sgV2jX0FEOL9JmXXK+q4OQ @ 2018-09-18  8:37 UTC (permalink / raw)
  To: Stephen Warren
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA,
	linux-pci-owner-u79uwXL29TY76Z2rM5mHXA, Kishon Vijay Abraham I,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joao Pinto,
	Jingoo Han, Bjorn Helgaas, Vidya Sagar, Robin Murphy,
	Christoph Hellwig


On 2018-09-18 03:06, Stephen Warren wrote:
> Joerg, Christoph, Marek, Robin,
> 
> I believe that the driver for our PCIe endpoint controller hardware
> will need to explicitly manage its IOVA space more than current APIs
> allow. I'd like to discuss how to make that possible.
> 
> First some background on our hardware:
> 
> NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller.
> This can operate in either root port or endpoint mode. I'm
> particularly interested in endpoint mode.
> 
> Our particular instantiation of this controller exposes a single
> function with a single software-controlled PCIe BAR to the PCIe bus
> (there are also BARs for access to DMA controller registers and
> outbound MSI configuration, which can both be enabled/disabled but not
> used for any other purpose). When a transaction is received from the
> PCIe bus, the following happens:
> 
> 1) Transaction is matched against the BAR base/size (in PCIe address
> space) to determine whether it "hits" this BAR or not.
> 
> 2) The transaction's address is processed by the PCIe controller's ATU
> (Address Translation Unit), which can re-write the address that the
> transaction accesses.
> 
> Our particular instantiation of the hardware only has 2 entries in the
> ATU mapping table, which gives very little flexibility in setting up a
> mapping.
> 
> As an FYI, ATU entries can match PCIe transactions either:
> a) Any transaction received on a particular BAR.
> b) Any transaction received within a single contiguous window of PCIe
> address space. This kind of mapping entry obviously has to be set up
> after device enumeration is complete so that it can match the correct
> PCIe address.
> 
> Each ATU entry maps a single contiguous set of PCIe addresses to a
> single contiguous set of IOVAs which are passed to the IOMMU.
> Transactions can pass through the ATU without being translated if
> desired.
> 
> 3) The transaction is passed to the IOMMU, which can again re-write
> the address that the transaction accesses.
> 
> 4) The transaction is passed to the memory controller and reads/writes 
> DRAM.
> 
> In general, we want to be able to expose a large and dynamic set of
> data buffers to the PCIe bus; certainly /far/ more than two separate
> buffers (the number of ATU table entries). With current Linux APIs,
> these buffers will not be located in contiguous or adjacent physical
> (DRAM) or virtual (IOVA) addresses, nor in any particular window of
> physical or IOVA addresses. However, the ATU's mapping from PCIe to
> IOVA can only expose one or two contiguous ranges of IOVA space. These
> two sets of requirements are at odds!
> 
> So, I'd like to propose some new APIs that the PCIe endpoint driver can 
> use:
> 
> 1) Allocate/reserve an IOVA range of specified size, but don't map
> anything into the IOVA range.

I had done some work on this in the past, those patches were tested on 
Broadcom HW.

https://lkml.org/lkml/2017/5/16/23,
https://lkml.org/lkml/2017/5/16/21,
https://lkml.org/lkml/2017/5/16/19

I could not pursue it further, since I do not have the same HW to test 
it.
Although now in Qualcomm SOC, we do use Synopsis Designware PCIe 
controller
but we dont restrict inbound addresses range for our SOC.

of course these patches can easily be ported, and extended.
they basically reserve IOVA ranges based on inbound dma-ranges DT 
property.

Regards,
Oza.

> 
> 2) De-allocate the IOVA range allocated in (1).
> 
> 3) Map a specific set (scatter-gather list I suppose) of
> already-allocated/extant physical addresses into part of an IOVA range
> allocated in (1).
> 
> 4) Unmap a portion of an IOVA range that was mapped by (3).
> 
> One final note:
> 
> The memory controller can translate accesses to a small region of DRAM
> address space into accesses to an interrupt generation module. This
> allows devices attached to the PCIe bus to generate interrupts to
> software running on the system with the PCIe endpoint controller. Thus
> I deliberately described API 3 above as mapping a specific physical
> address into IOVA space, as opposed to mapping an existing DRAM
> allocation into IOVA space, in order to allow mapping this interrupt
> generation address space into IOVA space. If we needed separate APIs
> to map physical addresses vs. DRAM allocations into IOVA space, that
> would likely be fine too.
> 
> Does this API proposal sound reasonable?
> 
> I have heard from some NVIDIA developers that the above APIs rather go
> against the principle that individual drivers should not be aware of
> the presence/absence of an IOMMU, and hence direct management of IOVA
> allocation/layout is deliberately avoided, and hence there hasn't been
> a need/desire for this kind of API in the past. However, I think our
> current hardware design and use-case rather requires it. Do you agree?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18 10:59   ` Robin Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Robin Murphy @ 2018-09-18 10:59 UTC (permalink / raw)
  To: Stephen Warren, Christoph Hellwig, Marek Szyprowski, Joerg Roedel
  Cc: Kishon Vijay Abraham I, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, Vidya Sagar, iommu, Jingoo Han, Joao Pinto

Hi Stephen,

On 17/09/18 22:36, Stephen Warren wrote:
> Joerg, Christoph, Marek, Robin,
> 
> I believe that the driver for our PCIe endpoint controller hardware will 
> need to explicitly manage its IOVA space more than current APIs allow. 
> I'd like to discuss how to make that possible.
> 
> First some background on our hardware:
> 
> NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This 
> can operate in either root port or endpoint mode. I'm particularly 
> interested in endpoint mode.
> 
> Our particular instantiation of this controller exposes a single 
> function with a single software-controlled PCIe BAR to the PCIe bus 
> (there are also BARs for access to DMA controller registers and outbound 
> MSI configuration, which can both be enabled/disabled but not used for 
> any other purpose). When a transaction is received from the PCIe bus, 
> the following happens:
> 
> 1) Transaction is matched against the BAR base/size (in PCIe address 
> space) to determine whether it "hits" this BAR or not.
> 
> 2) The transaction's address is processed by the PCIe controller's ATU 
> (Address Translation Unit), which can re-write the address that the 
> transaction accesses.
> 
> Our particular instantiation of the hardware only has 2 entries in the 
> ATU mapping table, which gives very little flexibility in setting up a 
> mapping.
> 
> As an FYI, ATU entries can match PCIe transactions either:
> a) Any transaction received on a particular BAR.
> b) Any transaction received within a single contiguous window of PCIe 
> address space. This kind of mapping entry obviously has to be set up 
> after device enumeration is complete so that it can match the correct 
> PCIe address.
> 
> Each ATU entry maps a single contiguous set of PCIe addresses to a 
> single contiguous set of IOVAs which are passed to the IOMMU. 
> Transactions can pass through the ATU without being translated if desired.
> 
> 3) The transaction is passed to the IOMMU, which can again re-write the 
> address that the transaction accesses.
> 
> 4) The transaction is passed to the memory controller and reads/writes 
> DRAM.
> 
> In general, we want to be able to expose a large and dynamic set of data 
> buffers to the PCIe bus; certainly /far/ more than two separate buffers 
> (the number of ATU table entries). With current Linux APIs, these 
> buffers will not be located in contiguous or adjacent physical (DRAM) or 
> virtual (IOVA) addresses, nor in any particular window of physical or 
> IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only 
> expose one or two contiguous ranges of IOVA space. These two sets of 
> requirements are at odds!
> 
> So, I'd like to propose some new APIs that the PCIe endpoint driver can 
> use:
> 
> 1) Allocate/reserve an IOVA range of specified size, but don't map 
> anything into the IOVA range.
> 
> 2) De-allocate the IOVA range allocated in (1).
> 
> 3) Map a specific set (scatter-gather list I suppose) of 
> already-allocated/extant physical addresses into part of an IOVA range 
> allocated in (1).
> 
> 4) Unmap a portion of an IOVA range that was mapped by (3).

That all sounds perfectly reasonable - basically it sounds like the 
endpoint framework wants the option to do the same as VFIO or many DRM 
drivers, i.e. set up its own IOMMU domain, attach the endpoint's group, 
and explicitly manage its mappings via IOMMU API calls. Provided you can 
assume cache-coherent PCI, that should be enough to get things going - 
supporting non-coherent endpoints is a little trickier in terms of 
making sure the endpoint controller and/or device gets the right DMA ops 
to only ever perform cache maintenance once you add streaming DMA 
mappings into the mix, but that's not insurmountable (and I think it's 
something we still need to address for DRM anyway, at least on arm64)

> One final note:
> 
> The memory controller can translate accesses to a small region of DRAM 
> address space into accesses to an interrupt generation module. This 
> allows devices attached to the PCIe bus to generate interrupts to 
> software running on the system with the PCIe endpoint controller. Thus I 
> deliberately described API 3 above as mapping a specific physical 
> address into IOVA space, as opposed to mapping an existing DRAM 
> allocation into IOVA space, in order to allow mapping this interrupt 
> generation address space into IOVA space. If we needed separate APIs to 
> map physical addresses vs. DRAM allocations into IOVA space, that would 
> likely be fine too.

If that's the standard DesignWare MSI dingaling, then all you should 
need to do is ensure you IOVA is reserved in your allocator (if it can 
be entirely outside the EP BAR, even better) - AFAIK the writes get 
completely intercepted such that they never go out to the SMMU side at 
all, and thus no actual mapping is even needed.

> Does this API proposal sound reasonable?

Indeed, as I say apart from using streaming DMA for coherency management 
(which I think could be added in pretty much orthogonally later), this 
sounds like something you could plumb into the endpoint framework right 
now with no dependent changes elsewhere.

> I have heard from some NVIDIA developers that the above APIs rather go 
> against the principle that individual drivers should not be aware of the 
> presence/absence of an IOMMU, and hence direct management of IOVA 
> allocation/layout is deliberately avoided, and hence there hasn't been a 
> need/desire for this kind of API in the past. However, I think our 
> current hardware design and use-case rather requires it. Do you agree?

If there is a principle, it's more the inverse - the point of things 
like SWIOTLB and iommu-dma is that we don't want to *have* to add 
IOMMU-awareness or explicit bounce-buffering to every driver or 
subsystem which might ever find itself on a machine with more memory 
than its device can address natively. Thus drivers which only need to 
use the DMA API can continue to do so and the arch code hooks up this 
stuff automatically to make sure that just works. However, drivers which 
*do* expect their device to have an IOMMU, and have good cause to manage 
it themselves to do things that simple DMA API calls can't, should of 
course be welcome to implement that extra code and depend on IOMMU_API 
if they so wish. Again, DRM drivers are the prime example (er, no pun 
intended) - simple ones let drm_gem_cma_helper et al do all the heavy 
lifting for them, more complex ones get their hands dirty.

Robin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18 10:59   ` Robin Murphy
  0 siblings, 0 replies; 10+ messages in thread
From: Robin Murphy @ 2018-09-18 10:59 UTC (permalink / raw)
  To: Stephen Warren, Christoph Hellwig, Marek Szyprowski, Joerg Roedel
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA, Vidya Sagar,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joao Pinto,
	Jingoo Han, Bjorn Helgaas, Kishon Vijay Abraham I

Hi Stephen,

On 17/09/18 22:36, Stephen Warren wrote:
> Joerg, Christoph, Marek, Robin,
> 
> I believe that the driver for our PCIe endpoint controller hardware will 
> need to explicitly manage its IOVA space more than current APIs allow. 
> I'd like to discuss how to make that possible.
> 
> First some background on our hardware:
> 
> NVIDIA's Xavier SoC contains a Synopsis Designware PCIe controller. This 
> can operate in either root port or endpoint mode. I'm particularly 
> interested in endpoint mode.
> 
> Our particular instantiation of this controller exposes a single 
> function with a single software-controlled PCIe BAR to the PCIe bus 
> (there are also BARs for access to DMA controller registers and outbound 
> MSI configuration, which can both be enabled/disabled but not used for 
> any other purpose). When a transaction is received from the PCIe bus, 
> the following happens:
> 
> 1) Transaction is matched against the BAR base/size (in PCIe address 
> space) to determine whether it "hits" this BAR or not.
> 
> 2) The transaction's address is processed by the PCIe controller's ATU 
> (Address Translation Unit), which can re-write the address that the 
> transaction accesses.
> 
> Our particular instantiation of the hardware only has 2 entries in the 
> ATU mapping table, which gives very little flexibility in setting up a 
> mapping.
> 
> As an FYI, ATU entries can match PCIe transactions either:
> a) Any transaction received on a particular BAR.
> b) Any transaction received within a single contiguous window of PCIe 
> address space. This kind of mapping entry obviously has to be set up 
> after device enumeration is complete so that it can match the correct 
> PCIe address.
> 
> Each ATU entry maps a single contiguous set of PCIe addresses to a 
> single contiguous set of IOVAs which are passed to the IOMMU. 
> Transactions can pass through the ATU without being translated if desired.
> 
> 3) The transaction is passed to the IOMMU, which can again re-write the 
> address that the transaction accesses.
> 
> 4) The transaction is passed to the memory controller and reads/writes 
> DRAM.
> 
> In general, we want to be able to expose a large and dynamic set of data 
> buffers to the PCIe bus; certainly /far/ more than two separate buffers 
> (the number of ATU table entries). With current Linux APIs, these 
> buffers will not be located in contiguous or adjacent physical (DRAM) or 
> virtual (IOVA) addresses, nor in any particular window of physical or 
> IOVA addresses. However, the ATU's mapping from PCIe to IOVA can only 
> expose one or two contiguous ranges of IOVA space. These two sets of 
> requirements are at odds!
> 
> So, I'd like to propose some new APIs that the PCIe endpoint driver can 
> use:
> 
> 1) Allocate/reserve an IOVA range of specified size, but don't map 
> anything into the IOVA range.
> 
> 2) De-allocate the IOVA range allocated in (1).
> 
> 3) Map a specific set (scatter-gather list I suppose) of 
> already-allocated/extant physical addresses into part of an IOVA range 
> allocated in (1).
> 
> 4) Unmap a portion of an IOVA range that was mapped by (3).

That all sounds perfectly reasonable - basically it sounds like the 
endpoint framework wants the option to do the same as VFIO or many DRM 
drivers, i.e. set up its own IOMMU domain, attach the endpoint's group, 
and explicitly manage its mappings via IOMMU API calls. Provided you can 
assume cache-coherent PCI, that should be enough to get things going - 
supporting non-coherent endpoints is a little trickier in terms of 
making sure the endpoint controller and/or device gets the right DMA ops 
to only ever perform cache maintenance once you add streaming DMA 
mappings into the mix, but that's not insurmountable (and I think it's 
something we still need to address for DRM anyway, at least on arm64)

> One final note:
> 
> The memory controller can translate accesses to a small region of DRAM 
> address space into accesses to an interrupt generation module. This 
> allows devices attached to the PCIe bus to generate interrupts to 
> software running on the system with the PCIe endpoint controller. Thus I 
> deliberately described API 3 above as mapping a specific physical 
> address into IOVA space, as opposed to mapping an existing DRAM 
> allocation into IOVA space, in order to allow mapping this interrupt 
> generation address space into IOVA space. If we needed separate APIs to 
> map physical addresses vs. DRAM allocations into IOVA space, that would 
> likely be fine too.

If that's the standard DesignWare MSI dingaling, then all you should 
need to do is ensure you IOVA is reserved in your allocator (if it can 
be entirely outside the EP BAR, even better) - AFAIK the writes get 
completely intercepted such that they never go out to the SMMU side at 
all, and thus no actual mapping is even needed.

> Does this API proposal sound reasonable?

Indeed, as I say apart from using streaming DMA for coherency management 
(which I think could be added in pretty much orthogonally later), this 
sounds like something you could plumb into the endpoint framework right 
now with no dependent changes elsewhere.

> I have heard from some NVIDIA developers that the above APIs rather go 
> against the principle that individual drivers should not be aware of the 
> presence/absence of an IOMMU, and hence direct management of IOVA 
> allocation/layout is deliberately avoided, and hence there hasn't been a 
> need/desire for this kind of API in the past. However, I think our 
> current hardware design and use-case rather requires it. Do you agree?

If there is a principle, it's more the inverse - the point of things 
like SWIOTLB and iommu-dma is that we don't want to *have* to add 
IOMMU-awareness or explicit bounce-buffering to every driver or 
subsystem which might ever find itself on a machine with more memory 
than its device can address natively. Thus drivers which only need to 
use the DMA API can continue to do so and the arch code hooks up this 
stuff automatically to make sure that just works. However, drivers which 
*do* expect their device to have an IOMMU, and have good cause to manage 
it themselves to do things that simple DMA API calls can't, should of 
course be welcome to implement that extra code and depend on IOMMU_API 
if they so wish. Again, DRM drivers are the prime example (er, no pun 
intended) - simple ones let drm_gem_cma_helper et al do all the heavy 
lifting for them, more complex ones get their hands dirty.

Robin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18 18:16     ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-18 18:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Marek Szyprowski, Joerg Roedel,
	Kishon Vijay Abraham I, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, Vidya Sagar, iommu, Jingoo Han, Joao Pinto

On 09/18/2018 04:59 AM, Robin Murphy wrote:
> Hi Stephen,
> 
> On 17/09/18 22:36, Stephen Warren wrote:
>> Joerg, Christoph, Marek, Robin,
>>
>> I believe that the driver for our PCIe endpoint controller hardware 
>> will need to explicitly manage its IOVA space more than current APIs 
>> allow. I'd like to discuss how to make that possible.
...
>> One final note:
>>
>> The memory controller can translate accesses to a small region of DRAM 
>> address space into accesses to an interrupt generation module. This 
>> allows devices attached to the PCIe bus to generate interrupts to 
>> software running on the system with the PCIe endpoint controller. Thus 
>> I deliberately described API 3 above as mapping a specific physical 
>> address into IOVA space, as opposed to mapping an existing DRAM 
>> allocation into IOVA space, in order to allow mapping this interrupt 
>> generation address space into IOVA space. If we needed separate APIs 
>> to map physical addresses vs. DRAM allocations into IOVA space, that 
>> would likely be fine too.
> 
> If that's the standard DesignWare MSI dingaling, then all you should 
> need to do is ensure you IOVA is reserved in your allocator (if it can 
> be entirely outside the EP BAR, even better) - AFAIK the writes get 
> completely intercepted such that they never go out to the SMMU side at 
> all, and thus no actual mapping is even needed.

Unfortunately it's not. We have some custom hardware module (that 
already existed for other purposes, such as interaction/synchronization 
between various graphics modules) that we will slightly repurpose as a 
plain interrupt generator for PCIe endpoint use-cases.

>> Does this API proposal sound reasonable?
> 
> Indeed, as I say apart from using streaming DMA for coherency management 
> (which I think could be added in pretty much orthogonally later), this 
> sounds like something you could plumb into the endpoint framework right 
> now with no dependent changes elsewhere.

Great. I'll take a look at Oza's code and see about getting this 
implemented.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-18 18:16     ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-18 18:16 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA, Kishon Vijay Abraham I,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joao Pinto,
	Jingoo Han, Bjorn Helgaas, Vidya Sagar, Christoph Hellwig

On 09/18/2018 04:59 AM, Robin Murphy wrote:
> Hi Stephen,
> 
> On 17/09/18 22:36, Stephen Warren wrote:
>> Joerg, Christoph, Marek, Robin,
>>
>> I believe that the driver for our PCIe endpoint controller hardware 
>> will need to explicitly manage its IOVA space more than current APIs 
>> allow. I'd like to discuss how to make that possible.
...
>> One final note:
>>
>> The memory controller can translate accesses to a small region of DRAM 
>> address space into accesses to an interrupt generation module. This 
>> allows devices attached to the PCIe bus to generate interrupts to 
>> software running on the system with the PCIe endpoint controller. Thus 
>> I deliberately described API 3 above as mapping a specific physical 
>> address into IOVA space, as opposed to mapping an existing DRAM 
>> allocation into IOVA space, in order to allow mapping this interrupt 
>> generation address space into IOVA space. If we needed separate APIs 
>> to map physical addresses vs. DRAM allocations into IOVA space, that 
>> would likely be fine too.
> 
> If that's the standard DesignWare MSI dingaling, then all you should 
> need to do is ensure you IOVA is reserved in your allocator (if it can 
> be entirely outside the EP BAR, even better) - AFAIK the writes get 
> completely intercepted such that they never go out to the SMMU side at 
> all, and thus no actual mapping is even needed.

Unfortunately it's not. We have some custom hardware module (that 
already existed for other purposes, such as interaction/synchronization 
between various graphics modules) that we will slightly repurpose as a 
plain interrupt generator for PCIe endpoint use-cases.

>> Does this API proposal sound reasonable?
> 
> Indeed, as I say apart from using streaming DMA for coherency management 
> (which I think could be added in pretty much orthogonally later), this 
> sounds like something you could plumb into the endpoint framework right 
> now with no dependent changes elsewhere.

Great. I'll take a look at Oza's code and see about getting this 
implemented.

Thanks.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-28 20:39       ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-28 20:39 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Christoph Hellwig, Marek Szyprowski, Joerg Roedel,
	Kishon Vijay Abraham I, Lorenzo Pieralisi, linux-pci,
	Bjorn Helgaas, Vidya Sagar, iommu, Jingoo Han, Joao Pinto

On 09/18/2018 12:16 PM, Stephen Warren wrote:
> On 09/18/2018 04:59 AM, Robin Murphy wrote:
>> Hi Stephen,
>>
>> On 17/09/18 22:36, Stephen Warren wrote:
>>> Joerg, Christoph, Marek, Robin,
>>>
>>> I believe that the driver for our PCIe endpoint controller hardware 
>>> will need to explicitly manage its IOVA space more than current APIs 
>>> allow. I'd like to discuss how to make that possible.
...
>>> Does this API proposal sound reasonable?
>>
>> Indeed, as I say apart from using streaming DMA for coherency 
>> management (which I think could be added in pretty much orthogonally 
>> later), this sounds like something you could plumb into the endpoint 
>> framework right now with no dependent changes elsewhere.
> 
> Great. I'll take a look at Oza's code and see about getting this 
> implemented.

I took a longer look at the various APIs in iommu.h and dma-iommh.h. As 
you said, I think most of it is already there. I think we just need to 
add functions iommu_dma_alloc/free_iova() [1] that drivers can call to 
acquire an IOVA range that is guaranteed not be used by any other device 
that shares the same IOVA domain (i.e. IOMMU ASID). After the driver 
calls that, it can just use iommu_map() and iommu_map_sg() on the IOVA 
range that was reserved. Does that sound reasonable?

[1] there's already a static function of that name for internal use in 
dma-iommu.c. I guess I'd rename that to __iommu_dma_alloc_iova() and 
have the new function be a thin wrapper on top of it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Explicit IOVA management from a PCIe endpoint driver
@ 2018-09-28 20:39       ` Stephen Warren
  0 siblings, 0 replies; 10+ messages in thread
From: Stephen Warren @ 2018-09-28 20:39 UTC (permalink / raw)
  To: Robin Murphy
  Cc: linux-pci-u79uwXL29TY76Z2rM5mHXA, Kishon Vijay Abraham I,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Joao Pinto,
	Jingoo Han, Bjorn Helgaas, Vidya Sagar, Christoph Hellwig

On 09/18/2018 12:16 PM, Stephen Warren wrote:
> On 09/18/2018 04:59 AM, Robin Murphy wrote:
>> Hi Stephen,
>>
>> On 17/09/18 22:36, Stephen Warren wrote:
>>> Joerg, Christoph, Marek, Robin,
>>>
>>> I believe that the driver for our PCIe endpoint controller hardware 
>>> will need to explicitly manage its IOVA space more than current APIs 
>>> allow. I'd like to discuss how to make that possible.
...
>>> Does this API proposal sound reasonable?
>>
>> Indeed, as I say apart from using streaming DMA for coherency 
>> management (which I think could be added in pretty much orthogonally 
>> later), this sounds like something you could plumb into the endpoint 
>> framework right now with no dependent changes elsewhere.
> 
> Great. I'll take a look at Oza's code and see about getting this 
> implemented.

I took a longer look at the various APIs in iommu.h and dma-iommh.h. As 
you said, I think most of it is already there. I think we just need to 
add functions iommu_dma_alloc/free_iova() [1] that drivers can call to 
acquire an IOVA range that is guaranteed not be used by any other device 
that shares the same IOVA domain (i.e. IOMMU ASID). After the driver 
calls that, it can just use iommu_map() and iommu_map_sg() on the IOVA 
range that was reserved. Does that sound reasonable?

[1] there's already a static function of that name for internal use in 
dma-iommu.c. I guess I'd rename that to __iommu_dma_alloc_iova() and 
have the new function be a thin wrapper on top of it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-09-28 20:39 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-09-17 21:36 Explicit IOVA management from a PCIe endpoint driver Stephen Warren
2018-09-17 21:36 ` Stephen Warren
2018-09-18  8:37 ` poza
2018-09-18  8:37   ` poza-sgV2jX0FEOL9JmXXK+q4OQ
2018-09-18 10:59 ` Robin Murphy
2018-09-18 10:59   ` Robin Murphy
2018-09-18 18:16   ` Stephen Warren
2018-09-18 18:16     ` Stephen Warren
2018-09-28 20:39     ` Stephen Warren
2018-09-28 20:39       ` Stephen Warren

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.