All of lore.kernel.org
 help / color / mirror / Atom feed
* Device address specific mapping of arm,mmu-500
@ 2017-05-30  1:18 ` Ray Jui
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30  1:18 UTC (permalink / raw)
  To: Will Deacon, Robin Murphy, Mark Rutland, Marc Zyngier,
	Joerg Roedel, linux-arm-kernel, iommu, linux-kernel, ray.jui

Hi All,

I'm writing to check with you to see if the latest arm-smmu.c driver in
v4.12-rc Linux for smmu-500 can support mapping that is only specific to
a particular physical address range while leave the rest still to be
handled by the client device. I believe this can already be supported by
the device tree binding of the generic IOMMU framework; however, it is
not clear to me whether or not the arm-smmu.c driver can support it.

To give you some background information:

We have a SoC that has PCIe root complex that has a build-in logic block
to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
has a HW bug that causes the MSI writes not parsed properly and can
potentially corrupt data in the internal FIFO. A workaround is to have
ARM MMU-500 takes care of all inbound transactions. I found that is
working after hooking up our PCIe root complex to MMU-500; however, even
with this optimized arm-smmu driver in v4.12, I'm still seeing a
significant Ethernet throughput drop in both the TX and RX directions.
The throughput drop is very significant at around 50% (but is already
much improved compared to other prior kernel versions at 70~90%).

One alternative is to only use MMU-500 for MSI writes towards
GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
region of physical address that I want MMU-500 to act on and leave the
rest of inbound transactions to be handled directly by our PCIe
controller, it can potentially work around the HW bug we have and at the
same time achieve optimal throughput.

Any feedback from you is greatly appreciated!

Best regards,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30  1:18 ` Ray Jui
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30  1:18 UTC (permalink / raw)
  To: linux-arm-kernel

Hi All,

I'm writing to check with you to see if the latest arm-smmu.c driver in
v4.12-rc Linux for smmu-500 can support mapping that is only specific to
a particular physical address range while leave the rest still to be
handled by the client device. I believe this can already be supported by
the device tree binding of the generic IOMMU framework; however, it is
not clear to me whether or not the arm-smmu.c driver can support it.

To give you some background information:

We have a SoC that has PCIe root complex that has a build-in logic block
to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
has a HW bug that causes the MSI writes not parsed properly and can
potentially corrupt data in the internal FIFO. A workaround is to have
ARM MMU-500 takes care of all inbound transactions. I found that is
working after hooking up our PCIe root complex to MMU-500; however, even
with this optimized arm-smmu driver in v4.12, I'm still seeing a
significant Ethernet throughput drop in both the TX and RX directions.
The throughput drop is very significant at around 50% (but is already
much improved compared to other prior kernel versions at 70~90%).

One alternative is to only use MMU-500 for MSI writes towards
GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
region of physical address that I want MMU-500 to act on and leave the
rest of inbound transactions to be handled directly by our PCIe
controller, it can potentially work around the HW bug we have and at the
same time achieve optimal throughput.

Any feedback from you is greatly appreciated!

Best regards,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 15:14   ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-30 15:14 UTC (permalink / raw)
  To: Ray Jui
  Cc: Robin Murphy, Mark Rutland, Marc Zyngier, Joerg Roedel,
	linux-arm-kernel, iommu, linux-kernel

On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
> I'm writing to check with you to see if the latest arm-smmu.c driver in
> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
> a particular physical address range while leave the rest still to be
> handled by the client device. I believe this can already be supported by
> the device tree binding of the generic IOMMU framework; however, it is
> not clear to me whether or not the arm-smmu.c driver can support it.
> 
> To give you some background information:
> 
> We have a SoC that has PCIe root complex that has a build-in logic block
> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
> has a HW bug that causes the MSI writes not parsed properly and can
> potentially corrupt data in the internal FIFO. A workaround is to have
> ARM MMU-500 takes care of all inbound transactions. I found that is
> working after hooking up our PCIe root complex to MMU-500; however, even
> with this optimized arm-smmu driver in v4.12, I'm still seeing a
> significant Ethernet throughput drop in both the TX and RX directions.
> The throughput drop is very significant at around 50% (but is already
> much improved compared to other prior kernel versions at 70~90%).

Did Robin's experiments help at all with this?

http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf

> One alternative is to only use MMU-500 for MSI writes towards
> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
> region of physical address that I want MMU-500 to act on and leave the
> rest of inbound transactions to be handled directly by our PCIe
> controller, it can potentially work around the HW bug we have and at the
> same time achieve optimal throughput.

I don't think you can bypass the SMMU for MSIs unless you give them their
own StreamIDs, which is likely to break things horribly in the kernel. You
could try to create an identity mapping, but you'll still have the
translation overhead and you'd probably end up having to supply your own DMA
ops to manage the address space. I'm assuming that you need to prevent the
physical address of the ITS from being allocated as an IOVA?

> Any feedback from you is greatly appreciated!

Fix the hardware ;)

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 15:14   ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-30 15:14 UTC (permalink / raw)
  To: Ray Jui
  Cc: Mark Rutland, Marc Zyngier, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
> I'm writing to check with you to see if the latest arm-smmu.c driver in
> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
> a particular physical address range while leave the rest still to be
> handled by the client device. I believe this can already be supported by
> the device tree binding of the generic IOMMU framework; however, it is
> not clear to me whether or not the arm-smmu.c driver can support it.
> 
> To give you some background information:
> 
> We have a SoC that has PCIe root complex that has a build-in logic block
> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
> has a HW bug that causes the MSI writes not parsed properly and can
> potentially corrupt data in the internal FIFO. A workaround is to have
> ARM MMU-500 takes care of all inbound transactions. I found that is
> working after hooking up our PCIe root complex to MMU-500; however, even
> with this optimized arm-smmu driver in v4.12, I'm still seeing a
> significant Ethernet throughput drop in both the TX and RX directions.
> The throughput drop is very significant at around 50% (but is already
> much improved compared to other prior kernel versions at 70~90%).

Did Robin's experiments help at all with this?

http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf

> One alternative is to only use MMU-500 for MSI writes towards
> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
> region of physical address that I want MMU-500 to act on and leave the
> rest of inbound transactions to be handled directly by our PCIe
> controller, it can potentially work around the HW bug we have and at the
> same time achieve optimal throughput.

I don't think you can bypass the SMMU for MSIs unless you give them their
own StreamIDs, which is likely to break things horribly in the kernel. You
could try to create an identity mapping, but you'll still have the
translation overhead and you'd probably end up having to supply your own DMA
ops to manage the address space. I'm assuming that you need to prevent the
physical address of the ITS from being allocated as an IOVA?

> Any feedback from you is greatly appreciated!

Fix the hardware ;)

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 15:14   ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-30 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
> I'm writing to check with you to see if the latest arm-smmu.c driver in
> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
> a particular physical address range while leave the rest still to be
> handled by the client device. I believe this can already be supported by
> the device tree binding of the generic IOMMU framework; however, it is
> not clear to me whether or not the arm-smmu.c driver can support it.
> 
> To give you some background information:
> 
> We have a SoC that has PCIe root complex that has a build-in logic block
> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
> has a HW bug that causes the MSI writes not parsed properly and can
> potentially corrupt data in the internal FIFO. A workaround is to have
> ARM MMU-500 takes care of all inbound transactions. I found that is
> working after hooking up our PCIe root complex to MMU-500; however, even
> with this optimized arm-smmu driver in v4.12, I'm still seeing a
> significant Ethernet throughput drop in both the TX and RX directions.
> The throughput drop is very significant at around 50% (but is already
> much improved compared to other prior kernel versions at 70~90%).

Did Robin's experiments help at all with this?

http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf

> One alternative is to only use MMU-500 for MSI writes towards
> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
> region of physical address that I want MMU-500 to act on and leave the
> rest of inbound transactions to be handled directly by our PCIe
> controller, it can potentially work around the HW bug we have and at the
> same time achieve optimal throughput.

I don't think you can bypass the SMMU for MSIs unless you give them their
own StreamIDs, which is likely to break things horribly in the kernel. You
could try to create an identity mapping, but you'll still have the
translation overhead and you'd probably end up having to supply your own DMA
ops to manage the address space. I'm assuming that you need to prevent the
physical address of the ITS from being allocated as an IOVA?

> Any feedback from you is greatly appreciated!

Fix the hardware ;)

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:49     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 16:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Robin Murphy, Mark Rutland, Marc Zyngier, Joerg Roedel,
	linux-arm-kernel, iommu, linux-kernel

Hi Will,

On 5/30/17 8:14 AM, Will Deacon wrote:
> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>> a particular physical address range while leave the rest still to be
>> handled by the client device. I believe this can already be supported by
>> the device tree binding of the generic IOMMU framework; however, it is
>> not clear to me whether or not the arm-smmu.c driver can support it.
>>
>> To give you some background information:
>>
>> We have a SoC that has PCIe root complex that has a build-in logic block
>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>> has a HW bug that causes the MSI writes not parsed properly and can
>> potentially corrupt data in the internal FIFO. A workaround is to have
>> ARM MMU-500 takes care of all inbound transactions. I found that is
>> working after hooking up our PCIe root complex to MMU-500; however, even
>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>> significant Ethernet throughput drop in both the TX and RX directions.
>> The throughput drop is very significant at around 50% (but is already
>> much improved compared to other prior kernel versions at 70~90%).
> 
> Did Robin's experiments help at all with this?
> 
> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
> 

It looks like these are new optimizations that have not yet been merged
in v4.12? I'm going to give it a try.

>> One alternative is to only use MMU-500 for MSI writes towards
>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>> region of physical address that I want MMU-500 to act on and leave the
>> rest of inbound transactions to be handled directly by our PCIe
>> controller, it can potentially work around the HW bug we have and at the
>> same time achieve optimal throughput.
> 
> I don't think you can bypass the SMMU for MSIs unless you give them their
> own StreamIDs, which is likely to break things horribly in the kernel. You
> could try to create an identity mapping, but you'll still have the
> translation overhead and you'd probably end up having to supply your own DMA
> ops to manage the address space. I'm assuming that you need to prevent the
> physical address of the ITS from being allocated as an IOVA?

Will, is that a HW limitation that the SMMU cannot be used, only for MSI
writes, in which case, the physical address range is very specific in
our ASIC that falls in the device memory region (e.g., below 0x80000000)?

In fact, what I need in this case is a static mapping from IOMMU on the
physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
address that MSI writes go to. This is to bypass the MSI forwarding
logic in our PCIe controller. At the same time, I can leave the rest of
inbound transactions to be handled by our PCIe controller without going
through the MMU.

> 
>> Any feedback from you is greatly appreciated!
> 
> Fix the hardware ;)

Indeed that has to happen with the next revision of the ASIC. But as you
can see I'm getting quite desperate here trying to find an interim solution.

> 
> Will
> 

Thanks for the help!

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:49     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-05-30 16:49 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, Marc Zyngier, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On 5/30/17 8:14 AM, Will Deacon wrote:
> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>> a particular physical address range while leave the rest still to be
>> handled by the client device. I believe this can already be supported by
>> the device tree binding of the generic IOMMU framework; however, it is
>> not clear to me whether or not the arm-smmu.c driver can support it.
>>
>> To give you some background information:
>>
>> We have a SoC that has PCIe root complex that has a build-in logic block
>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>> has a HW bug that causes the MSI writes not parsed properly and can
>> potentially corrupt data in the internal FIFO. A workaround is to have
>> ARM MMU-500 takes care of all inbound transactions. I found that is
>> working after hooking up our PCIe root complex to MMU-500; however, even
>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>> significant Ethernet throughput drop in both the TX and RX directions.
>> The throughput drop is very significant at around 50% (but is already
>> much improved compared to other prior kernel versions at 70~90%).
> 
> Did Robin's experiments help at all with this?
> 
> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
> 

It looks like these are new optimizations that have not yet been merged
in v4.12? I'm going to give it a try.

>> One alternative is to only use MMU-500 for MSI writes towards
>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>> region of physical address that I want MMU-500 to act on and leave the
>> rest of inbound transactions to be handled directly by our PCIe
>> controller, it can potentially work around the HW bug we have and at the
>> same time achieve optimal throughput.
> 
> I don't think you can bypass the SMMU for MSIs unless you give them their
> own StreamIDs, which is likely to break things horribly in the kernel. You
> could try to create an identity mapping, but you'll still have the
> translation overhead and you'd probably end up having to supply your own DMA
> ops to manage the address space. I'm assuming that you need to prevent the
> physical address of the ITS from being allocated as an IOVA?

Will, is that a HW limitation that the SMMU cannot be used, only for MSI
writes, in which case, the physical address range is very specific in
our ASIC that falls in the device memory region (e.g., below 0x80000000)?

In fact, what I need in this case is a static mapping from IOMMU on the
physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
address that MSI writes go to. This is to bypass the MSI forwarding
logic in our PCIe controller. At the same time, I can leave the rest of
inbound transactions to be handled by our PCIe controller without going
through the MMU.

> 
>> Any feedback from you is greatly appreciated!
> 
> Fix the hardware ;)

Indeed that has to happen with the next revision of the ASIC. But as you
can see I'm getting quite desperate here trying to find an interim solution.

> 
> Will
> 

Thanks for the help!

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:49     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 16:49 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 5/30/17 8:14 AM, Will Deacon wrote:
> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>> a particular physical address range while leave the rest still to be
>> handled by the client device. I believe this can already be supported by
>> the device tree binding of the generic IOMMU framework; however, it is
>> not clear to me whether or not the arm-smmu.c driver can support it.
>>
>> To give you some background information:
>>
>> We have a SoC that has PCIe root complex that has a build-in logic block
>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>> has a HW bug that causes the MSI writes not parsed properly and can
>> potentially corrupt data in the internal FIFO. A workaround is to have
>> ARM MMU-500 takes care of all inbound transactions. I found that is
>> working after hooking up our PCIe root complex to MMU-500; however, even
>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>> significant Ethernet throughput drop in both the TX and RX directions.
>> The throughput drop is very significant at around 50% (but is already
>> much improved compared to other prior kernel versions at 70~90%).
> 
> Did Robin's experiments help at all with this?
> 
> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
> 

It looks like these are new optimizations that have not yet been merged
in v4.12? I'm going to give it a try.

>> One alternative is to only use MMU-500 for MSI writes towards
>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>> region of physical address that I want MMU-500 to act on and leave the
>> rest of inbound transactions to be handled directly by our PCIe
>> controller, it can potentially work around the HW bug we have and at the
>> same time achieve optimal throughput.
> 
> I don't think you can bypass the SMMU for MSIs unless you give them their
> own StreamIDs, which is likely to break things horribly in the kernel. You
> could try to create an identity mapping, but you'll still have the
> translation overhead and you'd probably end up having to supply your own DMA
> ops to manage the address space. I'm assuming that you need to prevent the
> physical address of the ITS from being allocated as an IOVA?

Will, is that a HW limitation that the SMMU cannot be used, only for MSI
writes, in which case, the physical address range is very specific in
our ASIC that falls in the device memory region (e.g., below 0x80000000)?

In fact, what I need in this case is a static mapping from IOMMU on the
physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
address that MSI writes go to. This is to bypass the MSI forwarding
logic in our PCIe controller. At the same time, I can leave the rest of
inbound transactions to be handled by our PCIe controller without going
through the MMU.

> 
>> Any feedback from you is greatly appreciated!
> 
> Fix the hardware ;)

Indeed that has to happen with the next revision of the ASIC. But as you
can see I'm getting quite desperate here trying to find an interim solution.

> 
> Will
> 

Thanks for the help!

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:59       ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 16:59 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Robin Murphy, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

On 30/05/17 17:49, Ray Jui wrote:
> Hi Will,
> 
> On 5/30/17 8:14 AM, Will Deacon wrote:
>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>> a particular physical address range while leave the rest still to be
>>> handled by the client device. I believe this can already be supported by
>>> the device tree binding of the generic IOMMU framework; however, it is
>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>
>>> To give you some background information:
>>>
>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>> has a HW bug that causes the MSI writes not parsed properly and can
>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>> significant Ethernet throughput drop in both the TX and RX directions.
>>> The throughput drop is very significant at around 50% (but is already
>>> much improved compared to other prior kernel versions at 70~90%).
>>
>> Did Robin's experiments help at all with this?
>>
>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>
> 
> It looks like these are new optimizations that have not yet been merged
> in v4.12? I'm going to give it a try.
> 
>>> One alternative is to only use MMU-500 for MSI writes towards
>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>> region of physical address that I want MMU-500 to act on and leave the
>>> rest of inbound transactions to be handled directly by our PCIe
>>> controller, it can potentially work around the HW bug we have and at the
>>> same time achieve optimal throughput.
>>
>> I don't think you can bypass the SMMU for MSIs unless you give them their
>> own StreamIDs, which is likely to break things horribly in the kernel. You
>> could try to create an identity mapping, but you'll still have the
>> translation overhead and you'd probably end up having to supply your own DMA
>> ops to manage the address space. I'm assuming that you need to prevent the
>> physical address of the ITS from being allocated as an IOVA?
> 
> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
> writes, in which case, the physical address range is very specific in
> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
> 
> In fact, what I need in this case is a static mapping from IOMMU on the
> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
> address that MSI writes go to. This is to bypass the MSI forwarding
> logic in our PCIe controller. At the same time, I can leave the rest of
> inbound transactions to be handled by our PCIe controller without going
> through the MMU.

How is that going to work for DMA? I imagine your network interfaces do
have to access memory, don't they? How can the transactions be
terminated in the PCIe controller?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:59       ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 16:59 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Mark Rutland, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 30/05/17 17:49, Ray Jui wrote:
> Hi Will,
> 
> On 5/30/17 8:14 AM, Will Deacon wrote:
>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>> a particular physical address range while leave the rest still to be
>>> handled by the client device. I believe this can already be supported by
>>> the device tree binding of the generic IOMMU framework; however, it is
>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>
>>> To give you some background information:
>>>
>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>> has a HW bug that causes the MSI writes not parsed properly and can
>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>> significant Ethernet throughput drop in both the TX and RX directions.
>>> The throughput drop is very significant at around 50% (but is already
>>> much improved compared to other prior kernel versions at 70~90%).
>>
>> Did Robin's experiments help at all with this?
>>
>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>
> 
> It looks like these are new optimizations that have not yet been merged
> in v4.12? I'm going to give it a try.
> 
>>> One alternative is to only use MMU-500 for MSI writes towards
>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>> region of physical address that I want MMU-500 to act on and leave the
>>> rest of inbound transactions to be handled directly by our PCIe
>>> controller, it can potentially work around the HW bug we have and at the
>>> same time achieve optimal throughput.
>>
>> I don't think you can bypass the SMMU for MSIs unless you give them their
>> own StreamIDs, which is likely to break things horribly in the kernel. You
>> could try to create an identity mapping, but you'll still have the
>> translation overhead and you'd probably end up having to supply your own DMA
>> ops to manage the address space. I'm assuming that you need to prevent the
>> physical address of the ITS from being allocated as an IOVA?
> 
> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
> writes, in which case, the physical address range is very specific in
> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
> 
> In fact, what I need in this case is a static mapping from IOMMU on the
> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
> address that MSI writes go to. This is to bypass the MSI forwarding
> logic in our PCIe controller. At the same time, I can leave the rest of
> inbound transactions to be handled by our PCIe controller without going
> through the MMU.

How is that going to work for DMA? I imagine your network interfaces do
have to access memory, don't they? How can the transactions be
terminated in the PCIe controller?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 16:59       ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 16:59 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/05/17 17:49, Ray Jui wrote:
> Hi Will,
> 
> On 5/30/17 8:14 AM, Will Deacon wrote:
>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>> a particular physical address range while leave the rest still to be
>>> handled by the client device. I believe this can already be supported by
>>> the device tree binding of the generic IOMMU framework; however, it is
>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>
>>> To give you some background information:
>>>
>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>> has a HW bug that causes the MSI writes not parsed properly and can
>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>> significant Ethernet throughput drop in both the TX and RX directions.
>>> The throughput drop is very significant at around 50% (but is already
>>> much improved compared to other prior kernel versions at 70~90%).
>>
>> Did Robin's experiments help at all with this?
>>
>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>
> 
> It looks like these are new optimizations that have not yet been merged
> in v4.12? I'm going to give it a try.
> 
>>> One alternative is to only use MMU-500 for MSI writes towards
>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>> region of physical address that I want MMU-500 to act on and leave the
>>> rest of inbound transactions to be handled directly by our PCIe
>>> controller, it can potentially work around the HW bug we have and at the
>>> same time achieve optimal throughput.
>>
>> I don't think you can bypass the SMMU for MSIs unless you give them their
>> own StreamIDs, which is likely to break things horribly in the kernel. You
>> could try to create an identity mapping, but you'll still have the
>> translation overhead and you'd probably end up having to supply your own DMA
>> ops to manage the address space. I'm assuming that you need to prevent the
>> physical address of the ITS from being allocated as an IOVA?
> 
> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
> writes, in which case, the physical address range is very specific in
> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
> 
> In fact, what I need in this case is a static mapping from IOMMU on the
> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
> address that MSI writes go to. This is to bypass the MSI forwarding
> logic in our PCIe controller. At the same time, I can leave the rest of
> inbound transactions to be handled by our PCIe controller without going
> through the MMU.

How is that going to work for DMA? I imagine your network interfaces do
have to access memory, don't they? How can the transactions be
terminated in the PCIe controller?

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:16         ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 17:16 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Robin Murphy, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

Hi Marc,

On 5/30/17 9:59 AM, Marc Zyngier wrote:
> On 30/05/17 17:49, Ray Jui wrote:
>> Hi Will,
>>
>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>> a particular physical address range while leave the rest still to be
>>>> handled by the client device. I believe this can already be supported by
>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>
>>>> To give you some background information:
>>>>
>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>> The throughput drop is very significant at around 50% (but is already
>>>> much improved compared to other prior kernel versions at 70~90%).
>>>
>>> Did Robin's experiments help at all with this?
>>>
>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>
>>
>> It looks like these are new optimizations that have not yet been merged
>> in v4.12? I'm going to give it a try.
>>
>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>> region of physical address that I want MMU-500 to act on and leave the
>>>> rest of inbound transactions to be handled directly by our PCIe
>>>> controller, it can potentially work around the HW bug we have and at the
>>>> same time achieve optimal throughput.
>>>
>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>> could try to create an identity mapping, but you'll still have the
>>> translation overhead and you'd probably end up having to supply your own DMA
>>> ops to manage the address space. I'm assuming that you need to prevent the
>>> physical address of the ITS from being allocated as an IOVA?
>>
>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>> writes, in which case, the physical address range is very specific in
>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>
>> In fact, what I need in this case is a static mapping from IOMMU on the
>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>> address that MSI writes go to. This is to bypass the MSI forwarding
>> logic in our PCIe controller. At the same time, I can leave the rest of
>> inbound transactions to be handled by our PCIe controller without going
>> through the MMU.
> 
> How is that going to work for DMA? I imagine your network interfaces do
> have to access memory, don't they? How can the transactions be
> terminated in the PCIe controller?

Sorry, I may not phrase this properly. These inbound transactions (DMA
write to DDR, from endpoint) do not terminate in the PCIe controller.
They are taken by the PCIe controller as PCIe transactions and will be
carried towards the designated memory on the host.

> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:16         ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-05-30 17:16 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Mark Rutland, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Marc,

On 5/30/17 9:59 AM, Marc Zyngier wrote:
> On 30/05/17 17:49, Ray Jui wrote:
>> Hi Will,
>>
>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>> a particular physical address range while leave the rest still to be
>>>> handled by the client device. I believe this can already be supported by
>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>
>>>> To give you some background information:
>>>>
>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>> The throughput drop is very significant at around 50% (but is already
>>>> much improved compared to other prior kernel versions at 70~90%).
>>>
>>> Did Robin's experiments help at all with this?
>>>
>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>
>>
>> It looks like these are new optimizations that have not yet been merged
>> in v4.12? I'm going to give it a try.
>>
>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>> region of physical address that I want MMU-500 to act on and leave the
>>>> rest of inbound transactions to be handled directly by our PCIe
>>>> controller, it can potentially work around the HW bug we have and at the
>>>> same time achieve optimal throughput.
>>>
>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>> could try to create an identity mapping, but you'll still have the
>>> translation overhead and you'd probably end up having to supply your own DMA
>>> ops to manage the address space. I'm assuming that you need to prevent the
>>> physical address of the ITS from being allocated as an IOVA?
>>
>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>> writes, in which case, the physical address range is very specific in
>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>
>> In fact, what I need in this case is a static mapping from IOMMU on the
>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>> address that MSI writes go to. This is to bypass the MSI forwarding
>> logic in our PCIe controller. At the same time, I can leave the rest of
>> inbound transactions to be handled by our PCIe controller without going
>> through the MMU.
> 
> How is that going to work for DMA? I imagine your network interfaces do
> have to access memory, don't they? How can the transactions be
> terminated in the PCIe controller?

Sorry, I may not phrase this properly. These inbound transactions (DMA
write to DDR, from endpoint) do not terminate in the PCIe controller.
They are taken by the PCIe controller as PCIe transactions and will be
carried towards the designated memory on the host.

> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:16         ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Marc,

On 5/30/17 9:59 AM, Marc Zyngier wrote:
> On 30/05/17 17:49, Ray Jui wrote:
>> Hi Will,
>>
>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>> a particular physical address range while leave the rest still to be
>>>> handled by the client device. I believe this can already be supported by
>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>
>>>> To give you some background information:
>>>>
>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>> The throughput drop is very significant at around 50% (but is already
>>>> much improved compared to other prior kernel versions at 70~90%).
>>>
>>> Did Robin's experiments help at all with this?
>>>
>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>
>>
>> It looks like these are new optimizations that have not yet been merged
>> in v4.12? I'm going to give it a try.
>>
>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>> region of physical address that I want MMU-500 to act on and leave the
>>>> rest of inbound transactions to be handled directly by our PCIe
>>>> controller, it can potentially work around the HW bug we have and at the
>>>> same time achieve optimal throughput.
>>>
>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>> could try to create an identity mapping, but you'll still have the
>>> translation overhead and you'd probably end up having to supply your own DMA
>>> ops to manage the address space. I'm assuming that you need to prevent the
>>> physical address of the ITS from being allocated as an IOVA?
>>
>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>> writes, in which case, the physical address range is very specific in
>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>
>> In fact, what I need in this case is a static mapping from IOMMU on the
>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>> address that MSI writes go to. This is to bypass the MSI forwarding
>> logic in our PCIe controller. At the same time, I can leave the rest of
>> inbound transactions to be handled by our PCIe controller without going
>> through the MMU.
> 
> How is that going to work for DMA? I imagine your network interfaces do
> have to access memory, don't they? How can the transactions be
> terminated in the PCIe controller?

Sorry, I may not phrase this properly. These inbound transactions (DMA
write to DDR, from endpoint) do not terminate in the PCIe controller.
They are taken by the PCIe controller as PCIe transactions and will be
carried towards the designated memory on the host.

> 
> Thanks,
> 
> 	M.
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
  2017-05-30 16:49     ` Ray Jui via iommu
@ 2017-05-30 17:27       ` Robin Murphy
  -1 siblings, 0 replies; 38+ messages in thread
From: Robin Murphy @ 2017-05-30 17:27 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Mark Rutland, Marc Zyngier, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

On 30/05/17 17:49, Ray Jui wrote:
> Hi Will,
> 
> On 5/30/17 8:14 AM, Will Deacon wrote:
>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>> a particular physical address range while leave the rest still to be
>>> handled by the client device. I believe this can already be supported by
>>> the device tree binding of the generic IOMMU framework; however, it is
>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>
>>> To give you some background information:
>>>
>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>> has a HW bug that causes the MSI writes not parsed properly and can
>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>> significant Ethernet throughput drop in both the TX and RX directions.
>>> The throughput drop is very significant at around 50% (but is already
>>> much improved compared to other prior kernel versions at 70~90%).
>>
>> Did Robin's experiments help at all with this?
>>
>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>
> 
> It looks like these are new optimizations that have not yet been merged
> in v4.12? I'm going to give it a try.

Actually, most of the stuff there did land in 4.12 - only the
iommu/pgtable part is experimental stuff which hasn't been on the list
yet (but hopefully should be soon).

>>> One alternative is to only use MMU-500 for MSI writes towards
>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>> region of physical address that I want MMU-500 to act on and leave the
>>> rest of inbound transactions to be handled directly by our PCIe
>>> controller, it can potentially work around the HW bug we have and at the
>>> same time achieve optimal throughput.
>>
>> I don't think you can bypass the SMMU for MSIs unless you give them their
>> own StreamIDs, which is likely to break things horribly in the kernel. You
>> could try to create an identity mapping, but you'll still have the
>> translation overhead and you'd probably end up having to supply your own DMA
>> ops to manage the address space. I'm assuming that you need to prevent the
>> physical address of the ITS from being allocated as an IOVA?
> 
> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
> writes, in which case, the physical address range is very specific in
> our ASIC that falls in the device memory region (e.g., below 0x80000000)?

Yes, either translation is enabled or it isn't - we don't have
GART-style apertures. To segregate by address the best you can do is set
up the page tables to identity-map all of the "untranslated" address
space. As Will mentioned, if MSI writes could be distinguished from DMA
writes by Stream ID, rather than by address, then there would be more
options, but in the PCI case at least that's not generally possible.

Robin.

> In fact, what I need in this case is a static mapping from IOMMU on the
> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
> address that MSI writes go to. This is to bypass the MSI forwarding
> logic in our PCIe controller. At the same time, I can leave the rest of
> inbound transactions to be handled by our PCIe controller without going
> through the MMU.
> 
>>
>>> Any feedback from you is greatly appreciated!
>>
>> Fix the hardware ;)
> 
> Indeed that has to happen with the next revision of the ASIC. But as you
> can see I'm getting quite desperate here trying to find an interim solution.
> 
>>
>> Will
>>
> 
> Thanks for the help!
> 
> Ray
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:27       ` Robin Murphy
  0 siblings, 0 replies; 38+ messages in thread
From: Robin Murphy @ 2017-05-30 17:27 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/05/17 17:49, Ray Jui wrote:
> Hi Will,
> 
> On 5/30/17 8:14 AM, Will Deacon wrote:
>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>> a particular physical address range while leave the rest still to be
>>> handled by the client device. I believe this can already be supported by
>>> the device tree binding of the generic IOMMU framework; however, it is
>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>
>>> To give you some background information:
>>>
>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>> has a HW bug that causes the MSI writes not parsed properly and can
>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>> significant Ethernet throughput drop in both the TX and RX directions.
>>> The throughput drop is very significant at around 50% (but is already
>>> much improved compared to other prior kernel versions at 70~90%).
>>
>> Did Robin's experiments help at all with this?
>>
>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>
> 
> It looks like these are new optimizations that have not yet been merged
> in v4.12? I'm going to give it a try.

Actually, most of the stuff there did land in 4.12 - only the
iommu/pgtable part is experimental stuff which hasn't been on the list
yet (but hopefully should be soon).

>>> One alternative is to only use MMU-500 for MSI writes towards
>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>> region of physical address that I want MMU-500 to act on and leave the
>>> rest of inbound transactions to be handled directly by our PCIe
>>> controller, it can potentially work around the HW bug we have and at the
>>> same time achieve optimal throughput.
>>
>> I don't think you can bypass the SMMU for MSIs unless you give them their
>> own StreamIDs, which is likely to break things horribly in the kernel. You
>> could try to create an identity mapping, but you'll still have the
>> translation overhead and you'd probably end up having to supply your own DMA
>> ops to manage the address space. I'm assuming that you need to prevent the
>> physical address of the ITS from being allocated as an IOVA?
> 
> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
> writes, in which case, the physical address range is very specific in
> our ASIC that falls in the device memory region (e.g., below 0x80000000)?

Yes, either translation is enabled or it isn't - we don't have
GART-style apertures. To segregate by address the best you can do is set
up the page tables to identity-map all of the "untranslated" address
space. As Will mentioned, if MSI writes could be distinguished from DMA
writes by Stream ID, rather than by address, then there would be more
options, but in the PCI case at least that's not generally possible.

Robin.

> In fact, what I need in this case is a static mapping from IOMMU on the
> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
> address that MSI writes go to. This is to bypass the MSI forwarding
> logic in our PCIe controller. At the same time, I can leave the rest of
> inbound transactions to be handled by our PCIe controller without going
> through the MMU.
> 
>>
>>> Any feedback from you is greatly appreciated!
>>
>> Fix the hardware ;)
> 
> Indeed that has to happen with the next revision of the ASIC. But as you
> can see I'm getting quite desperate here trying to find an interim solution.
> 
>>
>> Will
>>
> 
> Thanks for the help!
> 
> Ray
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:27           ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 17:27 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Robin Murphy, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

On 30/05/17 18:16, Ray Jui wrote:
> Hi Marc,
> 
> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>> On 30/05/17 17:49, Ray Jui wrote:
>>> Hi Will,
>>>
>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>> a particular physical address range while leave the rest still to be
>>>>> handled by the client device. I believe this can already be supported by
>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>
>>>>> To give you some background information:
>>>>>
>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>> The throughput drop is very significant at around 50% (but is already
>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>
>>>> Did Robin's experiments help at all with this?
>>>>
>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>
>>>
>>> It looks like these are new optimizations that have not yet been merged
>>> in v4.12? I'm going to give it a try.
>>>
>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>> same time achieve optimal throughput.
>>>>
>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>> could try to create an identity mapping, but you'll still have the
>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>> physical address of the ITS from being allocated as an IOVA?
>>>
>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>> writes, in which case, the physical address range is very specific in
>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>
>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>> inbound transactions to be handled by our PCIe controller without going
>>> through the MMU.
>>
>> How is that going to work for DMA? I imagine your network interfaces do
>> have to access memory, don't they? How can the transactions be
>> terminated in the PCIe controller?
> 
> Sorry, I may not phrase this properly. These inbound transactions (DMA
> write to DDR, from endpoint) do not terminate in the PCIe controller.
> They are taken by the PCIe controller as PCIe transactions and will be
> carried towards the designated memory on the host.

So what is the StreamID used for these transactions? Is that a different
StreamID from that of the DMAing device? If you want to avoid the SMMU
effect on the transaction, you must make sure if doesn't match anything
there.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:27           ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 17:27 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Mark Rutland, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On 30/05/17 18:16, Ray Jui wrote:
> Hi Marc,
> 
> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>> On 30/05/17 17:49, Ray Jui wrote:
>>> Hi Will,
>>>
>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>> a particular physical address range while leave the rest still to be
>>>>> handled by the client device. I believe this can already be supported by
>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>
>>>>> To give you some background information:
>>>>>
>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>> The throughput drop is very significant at around 50% (but is already
>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>
>>>> Did Robin's experiments help at all with this?
>>>>
>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>
>>>
>>> It looks like these are new optimizations that have not yet been merged
>>> in v4.12? I'm going to give it a try.
>>>
>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>> same time achieve optimal throughput.
>>>>
>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>> could try to create an identity mapping, but you'll still have the
>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>> physical address of the ITS from being allocated as an IOVA?
>>>
>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>> writes, in which case, the physical address range is very specific in
>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>
>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>> inbound transactions to be handled by our PCIe controller without going
>>> through the MMU.
>>
>> How is that going to work for DMA? I imagine your network interfaces do
>> have to access memory, don't they? How can the transactions be
>> terminated in the PCIe controller?
> 
> Sorry, I may not phrase this properly. These inbound transactions (DMA
> write to DDR, from endpoint) do not terminate in the PCIe controller.
> They are taken by the PCIe controller as PCIe transactions and will be
> carried towards the designated memory on the host.

So what is the StreamID used for these transactions? Is that a different
StreamID from that of the DMAing device? If you want to avoid the SMMU
effect on the transaction, you must make sure if doesn't match anything
there.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 17:27           ` Marc Zyngier
  0 siblings, 0 replies; 38+ messages in thread
From: Marc Zyngier @ 2017-05-30 17:27 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/05/17 18:16, Ray Jui wrote:
> Hi Marc,
> 
> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>> On 30/05/17 17:49, Ray Jui wrote:
>>> Hi Will,
>>>
>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>> a particular physical address range while leave the rest still to be
>>>>> handled by the client device. I believe this can already be supported by
>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>
>>>>> To give you some background information:
>>>>>
>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>> The throughput drop is very significant at around 50% (but is already
>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>
>>>> Did Robin's experiments help at all with this?
>>>>
>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>
>>>
>>> It looks like these are new optimizations that have not yet been merged
>>> in v4.12? I'm going to give it a try.
>>>
>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>> same time achieve optimal throughput.
>>>>
>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>> could try to create an identity mapping, but you'll still have the
>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>> physical address of the ITS from being allocated as an IOVA?
>>>
>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>> writes, in which case, the physical address range is very specific in
>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>
>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>> inbound transactions to be handled by our PCIe controller without going
>>> through the MMU.
>>
>> How is that going to work for DMA? I imagine your network interfaces do
>> have to access memory, don't they? How can the transactions be
>> terminated in the PCIe controller?
> 
> Sorry, I may not phrase this properly. These inbound transactions (DMA
> write to DDR, from endpoint) do not terminate in the PCIe controller.
> They are taken by the PCIe controller as PCIe transactions and will be
> carried towards the designated memory on the host.

So what is the StreamID used for these transactions? Is that a different
StreamID from that of the DMAing device? If you want to avoid the SMMU
effect on the transaction, you must make sure if doesn't match anything
there.

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 22:06             ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 22:06 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Robin Murphy, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

Hi Marc/Robin/Will,

On 5/30/17 10:27 AM, Marc Zyngier wrote:
> On 30/05/17 18:16, Ray Jui wrote:
>> Hi Marc,
>>
>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>> On 30/05/17 17:49, Ray Jui wrote:
>>>> Hi Will,
>>>>
>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>> a particular physical address range while leave the rest still to be
>>>>>> handled by the client device. I believe this can already be supported by
>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>
>>>>>> To give you some background information:
>>>>>>
>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>
>>>>> Did Robin's experiments help at all with this?
>>>>>
>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>
>>>>
>>>> It looks like these are new optimizations that have not yet been merged
>>>> in v4.12? I'm going to give it a try.
>>>>
>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>> same time achieve optimal throughput.
>>>>>
>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>> could try to create an identity mapping, but you'll still have the
>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>
>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>> writes, in which case, the physical address range is very specific in
>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>
>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>> inbound transactions to be handled by our PCIe controller without going
>>>> through the MMU.
>>>
>>> How is that going to work for DMA? I imagine your network interfaces do
>>> have to access memory, don't they? How can the transactions be
>>> terminated in the PCIe controller?
>>
>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>> They are taken by the PCIe controller as PCIe transactions and will be
>> carried towards the designated memory on the host.
> 
> So what is the StreamID used for these transactions? Is that a different
> StreamID from that of the DMAing device? If you want to avoid the SMMU
> effect on the transaction, you must make sure if doesn't match anything
> there.
> 
> Thanks,
> 
> 	M.
> 

Thanks for the reply. I'm checking with our ASIC team, but from my
understanding, the stream ID in our ASIC is constructed based on the
some custom fields that a developer can program + some standard PCIe BDF
fields. That is, I don't think we can make the stream ID from the same
PF different between MSI writes and DMA writes, as you have already
predicted.

It sounds like I do not have much option here...

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-30 22:06             ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-05-30 22:06 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Mark Rutland, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Marc/Robin/Will,

On 5/30/17 10:27 AM, Marc Zyngier wrote:
> On 30/05/17 18:16, Ray Jui wrote:
>> Hi Marc,
>>
>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>> On 30/05/17 17:49, Ray Jui wrote:
>>>> Hi Will,
>>>>
>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>> a particular physical address range while leave the rest still to be
>>>>>> handled by the client device. I believe this can already be supported by
>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>
>>>>>> To give you some background information:
>>>>>>
>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>
>>>>> Did Robin's experiments help at all with this?
>>>>>
>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>
>>>>
>>>> It looks like these are new optimizations that have not yet been merged
>>>> in v4.12? I'm going to give it a try.
>>>>
>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>> same time achieve optimal throughput.
>>>>>
>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>> could try to create an identity mapping, but you'll still have the
>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>
>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>> writes, in which case, the physical address range is very specific in
>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>
>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>> inbound transactions to be handled by our PCIe controller without going
>>>> through the MMU.
>>>
>>> How is that going to work for DMA? I imagine your network interfaces do
>>> have to access memory, don't they? How can the transactions be
>>> terminated in the PCIe controller?
>>
>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>> They are taken by the PCIe controller as PCIe transactions and will be
>> carried towards the designated memory on the host.
> 
> So what is the StreamID used for these transactions? Is that a different
> StreamID from that of the DMAing device? If you want to avoid the SMMU
> effect on the transaction, you must make sure if doesn't match anything
> there.
> 
> Thanks,
> 
> 	M.
> 

Thanks for the reply. I'm checking with our ASIC team, but from my
understanding, the stream ID in our ASIC is constructed based on the
some custom fields that a developer can program + some standard PCIe BDF
fields. That is, I don't think we can make the stream ID from the same
PF different between MSI writes and DMA writes, as you have already
predicted.

It sounds like I do not have much option here...

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-30 22:06             ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-30 22:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Marc/Robin/Will,

On 5/30/17 10:27 AM, Marc Zyngier wrote:
> On 30/05/17 18:16, Ray Jui wrote:
>> Hi Marc,
>>
>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>> On 30/05/17 17:49, Ray Jui wrote:
>>>> Hi Will,
>>>>
>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>> a particular physical address range while leave the rest still to be
>>>>>> handled by the client device. I believe this can already be supported by
>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>
>>>>>> To give you some background information:
>>>>>>
>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>
>>>>> Did Robin's experiments help at all with this?
>>>>>
>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>
>>>>
>>>> It looks like these are new optimizations that have not yet been merged
>>>> in v4.12? I'm going to give it a try.
>>>>
>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>> same time achieve optimal throughput.
>>>>>
>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>> could try to create an identity mapping, but you'll still have the
>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>
>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>> writes, in which case, the physical address range is very specific in
>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>
>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>> inbound transactions to be handled by our PCIe controller without going
>>>> through the MMU.
>>>
>>> How is that going to work for DMA? I imagine your network interfaces do
>>> have to access memory, don't they? How can the transactions be
>>> terminated in the PCIe controller?
>>
>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>> They are taken by the PCIe controller as PCIe transactions and will be
>> carried towards the designated memory on the host.
> 
> So what is the StreamID used for these transactions? Is that a different
> StreamID from that of the DMAing device? If you want to avoid the SMMU
> effect on the transaction, you must make sure if doesn't match anything
> there.
> 
> Thanks,
> 
> 	M.
> 

Thanks for the reply. I'm checking with our ASIC team, but from my
understanding, the stream ID in our ASIC is constructed based on the
some custom fields that a developer can program + some standard PCIe BDF
fields. That is, I don't think we can make the stream ID from the same
PF different between MSI writes and DMA writes, as you have already
predicted.

It sounds like I do not have much option here...

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31  6:13               ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-31  6:13 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Robin Murphy, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

Hi Marc/Robin/Will,

I did a little more digging myself and I think I now understand what you 
meant by identity mapping, i.e., configuring the MMU-500 with 1:1 
mapping between the DMA address and the IOVA address.

I think that should work. In the end, due to this MSI write parsing 
issue in our PCIe controller, the reason to use IOMMU is to allow the 
cache attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be 
modified by the IOMMU to be device type, while leaving the rest of 
inbound reads/writes from/to DDR with more optimized cache attributes 
setting, to allow I/O coherency to be still enabled for the PCIe 
controller. In fact, the PCIe controller itself is fully capable of DMA 
to/from the full address space of our SoC including both DDR and any 
device memory.

The 1:1 mapping will still pose some translation overhead like you 
suggested; however, the overhead of allocating page tables and locking 
will be gone. This sounds like the best possible option I have currently.

May I ask, how do I start to try to get this identity mapping to work as 
an experiment and proof of concept? Any pointer or advise is highly 
appreciated as you can see I'm not very experienced with this. I found 
Will recently added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu 
driver. But I suppose that is to bypass the SMMU completely, instead of 
still going through the MMU with 1:1 translation. Is my understanding 
correct?

Thanks,

Ray

On 5/30/2017 3:06 PM, Ray Jui wrote:
> Hi Marc/Robin/Will,
>
> On 5/30/17 10:27 AM, Marc Zyngier wrote:
>> On 30/05/17 18:16, Ray Jui wrote:
>>> Hi Marc,
>>>
>>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>>> On 30/05/17 17:49, Ray Jui wrote:
>>>>> Hi Will,
>>>>>
>>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>>> a particular physical address range while leave the rest still to be
>>>>>>> handled by the client device. I believe this can already be supported by
>>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>>
>>>>>>> To give you some background information:
>>>>>>>
>>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>> Did Robin's experiments help at all with this?
>>>>>>
>>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>>
>>>>> It looks like these are new optimizations that have not yet been merged
>>>>> in v4.12? I'm going to give it a try.
>>>>>
>>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>>> same time achieve optimal throughput.
>>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>>> could try to create an identity mapping, but you'll still have the
>>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>>> writes, in which case, the physical address range is very specific in
>>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>>
>>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>>> inbound transactions to be handled by our PCIe controller without going
>>>>> through the MMU.
>>>> How is that going to work for DMA? I imagine your network interfaces do
>>>> have to access memory, don't they? How can the transactions be
>>>> terminated in the PCIe controller?
>>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>>> They are taken by the PCIe controller as PCIe transactions and will be
>>> carried towards the designated memory on the host.
>> So what is the StreamID used for these transactions? Is that a different
>> StreamID from that of the DMAing device? If you want to avoid the SMMU
>> effect on the transaction, you must make sure if doesn't match anything
>> there.
>>
>> Thanks,
>>
>> 	M.
>>
> Thanks for the reply. I'm checking with our ASIC team, but from my
> understanding, the stream ID in our ASIC is constructed based on the
> some custom fields that a developer can program + some standard PCIe BDF
> fields. That is, I don't think we can make the stream ID from the same
> PF different between MSI writes and DMA writes, as you have already
> predicted.
>
> It sounds like I do not have much option here...
>
> Thanks,
>
> Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31  6:13               ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-05-31  6:13 UTC (permalink / raw)
  To: Marc Zyngier, Will Deacon
  Cc: Mark Rutland, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Marc/Robin/Will,

I did a little more digging myself and I think I now understand what you 
meant by identity mapping, i.e., configuring the MMU-500 with 1:1 
mapping between the DMA address and the IOVA address.

I think that should work. In the end, due to this MSI write parsing 
issue in our PCIe controller, the reason to use IOMMU is to allow the 
cache attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be 
modified by the IOMMU to be device type, while leaving the rest of 
inbound reads/writes from/to DDR with more optimized cache attributes 
setting, to allow I/O coherency to be still enabled for the PCIe 
controller. In fact, the PCIe controller itself is fully capable of DMA 
to/from the full address space of our SoC including both DDR and any 
device memory.

The 1:1 mapping will still pose some translation overhead like you 
suggested; however, the overhead of allocating page tables and locking 
will be gone. This sounds like the best possible option I have currently.

May I ask, how do I start to try to get this identity mapping to work as 
an experiment and proof of concept? Any pointer or advise is highly 
appreciated as you can see I'm not very experienced with this. I found 
Will recently added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu 
driver. But I suppose that is to bypass the SMMU completely, instead of 
still going through the MMU with 1:1 translation. Is my understanding 
correct?

Thanks,

Ray

On 5/30/2017 3:06 PM, Ray Jui wrote:
> Hi Marc/Robin/Will,
>
> On 5/30/17 10:27 AM, Marc Zyngier wrote:
>> On 30/05/17 18:16, Ray Jui wrote:
>>> Hi Marc,
>>>
>>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>>> On 30/05/17 17:49, Ray Jui wrote:
>>>>> Hi Will,
>>>>>
>>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>>> a particular physical address range while leave the rest still to be
>>>>>>> handled by the client device. I believe this can already be supported by
>>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>>
>>>>>>> To give you some background information:
>>>>>>>
>>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>> Did Robin's experiments help at all with this?
>>>>>>
>>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>>
>>>>> It looks like these are new optimizations that have not yet been merged
>>>>> in v4.12? I'm going to give it a try.
>>>>>
>>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>>> same time achieve optimal throughput.
>>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>>> could try to create an identity mapping, but you'll still have the
>>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>>> writes, in which case, the physical address range is very specific in
>>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>>
>>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>>> inbound transactions to be handled by our PCIe controller without going
>>>>> through the MMU.
>>>> How is that going to work for DMA? I imagine your network interfaces do
>>>> have to access memory, don't they? How can the transactions be
>>>> terminated in the PCIe controller?
>>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>>> They are taken by the PCIe controller as PCIe transactions and will be
>>> carried towards the designated memory on the host.
>> So what is the StreamID used for these transactions? Is that a different
>> StreamID from that of the DMAing device? If you want to avoid the SMMU
>> effect on the transaction, you must make sure if doesn't match anything
>> there.
>>
>> Thanks,
>>
>> 	M.
>>
> Thanks for the reply. I'm checking with our ASIC team, but from my
> understanding, the stream ID in our ASIC is constructed based on the
> some custom fields that a developer can program + some standard PCIe BDF
> fields. That is, I don't think we can make the stream ID from the same
> PF different between MSI writes and DMA writes, as you have already
> predicted.
>
> It sounds like I do not have much option here...
>
> Thanks,
>
> Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-31  6:13               ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-31  6:13 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Marc/Robin/Will,

I did a little more digging myself and I think I now understand what you 
meant by identity mapping, i.e., configuring the MMU-500 with 1:1 
mapping between the DMA address and the IOVA address.

I think that should work. In the end, due to this MSI write parsing 
issue in our PCIe controller, the reason to use IOMMU is to allow the 
cache attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be 
modified by the IOMMU to be device type, while leaving the rest of 
inbound reads/writes from/to DDR with more optimized cache attributes 
setting, to allow I/O coherency to be still enabled for the PCIe 
controller. In fact, the PCIe controller itself is fully capable of DMA 
to/from the full address space of our SoC including both DDR and any 
device memory.

The 1:1 mapping will still pose some translation overhead like you 
suggested; however, the overhead of allocating page tables and locking 
will be gone. This sounds like the best possible option I have currently.

May I ask, how do I start to try to get this identity mapping to work as 
an experiment and proof of concept? Any pointer or advise is highly 
appreciated as you can see I'm not very experienced with this. I found 
Will recently added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu 
driver. But I suppose that is to bypass the SMMU completely, instead of 
still going through the MMU with 1:1 translation. Is my understanding 
correct?

Thanks,

Ray

On 5/30/2017 3:06 PM, Ray Jui wrote:
> Hi Marc/Robin/Will,
>
> On 5/30/17 10:27 AM, Marc Zyngier wrote:
>> On 30/05/17 18:16, Ray Jui wrote:
>>> Hi Marc,
>>>
>>> On 5/30/17 9:59 AM, Marc Zyngier wrote:
>>>> On 30/05/17 17:49, Ray Jui wrote:
>>>>> Hi Will,
>>>>>
>>>>> On 5/30/17 8:14 AM, Will Deacon wrote:
>>>>>> On Mon, May 29, 2017 at 06:18:45PM -0700, Ray Jui wrote:
>>>>>>> I'm writing to check with you to see if the latest arm-smmu.c driver in
>>>>>>> v4.12-rc Linux for smmu-500 can support mapping that is only specific to
>>>>>>> a particular physical address range while leave the rest still to be
>>>>>>> handled by the client device. I believe this can already be supported by
>>>>>>> the device tree binding of the generic IOMMU framework; however, it is
>>>>>>> not clear to me whether or not the arm-smmu.c driver can support it.
>>>>>>>
>>>>>>> To give you some background information:
>>>>>>>
>>>>>>> We have a SoC that has PCIe root complex that has a build-in logic block
>>>>>>> to forward MSI writes to ARM GICv3 ITS. Unfortunately, this logic block
>>>>>>> has a HW bug that causes the MSI writes not parsed properly and can
>>>>>>> potentially corrupt data in the internal FIFO. A workaround is to have
>>>>>>> ARM MMU-500 takes care of all inbound transactions. I found that is
>>>>>>> working after hooking up our PCIe root complex to MMU-500; however, even
>>>>>>> with this optimized arm-smmu driver in v4.12, I'm still seeing a
>>>>>>> significant Ethernet throughput drop in both the TX and RX directions.
>>>>>>> The throughput drop is very significant at around 50% (but is already
>>>>>>> much improved compared to other prior kernel versions at 70~90%).
>>>>>> Did Robin's experiments help at all with this?
>>>>>>
>>>>>> http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/perf
>>>>>>
>>>>> It looks like these are new optimizations that have not yet been merged
>>>>> in v4.12? I'm going to give it a try.
>>>>>
>>>>>>> One alternative is to only use MMU-500 for MSI writes towards
>>>>>>> GITS_TRANSLATER register in the GICv3, i.e., if I can define a specific
>>>>>>> region of physical address that I want MMU-500 to act on and leave the
>>>>>>> rest of inbound transactions to be handled directly by our PCIe
>>>>>>> controller, it can potentially work around the HW bug we have and at the
>>>>>>> same time achieve optimal throughput.
>>>>>> I don't think you can bypass the SMMU for MSIs unless you give them their
>>>>>> own StreamIDs, which is likely to break things horribly in the kernel. You
>>>>>> could try to create an identity mapping, but you'll still have the
>>>>>> translation overhead and you'd probably end up having to supply your own DMA
>>>>>> ops to manage the address space. I'm assuming that you need to prevent the
>>>>>> physical address of the ITS from being allocated as an IOVA?
>>>>> Will, is that a HW limitation that the SMMU cannot be used, only for MSI
>>>>> writes, in which case, the physical address range is very specific in
>>>>> our ASIC that falls in the device memory region (e.g., below 0x80000000)?
>>>>>
>>>>> In fact, what I need in this case is a static mapping from IOMMU on the
>>>>> physical address of the GITS_TRANSLATER of the GICv3 ITS, which is the
>>>>> address that MSI writes go to. This is to bypass the MSI forwarding
>>>>> logic in our PCIe controller. At the same time, I can leave the rest of
>>>>> inbound transactions to be handled by our PCIe controller without going
>>>>> through the MMU.
>>>> How is that going to work for DMA? I imagine your network interfaces do
>>>> have to access memory, don't they? How can the transactions be
>>>> terminated in the PCIe controller?
>>> Sorry, I may not phrase this properly. These inbound transactions (DMA
>>> write to DDR, from endpoint) do not terminate in the PCIe controller.
>>> They are taken by the PCIe controller as PCIe transactions and will be
>>> carried towards the designated memory on the host.
>> So what is the StreamID used for these transactions? Is that a different
>> StreamID from that of the DMAing device? If you want to avoid the SMMU
>> effect on the transaction, you must make sure if doesn't match anything
>> there.
>>
>> Thanks,
>>
>> 	M.
>>
> Thanks for the reply. I'm checking with our ASIC team, but from my
> understanding, the stream ID in our ASIC is constructed based on the
> some custom fields that a developer can program + some standard PCIe BDF
> fields. That is, I don't think we can make the stream ID from the same
> PF different between MSI writes and DMA writes, as you have already
> predicted.
>
> It sounds like I do not have much option here...
>
> Thanks,
>
> Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31 12:44                 ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-31 12:44 UTC (permalink / raw)
  To: Ray Jui
  Cc: Marc Zyngier, Robin Murphy, Mark Rutland, Joerg Roedel,
	linux-arm-kernel, iommu, linux-kernel

On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
> I did a little more digging myself and I think I now understand what you
> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
> between the DMA address and the IOVA address.
> 
> I think that should work. In the end, due to this MSI write parsing issue in
> our PCIe controller, the reason to use IOMMU is to allow the cache
> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
> the IOMMU to be device type, while leaving the rest of inbound reads/writes
> from/to DDR with more optimized cache attributes setting, to allow I/O
> coherency to be still enabled for the PCIe controller. In fact, the PCIe
> controller itself is fully capable of DMA to/from the full address space of
> our SoC including both DDR and any device memory.
> 
> The 1:1 mapping will still pose some translation overhead like you
> suggested; however, the overhead of allocating page tables and locking will
> be gone. This sounds like the best possible option I have currently.

It might end up being pretty invasive to work around a hardware bug, so
we'll have to see what it looks like. Ideally, we could just use the SMMU
for everything as-is and work on clawing back the lost performance (it
should be possible to get ~95% of the perf if we sort out the locking, which
we *are* working on).

> May I ask, how do I start to try to get this identity mapping to work as an
> experiment and proof of concept? Any pointer or advise is highly appreciated
> as you can see I'm not very experienced with this. I found Will recently
> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
> suppose that is to bypass the SMMU completely, instead of still going
> through the MMU with 1:1 translation. Is my understanding correct?

Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
actally need per-page control of memory attributes.

Robin might have a better idea, but I think you'll have to hack dma-iommu.c
so that you can have a version of the DMA ops that:

  * Initialises the identity map (I guess as normal WB cacheable?)
  * Reserves and maps the MSI region appropriately
  * Just returns the physical address for the dma address for map requests
    (return error for the MSI region)
  * Does nothing for unmap requests

But my strong preference would be to fix the locking overhead from the
SMMU so that the perf hit is acceptable.

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31 12:44                 ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-31 12:44 UTC (permalink / raw)
  To: Ray Jui
  Cc: Mark Rutland, Marc Zyngier, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
> I did a little more digging myself and I think I now understand what you
> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
> between the DMA address and the IOVA address.
> 
> I think that should work. In the end, due to this MSI write parsing issue in
> our PCIe controller, the reason to use IOMMU is to allow the cache
> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
> the IOMMU to be device type, while leaving the rest of inbound reads/writes
> from/to DDR with more optimized cache attributes setting, to allow I/O
> coherency to be still enabled for the PCIe controller. In fact, the PCIe
> controller itself is fully capable of DMA to/from the full address space of
> our SoC including both DDR and any device memory.
> 
> The 1:1 mapping will still pose some translation overhead like you
> suggested; however, the overhead of allocating page tables and locking will
> be gone. This sounds like the best possible option I have currently.

It might end up being pretty invasive to work around a hardware bug, so
we'll have to see what it looks like. Ideally, we could just use the SMMU
for everything as-is and work on clawing back the lost performance (it
should be possible to get ~95% of the perf if we sort out the locking, which
we *are* working on).

> May I ask, how do I start to try to get this identity mapping to work as an
> experiment and proof of concept? Any pointer or advise is highly appreciated
> as you can see I'm not very experienced with this. I found Will recently
> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
> suppose that is to bypass the SMMU completely, instead of still going
> through the MMU with 1:1 translation. Is my understanding correct?

Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
actally need per-page control of memory attributes.

Robin might have a better idea, but I think you'll have to hack dma-iommu.c
so that you can have a version of the DMA ops that:

  * Initialises the identity map (I guess as normal WB cacheable?)
  * Reserves and maps the MSI region appropriately
  * Just returns the physical address for the dma address for map requests
    (return error for the MSI region)
  * Does nothing for unmap requests

But my strong preference would be to fix the locking overhead from the
SMMU so that the perf hit is acceptable.

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-31 12:44                 ` Will Deacon
  0 siblings, 0 replies; 38+ messages in thread
From: Will Deacon @ 2017-05-31 12:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
> I did a little more digging myself and I think I now understand what you
> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
> between the DMA address and the IOVA address.
> 
> I think that should work. In the end, due to this MSI write parsing issue in
> our PCIe controller, the reason to use IOMMU is to allow the cache
> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
> the IOMMU to be device type, while leaving the rest of inbound reads/writes
> from/to DDR with more optimized cache attributes setting, to allow I/O
> coherency to be still enabled for the PCIe controller. In fact, the PCIe
> controller itself is fully capable of DMA to/from the full address space of
> our SoC including both DDR and any device memory.
> 
> The 1:1 mapping will still pose some translation overhead like you
> suggested; however, the overhead of allocating page tables and locking will
> be gone. This sounds like the best possible option I have currently.

It might end up being pretty invasive to work around a hardware bug, so
we'll have to see what it looks like. Ideally, we could just use the SMMU
for everything as-is and work on clawing back the lost performance (it
should be possible to get ~95% of the perf if we sort out the locking, which
we *are* working on).

> May I ask, how do I start to try to get this identity mapping to work as an
> experiment and proof of concept? Any pointer or advise is highly appreciated
> as you can see I'm not very experienced with this. I found Will recently
> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
> suppose that is to bypass the SMMU completely, instead of still going
> through the MMU with 1:1 translation. Is my understanding correct?

Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
actally need per-page control of memory attributes.

Robin might have a better idea, but I think you'll have to hack dma-iommu.c
so that you can have a version of the DMA ops that:

  * Initialises the identity map (I guess as normal WB cacheable?)
  * Reserves and maps the MSI region appropriately
  * Just returns the physical address for the dma address for map requests
    (return error for the MSI region)
  * Does nothing for unmap requests

But my strong preference would be to fix the locking overhead from the
SMMU so that the perf hit is acceptable.

Will

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31 17:32                   ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-31 17:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Marc Zyngier, Robin Murphy, Mark Rutland, Joerg Roedel,
	linux-arm-kernel, iommu, linux-kernel

Hi Will,

On 5/31/17 5:44 AM, Will Deacon wrote:
> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>> I did a little more digging myself and I think I now understand what you
>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>> between the DMA address and the IOVA address.
>>
>> I think that should work. In the end, due to this MSI write parsing issue in
>> our PCIe controller, the reason to use IOMMU is to allow the cache
>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>> from/to DDR with more optimized cache attributes setting, to allow I/O
>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>> controller itself is fully capable of DMA to/from the full address space of
>> our SoC including both DDR and any device memory.
>>
>> The 1:1 mapping will still pose some translation overhead like you
>> suggested; however, the overhead of allocating page tables and locking will
>> be gone. This sounds like the best possible option I have currently.
> 
> It might end up being pretty invasive to work around a hardware bug, so
> we'll have to see what it looks like. Ideally, we could just use the SMMU
> for everything as-is and work on clawing back the lost performance (it
> should be possible to get ~95% of the perf if we sort out the locking, which
> we *are* working on).
> 

If 95% of performance can be achieved by fixing the locking in the
driver, then that's great news.

If you have anything that you want me to help test, feel free to send it
out. I will be more than happy to help testing it and let you know about
the performance numbers, :)

>> May I ask, how do I start to try to get this identity mapping to work as an
>> experiment and proof of concept? Any pointer or advise is highly appreciated
>> as you can see I'm not very experienced with this. I found Will recently
>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>> suppose that is to bypass the SMMU completely, instead of still going
>> through the MMU with 1:1 translation. Is my understanding correct?
> 
> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
> actally need per-page control of memory attributes.
> 
> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
> so that you can have a version of the DMA ops that:
> 
>   * Initialises the identity map (I guess as normal WB cacheable?)
>   * Reserves and maps the MSI region appropriately
>   * Just returns the physical address for the dma address for map requests
>     (return error for the MSI region)
>   * Does nothing for unmap requests
> 
> But my strong preference would be to fix the locking overhead from the
> SMMU so that the perf hit is acceptable.

Yes, I agree, we want to be able to use the SMMU the intended way. Do
you have a timeline on when the locking issue may be fixed (or
improved)? Depending on the timeline, on our side, we may still need to
go for identity mapping as a temporary solution until the fix.

> 
> Will
> 

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-05-31 17:32                   ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-05-31 17:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, Marc Zyngier, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On 5/31/17 5:44 AM, Will Deacon wrote:
> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>> I did a little more digging myself and I think I now understand what you
>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>> between the DMA address and the IOVA address.
>>
>> I think that should work. In the end, due to this MSI write parsing issue in
>> our PCIe controller, the reason to use IOMMU is to allow the cache
>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>> from/to DDR with more optimized cache attributes setting, to allow I/O
>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>> controller itself is fully capable of DMA to/from the full address space of
>> our SoC including both DDR and any device memory.
>>
>> The 1:1 mapping will still pose some translation overhead like you
>> suggested; however, the overhead of allocating page tables and locking will
>> be gone. This sounds like the best possible option I have currently.
> 
> It might end up being pretty invasive to work around a hardware bug, so
> we'll have to see what it looks like. Ideally, we could just use the SMMU
> for everything as-is and work on clawing back the lost performance (it
> should be possible to get ~95% of the perf if we sort out the locking, which
> we *are* working on).
> 

If 95% of performance can be achieved by fixing the locking in the
driver, then that's great news.

If you have anything that you want me to help test, feel free to send it
out. I will be more than happy to help testing it and let you know about
the performance numbers, :)

>> May I ask, how do I start to try to get this identity mapping to work as an
>> experiment and proof of concept? Any pointer or advise is highly appreciated
>> as you can see I'm not very experienced with this. I found Will recently
>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>> suppose that is to bypass the SMMU completely, instead of still going
>> through the MMU with 1:1 translation. Is my understanding correct?
> 
> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
> actally need per-page control of memory attributes.
> 
> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
> so that you can have a version of the DMA ops that:
> 
>   * Initialises the identity map (I guess as normal WB cacheable?)
>   * Reserves and maps the MSI region appropriately
>   * Just returns the physical address for the dma address for map requests
>     (return error for the MSI region)
>   * Does nothing for unmap requests
> 
> But my strong preference would be to fix the locking overhead from the
> SMMU so that the perf hit is acceptable.

Yes, I agree, we want to be able to use the SMMU the intended way. Do
you have a timeline on when the locking issue may be fixed (or
improved)? Depending on the timeline, on our side, we may still need to
go for identity mapping as a temporary solution until the fix.

> 
> Will
> 

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-05-31 17:32                   ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-05-31 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On 5/31/17 5:44 AM, Will Deacon wrote:
> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>> I did a little more digging myself and I think I now understand what you
>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>> between the DMA address and the IOVA address.
>>
>> I think that should work. In the end, due to this MSI write parsing issue in
>> our PCIe controller, the reason to use IOMMU is to allow the cache
>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>> from/to DDR with more optimized cache attributes setting, to allow I/O
>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>> controller itself is fully capable of DMA to/from the full address space of
>> our SoC including both DDR and any device memory.
>>
>> The 1:1 mapping will still pose some translation overhead like you
>> suggested; however, the overhead of allocating page tables and locking will
>> be gone. This sounds like the best possible option I have currently.
> 
> It might end up being pretty invasive to work around a hardware bug, so
> we'll have to see what it looks like. Ideally, we could just use the SMMU
> for everything as-is and work on clawing back the lost performance (it
> should be possible to get ~95% of the perf if we sort out the locking, which
> we *are* working on).
> 

If 95% of performance can be achieved by fixing the locking in the
driver, then that's great news.

If you have anything that you want me to help test, feel free to send it
out. I will be more than happy to help testing it and let you know about
the performance numbers, :)

>> May I ask, how do I start to try to get this identity mapping to work as an
>> experiment and proof of concept? Any pointer or advise is highly appreciated
>> as you can see I'm not very experienced with this. I found Will recently
>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>> suppose that is to bypass the SMMU completely, instead of still going
>> through the MMU with 1:1 translation. Is my understanding correct?
> 
> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
> actally need per-page control of memory attributes.
> 
> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
> so that you can have a version of the DMA ops that:
> 
>   * Initialises the identity map (I guess as normal WB cacheable?)
>   * Reserves and maps the MSI region appropriately
>   * Just returns the physical address for the dma address for map requests
>     (return error for the MSI region)
>   * Does nothing for unmap requests
> 
> But my strong preference would be to fix the locking overhead from the
> SMMU so that the perf hit is acceptable.

Yes, I agree, we want to be able to use the SMMU the intended way. Do
you have a timeline on when the locking issue may be fixed (or
improved)? Depending on the timeline, on our side, we may still need to
go for identity mapping as a temporary solution until the fix.

> 
> Will
> 

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-06-05 18:03                     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-06-05 18:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Marc Zyngier, Robin Murphy, Mark Rutland, Joerg Roedel,
	linux-arm-kernel, iommu, linux-kernel

Hi Will/Robin,

Just want to check with you on this again. Do you have a very rough
timeline on when the excessive locking in the IOMMU driver may be fixed
(so we can restore expected up to 95% performance)?

Thanks,

Ray


On 5/31/17 10:32 AM, Ray Jui wrote:
> Hi Will,
> 
> On 5/31/17 5:44 AM, Will Deacon wrote:
>> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>>> I did a little more digging myself and I think I now understand what you
>>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>>> between the DMA address and the IOVA address.
>>>
>>> I think that should work. In the end, due to this MSI write parsing issue in
>>> our PCIe controller, the reason to use IOMMU is to allow the cache
>>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>>> from/to DDR with more optimized cache attributes setting, to allow I/O
>>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>>> controller itself is fully capable of DMA to/from the full address space of
>>> our SoC including both DDR and any device memory.
>>>
>>> The 1:1 mapping will still pose some translation overhead like you
>>> suggested; however, the overhead of allocating page tables and locking will
>>> be gone. This sounds like the best possible option I have currently.
>>
>> It might end up being pretty invasive to work around a hardware bug, so
>> we'll have to see what it looks like. Ideally, we could just use the SMMU
>> for everything as-is and work on clawing back the lost performance (it
>> should be possible to get ~95% of the perf if we sort out the locking, which
>> we *are* working on).
>>
> 
> If 95% of performance can be achieved by fixing the locking in the
> driver, then that's great news.
> 
> If you have anything that you want me to help test, feel free to send it
> out. I will be more than happy to help testing it and let you know about
> the performance numbers, :)
> 
>>> May I ask, how do I start to try to get this identity mapping to work as an
>>> experiment and proof of concept? Any pointer or advise is highly appreciated
>>> as you can see I'm not very experienced with this. I found Will recently
>>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>>> suppose that is to bypass the SMMU completely, instead of still going
>>> through the MMU with 1:1 translation. Is my understanding correct?
>>
>> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
>> actally need per-page control of memory attributes.
>>
>> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
>> so that you can have a version of the DMA ops that:
>>
>>   * Initialises the identity map (I guess as normal WB cacheable?)
>>   * Reserves and maps the MSI region appropriately
>>   * Just returns the physical address for the dma address for map requests
>>     (return error for the MSI region)
>>   * Does nothing for unmap requests
>>
>> But my strong preference would be to fix the locking overhead from the
>> SMMU so that the perf hit is acceptable.
> 
> Yes, I agree, we want to be able to use the SMMU the intended way. Do
> you have a timeline on when the locking issue may be fixed (or
> improved)? Depending on the timeline, on our side, we may still need to
> go for identity mapping as a temporary solution until the fix.
> 
>>
>> Will
>>
> 
> Thanks,
> 
> Ray
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
@ 2017-06-05 18:03                     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui via iommu @ 2017-06-05 18:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Mark Rutland, Marc Zyngier, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will/Robin,

Just want to check with you on this again. Do you have a very rough
timeline on when the excessive locking in the IOMMU driver may be fixed
(so we can restore expected up to 95% performance)?

Thanks,

Ray


On 5/31/17 10:32 AM, Ray Jui wrote:
> Hi Will,
> 
> On 5/31/17 5:44 AM, Will Deacon wrote:
>> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>>> I did a little more digging myself and I think I now understand what you
>>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>>> between the DMA address and the IOVA address.
>>>
>>> I think that should work. In the end, due to this MSI write parsing issue in
>>> our PCIe controller, the reason to use IOMMU is to allow the cache
>>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>>> from/to DDR with more optimized cache attributes setting, to allow I/O
>>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>>> controller itself is fully capable of DMA to/from the full address space of
>>> our SoC including both DDR and any device memory.
>>>
>>> The 1:1 mapping will still pose some translation overhead like you
>>> suggested; however, the overhead of allocating page tables and locking will
>>> be gone. This sounds like the best possible option I have currently.
>>
>> It might end up being pretty invasive to work around a hardware bug, so
>> we'll have to see what it looks like. Ideally, we could just use the SMMU
>> for everything as-is and work on clawing back the lost performance (it
>> should be possible to get ~95% of the perf if we sort out the locking, which
>> we *are* working on).
>>
> 
> If 95% of performance can be achieved by fixing the locking in the
> driver, then that's great news.
> 
> If you have anything that you want me to help test, feel free to send it
> out. I will be more than happy to help testing it and let you know about
> the performance numbers, :)
> 
>>> May I ask, how do I start to try to get this identity mapping to work as an
>>> experiment and proof of concept? Any pointer or advise is highly appreciated
>>> as you can see I'm not very experienced with this. I found Will recently
>>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>>> suppose that is to bypass the SMMU completely, instead of still going
>>> through the MMU with 1:1 translation. Is my understanding correct?
>>
>> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
>> actally need per-page control of memory attributes.
>>
>> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
>> so that you can have a version of the DMA ops that:
>>
>>   * Initialises the identity map (I guess as normal WB cacheable?)
>>   * Reserves and maps the MSI region appropriately
>>   * Just returns the physical address for the dma address for map requests
>>     (return error for the MSI region)
>>   * Does nothing for unmap requests
>>
>> But my strong preference would be to fix the locking overhead from the
>> SMMU so that the perf hit is acceptable.
> 
> Yes, I agree, we want to be able to use the SMMU the intended way. Do
> you have a timeline on when the locking issue may be fixed (or
> improved)? Depending on the timeline, on our side, we may still need to
> go for identity mapping as a temporary solution until the fix.
> 
>>
>> Will
>>
> 
> Thanks,
> 
> Ray
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-06-05 18:03                     ` Ray Jui via iommu
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-06-05 18:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will/Robin,

Just want to check with you on this again. Do you have a very rough
timeline on when the excessive locking in the IOMMU driver may be fixed
(so we can restore expected up to 95% performance)?

Thanks,

Ray


On 5/31/17 10:32 AM, Ray Jui wrote:
> Hi Will,
> 
> On 5/31/17 5:44 AM, Will Deacon wrote:
>> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>>> I did a little more digging myself and I think I now understand what you
>>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>>> between the DMA address and the IOVA address.
>>>
>>> I think that should work. In the end, due to this MSI write parsing issue in
>>> our PCIe controller, the reason to use IOMMU is to allow the cache
>>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>>> from/to DDR with more optimized cache attributes setting, to allow I/O
>>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>>> controller itself is fully capable of DMA to/from the full address space of
>>> our SoC including both DDR and any device memory.
>>>
>>> The 1:1 mapping will still pose some translation overhead like you
>>> suggested; however, the overhead of allocating page tables and locking will
>>> be gone. This sounds like the best possible option I have currently.
>>
>> It might end up being pretty invasive to work around a hardware bug, so
>> we'll have to see what it looks like. Ideally, we could just use the SMMU
>> for everything as-is and work on clawing back the lost performance (it
>> should be possible to get ~95% of the perf if we sort out the locking, which
>> we *are* working on).
>>
> 
> If 95% of performance can be achieved by fixing the locking in the
> driver, then that's great news.
> 
> If you have anything that you want me to help test, feel free to send it
> out. I will be more than happy to help testing it and let you know about
> the performance numbers, :)
> 
>>> May I ask, how do I start to try to get this identity mapping to work as an
>>> experiment and proof of concept? Any pointer or advise is highly appreciated
>>> as you can see I'm not very experienced with this. I found Will recently
>>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>>> suppose that is to bypass the SMMU completely, instead of still going
>>> through the MMU with 1:1 translation. Is my understanding correct?
>>
>> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
>> actally need per-page control of memory attributes.
>>
>> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
>> so that you can have a version of the DMA ops that:
>>
>>   * Initialises the identity map (I guess as normal WB cacheable?)
>>   * Reserves and maps the MSI region appropriately
>>   * Just returns the physical address for the dma address for map requests
>>     (return error for the MSI region)
>>   * Does nothing for unmap requests
>>
>> But my strong preference would be to fix the locking overhead from the
>> SMMU so that the perf hit is acceptable.
> 
> Yes, I agree, we want to be able to use the SMMU the intended way. Do
> you have a timeline on when the locking issue may be fixed (or
> improved)? Depending on the timeline, on our side, we may still need to
> go for identity mapping as a temporary solution until the fix.
> 
>>
>> Will
>>
> 
> Thanks,
> 
> Ray
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
  2017-06-05 18:03                     ` Ray Jui via iommu
@ 2017-06-06 10:02                       ` Robin Murphy
  -1 siblings, 0 replies; 38+ messages in thread
From: Robin Murphy @ 2017-06-06 10:02 UTC (permalink / raw)
  To: Ray Jui, Will Deacon
  Cc: Marc Zyngier, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

Hi Ray,

On 05/06/17 19:03, Ray Jui wrote:
> Hi Will/Robin,
> 
> Just want to check with you on this again. Do you have a very rough
> timeline on when the excessive locking in the IOMMU driver may be fixed
> (so we can restore expected up to 95% performance)?

I've currently got some experimental patches pushed out here:

    git://linux-arm.org/linux-rm  iommu/pgtable

So far, there's still one silly bug (which doesn't affect DMA ops usage)
and an awkward race for non-coherent table walks which will need
resolving before I have anything to post properly, which I hope will be
within the next couple of weeks. In the meantime, though, it already
seems to work well enough in practice, so any feedback is welcome!

Robin.

> 
> Thanks,
> 
> Ray
> 
> 
> On 5/31/17 10:32 AM, Ray Jui wrote:
>> Hi Will,
>>
>> On 5/31/17 5:44 AM, Will Deacon wrote:
>>> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>>>> I did a little more digging myself and I think I now understand what you
>>>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>>>> between the DMA address and the IOVA address.
>>>>
>>>> I think that should work. In the end, due to this MSI write parsing issue in
>>>> our PCIe controller, the reason to use IOMMU is to allow the cache
>>>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>>>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>>>> from/to DDR with more optimized cache attributes setting, to allow I/O
>>>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>>>> controller itself is fully capable of DMA to/from the full address space of
>>>> our SoC including both DDR and any device memory.
>>>>
>>>> The 1:1 mapping will still pose some translation overhead like you
>>>> suggested; however, the overhead of allocating page tables and locking will
>>>> be gone. This sounds like the best possible option I have currently.
>>>
>>> It might end up being pretty invasive to work around a hardware bug, so
>>> we'll have to see what it looks like. Ideally, we could just use the SMMU
>>> for everything as-is and work on clawing back the lost performance (it
>>> should be possible to get ~95% of the perf if we sort out the locking, which
>>> we *are* working on).
>>>
>>
>> If 95% of performance can be achieved by fixing the locking in the
>> driver, then that's great news.
>>
>> If you have anything that you want me to help test, feel free to send it
>> out. I will be more than happy to help testing it and let you know about
>> the performance numbers, :)
>>
>>>> May I ask, how do I start to try to get this identity mapping to work as an
>>>> experiment and proof of concept? Any pointer or advise is highly appreciated
>>>> as you can see I'm not very experienced with this. I found Will recently
>>>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>>>> suppose that is to bypass the SMMU completely, instead of still going
>>>> through the MMU with 1:1 translation. Is my understanding correct?
>>>
>>> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
>>> actally need per-page control of memory attributes.
>>>
>>> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
>>> so that you can have a version of the DMA ops that:
>>>
>>>   * Initialises the identity map (I guess as normal WB cacheable?)
>>>   * Reserves and maps the MSI region appropriately
>>>   * Just returns the physical address for the dma address for map requests
>>>     (return error for the MSI region)
>>>   * Does nothing for unmap requests
>>>
>>> But my strong preference would be to fix the locking overhead from the
>>> SMMU so that the perf hit is acceptable.
>>
>> Yes, I agree, we want to be able to use the SMMU the intended way. Do
>> you have a timeline on when the locking issue may be fixed (or
>> improved)? Depending on the timeline, on our side, we may still need to
>> go for identity mapping as a temporary solution until the fix.
>>
>>>
>>> Will
>>>
>>
>> Thanks,
>>
>> Ray
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-06-06 10:02                       ` Robin Murphy
  0 siblings, 0 replies; 38+ messages in thread
From: Robin Murphy @ 2017-06-06 10:02 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Ray,

On 05/06/17 19:03, Ray Jui wrote:
> Hi Will/Robin,
> 
> Just want to check with you on this again. Do you have a very rough
> timeline on when the excessive locking in the IOMMU driver may be fixed
> (so we can restore expected up to 95% performance)?

I've currently got some experimental patches pushed out here:

    git://linux-arm.org/linux-rm  iommu/pgtable

So far, there's still one silly bug (which doesn't affect DMA ops usage)
and an awkward race for non-coherent table walks which will need
resolving before I have anything to post properly, which I hope will be
within the next couple of weeks. In the meantime, though, it already
seems to work well enough in practice, so any feedback is welcome!

Robin.

> 
> Thanks,
> 
> Ray
> 
> 
> On 5/31/17 10:32 AM, Ray Jui wrote:
>> Hi Will,
>>
>> On 5/31/17 5:44 AM, Will Deacon wrote:
>>> On Tue, May 30, 2017 at 11:13:36PM -0700, Ray Jui wrote:
>>>> I did a little more digging myself and I think I now understand what you
>>>> meant by identity mapping, i.e., configuring the MMU-500 with 1:1 mapping
>>>> between the DMA address and the IOVA address.
>>>>
>>>> I think that should work. In the end, due to this MSI write parsing issue in
>>>> our PCIe controller, the reason to use IOMMU is to allow the cache
>>>> attributes (AxCACHE) of the MSI writes towards GICv3 ITS to be modified by
>>>> the IOMMU to be device type, while leaving the rest of inbound reads/writes
>>>> from/to DDR with more optimized cache attributes setting, to allow I/O
>>>> coherency to be still enabled for the PCIe controller. In fact, the PCIe
>>>> controller itself is fully capable of DMA to/from the full address space of
>>>> our SoC including both DDR and any device memory.
>>>>
>>>> The 1:1 mapping will still pose some translation overhead like you
>>>> suggested; however, the overhead of allocating page tables and locking will
>>>> be gone. This sounds like the best possible option I have currently.
>>>
>>> It might end up being pretty invasive to work around a hardware bug, so
>>> we'll have to see what it looks like. Ideally, we could just use the SMMU
>>> for everything as-is and work on clawing back the lost performance (it
>>> should be possible to get ~95% of the perf if we sort out the locking, which
>>> we *are* working on).
>>>
>>
>> If 95% of performance can be achieved by fixing the locking in the
>> driver, then that's great news.
>>
>> If you have anything that you want me to help test, feel free to send it
>> out. I will be more than happy to help testing it and let you know about
>> the performance numbers, :)
>>
>>>> May I ask, how do I start to try to get this identity mapping to work as an
>>>> experiment and proof of concept? Any pointer or advise is highly appreciated
>>>> as you can see I'm not very experienced with this. I found Will recently
>>>> added the IOMMU_DOMAIN_IDENTITY support to the arm-smmu driver. But I
>>>> suppose that is to bypass the SMMU completely, instead of still going
>>>> through the MMU with 1:1 translation. Is my understanding correct?
>>>
>>> Yes, I don't think IOMMU_DOMAIN_IDENTITY is what you need because you
>>> actally need per-page control of memory attributes.
>>>
>>> Robin might have a better idea, but I think you'll have to hack dma-iommu.c
>>> so that you can have a version of the DMA ops that:
>>>
>>>   * Initialises the identity map (I guess as normal WB cacheable?)
>>>   * Reserves and maps the MSI region appropriately
>>>   * Just returns the physical address for the dma address for map requests
>>>     (return error for the MSI region)
>>>   * Does nothing for unmap requests
>>>
>>> But my strong preference would be to fix the locking overhead from the
>>> SMMU so that the perf hit is acceptable.
>>
>> Yes, I agree, we want to be able to use the SMMU the intended way. Do
>> you have a timeline on when the locking issue may be fixed (or
>> improved)? Depending on the timeline, on our side, we may still need to
>> go for identity mapping as a temporary solution until the fix.
>>
>>>
>>> Will
>>>
>>
>> Thanks,
>>
>> Ray
>>

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: Device address specific mapping of arm,mmu-500
  2017-06-06 10:02                       ` Robin Murphy
@ 2017-06-07  6:20                         ` Ray Jui
  -1 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-06-07  6:20 UTC (permalink / raw)
  To: Robin Murphy, Will Deacon
  Cc: Marc Zyngier, Mark Rutland, Joerg Roedel, linux-arm-kernel,
	iommu, linux-kernel

Hi Robin,


On 6/6/2017 3:02 AM, Robin Murphy wrote:
> I've currently got some experimental patches pushed out here:
>
>      git://linux-arm.org/linux-rm  iommu/pgtable
>
> So far, there's still one silly bug (which doesn't affect DMA ops usage)
> and an awkward race for non-coherent table walks which will need
> resolving before I have anything to post properly, which I hope will be
> within the next couple of weeks. In the meantime, though, it already
> seems to work well enough in practice, so any feedback is welcome!
>
> Robin.
Excellent! I'm going to find time to test it out (likely next week). 
I'll report back the test result after that.

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Device address specific mapping of arm,mmu-500
@ 2017-06-07  6:20                         ` Ray Jui
  0 siblings, 0 replies; 38+ messages in thread
From: Ray Jui @ 2017-06-07  6:20 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,


On 6/6/2017 3:02 AM, Robin Murphy wrote:
> I've currently got some experimental patches pushed out here:
>
>      git://linux-arm.org/linux-rm  iommu/pgtable
>
> So far, there's still one silly bug (which doesn't affect DMA ops usage)
> and an awkward race for non-coherent table walks which will need
> resolving before I have anything to post properly, which I hope will be
> within the next couple of weeks. In the meantime, though, it already
> seems to work well enough in practice, so any feedback is welcome!
>
> Robin.
Excellent! I'm going to find time to test it out (likely next week). 
I'll report back the test result after that.

Thanks,

Ray

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2017-06-07  6:20 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-05-30  1:18 Device address specific mapping of arm,mmu-500 Ray Jui
2017-05-30  1:18 ` Ray Jui
2017-05-30 15:14 ` Will Deacon
2017-05-30 15:14   ` Will Deacon
2017-05-30 15:14   ` Will Deacon
2017-05-30 16:49   ` Ray Jui
2017-05-30 16:49     ` Ray Jui
2017-05-30 16:49     ` Ray Jui via iommu
2017-05-30 16:59     ` Marc Zyngier
2017-05-30 16:59       ` Marc Zyngier
2017-05-30 16:59       ` Marc Zyngier
2017-05-30 17:16       ` Ray Jui
2017-05-30 17:16         ` Ray Jui
2017-05-30 17:16         ` Ray Jui via iommu
2017-05-30 17:27         ` Marc Zyngier
2017-05-30 17:27           ` Marc Zyngier
2017-05-30 17:27           ` Marc Zyngier
2017-05-30 22:06           ` Ray Jui
2017-05-30 22:06             ` Ray Jui
2017-05-30 22:06             ` Ray Jui via iommu
2017-05-31  6:13             ` Ray Jui
2017-05-31  6:13               ` Ray Jui
2017-05-31  6:13               ` Ray Jui via iommu
2017-05-31 12:44               ` Will Deacon
2017-05-31 12:44                 ` Will Deacon
2017-05-31 12:44                 ` Will Deacon
2017-05-31 17:32                 ` Ray Jui
2017-05-31 17:32                   ` Ray Jui
2017-05-31 17:32                   ` Ray Jui via iommu
2017-06-05 18:03                   ` Ray Jui
2017-06-05 18:03                     ` Ray Jui
2017-06-05 18:03                     ` Ray Jui via iommu
2017-06-06 10:02                     ` Robin Murphy
2017-06-06 10:02                       ` Robin Murphy
2017-06-07  6:20                       ` Ray Jui
2017-06-07  6:20                         ` Ray Jui
2017-05-30 17:27     ` Robin Murphy
2017-05-30 17:27       ` Robin Murphy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.