linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* SMMU performance
@ 2019-09-30 11:00 Russell King - ARM Linux admin
  2019-09-30 11:45 ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux admin @ 2019-09-30 11:00 UTC (permalink / raw)
  To: linux-arm-kernel, Will Deacon, Robin Murphy

Hi,

While using iperf on a platform using the ARM SMMU (v2), I notice the
following behaviour on v5.3 with Will's iommu patch set merged, kernel
lock debugging disabled.

With iommu.passthrough=1, three consecutive runs:
[  3]  0.0-10.0 sec  4.51 GBytes  3.87 Gbits/sec
[  3]  0.0-10.0 sec  4.53 GBytes  3.89 Gbits/sec
[  3]  0.0-10.0 sec  4.49 GBytes  3.86 Gbits/sec

With iommu.passthrough=0:
[  3]  0.0-10.0 sec  1.77 GBytes  1.52 Gbits/sec
[  3]  0.0-10.0 sec  1.82 GBytes  1.56 Gbits/sec
[  3]  0.0-10.0 sec  1.69 GBytes  1.45 Gbits/sec

Running perf record -a -g ... followed by perf report --no-children
shows:

-   15.72%  iperf            [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestor
   - _raw_spin_unlock_irqrestore
      - 8.95% arm_smmu_tlb_sync_context
           arm_smmu_iotlb_sync
         - __iommu_dma_unmap
            + 4.54% iommu_dma_unmap_sg
            + 4.41% iommu_dma_unmap_page
      - 2.92% alloc_iova_fast
         - iommu_dma_alloc_iova.isra.26
            + 1.54% iommu_dma_map_sg
            + 1.38% __iommu_dma_map
      - 2.64% free_iova_fast
           iommu_dma_free_iova
         - __iommu_dma_unmap
            + 1.35% iommu_dma_unmap_sg
            + 1.29% iommu_dma_unmap_page

which seems to be pointing to the SMMU code as a bottleneck.

Will suggests that his iommu changes (in his for-joerg/arm-smmu/updates
branch), allows IOMMU driver modifications that may have a beneficial
effect.  Any thoughts?

Thanks.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SMMU performance
  2019-09-30 11:00 SMMU performance Russell King - ARM Linux admin
@ 2019-09-30 11:45 ` Robin Murphy
  2019-09-30 11:54   ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2019-09-30 11:45 UTC (permalink / raw)
  To: Russell King - ARM Linux admin, linux-arm-kernel, Will Deacon

Hi Russell,

On 30/09/2019 12:00, Russell King - ARM Linux admin wrote:
> Hi,
> 
> While using iperf on a platform using the ARM SMMU (v2), I notice the
> following behaviour on v5.3 with Will's iommu patch set merged, kernel
> lock debugging disabled.
> 
> With iommu.passthrough=1, three consecutive runs:
> [  3]  0.0-10.0 sec  4.51 GBytes  3.87 Gbits/sec
> [  3]  0.0-10.0 sec  4.53 GBytes  3.89 Gbits/sec
> [  3]  0.0-10.0 sec  4.49 GBytes  3.86 Gbits/sec
> 
> With iommu.passthrough=0:
> [  3]  0.0-10.0 sec  1.77 GBytes  1.52 Gbits/sec
> [  3]  0.0-10.0 sec  1.82 GBytes  1.56 Gbits/sec
> [  3]  0.0-10.0 sec  1.69 GBytes  1.45 Gbits/sec
> 
> Running perf record -a -g ... followed by perf report --no-children
> shows:
> 
> -   15.72%  iperf            [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestor
>     - _raw_spin_unlock_irqrestore
>        - 8.95% arm_smmu_tlb_sync_context
>             arm_smmu_iotlb_sync
>           - __iommu_dma_unmap
>              + 4.54% iommu_dma_unmap_sg
>              + 4.41% iommu_dma_unmap_page
>        - 2.92% alloc_iova_fast
>           - iommu_dma_alloc_iova.isra.26
>              + 1.54% iommu_dma_map_sg
>              + 1.38% __iommu_dma_map
>        - 2.64% free_iova_fast
>             iommu_dma_free_iova
>           - __iommu_dma_unmap
>              + 1.35% iommu_dma_unmap_sg
>              + 1.29% iommu_dma_unmap_page
> 
> which seems to be pointing to the SMMU code as a bottleneck.
> 
> Will suggests that his iommu changes (in his for-joerg/arm-smmu/updates
> branch), allows IOMMU driver modifications that may have a beneficial
> effect.  Any thoughts?

We default to synchronous invalidation on unmaps, since it gives the 
greatest degree of security against misbehaving devices (and proves 
quite useful for smoking out dodgy drivers too). If you're happy with 
deferred invalidation as x86 defaults to, try "iommu.strict=0" - that 
should avoid the main serialising bottleneck. As for the IOVA allocation 
overhead, that's probably about as low as it's likely to get now - what 
remains is the inevitable "doing anything vs. doing nothing" tradeoff.

The major changes in 5.4 are for SMMUv3, so won't impact your platform.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SMMU performance
  2019-09-30 11:45 ` Robin Murphy
@ 2019-09-30 11:54   ` Will Deacon
  2019-09-30 12:00     ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2019-09-30 11:54 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Russell King - ARM Linux admin, linux-arm-kernel

On Mon, Sep 30, 2019 at 12:45:28PM +0100, Robin Murphy wrote:
> On 30/09/2019 12:00, Russell King - ARM Linux admin wrote:
> > While using iperf on a platform using the ARM SMMU (v2), I notice the
> > following behaviour on v5.3 with Will's iommu patch set merged, kernel
> > lock debugging disabled.
> > 
> > With iommu.passthrough=1, three consecutive runs:
> > [  3]  0.0-10.0 sec  4.51 GBytes  3.87 Gbits/sec
> > [  3]  0.0-10.0 sec  4.53 GBytes  3.89 Gbits/sec
> > [  3]  0.0-10.0 sec  4.49 GBytes  3.86 Gbits/sec
> > 
> > With iommu.passthrough=0:
> > [  3]  0.0-10.0 sec  1.77 GBytes  1.52 Gbits/sec
> > [  3]  0.0-10.0 sec  1.82 GBytes  1.56 Gbits/sec
> > [  3]  0.0-10.0 sec  1.69 GBytes  1.45 Gbits/sec
> > 
> > Running perf record -a -g ... followed by perf report --no-children
> > shows:
> > 
> > -   15.72%  iperf            [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestor
> >     - _raw_spin_unlock_irqrestore
> >        - 8.95% arm_smmu_tlb_sync_context
> >             arm_smmu_iotlb_sync
> >           - __iommu_dma_unmap
> >              + 4.54% iommu_dma_unmap_sg
> >              + 4.41% iommu_dma_unmap_page
> >        - 2.92% alloc_iova_fast
> >           - iommu_dma_alloc_iova.isra.26
> >              + 1.54% iommu_dma_map_sg
> >              + 1.38% __iommu_dma_map
> >        - 2.64% free_iova_fast
> >             iommu_dma_free_iova
> >           - __iommu_dma_unmap
> >              + 1.35% iommu_dma_unmap_sg
> >              + 1.29% iommu_dma_unmap_page
> > 
> > which seems to be pointing to the SMMU code as a bottleneck.
> > 
> > Will suggests that his iommu changes (in his for-joerg/arm-smmu/updates
> > branch), allows IOMMU driver modifications that may have a beneficial
> > effect.  Any thoughts?
> 
> We default to synchronous invalidation on unmaps, since it gives the
> greatest degree of security against misbehaving devices (and proves quite
> useful for smoking out dodgy drivers too). If you're happy with deferred
> invalidation as x86 defaults to, try "iommu.strict=0" - that should avoid
> the main serialising bottleneck. As for the IOVA allocation overhead, that's
> probably about as low as it's likely to get now - what remains is the
> inevitable "doing anything vs. doing nothing" tradeoff.
> 
> The major changes in 5.4 are for SMMUv3, so won't impact your platform.

I was wondering whether rigging up the gather stuff would help here but,
looking at the backtrace, the time is spent on the sync itself so I suspect
it won't help. Hmm... I wonder if we can do better using a sequence number
so that we can ride off the back of somebody else's sync?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SMMU performance
  2019-09-30 11:54   ` Will Deacon
@ 2019-09-30 12:00     ` Robin Murphy
  2019-10-02  9:02       ` Will Deacon
  0 siblings, 1 reply; 6+ messages in thread
From: Robin Murphy @ 2019-09-30 12:00 UTC (permalink / raw)
  To: Will Deacon; +Cc: Russell King - ARM Linux admin, linux-arm-kernel

On 30/09/2019 12:54, Will Deacon wrote:
> On Mon, Sep 30, 2019 at 12:45:28PM +0100, Robin Murphy wrote:
>> On 30/09/2019 12:00, Russell King - ARM Linux admin wrote:
>>> While using iperf on a platform using the ARM SMMU (v2), I notice the
>>> following behaviour on v5.3 with Will's iommu patch set merged, kernel
>>> lock debugging disabled.
>>>
>>> With iommu.passthrough=1, three consecutive runs:
>>> [  3]  0.0-10.0 sec  4.51 GBytes  3.87 Gbits/sec
>>> [  3]  0.0-10.0 sec  4.53 GBytes  3.89 Gbits/sec
>>> [  3]  0.0-10.0 sec  4.49 GBytes  3.86 Gbits/sec
>>>
>>> With iommu.passthrough=0:
>>> [  3]  0.0-10.0 sec  1.77 GBytes  1.52 Gbits/sec
>>> [  3]  0.0-10.0 sec  1.82 GBytes  1.56 Gbits/sec
>>> [  3]  0.0-10.0 sec  1.69 GBytes  1.45 Gbits/sec
>>>
>>> Running perf record -a -g ... followed by perf report --no-children
>>> shows:
>>>
>>> -   15.72%  iperf            [kernel.vmlinux]    [k] _raw_spin_unlock_irqrestor
>>>      - _raw_spin_unlock_irqrestore
>>>         - 8.95% arm_smmu_tlb_sync_context
>>>              arm_smmu_iotlb_sync
>>>            - __iommu_dma_unmap
>>>               + 4.54% iommu_dma_unmap_sg
>>>               + 4.41% iommu_dma_unmap_page
>>>         - 2.92% alloc_iova_fast
>>>            - iommu_dma_alloc_iova.isra.26
>>>               + 1.54% iommu_dma_map_sg
>>>               + 1.38% __iommu_dma_map
>>>         - 2.64% free_iova_fast
>>>              iommu_dma_free_iova
>>>            - __iommu_dma_unmap
>>>               + 1.35% iommu_dma_unmap_sg
>>>               + 1.29% iommu_dma_unmap_page
>>>
>>> which seems to be pointing to the SMMU code as a bottleneck.
>>>
>>> Will suggests that his iommu changes (in his for-joerg/arm-smmu/updates
>>> branch), allows IOMMU driver modifications that may have a beneficial
>>> effect.  Any thoughts?
>>
>> We default to synchronous invalidation on unmaps, since it gives the
>> greatest degree of security against misbehaving devices (and proves quite
>> useful for smoking out dodgy drivers too). If you're happy with deferred
>> invalidation as x86 defaults to, try "iommu.strict=0" - that should avoid
>> the main serialising bottleneck. As for the IOVA allocation overhead, that's
>> probably about as low as it's likely to get now - what remains is the
>> inevitable "doing anything vs. doing nothing" tradeoff.
>>
>> The major changes in 5.4 are for SMMUv3, so won't impact your platform.
> 
> I was wondering whether rigging up the gather stuff would help here but,
> looking at the backtrace, the time is spent on the sync itself so I suspect
> it won't help. Hmm... I wonder if we can do better using a sequence number
> so that we can ride off the back of somebody else's sync?

The trouble with v2 is that then we'd have to introduce locking around 
the invalidates as well, in order to keep track of what the last 
'command' issued in each context was - that's almost certainly going to 
have far more overhead than eliding syncs could possibly save.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SMMU performance
  2019-09-30 12:00     ` Robin Murphy
@ 2019-10-02  9:02       ` Will Deacon
  2019-10-02 11:09         ` Robin Murphy
  0 siblings, 1 reply; 6+ messages in thread
From: Will Deacon @ 2019-10-02  9:02 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Russell King - ARM Linux admin, linux-arm-kernel

On Mon, Sep 30, 2019 at 01:00:00PM +0100, Robin Murphy wrote:
> On 30/09/2019 12:54, Will Deacon wrote:
> > On Mon, Sep 30, 2019 at 12:45:28PM +0100, Robin Murphy wrote:
> > > The major changes in 5.4 are for SMMUv3, so won't impact your platform.
> > 
> > I was wondering whether rigging up the gather stuff would help here but,
> > looking at the backtrace, the time is spent on the sync itself so I suspect
> > it won't help. Hmm... I wonder if we can do better using a sequence number
> > so that we can ride off the back of somebody else's sync?
> 
> The trouble with v2 is that then we'd have to introduce locking around the
> invalidates as well, in order to keep track of what the last 'command'
> issued in each context was - that's almost certainly going to have far more
> overhead than eliding syncs could possibly save.

I was thinking along the lines of allocating an ID to each flush, and then
updating a sync ID on sync, so you can elide the sync if the sync ID is
greater than your flush ID. But it's vague and I didn't try to implement
anything.

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: SMMU performance
  2019-10-02  9:02       ` Will Deacon
@ 2019-10-02 11:09         ` Robin Murphy
  0 siblings, 0 replies; 6+ messages in thread
From: Robin Murphy @ 2019-10-02 11:09 UTC (permalink / raw)
  To: Will Deacon; +Cc: Russell King - ARM Linux admin, linux-arm-kernel

On 02/10/2019 10:02, Will Deacon wrote:
> On Mon, Sep 30, 2019 at 01:00:00PM +0100, Robin Murphy wrote:
>> On 30/09/2019 12:54, Will Deacon wrote:
>>> On Mon, Sep 30, 2019 at 12:45:28PM +0100, Robin Murphy wrote:
>>>> The major changes in 5.4 are for SMMUv3, so won't impact your platform.
>>>
>>> I was wondering whether rigging up the gather stuff would help here but,
>>> looking at the backtrace, the time is spent on the sync itself so I suspect
>>> it won't help. Hmm... I wonder if we can do better using a sequence number
>>> so that we can ride off the back of somebody else's sync?
>>
>> The trouble with v2 is that then we'd have to introduce locking around the
>> invalidates as well, in order to keep track of what the last 'command'
>> issued in each context was - that's almost certainly going to have far more
>> overhead than eliding syncs could possibly save.
> 
> I was thinking along the lines of allocating an ID to each flush, and then
> updating a sync ID on sync, so you can elide the sync if the sync ID is
> greater than your flush ID. But it's vague and I didn't try to implement
> anything.

I don't think that works:

	  A		  B

	start flush 1
	TLBI
			start flush 2
	TLBI
			TLBI
			SYNC(2)
	TLBI
	TLBI
	...
	SYNC(1)

Even considering your idea upside-down, it seems unlikely to be 
beneficial for thread B to sit and wait however long for sync 1 to be 
issued just to nominally save issuing sync 2.

Robin.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2019-10-02 11:09 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-30 11:00 SMMU performance Russell King - ARM Linux admin
2019-09-30 11:45 ` Robin Murphy
2019-09-30 11:54   ` Will Deacon
2019-09-30 12:00     ` Robin Murphy
2019-10-02  9:02       ` Will Deacon
2019-10-02 11:09         ` Robin Murphy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).