All of lore.kernel.org
 help / color / mirror / Atom feed
* IOMMU in bypass mode by default on ARM64, instead of command line option
@ 2017-03-04 16:59 Sunil Kovvuri
  2017-03-06 12:34 ` Robin Murphy
  0 siblings, 1 reply; 5+ messages in thread
From: Sunil Kovvuri @ 2017-03-04 16:59 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

I did see some patches submitted earlier to enable user to boot kernel
with SMMU in bypass mode using a command line parameter.

Idea sounds good, but just wondering if it would be better if things are done in
reverse i.e put SMMU in bypass mode by default on host and provide
option to user to enable SMMU via bootargs. Reason i am saying this is
with SMMU
enabled on host, performance of devices drops down to pathetic levels, not
because of HW but mostly because of ARM IOMMU implementation in kernel.

On a Cavium's ARM64 platform below are some performance numbers
with and without SMMU enabled on host.
=======================================================
Iperf numbers with Intel 40G NIC: Without SMMU: 31.5Gbps, with SMMU: 820Mbps
FIO perf with Intel NVMe disk:

With SMMU on
=============

Random read:
--------------------

TEST:Rand MIX:0 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:69579
MBPS:284 TOTALCPU
TEST:Rand MIX:0 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:25299
MBPS:414 TOTALCPU:
TEST:Rand MIX:0 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:6264
MBPS:410 TOTALCPU:
TEST:Rand MIX:0 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:381
MBPS:399 TOTALCPU:

Random write
-------------------

TEST:Rand MIX:100 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:77159
MBPS:316 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:27283
MBPS:447 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:6634
MBPS:434 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:385
MBPS:403 TOTALCPU:


With SMMU off
==============

Random read
----------------------

TEST:Rand MIX:0 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32
IOPS:410392MBPS:1680 TOTALCPU:
TEST:Rand MIX:0 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:152583
MBPS:2499 TOTALCPU:
TEST:Rand MIX:0 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32
IOPS:37144MBPS:2434 TOTALCPU:
TEST:Rand MIX:0 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:2386
MBPS:2501 TOTALCPU:

Random write
----------------------

TEST:Rand MIX:100 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:99678
MBPS:408 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:113912
MBPS:1866 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:28871
MBPS:1892 TOTALCPU:
TEST:Rand MIX:100 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:1828
MBPS:1916 TOTALCPU:
=====================================================

Performance with SMMU enabled drops to <1/5th on NVMe disk and on NIC
it's down to nothing. Problem seems to be with high contention for
locks which are
being used for IOVA maintenance and while updating translation tables inside
ARM SMMUv2/v3 drivers.

Intel and AMD IOMMU drivers does make use of per-cpu IOVA caches to
reduce the impact of 'iova_rbtree_lock' but ARM is yet to catchup.
http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/iommu/iova.c?id=9257b4a206fc0229dd5f84b78e4d1ebf3f91d270

And the other lock is 'pgtbl_lock' used in ARM SMMUv2/V3 drivers.

Appreciate any feedback on the following
# Has anyone done any sort performance benchmarking before and saw
   similar results ?
# Is there a possibility of improvement here, has anyone already started
   working on these ?
# Would it be a good idea to use ARM IOMMU only with VFIO/Virtualization
   till things improve in this area, simply because kernel command
line parameter
   might be good from developer's perspective but in a deployment environment
   with a standard distro, don't think it's a feasible option.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* IOMMU in bypass mode by default on ARM64, instead of command line option
  2017-03-04 16:59 IOMMU in bypass mode by default on ARM64, instead of command line option Sunil Kovvuri
@ 2017-03-06 12:34 ` Robin Murphy
  2017-03-07 11:32   ` Will Deacon
  0 siblings, 1 reply; 5+ messages in thread
From: Robin Murphy @ 2017-03-06 12:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 04/03/17 16:59, Sunil Kovvuri wrote:
> Hi,
> 
> I did see some patches submitted earlier to enable user to boot kernel
> with SMMU in bypass mode using a command line parameter.
> 
> Idea sounds good, but just wondering if it would be better if things are done in
> reverse i.e put SMMU in bypass mode by default on host and provide
> option to user to enable SMMU via bootargs. Reason i am saying this is
> with SMMU
> enabled on host, performance of devices drops down to pathetic levels, not
> because of HW but mostly because of ARM IOMMU implementation in kernel.
> 
> On a Cavium's ARM64 platform below are some performance numbers
> with and without SMMU enabled on host.
> =======================================================
> Iperf numbers with Intel 40G NIC: Without SMMU: 31.5Gbps, with SMMU: 820Mbps
> FIO perf with Intel NVMe disk:
> 
> With SMMU on
> =============
> 
> Random read:
> --------------------
> 
> TEST:Rand MIX:0 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:69579
> MBPS:284 TOTALCPU
> TEST:Rand MIX:0 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:25299
> MBPS:414 TOTALCPU:
> TEST:Rand MIX:0 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:6264
> MBPS:410 TOTALCPU:
> TEST:Rand MIX:0 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:381
> MBPS:399 TOTALCPU:
> 
> Random write
> -------------------
> 
> TEST:Rand MIX:100 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:77159
> MBPS:316 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:27283
> MBPS:447 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:6634
> MBPS:434 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:385
> MBPS:403 TOTALCPU:
> 
> 
> With SMMU off
> ==============
> 
> Random read
> ----------------------
> 
> TEST:Rand MIX:0 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32
> IOPS:410392MBPS:1680 TOTALCPU:
> TEST:Rand MIX:0 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:152583
> MBPS:2499 TOTALCPU:
> TEST:Rand MIX:0 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32
> IOPS:37144MBPS:2434 TOTALCPU:
> TEST:Rand MIX:0 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:2386
> MBPS:2501 TOTALCPU:
> 
> Random write
> ----------------------
> 
> TEST:Rand MIX:100 BLOCK_SIZE:4096 THREADS:32 IODEPTH:32 IOPS:99678
> MBPS:408 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:16384 THREADS:32 IODEPTH:32 IOPS:113912
> MBPS:1866 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:65536 THREADS:32 IODEPTH:32 IOPS:28871
> MBPS:1892 TOTALCPU:
> TEST:Rand MIX:100 BLOCK_SIZE:1048576 THREADS:32 IODEPTH:32 IOPS:1828
> MBPS:1916 TOTALCPU:
> =====================================================
> 
> Performance with SMMU enabled drops to <1/5th on NVMe disk and on NIC
> it's down to nothing. Problem seems to be with high contention for
> locks which are
> being used for IOVA maintenance and while updating translation tables inside
> ARM SMMUv2/v3 drivers.

Yes, we're well aware that there's a whole load of potential lock
contention where there doesn't really need to be. The performance
optimisation effort has mostly been waiting for sufficiently big
systems/workloads to start appearing in order to measure and justify it.
Consider yourself the winner :)

> Intel and AMD IOMMU drivers does make use of per-cpu IOVA caches to
> reduce the impact of 'iova_rbtree_lock' but ARM is yet to catchup.
> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/iommu/iova.c?id=9257b4a206fc0229dd5f84b78e4d1ebf3f91d270
> 
> And the other lock is 'pgtbl_lock' used in ARM SMMUv2/V3 drivers.

Do you have any measurements to show how the contention on those two
locks compares? That would be useful to start with.

> Appreciate any feedback on the following
> # Has anyone done any sort performance benchmarking before and saw
>    similar results ?
> # Is there a possibility of improvement here, has anyone already started
>    working on these ?

Very much so. I've got a patch to convert iommu-dma over to
alloc_iova_fast() which I need to rebase and finish debugging, but hope
to send out in the next couple of weeks. Will has some ideas for
io-pgtable scalability that we've not yet found the time to look into
properly (I still have an old experiment up at [1], but I think we
identified some case in which it's broken).

> # Would it be a good idea to use ARM IOMMU only with VFIO/Virtualization
>    till things improve in this area, simply because kernel command
> line parameter
>    might be good from developer's perspective but in a deployment environment
>    with a standard distro, don't think it's a feasible option.

If you want to disable IOMMU DMA ops by default, you'll first have to
resolve things with the video/display/etc. folks who needed the IOMMU
DMA ops for their stuff to work properly at all ;)

Robin.

[1]:http://www.linux-arm.org/git?p=linux-rm.git;a=shortlog;h=refs/heads/iommu/pgtable

> 
> Thanks,
> Sunil.
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* IOMMU in bypass mode by default on ARM64, instead of command line option
  2017-03-06 12:34 ` Robin Murphy
@ 2017-03-07 11:32   ` Will Deacon
  2017-03-07 12:55     ` Sunil Kovvuri
  0 siblings, 1 reply; 5+ messages in thread
From: Will Deacon @ 2017-03-07 11:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Mar 06, 2017 at 12:34:36PM +0000, Robin Murphy wrote:
> On 04/03/17 16:59, Sunil Kovvuri wrote:
> > I did see some patches submitted earlier to enable user to boot kernel
> > with SMMU in bypass mode using a command line parameter.
> > 
> > Idea sounds good, but just wondering if it would be better if things are done in
> > reverse i.e put SMMU in bypass mode by default on host and provide
> > option to user to enable SMMU via bootargs. Reason i am saying this is
> > with SMMU
> > enabled on host, performance of devices drops down to pathetic levels, not
> > because of HW but mostly because of ARM IOMMU implementation in kernel.

[...]

> > Performance with SMMU enabled drops to <1/5th on NVMe disk and on NIC
> > it's down to nothing. Problem seems to be with high contention for
> > locks which are
> > being used for IOVA maintenance and while updating translation tables inside
> > ARM SMMUv2/v3 drivers.
> 
> Yes, we're well aware that there's a whole load of potential lock
> contention where there doesn't really need to be. The performance
> optimisation effort has mostly been waiting for sufficiently big
> systems/workloads to start appearing in order to measure and justify it.
> Consider yourself the winner :)

Yup. Given that you have numbers (thank you!), then we'll bump the priority
of this work and try to get something out that you can test.

I'll also respin my bypass series ASAP.

Will

^ permalink raw reply	[flat|nested] 5+ messages in thread

* IOMMU in bypass mode by default on ARM64, instead of command line option
  2017-03-07 11:32   ` Will Deacon
@ 2017-03-07 12:55     ` Sunil Kovvuri
  2017-03-07 12:59       ` Sunil Kovvuri
  0 siblings, 1 reply; 5+ messages in thread
From: Sunil Kovvuri @ 2017-03-07 12:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Mar 7, 2017 at 5:02 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Mon, Mar 06, 2017 at 12:34:36PM +0000, Robin Murphy wrote:
>> On 04/03/17 16:59, Sunil Kovvuri wrote:
>> > I did see some patches submitted earlier to enable user to boot kernel
>> > with SMMU in bypass mode using a command line parameter.
>> >
>> > Idea sounds good, but just wondering if it would be better if things are done in
>> > reverse i.e put SMMU in bypass mode by default on host and provide
>> > option to user to enable SMMU via bootargs. Reason i am saying this is
>> > with SMMU
>> > enabled on host, performance of devices drops down to pathetic levels, not
>> > because of HW but mostly because of ARM IOMMU implementation in kernel.
>
> [...]
>
>> > Performance with SMMU enabled drops to <1/5th on NVMe disk and on NIC
>> > it's down to nothing. Problem seems to be with high contention for
>> > locks which are
>> > being used for IOVA maintenance and while updating translation tables inside
>> > ARM SMMUv2/v3 drivers.
>>
>> Yes, we're well aware that there's a whole load of potential lock
>> contention where there doesn't really need to be. The performance
>> optimisation effort has mostly been waiting for sufficiently big
>> systems/workloads to start appearing in order to measure and justify it.
>> Consider yourself the winner :)
>
> Yup. Given that you have numbers (thank you!), then we'll bump the priority
> of this work and try to get something out that you can test.
>
> I'll also respin my bypass series ASAP.
>
> Will

Good to know that this is a known problem.
We will be morethan happy to test your patches.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* IOMMU in bypass mode by default on ARM64, instead of command line option
  2017-03-07 12:55     ` Sunil Kovvuri
@ 2017-03-07 12:59       ` Sunil Kovvuri
  0 siblings, 0 replies; 5+ messages in thread
From: Sunil Kovvuri @ 2017-03-07 12:59 UTC (permalink / raw)
  To: linux-arm-kernel

>> locks which are
>> being used for IOVA maintenance and while updating translation tables inside
>> ARM SMMUv2/v3 drivers.

>Yes, we're well aware that there's a whole load of potential lock
>contention where there doesn't really need to be. The performance
>optimisation effort has mostly been waiting for sufficiently big
>systems/workloads to start appearing in order to measure and justify it.
>Consider yourself the winner :)

>> Intel and AMD IOMMU drivers does make use of per-cpu IOVA caches to
>> reduce the impact of 'iova_rbtree_lock' but ARM is yet to catchup.
>> http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/drivers/iommu/iova.c?id=9257b4a206fc0229dd5f84b78e4d1ebf3f91d270
>>
>> And the other lock is 'pgtbl_lock' used in ARM SMMUv2/V3 drivers.

>Do you have any measurements to show how the contention on those two
>locks compares? That would be useful to start with.

Sure, I will capture a fresh and send you the perf data tomorrow.

Thanks,
Sunil.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2017-03-07 12:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-04 16:59 IOMMU in bypass mode by default on ARM64, instead of command line option Sunil Kovvuri
2017-03-06 12:34 ` Robin Murphy
2017-03-07 11:32   ` Will Deacon
2017-03-07 12:55     ` Sunil Kovvuri
2017-03-07 12:59       ` Sunil Kovvuri

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.