Re: [RFC PATCH v2 00/19] Try to reduce lock contention on the SMMUv3 command queue

From: Ganapatrao Kulkarni <gklkml16@gmail.com>
To: Will Deacon <will@kernel.org>
Cc: Vijay Kilary <vkilari@codeaurora.org>,
	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	Jon Masters <jcm@redhat.com>, Jan Glauber <jglauber@marvell.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	iommu@lists.linux-foundation.org,
	Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
	Robin Murphy <robin.murphy@arm.com>
Subject: Re: [RFC PATCH v2 00/19] Try to reduce lock contention on the SMMUv3 command queue
Date: Fri, 19 Jul 2019 09:55:39 +0530	[thread overview]
Message-ID: <CAKTKpr58zHi0Nw=Fb8d4xHUenW1d76V2pkQ_0+BqWQ0OfBmtCQ@mail.gmail.com> (raw)
In-Reply-To: <20190711171927.28803-1-will@kernel.org>

Hi Will,

On Thu, Jul 11, 2019 at 10:58 PM Will Deacon <will@kernel.org> wrote:
>
> Hi everyone,
>
> This is a significant rework of the RFC I previously posted here:
>
>   https://lkml.kernel.org/r/20190611134603.4253-1-will.deacon@arm.com
>
> But this time, it looks like it might actually be worthwhile according
> to my perf profiles, where __iommu_unmap() falls a long way down the
> profile for a multi-threaded netperf run. I'm still relying on others to
> confirm this is useful, however.
>
> Some of the changes since last time are:
>
>   * Support for constructing and submitting a list of commands in the
>     driver
>
>   * Numerous changes to the IOMMU and io-pgtable APIs so that we can
>     submit commands in batches
>
>   * Removal of cmpxchg() from cmdq_shared_lock() fast-path
>
>   * Code restructuring and cleanups
>
> This current applies against my iommu/devel branch that Joerg has pulled
> for 5.3. If you want to test it out, I've put everything here:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq
>
> Feedback welcome. I appreciate that we're in the merge window, but I
> wanted to get this on the list for people to look at as an RFC.

I have tried branch iommu/cmdq on ThunderX2. I do see there is drastic
reduction in CPU bandwidth consumption(from 15 to 20% to 1 to 2% in
perf top) from SMMU CMDQ helper functions, when I run iperf with more
than 64 clients(-P 64). However I have not noticed any measurable
performance improvement in iperf results. IMO, this might/should help
in performance improvement of IO intensive workloads.

FWIW, you can add,
Tested-by: Ganapatrao Kulkarni  <gkulkarni@marvell.com>

>
> Cheers,
>
> Will
>
> --->8
>
> Cc: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
> Cc: Robin Murphy <robin.murphy@arm.com>
> Cc: Jayachandran Chandrasekharan Nair <jnair@marvell.com>
> Cc: Jan Glauber <jglauber@marvell.com>
> Cc: Jon Masters <jcm@redhat.com>
> Cc: Eric Auger <eric.auger@redhat.com>
> Cc: Zhen Lei <thunder.leizhen@huawei.com>
> Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Cc: Vijay Kilary <vkilari@codeaurora.org>
> Cc: Joerg Roedel <joro@8bytes.org>
> Cc: John Garry <john.garry@huawei.com>
> Cc: Alex Williamson <alex.williamson@redhat.com>
>
> Will Deacon (19):
>   iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops
>   iommu/io-pgtable-arm: Remove redundant call to io_pgtable_tlb_sync()
>   iommu/io-pgtable: Rename iommu_gather_ops to iommu_flush_ops
>   iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes
>   iommu: Introduce iommu_iotlb_gather_add_page()
>   iommu: Pass struct iommu_iotlb_gather to ->unmap() and ->iotlb_sync()
>   iommu/io-pgtable: Introduce tlb_flush_walk() and tlb_flush_leaf()
>   iommu/io-pgtable: Hook up ->tlb_flush_walk() and ->tlb_flush_leaf() in
>     drivers
>   iommu/io-pgtable-arm: Call ->tlb_flush_walk() and ->tlb_flush_leaf()
>   iommu/io-pgtable: Replace ->tlb_add_flush() with ->tlb_add_page()
>   iommu/io-pgtable: Remove unused ->tlb_sync() callback
>   iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->unmap()
>   iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->tlb_add_page()
>   iommu/arm-smmu-v3: Separate s/w and h/w views of prod and cons indexes
>   iommu/arm-smmu-v3: Drop unused 'q' argument from Q_OVF macro
>   iommu/arm-smmu-v3: Move low-level queue fields out of arm_smmu_queue
>   iommu/arm-smmu-v3: Operate directly on low-level queue where possible
>   iommu/arm-smmu-v3: Reduce contention during command-queue insertion
>   iommu/arm-smmu-v3: Defer TLB invalidation until ->iotlb_sync()
>
>  drivers/gpu/drm/panfrost/panfrost_mmu.c |  24 +-
>  drivers/iommu/amd_iommu.c               |  11 +-
>  drivers/iommu/arm-smmu-v3.c             | 856 ++++++++++++++++++++++++--------
>  drivers/iommu/arm-smmu.c                | 103 +++-
>  drivers/iommu/dma-iommu.c               |   9 +-
>  drivers/iommu/exynos-iommu.c            |   3 +-
>  drivers/iommu/intel-iommu.c             |   3 +-
>  drivers/iommu/io-pgtable-arm-v7s.c      |  57 +--
>  drivers/iommu/io-pgtable-arm.c          |  48 +-
>  drivers/iommu/iommu.c                   |  24 +-
>  drivers/iommu/ipmmu-vmsa.c              |  28 +-
>  drivers/iommu/msm_iommu.c               |  42 +-
>  drivers/iommu/mtk_iommu.c               |  45 +-
>  drivers/iommu/mtk_iommu_v1.c            |   3 +-
>  drivers/iommu/omap-iommu.c              |   2 +-
>  drivers/iommu/qcom_iommu.c              |  44 +-
>  drivers/iommu/rockchip-iommu.c          |   2 +-
>  drivers/iommu/s390-iommu.c              |   3 +-
>  drivers/iommu/tegra-gart.c              |  12 +-
>  drivers/iommu/tegra-smmu.c              |   2 +-
>  drivers/vfio/vfio_iommu_type1.c         |  27 +-
>  include/linux/io-pgtable.h              |  57 ++-
>  include/linux/iommu.h                   |  92 +++-
>  23 files changed, 1090 insertions(+), 407 deletions(-)
>
> --
> 2.11.0
>

Thanks,
Ganapat
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu