iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
From: John Garry <john.garry@huawei.com>
To: Will Deacon <will@kernel.org>
Cc: Vijay Kilary <vkilari@codeaurora.org>,
	Jean-Philippe Brucker <jean-philippe.brucker@arm.com>,
	Jon Masters <jcm@redhat.com>, Jan Glauber <jglauber@marvell.com>,
	Alex Williamson <alex.williamson@redhat.com>,
	iommu@lists.linux-foundation.org,
	Jayachandran Chandrasekharan Nair <jnair@marvell.com>,
	Robin Murphy <robin.murphy@arm.com>
Subject: Re: [RFC PATCH v2 00/19] Try to reduce lock contention on the SMMUv3 command queue
Date: Thu, 25 Jul 2019 11:11:45 +0100	[thread overview]
Message-ID: <7cf4d92f-fcaf-dacc-5e34-ed315ef52abe@huawei.com> (raw)
In-Reply-To: <20190724144817.kecc6kx7lhitaaac@willie-the-truck>

On 24/07/2019 15:48, Will Deacon wrote:
> On Wed, Jul 24, 2019 at 03:25:07PM +0100, John Garry wrote:
>> On 24/07/2019 13:20, Will Deacon wrote:
>>> On Wed, Jul 24, 2019 at 10:58:26AM +0100, John Garry wrote:
>>>> On 11/07/2019 18:19, Will Deacon wrote:
>>>>> This is a significant rework of the RFC I previously posted here:
>>>>>
>>>>>   https://lkml.kernel.org/r/20190611134603.4253-1-will.deacon@arm.com
>>>>>
>>>>> But this time, it looks like it might actually be worthwhile according
>>>>> to my perf profiles, where __iommu_unmap() falls a long way down the
>>>>> profile for a multi-threaded netperf run. I'm still relying on others to
>>>>> confirm this is useful, however.
>>>>>
>>>>> Some of the changes since last time are:
>>>>>
>>>>>   * Support for constructing and submitting a list of commands in the
>>>>>     driver
>>>>>
>>>>>   * Numerous changes to the IOMMU and io-pgtable APIs so that we can
>>>>>     submit commands in batches
>>>>>
>>>>>   * Removal of cmpxchg() from cmdq_shared_lock() fast-path
>>>>>
>>>>>   * Code restructuring and cleanups
>>>>>
>>>>> This current applies against my iommu/devel branch that Joerg has pulled
>>>>> for 5.3. If you want to test it out, I've put everything here:
>>>>>
>>>>>   https://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git/log/?h=iommu/cmdq
>>>>>
>>>>> Feedback welcome. I appreciate that we're in the merge window, but I
>>>>> wanted to get this on the list for people to look at as an RFC.
>>>>>
>>>>
>>>> I tested storage performance on this series, which I think is a better
>>>> scenario to test than network performance, that being generally limited by
>>>> the network link speed.
>>>
>>> Interesting, thanks for sharing. Do you also see a similar drop in CPU time
>>> to the one reported by Ganapat?
>>
>> Not really, CPU load reported by fio is mostly the same.
>
> That's a pity. Maybe the cmdq isn't actually getting hit very heavily by
> fio.
>
>>>> Baseline performance (will/iommu/devel, commit 9e6ea59f3)
>>>> 8x SAS disks D05	839K IOPS
>>>> 1x NVMe D05		454K IOPS
>>>> 1x NVMe D06		442k IOPS
>>>>
>>>> Patchset performance (will/iommu/cmdq)
>>>> 8x SAS disk D05		835K IOPS
>>>> 1x NVMe D05		472K IOPS
>>>> 1x NVMe D06		459k IOPS
>>>>
>>>> So we see a bit of an NVMe boost, but about the same for 8x disks. No iommu
>>>> performance is about 918K IOPs for 8x disks, so it is not limited by the
>>>> medium.
>>>
>>> It would be nice to know if this performance gap is because of Linux, or
>>> simply because of the translation overhead in the SMMU hardware. Are you
>>> able to get a perf profile to see where we're spending time?
>>
>> I'll look to do that, but I'd really expect it to be down to the time linux
>> spends on the DMA map and unmaps.
>
> Right, and it would be good to see how much of that is in SMMUv3-specific
> code. Another interesting thing to try would be reducing the depth of the
> io-pgtable. We currently key that off VA_BITS which may be much larger
> than you need (by virtue of being a compile-time value).
>

The perf call-graph before and after is below. Interesting to see how 
long the mapping takes. So changing the io-pgtable depth may help here.

The unmapping speed improves - I don't know why it doesn't help the 8x 
disk scenario throughput, which these profiles are based on.

JFYI, our HW guys are quite concerned with the SMMU programming 
interface, specifically having a single submission queue which requires 
locking or some sort of synchronization. Obviously it's not a single 
culprit for consuming time.

before map
  |--3.02%--__iommu_map_sg_attrs
      |
      --2.99%--iommu_dma_map_sg
             |
             --2.71%--iommu_map_sg
                |
                --2.58%--iommu_map
                    |
                   --2.50%--arm_smmu_map
                          |
                          --2.48%--arm_lpae_map
                               |
                               --0.66%--__arm_lpae_map
                                     |
                                     --0.50%--__arm_lpae_map


after map
--3.25%--__iommu_map_sg_attrs
     |
     --3.20%--iommu_dma_map_sg
          |
          --2.77%--iommu_map_sg
               |
               --2.62%--iommu_map
                    |
                     --2.53%--arm_smmu_map
                            |
                            --2.49%--arm_lpae_map
                                   |
                                   --0.72%--__arm_lpae_map
                                         |
                                         --0.56%--__arm_lpae_map

before unmap
--5.64%--__iommu_unmap_sg_attrs
   |
   --5.63%--iommu_dma_unmap_sg
     |
      --5.53%--__iommu_dma_unmap
        |
        --4.97%--iommu_unmap_fast
          |
          --4.94%--__iommu_unmap
            |
            --4.91%--arm_smmu_unmap
               |
                --4.83%--arm_lpae_unmap
                   |
                   --4.80%--__arm_lpae_unmap
                     |
                     --4.62%--__arm_lpae_unmap
                        |
                        --4.50%--__arm_lpae_unmap
                           |
                           --4.33%--__arm_lpae_unmap
                             |
                              --4.13%--arm_smmu_tlb_inv_range_nosync
                                |
                                 --3.97%--arm_smmu_cmdq_issue_cmd
                                   |
                                    --3.74%--_raw_spin_unlock_irqrestore

after unmap
--2.75%--__iommu_unmap_sg_attrs
     |
     --2.74%--iommu_dma_unmap_sg
        |
        --2.59%--__iommu_dma_unmap
              |
              |--1.82%--arm_smmu_iotlb_sync
              |       |
              |        --1.78%--arm_smmu_tlb_inv_range
              |              |
              |              --1.75%--arm_smmu_cmdq_issue_cmdlist
              |
              --0.65%--iommu_unmap_fast
                  |
                   --0.64%--__iommu_unmap
                        |
                        --0.62%--arm_smmu_unmap
                            |
                            --0.55%--arm_lpae_unmap
                                  |
                                  --0.54%--__arm_lpae_unmap


I tried to capture a record for direct DMA, but dma_direct_map_sg et al 
wasn't even showing in graph. That's the target...

John

> Will
>
> .
>


_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

      reply	other threads:[~2019-07-25 10:12 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-07-11 17:19 [RFC PATCH v2 00/19] Try to reduce lock contention on the SMMUv3 command queue Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 01/19] iommu: Remove empty iommu_tlb_range_add() callback from iommu_ops Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 02/19] iommu/io-pgtable-arm: Remove redundant call to io_pgtable_tlb_sync() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 03/19] iommu/io-pgtable: Rename iommu_gather_ops to iommu_flush_ops Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 04/19] iommu: Introduce struct iommu_iotlb_gather for batching TLB flushes Will Deacon
2019-07-24  7:19   ` Joerg Roedel
2019-07-24  7:41     ` Will Deacon
2019-07-25  7:58       ` Joerg Roedel
2019-07-11 17:19 ` [RFC PATCH v2 05/19] iommu: Introduce iommu_iotlb_gather_add_page() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 06/19] iommu: Pass struct iommu_iotlb_gather to ->unmap() and ->iotlb_sync() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 07/19] iommu/io-pgtable: Introduce tlb_flush_walk() and tlb_flush_leaf() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 08/19] iommu/io-pgtable: Hook up ->tlb_flush_walk() and ->tlb_flush_leaf() in drivers Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 09/19] iommu/io-pgtable-arm: Call ->tlb_flush_walk() and ->tlb_flush_leaf() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 10/19] iommu/io-pgtable: Replace ->tlb_add_flush() with ->tlb_add_page() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 11/19] iommu/io-pgtable: Remove unused ->tlb_sync() callback Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 12/19] iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->unmap() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 13/19] iommu/io-pgtable: Pass struct iommu_iotlb_gather to ->tlb_add_page() Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 14/19] iommu/arm-smmu-v3: Separate s/w and h/w views of prod and cons indexes Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 15/19] iommu/arm-smmu-v3: Drop unused 'q' argument from Q_OVF macro Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 16/19] iommu/arm-smmu-v3: Move low-level queue fields out of arm_smmu_queue Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 17/19] iommu/arm-smmu-v3: Operate directly on low-level queue where possible Will Deacon
2019-07-11 17:19 ` [RFC PATCH v2 18/19] iommu/arm-smmu-v3: Reduce contention during command-queue insertion Will Deacon
2019-07-19 11:04   ` John Garry
2019-07-24 12:15     ` Will Deacon
2019-07-24 14:03       ` John Garry
2019-07-24 14:07         ` Will Deacon
2019-07-24  8:20   ` John Garry
2019-07-24 14:33     ` Will Deacon
2019-07-25 11:31       ` John Garry
2019-07-11 17:19 ` [RFC PATCH v2 19/19] iommu/arm-smmu-v3: Defer TLB invalidation until ->iotlb_sync() Will Deacon
2019-07-19  4:25 ` [RFC PATCH v2 00/19] Try to reduce lock contention on the SMMUv3 command queue Ganapatrao Kulkarni
2019-07-24 12:28   ` Will Deacon
2019-07-24  9:58 ` John Garry
2019-07-24 12:20   ` Will Deacon
2019-07-24 14:25     ` John Garry
2019-07-24 14:48       ` Will Deacon
2019-07-25 10:11         ` John Garry [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7cf4d92f-fcaf-dacc-5e34-ed315ef52abe@huawei.com \
    --to=john.garry@huawei.com \
    --cc=alex.williamson@redhat.com \
    --cc=iommu@lists.linux-foundation.org \
    --cc=jcm@redhat.com \
    --cc=jean-philippe.brucker@arm.com \
    --cc=jglauber@marvell.com \
    --cc=jnair@marvell.com \
    --cc=robin.murphy@arm.com \
    --cc=vkilari@codeaurora.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).