From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932248AbdGSKXT (ORCPT ); Wed, 19 Jul 2017 06:23:19 -0400 Received: from foss.arm.com ([217.140.101.70]:37186 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753343AbdGSKXR (ORCPT ); Wed, 19 Jul 2017 06:23:17 -0400 Subject: Re: [PATCH 0/4] Optimise 64-bit IOVA allocations To: Ard Biesheuvel Cc: Joerg Roedel , iommu@lists.linux-foundation.org, "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , David Woodhouse , Zhen Lei , Lorenzo Pieralisi , Jonathan.Cameron@huawei.com, nwatters@codeaurora.org, ray.jui@broadcom.com References: From: Robin Murphy Message-ID: <19661034-093e-a744-b6fb-3d23a285ebe3@arm.com> Date: Wed, 19 Jul 2017 11:23:12 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.2.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-GB Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 19/07/17 09:37, Ard Biesheuvel wrote: > On 18 July 2017 at 17:57, Robin Murphy wrote: >> Hi all, >> >> In the wake of the ARM SMMU optimisation efforts, it seems that certain >> workloads (e.g. storage I/O with large scatterlists) probably remain quite >> heavily influenced by IOVA allocation performance. Separately, Ard also >> reported massive performance drops for a graphical desktop on AMD Seattle >> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA >> ops domain getting initialised differently for ACPI vs. DT, and exposing >> the overhead of the rbtree slow path. Whilst we could go around trying to >> close up all the little gaps that lead to hitting the slowest case, it >> seems a much better idea to simply make said slowest case a lot less slow. >> >> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding >> the changes rather too hard to follow, so I've taken the liberty here of >> picking the whole thing up and reimplementing the main part in a rather >> less invasive manner. >> >> Robin. >> >> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html >> >> Robin Murphy (1): >> iommu/iova: Extend rbtree node caching >> >> Zhen Lei (3): >> iommu/iova: Optimise rbtree searching >> iommu/iova: Optimise the padding calculation >> iommu/iova: Make dma_32bit_pfn implicit >> >> drivers/gpu/drm/tegra/drm.c | 3 +- >> drivers/gpu/host1x/dev.c | 3 +- >> drivers/iommu/amd_iommu.c | 7 +-- >> drivers/iommu/dma-iommu.c | 18 +------ >> drivers/iommu/intel-iommu.c | 11 ++-- >> drivers/iommu/iova.c | 112 ++++++++++++++++----------------------- >> drivers/misc/mic/scif/scif_rma.c | 3 +- >> include/linux/iova.h | 8 +-- >> 8 files changed, 60 insertions(+), 105 deletions(-) >> > > These patches look suspiciously like the ones I have been using over > the past couple of weeks (modulo the tegra and host1x changes) from > your git tree. They work fine on my AMD Overdrive B1, both in DT and > in ACPI/IORT modes, although it is difficult to quantify any > performance deltas on my setup. Indeed - this is a rebase (to account for those new callers) with a couple of trivial tweaks to error paths and corner cases that normal usage shouldn't have been hitting anyway. "No longer unusably awful" is a good enough performance delta for me :) > Tested-by: Ard Biesheuvel Thanks! Robin. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Robin Murphy Subject: Re: [PATCH 0/4] Optimise 64-bit IOVA allocations Date: Wed, 19 Jul 2017 11:23:12 +0100 Message-ID: <19661034-093e-a744-b6fb-3d23a285ebe3@arm.com> References: Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: Content-Language: en-GB Sender: linux-kernel-owner@vger.kernel.org To: Ard Biesheuvel Cc: Joerg Roedel , iommu@lists.linux-foundation.org, "linux-arm-kernel@lists.infradead.org" , "linux-kernel@vger.kernel.org" , David Woodhouse , Zhen Lei , Lorenzo Pieralisi , Jonathan.Cameron@huawei.com, nwatters@codeaurora.org, ray.jui@broadcom.com List-Id: iommu@lists.linux-foundation.org On 19/07/17 09:37, Ard Biesheuvel wrote: > On 18 July 2017 at 17:57, Robin Murphy wrote: >> Hi all, >> >> In the wake of the ARM SMMU optimisation efforts, it seems that certain >> workloads (e.g. storage I/O with large scatterlists) probably remain quite >> heavily influenced by IOVA allocation performance. Separately, Ard also >> reported massive performance drops for a graphical desktop on AMD Seattle >> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA >> ops domain getting initialised differently for ACPI vs. DT, and exposing >> the overhead of the rbtree slow path. Whilst we could go around trying to >> close up all the little gaps that lead to hitting the slowest case, it >> seems a much better idea to simply make said slowest case a lot less slow. >> >> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding >> the changes rather too hard to follow, so I've taken the liberty here of >> picking the whole thing up and reimplementing the main part in a rather >> less invasive manner. >> >> Robin. >> >> [1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg17753.html >> >> Robin Murphy (1): >> iommu/iova: Extend rbtree node caching >> >> Zhen Lei (3): >> iommu/iova: Optimise rbtree searching >> iommu/iova: Optimise the padding calculation >> iommu/iova: Make dma_32bit_pfn implicit >> >> drivers/gpu/drm/tegra/drm.c | 3 +- >> drivers/gpu/host1x/dev.c | 3 +- >> drivers/iommu/amd_iommu.c | 7 +-- >> drivers/iommu/dma-iommu.c | 18 +------ >> drivers/iommu/intel-iommu.c | 11 ++-- >> drivers/iommu/iova.c | 112 ++++++++++++++++----------------------- >> drivers/misc/mic/scif/scif_rma.c | 3 +- >> include/linux/iova.h | 8 +-- >> 8 files changed, 60 insertions(+), 105 deletions(-) >> > > These patches look suspiciously like the ones I have been using over > the past couple of weeks (modulo the tegra and host1x changes) from > your git tree. They work fine on my AMD Overdrive B1, both in DT and > in ACPI/IORT modes, although it is difficult to quantify any > performance deltas on my setup. Indeed - this is a rebase (to account for those new callers) with a couple of trivial tweaks to error paths and corner cases that normal usage shouldn't have been hitting anyway. "No longer unusably awful" is a good enough performance delta for me :) > Tested-by: Ard Biesheuvel Thanks! Robin. From mboxrd@z Thu Jan 1 00:00:00 1970 From: robin.murphy@arm.com (Robin Murphy) Date: Wed, 19 Jul 2017 11:23:12 +0100 Subject: [PATCH 0/4] Optimise 64-bit IOVA allocations In-Reply-To: References: Message-ID: <19661034-093e-a744-b6fb-3d23a285ebe3@arm.com> To: linux-arm-kernel@lists.infradead.org List-Id: linux-arm-kernel.lists.infradead.org On 19/07/17 09:37, Ard Biesheuvel wrote: > On 18 July 2017 at 17:57, Robin Murphy wrote: >> Hi all, >> >> In the wake of the ARM SMMU optimisation efforts, it seems that certain >> workloads (e.g. storage I/O with large scatterlists) probably remain quite >> heavily influenced by IOVA allocation performance. Separately, Ard also >> reported massive performance drops for a graphical desktop on AMD Seattle >> when enabling SMMUs via IORT, which we traced to dma_32bit_pfn in the DMA >> ops domain getting initialised differently for ACPI vs. DT, and exposing >> the overhead of the rbtree slow path. Whilst we could go around trying to >> close up all the little gaps that lead to hitting the slowest case, it >> seems a much better idea to simply make said slowest case a lot less slow. >> >> I had a go at rebasing Leizhen's last IOVA series[1], but ended up finding >> the changes rather too hard to follow, so I've taken the liberty here of >> picking the whole thing up and reimplementing the main part in a rather >> less invasive manner. >> >> Robin. >> >> [1] https://www.mail-archive.com/iommu at lists.linux-foundation.org/msg17753.html >> >> Robin Murphy (1): >> iommu/iova: Extend rbtree node caching >> >> Zhen Lei (3): >> iommu/iova: Optimise rbtree searching >> iommu/iova: Optimise the padding calculation >> iommu/iova: Make dma_32bit_pfn implicit >> >> drivers/gpu/drm/tegra/drm.c | 3 +- >> drivers/gpu/host1x/dev.c | 3 +- >> drivers/iommu/amd_iommu.c | 7 +-- >> drivers/iommu/dma-iommu.c | 18 +------ >> drivers/iommu/intel-iommu.c | 11 ++-- >> drivers/iommu/iova.c | 112 ++++++++++++++++----------------------- >> drivers/misc/mic/scif/scif_rma.c | 3 +- >> include/linux/iova.h | 8 +-- >> 8 files changed, 60 insertions(+), 105 deletions(-) >> > > These patches look suspiciously like the ones I have been using over > the past couple of weeks (modulo the tegra and host1x changes) from > your git tree. They work fine on my AMD Overdrive B1, both in DT and > in ACPI/IORT modes, although it is difficult to quantify any > performance deltas on my setup. Indeed - this is a rebase (to account for those new callers) with a couple of trivial tweaks to error paths and corner cases that normal usage shouldn't have been hitting anyway. "No longer unusably awful" is a good enough performance delta for me :) > Tested-by: Ard Biesheuvel Thanks! Robin.