From: Dmitry Osipenko <digetx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> To: Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Jonathan Hunter <jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, Thierry Reding <thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>, linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org Subject: Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Date: Mon, 7 May 2018 18:51:38 +0300 [thread overview] Message-ID: <a6e26d17-97ac-d02e-7bf4-f009af1a25dc@gmail.com> (raw) In-Reply-To: <20180507080420.GB18595-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> On 07.05.2018 11:04, Joerg Roedel wrote: > On Mon, May 07, 2018 at 12:19:01AM +0300, Dmitry Osipenko wrote: >> Probably the best variant would be to give an explicit control over syncing to a >> user of the IOMMU API, like for example device driver may perform multiple >> mappings / unmappings and then sync/flush in the end. I'm not sure that it's >> really worth the hassle to shuffle the API right now, maybe we can implement it >> later if needed. Joerg, do you have objections to a 'compound page' approach? > > Have you measured the performance difference on both variants? The > compound-page approach only works for cases when the physical memory you > map contiguous and correctly aligned. Yes, previously I actually only tested mapping of the contiguous allocations (used for memory isolation purposes). But now I've re-tested all variants and got somewhat interesting results. Firstly it is not that easy to test a really sparse mapping simply because memory allocator produces sparse allocation only when memory is _really_ fragmented. Pretty much all of the time the sparse allocations are contiguous or they consist of a very few chunks that do not impose any noticeable performance impact. Secondly, the interesting part is that mapping / unmapping of a contiguous allocation (CMA using DMA API) is slower by ~50% then doing it for a sparse allocation (get_pages using bare IOMMU API). /I think/ it's a shortcoming of the arch/arm/mm/dma-mapping.c, which also suffers from other inflexibilities that Thierry faced recently. Though I haven't really tried to figure out what is the bottleneck yet and Thierry was going to re-write ARM's dma-mapping implementation anyway, I'll take a closer look at this issue a bit later. I've implemented the iotlb_sync_map() and tested things with it. The end result is the same as for the compound page approach, simply because actual allocations are pretty much always contiguous. > If it is really needed I would prefer a separate iotlb_sync_map() > call-back that is just NULL when not needed. This way all users that > don't need it only get a minimal penalty in the mapping path and you > don't have any requirements on the physical memory you map to get good > performance. Summarizing, the iotlb_sync_map() is indeed better way. As you rightly noticed, that approach is also optimal for the non-contiguous cases as we won't have to flush on mapping of each contiguous chunk of the sparse allocation, but after the whole mapping is done. Thierry, Robin and Joerg - thanks for your input, I'll prepare patches implementing the iotlb_sync_map.
WARNING: multiple messages have this Message-ID (diff)
From: Dmitry Osipenko <digetx@gmail.com> To: Joerg Roedel <joro@8bytes.org> Cc: Robin Murphy <robin.murphy@arm.com>, Thierry Reding <thierry.reding@gmail.com>, linux-tegra@vger.kernel.org, iommu@lists.linux-foundation.org, linux-kernel@vger.kernel.org, Jonathan Hunter <jonathanh@nvidia.com> Subject: Re: [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Date: Mon, 7 May 2018 18:51:38 +0300 [thread overview] Message-ID: <a6e26d17-97ac-d02e-7bf4-f009af1a25dc@gmail.com> (raw) In-Reply-To: <20180507080420.GB18595@8bytes.org> On 07.05.2018 11:04, Joerg Roedel wrote: > On Mon, May 07, 2018 at 12:19:01AM +0300, Dmitry Osipenko wrote: >> Probably the best variant would be to give an explicit control over syncing to a >> user of the IOMMU API, like for example device driver may perform multiple >> mappings / unmappings and then sync/flush in the end. I'm not sure that it's >> really worth the hassle to shuffle the API right now, maybe we can implement it >> later if needed. Joerg, do you have objections to a 'compound page' approach? > > Have you measured the performance difference on both variants? The > compound-page approach only works for cases when the physical memory you > map contiguous and correctly aligned. Yes, previously I actually only tested mapping of the contiguous allocations (used for memory isolation purposes). But now I've re-tested all variants and got somewhat interesting results. Firstly it is not that easy to test a really sparse mapping simply because memory allocator produces sparse allocation only when memory is _really_ fragmented. Pretty much all of the time the sparse allocations are contiguous or they consist of a very few chunks that do not impose any noticeable performance impact. Secondly, the interesting part is that mapping / unmapping of a contiguous allocation (CMA using DMA API) is slower by ~50% then doing it for a sparse allocation (get_pages using bare IOMMU API). /I think/ it's a shortcoming of the arch/arm/mm/dma-mapping.c, which also suffers from other inflexibilities that Thierry faced recently. Though I haven't really tried to figure out what is the bottleneck yet and Thierry was going to re-write ARM's dma-mapping implementation anyway, I'll take a closer look at this issue a bit later. I've implemented the iotlb_sync_map() and tested things with it. The end result is the same as for the compound page approach, simply because actual allocations are pretty much always contiguous. > If it is really needed I would prefer a separate iotlb_sync_map() > call-back that is just NULL when not needed. This way all users that > don't need it only get a minimal penalty in the mapping path and you > don't have any requirements on the physical memory you map to get good > performance. Summarizing, the iotlb_sync_map() is indeed better way. As you rightly noticed, that approach is also optimal for the non-contiguous cases as we won't have to flush on mapping of each contiguous chunk of the sparse allocation, but after the whole mapping is done. Thierry, Robin and Joerg - thanks for your input, I'll prepare patches implementing the iotlb_sync_map.
next prev parent reply other threads:[~2018-05-07 15:51 UTC|newest] Thread overview: 34+ messages / expand[flat|nested] mbox.gz Atom feed top 2018-04-09 20:07 [PATCH v1 0/4] Tegra GART fixes and improvements Dmitry Osipenko 2018-04-09 20:07 ` Dmitry Osipenko [not found] ` <cover.1523304324.git.digetx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-04-09 20:07 ` [PATCH v1 1/4] iommu/tegra: gart: Add debugging facility Dmitry Osipenko 2018-04-09 20:07 ` Dmitry Osipenko 2018-04-27 9:46 ` Thierry Reding 2018-04-09 20:07 ` [PATCH v1 2/4] iommu/tegra: gart: Fix gart_iommu_unmap() Dmitry Osipenko 2018-04-09 20:07 ` Dmitry Osipenko [not found] ` <dd25a9ff7bad7c92b345c0c0ce2bf235c4c3b6e8.1523304324.git.digetx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-04-27 9:43 ` Thierry Reding 2018-04-27 9:43 ` Thierry Reding 2018-04-09 20:07 ` [PATCH v1 3/4] iommu/tegra: gart: Constify number of GART pages Dmitry Osipenko 2018-04-09 20:07 ` Dmitry Osipenko [not found] ` <954659be6760130f6ffd5e733db2ad58cbb8e6e4.1523304324.git.digetx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-04-27 9:49 ` Thierry Reding 2018-04-27 9:49 ` Thierry Reding 2018-04-09 20:07 ` [PATCH v1 4/4] iommu/tegra: gart: Optimize map/unmap Dmitry Osipenko 2018-04-09 20:07 ` Dmitry Osipenko [not found] ` <f21a7b6a8f141b87f75687904a76f3728ea639a8.1523304324.git.digetx-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-04-27 10:02 ` Thierry Reding 2018-04-27 10:02 ` Thierry Reding 2018-04-27 12:01 ` Dmitry Osipenko 2018-04-27 12:36 ` Robin Murphy 2018-04-27 12:36 ` Robin Murphy [not found] ` <716edf58-38a7-21e5-1668-b866bf392e34-5wv7dgnIgG8@public.gmane.org> 2018-05-06 21:19 ` Dmitry Osipenko 2018-05-06 21:19 ` Dmitry Osipenko [not found] ` <6827bda3-1aa2-da60-a749-8e2dd2e595f3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-05-07 8:04 ` Joerg Roedel 2018-05-07 8:04 ` Joerg Roedel [not found] ` <20180507080420.GB18595-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> 2018-05-07 15:51 ` Dmitry Osipenko [this message] 2018-05-07 15:51 ` Dmitry Osipenko [not found] ` <a6e26d17-97ac-d02e-7bf4-f009af1a25dc-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> 2018-05-07 17:38 ` Dmitry Osipenko 2018-05-07 17:38 ` Dmitry Osipenko 2018-05-07 7:59 ` Joerg Roedel 2018-05-07 7:59 ` Joerg Roedel [not found] ` <20180507075920.GA18595-zLv9SwRftAIdnm+yROfE0A@public.gmane.org> 2018-05-07 15:46 ` Dmitry Osipenko 2018-05-07 15:46 ` Dmitry Osipenko 2018-05-03 12:52 ` [PATCH v1 0/4] Tegra GART fixes and improvements Joerg Roedel 2018-05-03 12:52 ` Joerg Roedel
Reply instructions: You may reply publicly to this message via plain-text email using any one of the following methods: * Save the following mbox file, import it into your mail client, and reply-to-all from there: mbox Avoid top-posting and favor interleaved quoting: https://en.wikipedia.org/wiki/Posting_style#Interleaved_style * Reply using the --to, --cc, and --in-reply-to switches of git-send-email(1): git send-email \ --in-reply-to=a6e26d17-97ac-d02e-7bf4-f009af1a25dc@gmail.com \ --to=digetx-re5jqeeqqe8avxtiumwx3w@public.gmane.org \ --cc=iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org \ --cc=jonathanh-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org \ --cc=joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org \ --cc=linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \ --cc=linux-tegra-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \ --cc=thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org \ /path/to/YOUR_REPLY https://kernel.org/pub/software/scm/git/docs/git-send-email.html * If your mail client supports setting the In-Reply-To header via mailto: links, try the mailto: linkBe sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.