linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
@ 2018-11-01 21:35 Nicolin Chen
  2018-11-02 16:54 ` Robin Murphy
  2018-11-04 15:50 ` Christoph Hellwig
  0 siblings, 2 replies; 10+ messages in thread
From: Nicolin Chen @ 2018-11-01 21:35 UTC (permalink / raw)
  To: joro; +Cc: vdumpa, iommu, linux-kernel

The __GFP_ZERO will be passed down to the generic page allocation
routine which zeros everything page by page. This is safe to be a
generic way but not efficient for iommu allocation that organizes
contiguous pages using scatterlist.

So this changes drops __GFP_ZERO from the flag, and adds a manual
memset after page/sg allocations, using the length of scatterlist.

My test result of a 2.5MB size allocation shows iommu_dma_alloc()
takes 46% less time, reduced from averagely 925 usec to 500 usec.

Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
---
 drivers/iommu/dma-iommu.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b04753b204..e48d995e65c5 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -551,10 +551,13 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
 	struct iommu_domain *domain = iommu_get_dma_domain(dev);
 	struct iommu_dma_cookie *cookie = domain->iova_cookie;
 	struct iova_domain *iovad = &cookie->iovad;
+	struct scatterlist *s;
 	struct page **pages;
 	struct sg_table sgt;
 	dma_addr_t iova;
 	unsigned int count, min_size, alloc_sizes = domain->pgsize_bitmap;
+	bool gfp_zero = false;
+	int i;
 
 	*handle = IOMMU_MAPPING_ERROR;
 
@@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
 	if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
 		alloc_sizes = min_size;
 
+	/*
+	 * The generic zeroing in a length of one page size is slow,
+	 * so do it manually in a length of scatterlist size instead
+	 */
+	if (gfp & __GFP_ZERO) {
+		gfp &= ~__GFP_ZERO;
+		gfp_zero = true;
+	}
+
 	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
 	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
 	if (!pages)
@@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
 	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
 		goto out_free_iova;
 
+	if (gfp_zero) {
+		/* Now zero all the pages in the scatterlist */
+		for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
+			memset(sg_virt(s), 0, s->length);
+	}
+
 	if (!(prot & IOMMU_CACHE)) {
 		struct sg_mapping_iter miter;
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-01 21:35 [PATCH] iommu/dma: Zero pages manually in a length of scatterlist Nicolin Chen
@ 2018-11-02 16:54 ` Robin Murphy
  2018-11-02 23:36   ` Nicolin Chen
  2018-11-04 15:50 ` Christoph Hellwig
  1 sibling, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2018-11-02 16:54 UTC (permalink / raw)
  To: Nicolin Chen, joro; +Cc: iommu, linux-kernel

On 01/11/2018 21:35, Nicolin Chen wrote:
> The __GFP_ZERO will be passed down to the generic page allocation
> routine which zeros everything page by page. This is safe to be a
> generic way but not efficient for iommu allocation that organizes
> contiguous pages using scatterlist.
> 
> So this changes drops __GFP_ZERO from the flag, and adds a manual
> memset after page/sg allocations, using the length of scatterlist.
> 
> My test result of a 2.5MB size allocation shows iommu_dma_alloc()
> takes 46% less time, reduced from averagely 925 usec to 500 usec.

Assuming this is for arm64, I'm somewhat surprised that memset() could 
be that much faster than clear_page(), since they should effectively 
amount to the same thing (a DC ZVA loop). What hardware is this on? 
Profiling to try and see exactly where the extra time goes would be 
interesting too.

> Signed-off-by: Nicolin Chen <nicoleotsuka@gmail.com>
> ---
>   drivers/iommu/dma-iommu.c | 18 ++++++++++++++++++
>   1 file changed, 18 insertions(+)
> 
> diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
> index d1b04753b204..e48d995e65c5 100644
> --- a/drivers/iommu/dma-iommu.c
> +++ b/drivers/iommu/dma-iommu.c
> @@ -551,10 +551,13 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>   	struct iommu_domain *domain = iommu_get_dma_domain(dev);
>   	struct iommu_dma_cookie *cookie = domain->iova_cookie;
>   	struct iova_domain *iovad = &cookie->iovad;
> +	struct scatterlist *s;
>   	struct page **pages;
>   	struct sg_table sgt;
>   	dma_addr_t iova;
>   	unsigned int count, min_size, alloc_sizes = domain->pgsize_bitmap;
> +	bool gfp_zero = false;
> +	int i;
>   
>   	*handle = IOMMU_MAPPING_ERROR;
>   
> @@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>   	if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
>   		alloc_sizes = min_size;
>   
> +	/*
> +	 * The generic zeroing in a length of one page size is slow,
> +	 * so do it manually in a length of scatterlist size instead
> +	 */
> +	if (gfp & __GFP_ZERO) {
> +		gfp &= ~__GFP_ZERO;
> +		gfp_zero = true;
> +	}

Or just mask it out in __iommu_dma_alloc_pages()?

> +
>   	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
>   	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
>   	if (!pages)
> @@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>   	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
>   		goto out_free_iova;
>   
> +	if (gfp_zero) {
> +		/* Now zero all the pages in the scatterlist */
> +		for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
> +			memset(sg_virt(s), 0, s->length);

What if the pages came from highmem? I know that doesn't happen on arm64 
today, but the point of this code *is* to be generic, and other users 
will arrive eventually.

Robin.

> +	}
> +
>   	if (!(prot & IOMMU_CACHE)) {
>   		struct sg_mapping_iter miter;
>   		/*
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-02 16:54 ` Robin Murphy
@ 2018-11-02 23:36   ` Nicolin Chen
  2018-11-05 14:58     ` Christoph Hellwig
  2018-11-06 18:27     ` Robin Murphy
  0 siblings, 2 replies; 10+ messages in thread
From: Nicolin Chen @ 2018-11-02 23:36 UTC (permalink / raw)
  To: Robin Murphy; +Cc: joro, iommu, linux-kernel

On Fri, Nov 02, 2018 at 04:54:07PM +0000, Robin Murphy wrote:
> On 01/11/2018 21:35, Nicolin Chen wrote:
> > The __GFP_ZERO will be passed down to the generic page allocation
> > routine which zeros everything page by page. This is safe to be a
> > generic way but not efficient for iommu allocation that organizes
> > contiguous pages using scatterlist.
> > 
> > So this changes drops __GFP_ZERO from the flag, and adds a manual
> > memset after page/sg allocations, using the length of scatterlist.
> > 
> > My test result of a 2.5MB size allocation shows iommu_dma_alloc()
> > takes 46% less time, reduced from averagely 925 usec to 500 usec.
> 
> Assuming this is for arm64, I'm somewhat surprised that memset() could be
> that much faster than clear_page(), since they should effectively amount to
> the same thing (a DC ZVA loop). What hardware is this on? Profiling to try

I am running with tegra186-p2771-0000.dtb so it's arm64 yes.

> and see exactly where the extra time goes would be interesting too.

I re-ran the test to get some accuracy within the function and got:
1) pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
   // reduced from 422 usec to 56 usec == 366 usec less
2) if (!(prot & IOMMU_CACHE)) {...}	//flush routine
   // reduced from 439 usec to 236 usec == 203 usec less
Note: new memset takes about 164 usec, resulting in 400 usec diff
      for the entire iommu_dma_alloc() function call.

It looks like this might be more than the diff between clear_page
and memset, and might be related to mapping and cache. Any idea?

> > @@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
> >   	if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
> >   		alloc_sizes = min_size;
> > +	/*
> > +	 * The generic zeroing in a length of one page size is slow,
> > +	 * so do it manually in a length of scatterlist size instead
> > +	 */
> > +	if (gfp & __GFP_ZERO) {
> > +		gfp &= ~__GFP_ZERO;
> > +		gfp_zero = true;
> > +	}
> 
> Or just mask it out in __iommu_dma_alloc_pages()?

Yea, the change here would be neater then.

> > @@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
> >   	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
> >   		goto out_free_iova;
> > +	if (gfp_zero) {
> > +		/* Now zero all the pages in the scatterlist */
> > +		for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
> > +			memset(sg_virt(s), 0, s->length);
> 
> What if the pages came from highmem? I know that doesn't happen on arm64
> today, but the point of this code *is* to be generic, and other users will
> arrive eventually.

Hmm, so it probably should use sg_miter_start/stop() too? Looking
at the flush routine doing in PAGE_SIZE for each iteration, would
be possible to map and memset contiguous pages together? Actually
the flush routine might be also optimized if we can map contiguous
pages.

Thank you
Nicolin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-01 21:35 [PATCH] iommu/dma: Zero pages manually in a length of scatterlist Nicolin Chen
  2018-11-02 16:54 ` Robin Murphy
@ 2018-11-04 15:50 ` Christoph Hellwig
  2018-11-06 23:46   ` Nicolin Chen
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2018-11-04 15:50 UTC (permalink / raw)
  To: Nicolin Chen; +Cc: joro, iommu, linux-kernel

On Thu, Nov 01, 2018 at 02:35:00PM -0700, Nicolin Chen wrote:
> The __GFP_ZERO will be passed down to the generic page allocation
> routine which zeros everything page by page. This is safe to be a
> generic way but not efficient for iommu allocation that organizes
> contiguous pages using scatterlist.
> 
> So this changes drops __GFP_ZERO from the flag, and adds a manual
> memset after page/sg allocations, using the length of scatterlist.
> 
> My test result of a 2.5MB size allocation shows iommu_dma_alloc()
> takes 46% less time, reduced from averagely 925 usec to 500 usec.

And in what case does dma_alloc_* performance even matter?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-02 23:36   ` Nicolin Chen
@ 2018-11-05 14:58     ` Christoph Hellwig
  2018-11-06 14:39       ` Robin Murphy
  2018-11-06 18:27     ` Robin Murphy
  1 sibling, 1 reply; 10+ messages in thread
From: Christoph Hellwig @ 2018-11-05 14:58 UTC (permalink / raw)
  To: Nicolin Chen; +Cc: Robin Murphy, joro, iommu, linux-kernel

On Fri, Nov 02, 2018 at 04:36:13PM -0700, Nicolin Chen wrote:
> > What if the pages came from highmem? I know that doesn't happen on arm64
> > today, but the point of this code *is* to be generic, and other users will
> > arrive eventually.
> 
> Hmm, so it probably should use sg_miter_start/stop() too? Looking
> at the flush routine doing in PAGE_SIZE for each iteration, would
> be possible to map and memset contiguous pages together? Actually
> the flush routine might be also optimized if we can map contiguous
> pages.

FYI, I have patches I plan to submit soon that gets rid of the
struct scatterlist use in this code to simplify it:

http://git.infradead.org/users/hch/misc.git/commitdiff/84e837fc3248b513f73adde49e04e7c58f605113

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-05 14:58     ` Christoph Hellwig
@ 2018-11-06 14:39       ` Robin Murphy
  2018-11-09  7:45         ` Christoph Hellwig
  0 siblings, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2018-11-06 14:39 UTC (permalink / raw)
  To: Christoph Hellwig, Nicolin Chen; +Cc: joro, iommu, linux-kernel

On 05/11/2018 14:58, Christoph Hellwig wrote:
> On Fri, Nov 02, 2018 at 04:36:13PM -0700, Nicolin Chen wrote:
>>> What if the pages came from highmem? I know that doesn't happen on arm64
>>> today, but the point of this code *is* to be generic, and other users will
>>> arrive eventually.
>>
>> Hmm, so it probably should use sg_miter_start/stop() too? Looking
>> at the flush routine doing in PAGE_SIZE for each iteration, would
>> be possible to map and memset contiguous pages together? Actually
>> the flush routine might be also optimized if we can map contiguous
>> pages.
> 
> FYI, I have patches I plan to submit soon that gets rid of the
> struct scatterlist use in this code to simplify it:

...and I have some significant objections to that simplification which I 
plan to respond with ;)

(namely that it defaults the whole higher-order page allocation business 
which will have varying degrees of performance impact on certain cases)

Robin.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-02 23:36   ` Nicolin Chen
  2018-11-05 14:58     ` Christoph Hellwig
@ 2018-11-06 18:27     ` Robin Murphy
  2018-11-07  0:11       ` Nicolin Chen
  1 sibling, 1 reply; 10+ messages in thread
From: Robin Murphy @ 2018-11-06 18:27 UTC (permalink / raw)
  To: Nicolin Chen; +Cc: joro, iommu, linux-kernel

On 02/11/2018 23:36, Nicolin Chen wrote:
> On Fri, Nov 02, 2018 at 04:54:07PM +0000, Robin Murphy wrote:
>> On 01/11/2018 21:35, Nicolin Chen wrote:
>>> The __GFP_ZERO will be passed down to the generic page allocation
>>> routine which zeros everything page by page. This is safe to be a
>>> generic way but not efficient for iommu allocation that organizes
>>> contiguous pages using scatterlist.
>>>
>>> So this changes drops __GFP_ZERO from the flag, and adds a manual
>>> memset after page/sg allocations, using the length of scatterlist.
>>>
>>> My test result of a 2.5MB size allocation shows iommu_dma_alloc()
>>> takes 46% less time, reduced from averagely 925 usec to 500 usec.
>>
>> Assuming this is for arm64, I'm somewhat surprised that memset() could be
>> that much faster than clear_page(), since they should effectively amount to
>> the same thing (a DC ZVA loop). What hardware is this on? Profiling to try
> 
> I am running with tegra186-p2771-0000.dtb so it's arm64 yes.
> 
>> and see exactly where the extra time goes would be interesting too.
> 
> I re-ran the test to get some accuracy within the function and got:
> 1) pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
>     // reduced from 422 usec to 56 usec == 366 usec less
> 2) if (!(prot & IOMMU_CACHE)) {...}	//flush routine
>     // reduced from 439 usec to 236 usec == 203 usec less
> Note: new memset takes about 164 usec, resulting in 400 usec diff
>        for the entire iommu_dma_alloc() function call.
> 
> It looks like this might be more than the diff between clear_page
> and memset, and might be related to mapping and cache. Any idea?

Hmm, I guess it might not be so much clear_page() itself as all the 
gubbins involved in getting there from prep_new_page(). I could perhaps 
make some vague guesses about how the A57 cores might get tickled by the 
different code patterns, but the Denver cores are well beyond my ability 
to reason about. Out of even further curiosity, how does the quick hack 
below compare?

>>> @@ -568,6 +571,15 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>>>    	if (attrs & DMA_ATTR_ALLOC_SINGLE_PAGES)
>>>    		alloc_sizes = min_size;
>>> +	/*
>>> +	 * The generic zeroing in a length of one page size is slow,
>>> +	 * so do it manually in a length of scatterlist size instead
>>> +	 */
>>> +	if (gfp & __GFP_ZERO) {
>>> +		gfp &= ~__GFP_ZERO;
>>> +		gfp_zero = true;
>>> +	}
>>
>> Or just mask it out in __iommu_dma_alloc_pages()?
> 
> Yea, the change here would be neater then.
> 
>>> @@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
>>>    	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
>>>    		goto out_free_iova;
>>> +	if (gfp_zero) {
>>> +		/* Now zero all the pages in the scatterlist */
>>> +		for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
>>> +			memset(sg_virt(s), 0, s->length);
>>
>> What if the pages came from highmem? I know that doesn't happen on arm64
>> today, but the point of this code *is* to be generic, and other users will
>> arrive eventually.
> 
> Hmm, so it probably should use sg_miter_start/stop() too? Looking
> at the flush routine doing in PAGE_SIZE for each iteration, would
> be possible to map and memset contiguous pages together? Actually
> the flush routine might be also optimized if we can map contiguous
> pages.

I suppose the ideal point at which to do it would be after the remapping 
when we have the entire buffer contiguous in vmalloc space and can make 
best use of prefetchers etc. - DMA_ATTR_NO_KERNEL_MAPPING is a bit of a 
spanner in the works, but we could probably accommodate a special case 
for that. As Christoph points out, this isn't really the place to be 
looking for performance anyway (unless it's pathologically bad as per 
the DMA_ATTR_ALLOC_SINGLE_PAGES fun), but if we're looking at pulling 
the remapping out of the arch code, maybe we could aim to rework the 
zeroing completely as part of that.

Robin.

----->8-----
diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index d1b04753b204..7d28db3bf4bf 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -569,7 +569,7 @@ struct page **iommu_dma_alloc(struct device *dev, 
size_t size, gfp_t gfp,
  		alloc_sizes = min_size;

  	count = PAGE_ALIGN(size) >> PAGE_SHIFT;
-	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
+	pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp 
& ~__GFP_ZERO);
  	if (!pages)
  		return NULL;

@@ -581,15 +581,18 @@ struct page **iommu_dma_alloc(struct device *dev, 
size_t size, gfp_t gfp,
  	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
  		goto out_free_iova;

-	if (!(prot & IOMMU_CACHE)) {
+	{
  		struct sg_mapping_iter miter;
  		/*
  		 * The CPU-centric flushing implied by SG_MITER_TO_SG isn't
  		 * sufficient here, so skip it by using the "wrong" direction.
  		 */
  		sg_miter_start(&miter, sgt.sgl, sgt.orig_nents, SG_MITER_FROM_SG);
-		while (sg_miter_next(&miter))
+		while (sg_miter_next(&miter)) {
+			clear_page(miter.addr);
+			if (!(prot & IOMMU_CACHE))
  			flush_page(dev, miter.addr, page_to_phys(miter.page));
+		}
  		sg_miter_stop(&miter);
  	}


^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-04 15:50 ` Christoph Hellwig
@ 2018-11-06 23:46   ` Nicolin Chen
  0 siblings, 0 replies; 10+ messages in thread
From: Nicolin Chen @ 2018-11-06 23:46 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: joro, iommu, linux-kernel

Hi Christoph,

On Sun, Nov 04, 2018 at 07:50:01AM -0800, Christoph Hellwig wrote:
> On Thu, Nov 01, 2018 at 02:35:00PM -0700, Nicolin Chen wrote:
> > The __GFP_ZERO will be passed down to the generic page allocation
> > routine which zeros everything page by page. This is safe to be a
> > generic way but not efficient for iommu allocation that organizes
> > contiguous pages using scatterlist.
> > 
> > So this changes drops __GFP_ZERO from the flag, and adds a manual
> > memset after page/sg allocations, using the length of scatterlist.
> > 
> > My test result of a 2.5MB size allocation shows iommu_dma_alloc()
> > takes 46% less time, reduced from averagely 925 usec to 500 usec.
> 
> And in what case does dma_alloc_* performance even matter?

Honestly, this was amplified by running a local iommu benchmark
test. Practically dma_alloc/free() should not be that stressful,
but we cannot say the performance doesn't matter at all, right?
Though many device drivers pre-allocte memory for DMA usage, it
could matter where a driver dynamically allocates and releases.

And actually I have a related question for you: I saw that the
dma_direct_alloc() cancels the __GFP_ZERO flag and does manual
memset() after allocation. Might that be possibly related to a
performance concern? Though I don't see any performance keyword
for that part of code, especially seems that memset() was there
from the beginning.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-06 18:27     ` Robin Murphy
@ 2018-11-07  0:11       ` Nicolin Chen
  0 siblings, 0 replies; 10+ messages in thread
From: Nicolin Chen @ 2018-11-07  0:11 UTC (permalink / raw)
  To: Robin Murphy; +Cc: joro, iommu, linux-kernel

Hi Robin,

On Tue, Nov 06, 2018 at 06:27:39PM +0000, Robin Murphy wrote:
> > I re-ran the test to get some accuracy within the function and got:
> > 1) pages = __iommu_dma_alloc_pages(count, alloc_sizes >> PAGE_SHIFT, gfp);
> >     // reduced from 422 usec to 56 usec == 366 usec less
> > 2) if (!(prot & IOMMU_CACHE)) {...}	//flush routine
> >     // reduced from 439 usec to 236 usec == 203 usec less
> > Note: new memset takes about 164 usec, resulting in 400 usec diff
> >        for the entire iommu_dma_alloc() function call.
> > 
> > It looks like this might be more than the diff between clear_page
> > and memset, and might be related to mapping and cache. Any idea?
> 
> Hmm, I guess it might not be so much clear_page() itself as all the gubbins
> involved in getting there from prep_new_page(). I could perhaps make some
> vague guesses about how the A57 cores might get tickled by the different
> code patterns, but the Denver cores are well beyond my ability to reason
> about. Out of even further curiosity, how does the quick hack below compare?

I tried out that change. And the results are as followings:
a. Routine (1) reduced from 422 usec to 55 usec
b. Routine (2) increased from 441 usec to 833 usec
c. Overall, it seems to remain the same: 900+ usec

> > > > @@ -581,6 +593,12 @@ struct page **iommu_dma_alloc(struct device *dev, size_t size, gfp_t gfp,
> > > >    	if (sg_alloc_table_from_pages(&sgt, pages, count, 0, size, GFP_KERNEL))
> > > >    		goto out_free_iova;
> > > > +	if (gfp_zero) {
> > > > +		/* Now zero all the pages in the scatterlist */
> > > > +		for_each_sg(sgt.sgl, s, sgt.orig_nents, i)
> > > > +			memset(sg_virt(s), 0, s->length);
> > > 
> > > What if the pages came from highmem? I know that doesn't happen on arm64
> > > today, but the point of this code *is* to be generic, and other users will
> > > arrive eventually.
> > 
> > Hmm, so it probably should use sg_miter_start/stop() too? Looking
> > at the flush routine doing in PAGE_SIZE for each iteration, would
> > be possible to map and memset contiguous pages together? Actually
> > the flush routine might be also optimized if we can map contiguous
> > pages.
> 
> I suppose the ideal point at which to do it would be after the remapping
> when we have the entire buffer contiguous in vmalloc space and can make best
> use of prefetchers etc. - DMA_ATTR_NO_KERNEL_MAPPING is a bit of a spanner
> in the works, but we could probably accommodate a special case for that. As
> Christoph points out, this isn't really the place to be looking for
> performance anyway (unless it's pathologically bad as per the

I would understand the point. So probably it'd be more plausible
to have the change if it reflects on some practical benchmark. I
might need to re-run some tests with heavier use cases.

> DMA_ATTR_ALLOC_SINGLE_PAGES fun), but if we're looking at pulling the
> remapping out of the arch code, maybe we could aim to rework the zeroing
> completely as part of that.

That'd be nice. I believe it'd be good to have.

Thanks
Nicolin

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] iommu/dma: Zero pages manually in a length of scatterlist
  2018-11-06 14:39       ` Robin Murphy
@ 2018-11-09  7:45         ` Christoph Hellwig
  0 siblings, 0 replies; 10+ messages in thread
From: Christoph Hellwig @ 2018-11-09  7:45 UTC (permalink / raw)
  To: Robin Murphy; +Cc: Christoph Hellwig, Nicolin Chen, joro, iommu, linux-kernel

On Tue, Nov 06, 2018 at 02:39:26PM +0000, Robin Murphy wrote:
> ...and I have some significant objections to that simplification which I
> plan to respond with ;)
> 
> (namely that it defaults the whole higher-order page allocation business
> which will have varying degrees of performance impact on certain cases)

Well, please place your objection there.  The behavior does match what
every other iommu-based dma ops implementation ouside of arm/arm64 does,
so there is some precedent for it to say the least.  But if the only
current users objects I'll surely find a way to accomodate it, but a
good rationale including numbers would be useful to document it.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2018-11-09  7:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-01 21:35 [PATCH] iommu/dma: Zero pages manually in a length of scatterlist Nicolin Chen
2018-11-02 16:54 ` Robin Murphy
2018-11-02 23:36   ` Nicolin Chen
2018-11-05 14:58     ` Christoph Hellwig
2018-11-06 14:39       ` Robin Murphy
2018-11-09  7:45         ` Christoph Hellwig
2018-11-06 18:27     ` Robin Murphy
2018-11-07  0:11       ` Nicolin Chen
2018-11-04 15:50 ` Christoph Hellwig
2018-11-06 23:46   ` Nicolin Chen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).