All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-17 20:30 ` Douglas Anderson
  0 siblings, 0 replies; 22+ messages in thread
From: Douglas Anderson @ 2015-12-17 20:30 UTC (permalink / raw)
  To: Russell King
  Cc: Tomasz Figa, Marek Szyprowski, Pawel Osciak, Dmitry Torokhov,
	Douglas Anderson, will.deacon, akpm, rientjes, carlo,
	laurent.pinchart+renesas, mike.looijmans, lorenx4,
	linux-arm-kernel, linux-kernel

The __iommu_alloc_buffer() is expected to be called to allocate pretty
sizeable buffers.  Upon simple tests of video I saw it trying to
allocate 4,194,304 bytes.  The function tries to be efficient about this
by starting out allocating large chunks and then moving to smaller and
smaller chunk sizes until it succeeds.

The current function is very, very slow.

One problem is the way it keeps trying and trying to allocate big
chunks.  Imagine a very fragmented memory that has 4M free but no
contiguous pages at all.  Further imagine allocating 4M (1024 pages).
We'll do the following memory allocations:
- For page 1:
  - Try to allocate order 10 (no retry)
  - Try to allocate order 9 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- For page 2:
  - Try to allocate order 9 (no retry)
  - Try to allocate order 8 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- ...
- ...

Total number of calls to alloc() calls for this case is:
  sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
  => 9228

The above is obviously worse case, but given how slow alloc can be we
really want to try to avoid even somewhat bad cases.  I timed the old
code with a device under memory pressure and it wasn't hard to see it
take more than 24 seconds to allocate 4 megs of memory (!!).

A second problem (and maybe even more important) is that allocating big
chunks when we don't need them is just not a good idea anyway.  The
first thing we do with these big chunks is break them into smaller
chunks!  If we allocate small chunks:
- The memory manager doesn't need to work so hard to give us big chunks.
- We can save the big chunks for those that really need them and this
  code can make great use of all the small chunks sitting around.

Let's simplify by just allocating one page at a time.  We may make more
total allocate calls but it works way better.  In real world tests that
used to sometimes see a 24 second allocation call I can now see at most
250 ms.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
 arch/arm/mm/dma-mapping.c | 38 ++++++--------------------------------
 1 file changed, 6 insertions(+), 32 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 492bf3efffab..7efeb2d4801b 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1160,39 +1160,13 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
 
 	while (count) {
-		int j, order;
-
-		for (order = __fls(count); order > 0; --order) {
-			/*
-			 * We do not want OOM killer to be invoked as long
-			 * as we can fall back to single pages, so we force
-			 * __GFP_NORETRY for orders higher than zero.
-			 */
-			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
-			if (pages[i])
-				break;
-		}
-
-		if (!pages[i]) {
-			/*
-			 * Fall back to single page allocation.
-			 * Might invoke OOM killer as last resort.
-			 */
-			pages[i] = alloc_pages(gfp, 0);
-			if (!pages[i])
-				goto error;
-		}
-
-		if (order) {
-			split_page(pages[i], order);
-			j = 1 << order;
-			while (--j)
-				pages[i + j] = pages[i] + j;
-		}
+		pages[i] = alloc_pages(gfp, 0);
+		if (!pages[i])
+			goto error;
 
-		__dma_clear_buffer(pages[i], PAGE_SIZE << order);
-		i += 1 << order;
-		count -= 1 << order;
+		__dma_clear_buffer(pages[i], PAGE_SIZE);
+		i += 1;
+		count -= 1;
 	}
 
 	return pages;
-- 
2.6.0.rc2.230.g3dd15c0


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-17 20:30 ` Douglas Anderson
  0 siblings, 0 replies; 22+ messages in thread
From: Douglas Anderson @ 2015-12-17 20:30 UTC (permalink / raw)
  To: linux-arm-kernel

The __iommu_alloc_buffer() is expected to be called to allocate pretty
sizeable buffers.  Upon simple tests of video I saw it trying to
allocate 4,194,304 bytes.  The function tries to be efficient about this
by starting out allocating large chunks and then moving to smaller and
smaller chunk sizes until it succeeds.

The current function is very, very slow.

One problem is the way it keeps trying and trying to allocate big
chunks.  Imagine a very fragmented memory that has 4M free but no
contiguous pages at all.  Further imagine allocating 4M (1024 pages).
We'll do the following memory allocations:
- For page 1:
  - Try to allocate order 10 (no retry)
  - Try to allocate order 9 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- For page 2:
  - Try to allocate order 9 (no retry)
  - Try to allocate order 8 (no retry)
  - ...
  - Try to allocate order 0 (with retry, but not needed)
- ...
- ...

Total number of calls to alloc() calls for this case is:
  sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
  => 9228

The above is obviously worse case, but given how slow alloc can be we
really want to try to avoid even somewhat bad cases.  I timed the old
code with a device under memory pressure and it wasn't hard to see it
take more than 24 seconds to allocate 4 megs of memory (!!).

A second problem (and maybe even more important) is that allocating big
chunks when we don't need them is just not a good idea anyway.  The
first thing we do with these big chunks is break them into smaller
chunks!  If we allocate small chunks:
- The memory manager doesn't need to work so hard to give us big chunks.
- We can save the big chunks for those that really need them and this
  code can make great use of all the small chunks sitting around.

Let's simplify by just allocating one page at a time.  We may make more
total allocate calls but it works way better.  In real world tests that
used to sometimes see a 24 second allocation call I can now see at most
250 ms.

Signed-off-by: Douglas Anderson <dianders@chromium.org>
---
 arch/arm/mm/dma-mapping.c | 38 ++++++--------------------------------
 1 file changed, 6 insertions(+), 32 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index 492bf3efffab..7efeb2d4801b 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1160,39 +1160,13 @@ static struct page **__iommu_alloc_buffer(struct device *dev, size_t size,
 	gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
 
 	while (count) {
-		int j, order;
-
-		for (order = __fls(count); order > 0; --order) {
-			/*
-			 * We do not want OOM killer to be invoked as long
-			 * as we can fall back to single pages, so we force
-			 * __GFP_NORETRY for orders higher than zero.
-			 */
-			pages[i] = alloc_pages(gfp | __GFP_NORETRY, order);
-			if (pages[i])
-				break;
-		}
-
-		if (!pages[i]) {
-			/*
-			 * Fall back to single page allocation.
-			 * Might invoke OOM killer as last resort.
-			 */
-			pages[i] = alloc_pages(gfp, 0);
-			if (!pages[i])
-				goto error;
-		}
-
-		if (order) {
-			split_page(pages[i], order);
-			j = 1 << order;
-			while (--j)
-				pages[i + j] = pages[i] + j;
-		}
+		pages[i] = alloc_pages(gfp, 0);
+		if (!pages[i])
+			goto error;
 
-		__dma_clear_buffer(pages[i], PAGE_SIZE << order);
-		i += 1 << order;
-		count -= 1 << order;
+		__dma_clear_buffer(pages[i], PAGE_SIZE);
+		i += 1;
+		count -= 1;
 	}
 
 	return pages;
-- 
2.6.0.rc2.230.g3dd15c0

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-17 20:30 ` Douglas Anderson
@ 2015-12-17 22:31   ` Doug Anderson
  -1 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-17 22:31 UTC (permalink / raw)
  To: Russell King
  Cc: Tomasz Figa, Marek Szyprowski, Pawel Osciak, Dmitry Torokhov,
	Douglas Anderson, Will Deacon, Andrew Morton, David Rientjes,
	Carlo Caione, Laurent Pinchart, mike.looijmans, lorenx4,
	linux-arm-kernel, linux-kernel

Hi,

On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
<dianders@chromium.org> wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to be efficient about this
> by starting out allocating large chunks and then moving to smaller and
> smaller chunk sizes until it succeeds.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>   - Try to allocate order 10 (no retry)
>   - Try to allocate order 9 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>   - Try to allocate order 9 (no retry)
>   - Try to allocate order 8 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>   => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 24 seconds to allocate 4 megs of memory (!!).
>
> A second problem (and maybe even more important) is that allocating big
> chunks when we don't need them is just not a good idea anyway.  The
> first thing we do with these big chunks is break them into smaller
> chunks!  If we allocate small chunks:
> - The memory manager doesn't need to work so hard to give us big chunks.
> - We can save the big chunks for those that really need them and this
>   code can make great use of all the small chunks sitting around.
>
> Let's simplify by just allocating one page at a time.  We may make more
> total allocate calls but it works way better.  In real world tests that
> used to sometimes see a 24 second allocation call I can now see at most
> 250 ms.

Off-list I talked to Dmitry about this a little bit and he pointed out
that contiguous chunks actually give a benefit to the IOMMU.  I don't
think the benefit outweighs the cost in this case, but I'm happy to
hear what others have to say.  I did some quick printouts and it turns
out that even when requesting page at a time the memory manager
(unsurprisingly) can in many cases still give us pages that are
contiguous.

Also I'm happy to post up
<https://chromium-review.googlesource.com/#/c/319210/> which sorts the
array and could possibly give us larger chunks of contiguous memory.


-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-17 22:31   ` Doug Anderson
  0 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-17 22:31 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
<dianders@chromium.org> wrote:
> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> sizeable buffers.  Upon simple tests of video I saw it trying to
> allocate 4,194,304 bytes.  The function tries to be efficient about this
> by starting out allocating large chunks and then moving to smaller and
> smaller chunk sizes until it succeeds.
>
> The current function is very, very slow.
>
> One problem is the way it keeps trying and trying to allocate big
> chunks.  Imagine a very fragmented memory that has 4M free but no
> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> We'll do the following memory allocations:
> - For page 1:
>   - Try to allocate order 10 (no retry)
>   - Try to allocate order 9 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - For page 2:
>   - Try to allocate order 9 (no retry)
>   - Try to allocate order 8 (no retry)
>   - ...
>   - Try to allocate order 0 (with retry, but not needed)
> - ...
> - ...
>
> Total number of calls to alloc() calls for this case is:
>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>   => 9228
>
> The above is obviously worse case, but given how slow alloc can be we
> really want to try to avoid even somewhat bad cases.  I timed the old
> code with a device under memory pressure and it wasn't hard to see it
> take more than 24 seconds to allocate 4 megs of memory (!!).
>
> A second problem (and maybe even more important) is that allocating big
> chunks when we don't need them is just not a good idea anyway.  The
> first thing we do with these big chunks is break them into smaller
> chunks!  If we allocate small chunks:
> - The memory manager doesn't need to work so hard to give us big chunks.
> - We can save the big chunks for those that really need them and this
>   code can make great use of all the small chunks sitting around.
>
> Let's simplify by just allocating one page at a time.  We may make more
> total allocate calls but it works way better.  In real world tests that
> used to sometimes see a 24 second allocation call I can now see at most
> 250 ms.

Off-list I talked to Dmitry about this a little bit and he pointed out
that contiguous chunks actually give a benefit to the IOMMU.  I don't
think the benefit outweighs the cost in this case, but I'm happy to
hear what others have to say.  I did some quick printouts and it turns
out that even when requesting page at a time the memory manager
(unsurprisingly) can in many cases still give us pages that are
contiguous.

Also I'm happy to post up
<https://chromium-review.googlesource.com/#/c/319210/> which sorts the
array and could possibly give us larger chunks of contiguous memory.


-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-17 22:31   ` Doug Anderson
@ 2015-12-18  6:05     ` Tomasz Figa
  -1 siblings, 0 replies; 22+ messages in thread
From: Tomasz Figa @ 2015-12-18  6:05 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Russell King, Marek Szyprowski, Pawel Osciak, Dmitry Torokhov,
	Will Deacon, Andrew Morton, David Rientjes, Carlo Caione,
	Laurent Pinchart, mike.looijmans, lorenx4, linux-arm-kernel,
	linux-kernel

On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
> <dianders@chromium.org> wrote:
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>> by starting out allocating large chunks and then moving to smaller and
>> smaller chunk sizes until it succeeds.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>   - Try to allocate order 10 (no retry)
>>   - Try to allocate order 9 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>   - Try to allocate order 9 (no retry)
>>   - Try to allocate order 8 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>   => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>
>> A second problem (and maybe even more important) is that allocating big
>> chunks when we don't need them is just not a good idea anyway.  The
>> first thing we do with these big chunks is break them into smaller
>> chunks!  If we allocate small chunks:
>> - The memory manager doesn't need to work so hard to give us big chunks.
>> - We can save the big chunks for those that really need them and this
>>   code can make great use of all the small chunks sitting around.
>>
>> Let's simplify by just allocating one page at a time.  We may make more
>> total allocate calls but it works way better.  In real world tests that
>> used to sometimes see a 24 second allocation call I can now see at most
>> 250 ms.
>
> Off-list I talked to Dmitry about this a little bit and he pointed out
> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> think the benefit outweighs the cost in this case, but I'm happy to
> hear what others have to say.

Yeah, I'd like to see some discussion about the effect of allocating
bigger chunks on IOMMU performance. Dmitry (on CC), could you
elaborate a bit on what Doug mentioned?

As for my own understanding, some IOMMUs can map memory using big
pages, which should improve TLB efficiency and so look-up speed.
However AFAICT current implementation of allocating function doesn't
allocate the chunks properly, because there is no guarantee that
particular chunks are aligned on big page boundary. For example, it
might happen that we allocate first chunk of order 0, then second
chunk of order 4 (64KiB - typical big page), then we won't be able to
map the second chunk using a big page, because the IOVA at that point
will not be aligned properly.

Is there any other case when bigger physically contiguous chunks can
help the IOMMU?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18  6:05     ` Tomasz Figa
  0 siblings, 0 replies; 22+ messages in thread
From: Tomasz Figa @ 2015-12-18  6:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson <dianders@chromium.org> wrote:
> Hi,
>
> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
> <dianders@chromium.org> wrote:
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>> by starting out allocating large chunks and then moving to smaller and
>> smaller chunk sizes until it succeeds.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>   - Try to allocate order 10 (no retry)
>>   - Try to allocate order 9 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>   - Try to allocate order 9 (no retry)
>>   - Try to allocate order 8 (no retry)
>>   - ...
>>   - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>   => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>
>> A second problem (and maybe even more important) is that allocating big
>> chunks when we don't need them is just not a good idea anyway.  The
>> first thing we do with these big chunks is break them into smaller
>> chunks!  If we allocate small chunks:
>> - The memory manager doesn't need to work so hard to give us big chunks.
>> - We can save the big chunks for those that really need them and this
>>   code can make great use of all the small chunks sitting around.
>>
>> Let's simplify by just allocating one page at a time.  We may make more
>> total allocate calls but it works way better.  In real world tests that
>> used to sometimes see a 24 second allocation call I can now see at most
>> 250 ms.
>
> Off-list I talked to Dmitry about this a little bit and he pointed out
> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> think the benefit outweighs the cost in this case, but I'm happy to
> hear what others have to say.

Yeah, I'd like to see some discussion about the effect of allocating
bigger chunks on IOMMU performance. Dmitry (on CC), could you
elaborate a bit on what Doug mentioned?

As for my own understanding, some IOMMUs can map memory using big
pages, which should improve TLB efficiency and so look-up speed.
However AFAICT current implementation of allocating function doesn't
allocate the chunks properly, because there is no guarantee that
particular chunks are aligned on big page boundary. For example, it
might happen that we allocate first chunk of order 0, then second
chunk of order 4 (64KiB - typical big page), then we won't be able to
map the second chunk using a big page, because the IOVA at that point
will not be aligned properly.

Is there any other case when bigger physically contiguous chunks can
help the IOMMU?

Best regards,
Tomasz

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-17 22:31   ` Doug Anderson
@ 2015-12-18 12:41     ` Robin Murphy
  -1 siblings, 0 replies; 22+ messages in thread
From: Robin Murphy @ 2015-12-18 12:41 UTC (permalink / raw)
  To: Doug Anderson, Russell King
  Cc: Laurent Pinchart, linux-kernel, Pawel Osciak, mike.looijmans,
	lorenx4, Dmitry Torokhov, Will Deacon, Tomasz Figa,
	David Rientjes, Carlo Caione, Andrew Morton, linux-arm-kernel,
	Marek Szyprowski

Hi Doug,

On 17/12/15 22:31, Doug Anderson wrote:
> Hi,
>
> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
> <dianders@chromium.org> wrote:
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>> by starting out allocating large chunks and then moving to smaller and
>> smaller chunk sizes until it succeeds.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>    - Try to allocate order 10 (no retry)
>>    - Try to allocate order 9 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>    - Try to allocate order 9 (no retry)
>>    - Try to allocate order 8 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>    => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>
>> A second problem (and maybe even more important) is that allocating big
>> chunks when we don't need them is just not a good idea anyway.  The
>> first thing we do with these big chunks is break them into smaller
>> chunks!  If we allocate small chunks:
>> - The memory manager doesn't need to work so hard to give us big chunks.
>> - We can save the big chunks for those that really need them and this
>>    code can make great use of all the small chunks sitting around.
>>
>> Let's simplify by just allocating one page at a time.  We may make more
>> total allocate calls but it works way better.  In real world tests that
>> used to sometimes see a 24 second allocation call I can now see at most
>> 250 ms.
>
> Off-list I talked to Dmitry about this a little bit and he pointed out
> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> think the benefit outweighs the cost in this case, but I'm happy to
> hear what others have to say.  I did some quick printouts and it turns
> out that even when requesting page at a time the memory manager
> (unsurprisingly) can in many cases still give us pages that are
> contiguous.
>
> Also I'm happy to post up
> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
> array and could possibly give us larger chunks of contiguous memory.

I think sorting individually-allocated pages really isn't worth the 
effort - I'm not aware of anything that's going to be capable of using 
larger page/section mappings without also having the necessary physical 
alignment, and if you _can_ cobble together, say, 2MB worth of 
contiguous pages *at 2MB alignment*, then you would have been far better 
off just asking the slab allocator for that in the first place.

That's the key point of the higher-order allocation - not that you get 
some contiguous pages, but that the region you get is also naturally 
aligned to its size physically. That we break up the CPU page tables for 
that region into individual pages is just an inconsequential 
implementation detail from the IOMMU side. When you _do_ have plenty of 
unfragmented free memory it can really be a big win - here's an 
instrumented example of what happens on my Juno with the ARM HDLCD/SMMU 
combo setting up a framebuffer at boot time:


   iommu_dma_alloc: alloc size 0x753000, 1875 pages
   __iommu_dma_alloc_pages: allocated at order 10
   __iommu_dma_alloc_pages: allocated at order 9
   __iommu_dma_alloc_pages: allocated at order 8
   __iommu_dma_alloc_pages: allocated at order 6
   __iommu_dma_alloc_pages: allocated at order 4
   __iommu_dma_alloc_pages: allocated at order 1
   __iommu_dma_alloc_pages: allocated at order 0
   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
   ...

Since the IOVA region itself is aligned to 8MB (for the total size) and 
the physical regions come out in optimal decreasing order, we're able to 
map over 80% of the whole buffer with just 3 section entries, with a 
corresponding saving on TLB pressure, page table maintenance (cache 
flushing), etc.

That said, unless you're in the middle of some crazy allocator-thrashing 
race, then it's probably safe to assume that once allocation fails at a 
given order that's going to remain the case in the near future - would 
you mind taking the following diff for a spin under your test conditions 
to see how it compares?

Robin.

----->8-----
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index dfb5001..95e75c4 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct 
device *dev, size_t size,
         int count = size >> PAGE_SHIFT;
         int array_size = count * sizeof(struct page *);
         int i = 0;
+       unsigned int order = MAX_ORDER;

         if (array_size <= PAGE_SIZE)
                 pages = kzalloc(array_size, GFP_KERNEL);
@@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct 
device *dev, size_t size,
         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;

         while (count) {
-               int j, order;
+               int j;

-               for (order = __fls(count); order > 0; --order) {
+               for (order = min_t(unsigned int, order, __fls(count));
+                    order > 0; --order) {
                         /*
                          * We do not want OOM killer to be invoked as long
                          * as we can fall back to single pages, so we force


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18 12:41     ` Robin Murphy
  0 siblings, 0 replies; 22+ messages in thread
From: Robin Murphy @ 2015-12-18 12:41 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Doug,

On 17/12/15 22:31, Doug Anderson wrote:
> Hi,
>
> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
> <dianders@chromium.org> wrote:
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>> by starting out allocating large chunks and then moving to smaller and
>> smaller chunk sizes until it succeeds.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>    - Try to allocate order 10 (no retry)
>>    - Try to allocate order 9 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>    - Try to allocate order 9 (no retry)
>>    - Try to allocate order 8 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>    => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>
>> A second problem (and maybe even more important) is that allocating big
>> chunks when we don't need them is just not a good idea anyway.  The
>> first thing we do with these big chunks is break them into smaller
>> chunks!  If we allocate small chunks:
>> - The memory manager doesn't need to work so hard to give us big chunks.
>> - We can save the big chunks for those that really need them and this
>>    code can make great use of all the small chunks sitting around.
>>
>> Let's simplify by just allocating one page at a time.  We may make more
>> total allocate calls but it works way better.  In real world tests that
>> used to sometimes see a 24 second allocation call I can now see at most
>> 250 ms.
>
> Off-list I talked to Dmitry about this a little bit and he pointed out
> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> think the benefit outweighs the cost in this case, but I'm happy to
> hear what others have to say.  I did some quick printouts and it turns
> out that even when requesting page at a time the memory manager
> (unsurprisingly) can in many cases still give us pages that are
> contiguous.
>
> Also I'm happy to post up
> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
> array and could possibly give us larger chunks of contiguous memory.

I think sorting individually-allocated pages really isn't worth the 
effort - I'm not aware of anything that's going to be capable of using 
larger page/section mappings without also having the necessary physical 
alignment, and if you _can_ cobble together, say, 2MB worth of 
contiguous pages *at 2MB alignment*, then you would have been far better 
off just asking the slab allocator for that in the first place.

That's the key point of the higher-order allocation - not that you get 
some contiguous pages, but that the region you get is also naturally 
aligned to its size physically. That we break up the CPU page tables for 
that region into individual pages is just an inconsequential 
implementation detail from the IOMMU side. When you _do_ have plenty of 
unfragmented free memory it can really be a big win - here's an 
instrumented example of what happens on my Juno with the ARM HDLCD/SMMU 
combo setting up a framebuffer at boot time:


   iommu_dma_alloc: alloc size 0x753000, 1875 pages
   __iommu_dma_alloc_pages: allocated at order 10
   __iommu_dma_alloc_pages: allocated at order 9
   __iommu_dma_alloc_pages: allocated at order 8
   __iommu_dma_alloc_pages: allocated at order 6
   __iommu_dma_alloc_pages: allocated at order 4
   __iommu_dma_alloc_pages: allocated at order 1
   __iommu_dma_alloc_pages: allocated at order 0
   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
   ...

Since the IOVA region itself is aligned to 8MB (for the total size) and 
the physical regions come out in optimal decreasing order, we're able to 
map over 80% of the whole buffer with just 3 section entries, with a 
corresponding saving on TLB pressure, page table maintenance (cache 
flushing), etc.

That said, unless you're in the middle of some crazy allocator-thrashing 
race, then it's probably safe to assume that once allocation fails at a 
given order that's going to remain the case in the near future - would 
you mind taking the following diff for a spin under your test conditions 
to see how it compares?

Robin.

----->8-----
diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index dfb5001..95e75c4 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct 
device *dev, size_t size,
         int count = size >> PAGE_SHIFT;
         int array_size = count * sizeof(struct page *);
         int i = 0;
+       unsigned int order = MAX_ORDER;

         if (array_size <= PAGE_SIZE)
                 pages = kzalloc(array_size, GFP_KERNEL);
@@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct 
device *dev, size_t size,
         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;

         while (count) {
-               int j, order;
+               int j;

-               for (order = __fls(count); order > 0; --order) {
+               for (order = min_t(unsigned int, order, __fls(count));
+                    order > 0; --order) {
                         /*
                          * We do not want OOM killer to be invoked as long
                          * as we can fall back to single pages, so we force

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18 12:41     ` Robin Murphy
@ 2015-12-18 14:38       ` Marek Szyprowski
  -1 siblings, 0 replies; 22+ messages in thread
From: Marek Szyprowski @ 2015-12-18 14:38 UTC (permalink / raw)
  To: Robin Murphy, Doug Anderson, Russell King
  Cc: Laurent Pinchart, linux-kernel, Pawel Osciak, mike.looijmans,
	lorenx4, Dmitry Torokhov, Will Deacon, Tomasz Figa,
	David Rientjes, Carlo Caione, Andrew Morton, linux-arm-kernel

Hello,

On 2015-12-18 13:41, Robin Murphy wrote:
> Hi Doug,
>
> On 17/12/15 22:31, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>> <dianders@chromium.org> wrote:
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to be efficient about 
>>> this
>>> by starting out allocating large chunks and then moving to smaller and
>>> smaller chunk sizes until it succeeds.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>    - Try to allocate order 10 (no retry)
>>>    - Try to allocate order 9 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>    - Try to allocate order 9 (no retry)
>>>    - Try to allocate order 8 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>    => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>
>>> A second problem (and maybe even more important) is that allocating big
>>> chunks when we don't need them is just not a good idea anyway.  The
>>> first thing we do with these big chunks is break them into smaller
>>> chunks!  If we allocate small chunks:
>>> - The memory manager doesn't need to work so hard to give us big 
>>> chunks.
>>> - We can save the big chunks for those that really need them and this
>>>    code can make great use of all the small chunks sitting around.
>>>
>>> Let's simplify by just allocating one page at a time.  We may make more
>>> total allocate calls but it works way better.  In real world tests that
>>> used to sometimes see a 24 second allocation call I can now see at most
>>> 250 ms.
>>
>> Off-list I talked to Dmitry about this a little bit and he pointed out
>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>> think the benefit outweighs the cost in this case, but I'm happy to
>> hear what others have to say.  I did some quick printouts and it turns
>> out that even when requesting page at a time the memory manager
>> (unsurprisingly) can in many cases still give us pages that are
>> contiguous.
>>
>> Also I'm happy to post up
>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>> array and could possibly give us larger chunks of contiguous memory.
>
> I think sorting individually-allocated pages really isn't worth the 
> effort - I'm not aware of anything that's going to be capable of using 
> larger page/section mappings without also having the necessary 
> physical alignment, and if you _can_ cobble together, say, 2MB worth 
> of contiguous pages *at 2MB alignment*, then you would have been far 
> better off just asking the slab allocator for that in the first place.
>
> That's the key point of the higher-order allocation - not that you get 
> some contiguous pages, but that the region you get is also naturally 
> aligned to its size physically. That we break up the CPU page tables 
> for that region into individual pages is just an inconsequential 
> implementation detail from the IOMMU side. When you _do_ have plenty 
> of unfragmented free memory it can really be a big win - here's an 
> instrumented example of what happens on my Juno with the ARM 
> HDLCD/SMMU combo setting up a framebuffer at boot time:
>
>
>   iommu_dma_alloc: alloc size 0x753000, 1875 pages
>   __iommu_dma_alloc_pages: allocated at order 10
>   __iommu_dma_alloc_pages: allocated at order 9
>   __iommu_dma_alloc_pages: allocated at order 8
>   __iommu_dma_alloc_pages: allocated at order 6
>   __iommu_dma_alloc_pages: allocated at order 4
>   __iommu_dma_alloc_pages: allocated at order 1
>   __iommu_dma_alloc_pages: allocated at order 0
>   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>   ...
>
> Since the IOVA region itself is aligned to 8MB (for the total size) 
> and the physical regions come out in optimal decreasing order, we're 
> able to map over 80% of the whole buffer with just 3 section entries, 
> with a corresponding saving on TLB pressure, page table maintenance 
> (cache flushing), etc.
>
> That said, unless you're in the middle of some crazy 
> allocator-thrashing race, then it's probably safe to assume that once 
> allocation fails at a given order that's going to remain the case in 
> the near future

Right, this is reasonable improvement. It should reduce the tries of higher
order.

Please note that mapping memory with larger pages significantly improves
performance, especially when IOMMU has a little TLB cache. This can be 
easily
observed when multimedia devices do processing of RGB data with 90/270 
degree
rotation. However I didn't notice much improvement between 2MiB and 64KiB
mappings.

Maybe it will be enough start trying from 64KiB instead of MAX_ORDER?


> - would you mind taking the following diff for a spin under your test 
> conditions to see how it compares?
>
> Robin.
>
> ----->8-----
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index dfb5001..95e75c4 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct 
> device *dev, size_t size,
>         int count = size >> PAGE_SHIFT;
>         int array_size = count * sizeof(struct page *);
>         int i = 0;
> +       unsigned int order = MAX_ORDER;
>
>         if (array_size <= PAGE_SIZE)
>                 pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1160,9 +1161,10 @@ static struct page 
> **__iommu_alloc_buffer(struct device *dev, size_t size,
>         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>
>         while (count) {
> -               int j, order;
> +               int j;
>
> -               for (order = __fls(count); order > 0; --order) {
> +               for (order = min_t(unsigned int, order, __fls(count));
> +                    order > 0; --order) {
>                         /*
>                          * We do not want OOM killer to be invoked as 
> long
>                          * as we can fall back to single pages, so we 
> force
>
>
>

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18 14:38       ` Marek Szyprowski
  0 siblings, 0 replies; 22+ messages in thread
From: Marek Szyprowski @ 2015-12-18 14:38 UTC (permalink / raw)
  To: linux-arm-kernel

Hello,

On 2015-12-18 13:41, Robin Murphy wrote:
> Hi Doug,
>
> On 17/12/15 22:31, Doug Anderson wrote:
>> Hi,
>>
>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>> <dianders@chromium.org> wrote:
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to be efficient about 
>>> this
>>> by starting out allocating large chunks and then moving to smaller and
>>> smaller chunk sizes until it succeeds.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>    - Try to allocate order 10 (no retry)
>>>    - Try to allocate order 9 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>    - Try to allocate order 9 (no retry)
>>>    - Try to allocate order 8 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>    => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>
>>> A second problem (and maybe even more important) is that allocating big
>>> chunks when we don't need them is just not a good idea anyway.  The
>>> first thing we do with these big chunks is break them into smaller
>>> chunks!  If we allocate small chunks:
>>> - The memory manager doesn't need to work so hard to give us big 
>>> chunks.
>>> - We can save the big chunks for those that really need them and this
>>>    code can make great use of all the small chunks sitting around.
>>>
>>> Let's simplify by just allocating one page at a time.  We may make more
>>> total allocate calls but it works way better.  In real world tests that
>>> used to sometimes see a 24 second allocation call I can now see at most
>>> 250 ms.
>>
>> Off-list I talked to Dmitry about this a little bit and he pointed out
>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>> think the benefit outweighs the cost in this case, but I'm happy to
>> hear what others have to say.  I did some quick printouts and it turns
>> out that even when requesting page at a time the memory manager
>> (unsurprisingly) can in many cases still give us pages that are
>> contiguous.
>>
>> Also I'm happy to post up
>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>> array and could possibly give us larger chunks of contiguous memory.
>
> I think sorting individually-allocated pages really isn't worth the 
> effort - I'm not aware of anything that's going to be capable of using 
> larger page/section mappings without also having the necessary 
> physical alignment, and if you _can_ cobble together, say, 2MB worth 
> of contiguous pages *at 2MB alignment*, then you would have been far 
> better off just asking the slab allocator for that in the first place.
>
> That's the key point of the higher-order allocation - not that you get 
> some contiguous pages, but that the region you get is also naturally 
> aligned to its size physically. That we break up the CPU page tables 
> for that region into individual pages is just an inconsequential 
> implementation detail from the IOMMU side. When you _do_ have plenty 
> of unfragmented free memory it can really be a big win - here's an 
> instrumented example of what happens on my Juno with the ARM 
> HDLCD/SMMU combo setting up a framebuffer at boot time:
>
>
>   iommu_dma_alloc: alloc size 0x753000, 1875 pages
>   __iommu_dma_alloc_pages: allocated at order 10
>   __iommu_dma_alloc_pages: allocated at order 9
>   __iommu_dma_alloc_pages: allocated at order 8
>   __iommu_dma_alloc_pages: allocated at order 6
>   __iommu_dma_alloc_pages: allocated at order 4
>   __iommu_dma_alloc_pages: allocated at order 1
>   __iommu_dma_alloc_pages: allocated at order 0
>   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>   ...
>
> Since the IOVA region itself is aligned to 8MB (for the total size) 
> and the physical regions come out in optimal decreasing order, we're 
> able to map over 80% of the whole buffer with just 3 section entries, 
> with a corresponding saving on TLB pressure, page table maintenance 
> (cache flushing), etc.
>
> That said, unless you're in the middle of some crazy 
> allocator-thrashing race, then it's probably safe to assume that once 
> allocation fails at a given order that's going to remain the case in 
> the near future

Right, this is reasonable improvement. It should reduce the tries of higher
order.

Please note that mapping memory with larger pages significantly improves
performance, especially when IOMMU has a little TLB cache. This can be 
easily
observed when multimedia devices do processing of RGB data with 90/270 
degree
rotation. However I didn't notice much improvement between 2MiB and 64KiB
mappings.

Maybe it will be enough start trying from 64KiB instead of MAX_ORDER?


> - would you mind taking the following diff for a spin under your test 
> conditions to see how it compares?
>
> Robin.
>
> ----->8-----
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index dfb5001..95e75c4 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct 
> device *dev, size_t size,
>         int count = size >> PAGE_SHIFT;
>         int array_size = count * sizeof(struct page *);
>         int i = 0;
> +       unsigned int order = MAX_ORDER;
>
>         if (array_size <= PAGE_SIZE)
>                 pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1160,9 +1161,10 @@ static struct page 
> **__iommu_alloc_buffer(struct device *dev, size_t size,
>         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>
>         while (count) {
> -               int j, order;
> +               int j;
>
> -               for (order = __fls(count); order > 0; --order) {
> +               for (order = min_t(unsigned int, order, __fls(count));
> +                    order > 0; --order) {
>                         /*
>                          * We do not want OOM killer to be invoked as 
> long
>                          * as we can fall back to single pages, so we 
> force
>
>
>

Best regards
-- 
Marek Szyprowski, PhD
Samsung R&D Institute Poland

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18 12:41     ` Robin Murphy
@ 2015-12-18 18:55       ` Doug Anderson
  -1 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-18 18:55 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Russell King, Laurent Pinchart, linux-kernel, Pawel Osciak,
	mike.looijmans, Lorenzo Nava, Dmitry Torokhov, Will Deacon,
	Tomasz Figa, David Rientjes, Carlo Caione, Andrew Morton,
	linux-arm-kernel, Marek Szyprowski

Robin,

On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Doug,
>
>
> On 17/12/15 22:31, Doug Anderson wrote:
>>
>> Hi,
>>
>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>> <dianders@chromium.org> wrote:
>>>
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>>> by starting out allocating large chunks and then moving to smaller and
>>> smaller chunk sizes until it succeeds.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>    - Try to allocate order 10 (no retry)
>>>    - Try to allocate order 9 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>    - Try to allocate order 9 (no retry)
>>>    - Try to allocate order 8 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>    => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>
>>> A second problem (and maybe even more important) is that allocating big
>>> chunks when we don't need them is just not a good idea anyway.  The
>>> first thing we do with these big chunks is break them into smaller
>>> chunks!  If we allocate small chunks:
>>> - The memory manager doesn't need to work so hard to give us big chunks.
>>> - We can save the big chunks for those that really need them and this
>>>    code can make great use of all the small chunks sitting around.
>>>
>>> Let's simplify by just allocating one page at a time.  We may make more
>>> total allocate calls but it works way better.  In real world tests that
>>> used to sometimes see a 24 second allocation call I can now see at most
>>> 250 ms.

One thing to note is that testing yesterday I actually managed to
reproduce an allocation taking 120 seconds (!) with the old code.


>> Off-list I talked to Dmitry about this a little bit and he pointed out
>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>> think the benefit outweighs the cost in this case, but I'm happy to
>> hear what others have to say.  I did some quick printouts and it turns
>> out that even when requesting page at a time the memory manager
>> (unsurprisingly) can in many cases still give us pages that are
>> contiguous.
>>
>> Also I'm happy to post up
>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>> array and could possibly give us larger chunks of contiguous memory.
>
>
> I think sorting individually-allocated pages really isn't worth the effort -
> I'm not aware of anything that's going to be capable of using larger
> page/section mappings without also having the necessary physical alignment,
> and if you _can_ cobble together, say, 2MB worth of contiguous pages *at 2MB
> alignment*, then you would have been far better off just asking the slab
> allocator for that in the first place.
>
> That's the key point of the higher-order allocation - not that you get some
> contiguous pages, but that the region you get is also naturally aligned to
> its size physically. That we break up the CPU page tables for that region
> into individual pages is just an inconsequential implementation detail from
> the IOMMU side. When you _do_ have plenty of unfragmented free memory it can
> really be a big win - here's an instrumented example of what happens on my
> Juno with the ARM HDLCD/SMMU combo setting up a framebuffer at boot time:
>
>
>   iommu_dma_alloc: alloc size 0x753000, 1875 pages
>   __iommu_dma_alloc_pages: allocated at order 10
>   __iommu_dma_alloc_pages: allocated at order 9
>   __iommu_dma_alloc_pages: allocated at order 8
>   __iommu_dma_alloc_pages: allocated at order 6
>   __iommu_dma_alloc_pages: allocated at order 4
>   __iommu_dma_alloc_pages: allocated at order 1
>   __iommu_dma_alloc_pages: allocated at order 0
>   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>   ...
>
> Since the IOVA region itself is aligned to 8MB (for the total size) and the
> physical regions come out in optimal decreasing order, we're able to map
> over 80% of the whole buffer with just 3 section entries, with a
> corresponding saving on TLB pressure, page table maintenance (cache
> flushing), etc.
>
> That said, unless you're in the middle of some crazy allocator-thrashing
> race, then it's probably safe to assume that once allocation fails at a
> given order that's going to remain the case in the near future - would you
> mind taking the following diff for a spin under your test conditions to see
> how it compares?
>
> Robin.
>
> ----->8-----
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index dfb5001..95e75c4 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
> device *dev, size_t size,
>         int count = size >> PAGE_SHIFT;
>         int array_size = count * sizeof(struct page *);
>         int i = 0;
> +       unsigned int order = MAX_ORDER;
>
>         if (array_size <= PAGE_SIZE)
>                 pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
> device *dev, size_t size,
>         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>
>         while (count) {
> -               int j, order;
> +               int j;
>
> -               for (order = __fls(count); order > 0; --order) {
> +               for (order = min_t(unsigned int, order, __fls(count));
> +                    order > 0; --order) {

Yeah, I'd been playing with things like that, though not that exact patch.

I just tried it now.  As should be obvious, it certainly makes a
DRASTIC improvement in things but it still has some downsides as
compared to my patch.

1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
seconds.  I can no longer reproduce the 24 second or 120 second
allocation call, but I still see things like "alloc 4194304 bytes:
3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
compared with 250 ms max with my patch.

2. We still have the same problem that we're taking away all the
contiguous memory that other users may want.  I've got a dwc2 USB
controller in my system and it needs to allocate bounce buffers for
its DMA.  While looking at cat videos on Facebook and running a
program to simulate memory pressure (4 userspace programs each walking
through 350 Megs of memory over and over) I start seeing lots of order
3 allocation failures in dwc2.  It's true that the USB/network stack
is resilient against these allocation failures (other than spamming my
log), but performance will decrease.  When I switch to WiFi I suddenly
start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
performance.



I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
we don't try for > 64K chunks.  This is might be a reasonable
compromise.  My cat video test still reproduces "alloc 4194304 bytes:
674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
all the large chunks of memory that everyone else would like to have.


Oh, or how about this: we start allocating of order 4.  Upon the first
failure we jump to order 1.  AKA: if there's no memory pressure we're
golden.  The moment we have the first bit of memory pressure we fold.
That's basically just a slight optimization on Marek's suggestion.  I
still see 450 ms for an allocation, but that's not too bad.  It can
still take away large chunks from other users, but maybe that's OK?


Anyway, I'll plan to send that patch up.  I'll also do a quick test to
see if my "sort()" actually helps anything.

-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18 18:55       ` Doug Anderson
  0 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-18 18:55 UTC (permalink / raw)
  To: linux-arm-kernel

Robin,

On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Doug,
>
>
> On 17/12/15 22:31, Doug Anderson wrote:
>>
>> Hi,
>>
>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>> <dianders@chromium.org> wrote:
>>>
>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>>> by starting out allocating large chunks and then moving to smaller and
>>> smaller chunk sizes until it succeeds.
>>>
>>> The current function is very, very slow.
>>>
>>> One problem is the way it keeps trying and trying to allocate big
>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>> We'll do the following memory allocations:
>>> - For page 1:
>>>    - Try to allocate order 10 (no retry)
>>>    - Try to allocate order 9 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - For page 2:
>>>    - Try to allocate order 9 (no retry)
>>>    - Try to allocate order 8 (no retry)
>>>    - ...
>>>    - Try to allocate order 0 (with retry, but not needed)
>>> - ...
>>> - ...
>>>
>>> Total number of calls to alloc() calls for this case is:
>>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>    => 9228
>>>
>>> The above is obviously worse case, but given how slow alloc can be we
>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>> code with a device under memory pressure and it wasn't hard to see it
>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>
>>> A second problem (and maybe even more important) is that allocating big
>>> chunks when we don't need them is just not a good idea anyway.  The
>>> first thing we do with these big chunks is break them into smaller
>>> chunks!  If we allocate small chunks:
>>> - The memory manager doesn't need to work so hard to give us big chunks.
>>> - We can save the big chunks for those that really need them and this
>>>    code can make great use of all the small chunks sitting around.
>>>
>>> Let's simplify by just allocating one page at a time.  We may make more
>>> total allocate calls but it works way better.  In real world tests that
>>> used to sometimes see a 24 second allocation call I can now see at most
>>> 250 ms.

One thing to note is that testing yesterday I actually managed to
reproduce an allocation taking 120 seconds (!) with the old code.


>> Off-list I talked to Dmitry about this a little bit and he pointed out
>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>> think the benefit outweighs the cost in this case, but I'm happy to
>> hear what others have to say.  I did some quick printouts and it turns
>> out that even when requesting page at a time the memory manager
>> (unsurprisingly) can in many cases still give us pages that are
>> contiguous.
>>
>> Also I'm happy to post up
>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>> array and could possibly give us larger chunks of contiguous memory.
>
>
> I think sorting individually-allocated pages really isn't worth the effort -
> I'm not aware of anything that's going to be capable of using larger
> page/section mappings without also having the necessary physical alignment,
> and if you _can_ cobble together, say, 2MB worth of contiguous pages *at 2MB
> alignment*, then you would have been far better off just asking the slab
> allocator for that in the first place.
>
> That's the key point of the higher-order allocation - not that you get some
> contiguous pages, but that the region you get is also naturally aligned to
> its size physically. That we break up the CPU page tables for that region
> into individual pages is just an inconsequential implementation detail from
> the IOMMU side. When you _do_ have plenty of unfragmented free memory it can
> really be a big win - here's an instrumented example of what happens on my
> Juno with the ARM HDLCD/SMMU combo setting up a framebuffer at boot time:
>
>
>   iommu_dma_alloc: alloc size 0x753000, 1875 pages
>   __iommu_dma_alloc_pages: allocated at order 10
>   __iommu_dma_alloc_pages: allocated at order 9
>   __iommu_dma_alloc_pages: allocated at order 8
>   __iommu_dma_alloc_pages: allocated at order 6
>   __iommu_dma_alloc_pages: allocated at order 4
>   __iommu_dma_alloc_pages: allocated at order 1
>   __iommu_dma_alloc_pages: allocated at order 0
>   iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>   iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>   iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>   iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>   iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>   iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>   iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>   iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>   iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>   iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>   ...
>
> Since the IOVA region itself is aligned to 8MB (for the total size) and the
> physical regions come out in optimal decreasing order, we're able to map
> over 80% of the whole buffer with just 3 section entries, with a
> corresponding saving on TLB pressure, page table maintenance (cache
> flushing), etc.
>
> That said, unless you're in the middle of some crazy allocator-thrashing
> race, then it's probably safe to assume that once allocation fails at a
> given order that's going to remain the case in the near future - would you
> mind taking the following diff for a spin under your test conditions to see
> how it compares?
>
> Robin.
>
> ----->8-----
> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> index dfb5001..95e75c4 100644
> --- a/arch/arm/mm/dma-mapping.c
> +++ b/arch/arm/mm/dma-mapping.c
> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
> device *dev, size_t size,
>         int count = size >> PAGE_SHIFT;
>         int array_size = count * sizeof(struct page *);
>         int i = 0;
> +       unsigned int order = MAX_ORDER;
>
>         if (array_size <= PAGE_SIZE)
>                 pages = kzalloc(array_size, GFP_KERNEL);
> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
> device *dev, size_t size,
>         gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>
>         while (count) {
> -               int j, order;
> +               int j;
>
> -               for (order = __fls(count); order > 0; --order) {
> +               for (order = min_t(unsigned int, order, __fls(count));
> +                    order > 0; --order) {

Yeah, I'd been playing with things like that, though not that exact patch.

I just tried it now.  As should be obvious, it certainly makes a
DRASTIC improvement in things but it still has some downsides as
compared to my patch.

1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
seconds.  I can no longer reproduce the 24 second or 120 second
allocation call, but I still see things like "alloc 4194304 bytes:
3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
compared with 250 ms max with my patch.

2. We still have the same problem that we're taking away all the
contiguous memory that other users may want.  I've got a dwc2 USB
controller in my system and it needs to allocate bounce buffers for
its DMA.  While looking at cat videos on Facebook and running a
program to simulate memory pressure (4 userspace programs each walking
through 350 Megs of memory over and over) I start seeing lots of order
3 allocation failures in dwc2.  It's true that the USB/network stack
is resilient against these allocation failures (other than spamming my
log), but performance will decrease.  When I switch to WiFi I suddenly
start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
performance.



I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
we don't try for > 64K chunks.  This is might be a reasonable
compromise.  My cat video test still reproduces "alloc 4194304 bytes:
674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
all the large chunks of memory that everyone else would like to have.


Oh, or how about this: we start allocating of order 4.  Upon the first
failure we jump to order 1.  AKA: if there's no memory pressure we're
golden.  The moment we have the first bit of memory pressure we fold.
That's basically just a slight optimization on Marek's suggestion.  I
still see 450 ms for an allocation, but that's not too bad.  It can
still take away large chunks from other users, but maybe that's OK?


Anyway, I'll plan to send that patch up.  I'll also do a quick test to
see if my "sort()" actually helps anything.

-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18 18:55       ` Doug Anderson
@ 2015-12-18 20:20         ` Robin Murphy
  -1 siblings, 0 replies; 22+ messages in thread
From: Robin Murphy @ 2015-12-18 20:20 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Russell King, Laurent Pinchart, linux-kernel, Pawel Osciak,
	mike.looijmans, Lorenzo Nava, Dmitry Torokhov, Will Deacon,
	Tomasz Figa, David Rientjes, Carlo Caione, Andrew Morton,
	linux-arm-kernel, Marek Szyprowski

On 18/12/15 18:55, Doug Anderson wrote:
> Robin,
>
> On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Doug,
>>
>>
>> On 17/12/15 22:31, Doug Anderson wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>>> <dianders@chromium.org> wrote:
>>>>
>>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>>>> by starting out allocating large chunks and then moving to smaller and
>>>> smaller chunk sizes until it succeeds.
>>>>
>>>> The current function is very, very slow.
>>>>
>>>> One problem is the way it keeps trying and trying to allocate big
>>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>>> We'll do the following memory allocations:
>>>> - For page 1:
>>>>     - Try to allocate order 10 (no retry)
>>>>     - Try to allocate order 9 (no retry)
>>>>     - ...
>>>>     - Try to allocate order 0 (with retry, but not needed)
>>>> - For page 2:
>>>>     - Try to allocate order 9 (no retry)
>>>>     - Try to allocate order 8 (no retry)
>>>>     - ...
>>>>     - Try to allocate order 0 (with retry, but not needed)
>>>> - ...
>>>> - ...
>>>>
>>>> Total number of calls to alloc() calls for this case is:
>>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>>     => 9228
>>>>
>>>> The above is obviously worse case, but given how slow alloc can be we
>>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>>> code with a device under memory pressure and it wasn't hard to see it
>>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>>
>>>> A second problem (and maybe even more important) is that allocating big
>>>> chunks when we don't need them is just not a good idea anyway.  The
>>>> first thing we do with these big chunks is break them into smaller
>>>> chunks!  If we allocate small chunks:
>>>> - The memory manager doesn't need to work so hard to give us big chunks.
>>>> - We can save the big chunks for those that really need them and this
>>>>     code can make great use of all the small chunks sitting around.
>>>>
>>>> Let's simplify by just allocating one page at a time.  We may make more
>>>> total allocate calls but it works way better.  In real world tests that
>>>> used to sometimes see a 24 second allocation call I can now see at most
>>>> 250 ms.
>
> One thing to note is that testing yesterday I actually managed to
> reproduce an allocation taking 120 seconds (!) with the old code.

Yikes! That really is worth avoiding...

>>> Off-list I talked to Dmitry about this a little bit and he pointed out
>>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>>> think the benefit outweighs the cost in this case, but I'm happy to
>>> hear what others have to say.  I did some quick printouts and it turns
>>> out that even when requesting page at a time the memory manager
>>> (unsurprisingly) can in many cases still give us pages that are
>>> contiguous.
>>>
>>> Also I'm happy to post up
>>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>>> array and could possibly give us larger chunks of contiguous memory.
>>
>>
>> I think sorting individually-allocated pages really isn't worth the effort -
>> I'm not aware of anything that's going to be capable of using larger
>> page/section mappings without also having the necessary physical alignment,
>> and if you _can_ cobble together, say, 2MB worth of contiguous pages *at 2MB
>> alignment*, then you would have been far better off just asking the slab
>> allocator for that in the first place.
>>
>> That's the key point of the higher-order allocation - not that you get some
>> contiguous pages, but that the region you get is also naturally aligned to
>> its size physically. That we break up the CPU page tables for that region
>> into individual pages is just an inconsequential implementation detail from
>> the IOMMU side. When you _do_ have plenty of unfragmented free memory it can
>> really be a big win - here's an instrumented example of what happens on my
>> Juno with the ARM HDLCD/SMMU combo setting up a framebuffer at boot time:
>>
>>
>>    iommu_dma_alloc: alloc size 0x753000, 1875 pages
>>    __iommu_dma_alloc_pages: allocated at order 10
>>    __iommu_dma_alloc_pages: allocated at order 9
>>    __iommu_dma_alloc_pages: allocated at order 8
>>    __iommu_dma_alloc_pages: allocated at order 6
>>    __iommu_dma_alloc_pages: allocated at order 4
>>    __iommu_dma_alloc_pages: allocated at order 1
>>    __iommu_dma_alloc_pages: allocated at order 0
>>    iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>>    iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>>    iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>>    iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>>    iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>>    iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>>    iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>>    ...
>>
>> Since the IOVA region itself is aligned to 8MB (for the total size) and the
>> physical regions come out in optimal decreasing order, we're able to map
>> over 80% of the whole buffer with just 3 section entries, with a
>> corresponding saving on TLB pressure, page table maintenance (cache
>> flushing), etc.
>>
>> That said, unless you're in the middle of some crazy allocator-thrashing
>> race, then it's probably safe to assume that once allocation fails at a
>> given order that's going to remain the case in the near future - would you
>> mind taking the following diff for a spin under your test conditions to see
>> how it compares?
>>
>> Robin.
>>
>> ----->8-----
>> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
>> index dfb5001..95e75c4 100644
>> --- a/arch/arm/mm/dma-mapping.c
>> +++ b/arch/arm/mm/dma-mapping.c
>> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
>> device *dev, size_t size,
>>          int count = size >> PAGE_SHIFT;
>>          int array_size = count * sizeof(struct page *);
>>          int i = 0;
>> +       unsigned int order = MAX_ORDER;
>>
>>          if (array_size <= PAGE_SIZE)
>>                  pages = kzalloc(array_size, GFP_KERNEL);
>> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
>> device *dev, size_t size,
>>          gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>>
>>          while (count) {
>> -               int j, order;
>> +               int j;
>>
>> -               for (order = __fls(count); order > 0; --order) {
>> +               for (order = min_t(unsigned int, order, __fls(count));
>> +                    order > 0; --order) {
>
> Yeah, I'd been playing with things like that, though not that exact patch.
>
> I just tried it now.  As should be obvious, it certainly makes a
> DRASTIC improvement in things but it still has some downsides as
> compared to my patch.
>
> 1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
> seconds.  I can no longer reproduce the 24 second or 120 second
> allocation call, but I still see things like "alloc 4194304 bytes:
> 3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
> compared with 250 ms max with my patch.

Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps 
instead of just __GFP_NORETRY we should go all the way to clearing 
__GFP_RECLAIM for the opportunistic calls so they really fail fast?

> 2. We still have the same problem that we're taking away all the
> contiguous memory that other users may want.  I've got a dwc2 USB
> controller in my system and it needs to allocate bounce buffers for
> its DMA.  While looking at cat videos on Facebook and running a
> program to simulate memory pressure (4 userspace programs each walking
> through 350 Megs of memory over and over) I start seeing lots of order
> 3 allocation failures in dwc2.  It's true that the USB/network stack
> is resilient against these allocation failures (other than spamming my
> log), but performance will decrease.  When I switch to WiFi I suddenly
> start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> performance.
>
>
>
> I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> we don't try for > 64K chunks.  This is might be a reasonable
> compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> all the large chunks of memory that everyone else would like to have.
>
>
> Oh, or how about this: we start allocating of order 4.  Upon the first
> failure we jump to order 1.  AKA: if there's no memory pressure we're
> golden.  The moment we have the first bit of memory pressure we fold.
> That's basically just a slight optimization on Marek's suggestion.  I
> still see 450 ms for an allocation, but that's not too bad.  It can
> still take away large chunks from other users, but maybe that's OK?

That makes sense - there's really no benefit to be had from trying 
orders which don't correspond to our relevant IOMMU page sizes - I'm not 
sure off-hand how many contortions you'd have to go through to actually 
get at those from here, although it might be another argument in favour 
of moving the pgsize_bitmap into the iommu_domain as Will proposed some 
time ago. In lieu of an actual lookup, my general inclination would be 
to go 2MB->1MB->64K->4K to cover all the common page sizes, but Marek's 
probably right that the larger two are less relevant in the context of 
mobile graphics stuff, which lets face it is the prime concern for 
IOMMUs on 32-bit ARM.

> Anyway, I'll plan to send that patch up.  I'll also do a quick test to
> see if my "sort()" actually helps anything.

Sounds good. I'm about to disappear off for holidays, but it'll be good 
to see how much you've improved everything when I get back :D

Thanks,
Robin.

>
> -Doug
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18 20:20         ` Robin Murphy
  0 siblings, 0 replies; 22+ messages in thread
From: Robin Murphy @ 2015-12-18 20:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 18/12/15 18:55, Doug Anderson wrote:
> Robin,
>
> On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy <robin.murphy@arm.com> wrote:
>> Hi Doug,
>>
>>
>> On 17/12/15 22:31, Doug Anderson wrote:
>>>
>>> Hi,
>>>
>>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson
>>> <dianders@chromium.org> wrote:
>>>>
>>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>>>> sizeable buffers.  Upon simple tests of video I saw it trying to
>>>> allocate 4,194,304 bytes.  The function tries to be efficient about this
>>>> by starting out allocating large chunks and then moving to smaller and
>>>> smaller chunk sizes until it succeeds.
>>>>
>>>> The current function is very, very slow.
>>>>
>>>> One problem is the way it keeps trying and trying to allocate big
>>>> chunks.  Imagine a very fragmented memory that has 4M free but no
>>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>>>> We'll do the following memory allocations:
>>>> - For page 1:
>>>>     - Try to allocate order 10 (no retry)
>>>>     - Try to allocate order 9 (no retry)
>>>>     - ...
>>>>     - Try to allocate order 0 (with retry, but not needed)
>>>> - For page 2:
>>>>     - Try to allocate order 9 (no retry)
>>>>     - Try to allocate order 8 (no retry)
>>>>     - ...
>>>>     - Try to allocate order 0 (with retry, but not needed)
>>>> - ...
>>>> - ...
>>>>
>>>> Total number of calls to alloc() calls for this case is:
>>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>>>     => 9228
>>>>
>>>> The above is obviously worse case, but given how slow alloc can be we
>>>> really want to try to avoid even somewhat bad cases.  I timed the old
>>>> code with a device under memory pressure and it wasn't hard to see it
>>>> take more than 24 seconds to allocate 4 megs of memory (!!).
>>>>
>>>> A second problem (and maybe even more important) is that allocating big
>>>> chunks when we don't need them is just not a good idea anyway.  The
>>>> first thing we do with these big chunks is break them into smaller
>>>> chunks!  If we allocate small chunks:
>>>> - The memory manager doesn't need to work so hard to give us big chunks.
>>>> - We can save the big chunks for those that really need them and this
>>>>     code can make great use of all the small chunks sitting around.
>>>>
>>>> Let's simplify by just allocating one page at a time.  We may make more
>>>> total allocate calls but it works way better.  In real world tests that
>>>> used to sometimes see a 24 second allocation call I can now see at most
>>>> 250 ms.
>
> One thing to note is that testing yesterday I actually managed to
> reproduce an allocation taking 120 seconds (!) with the old code.

Yikes! That really is worth avoiding...

>>> Off-list I talked to Dmitry about this a little bit and he pointed out
>>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
>>> think the benefit outweighs the cost in this case, but I'm happy to
>>> hear what others have to say.  I did some quick printouts and it turns
>>> out that even when requesting page at a time the memory manager
>>> (unsurprisingly) can in many cases still give us pages that are
>>> contiguous.
>>>
>>> Also I'm happy to post up
>>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
>>> array and could possibly give us larger chunks of contiguous memory.
>>
>>
>> I think sorting individually-allocated pages really isn't worth the effort -
>> I'm not aware of anything that's going to be capable of using larger
>> page/section mappings without also having the necessary physical alignment,
>> and if you _can_ cobble together, say, 2MB worth of contiguous pages *at 2MB
>> alignment*, then you would have been far better off just asking the slab
>> allocator for that in the first place.
>>
>> That's the key point of the higher-order allocation - not that you get some
>> contiguous pages, but that the region you get is also naturally aligned to
>> its size physically. That we break up the CPU page tables for that region
>> into individual pages is just an inconsequential implementation detail from
>> the IOMMU side. When you _do_ have plenty of unfragmented free memory it can
>> really be a big win - here's an instrumented example of what happens on my
>> Juno with the ARM HDLCD/SMMU combo setting up a framebuffer at boot time:
>>
>>
>>    iommu_dma_alloc: alloc size 0x753000, 1875 pages
>>    __iommu_dma_alloc_pages: allocated at order 10
>>    __iommu_dma_alloc_pages: allocated at order 9
>>    __iommu_dma_alloc_pages: allocated at order 8
>>    __iommu_dma_alloc_pages: allocated at order 6
>>    __iommu_dma_alloc_pages: allocated at order 4
>>    __iommu_dma_alloc_pages: allocated at order 1
>>    __iommu_dma_alloc_pages: allocated at order 0
>>    iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
>>    iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
>>    iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
>>    iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
>>    iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
>>    iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
>>    iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
>>    iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
>>    ...
>>
>> Since the IOVA region itself is aligned to 8MB (for the total size) and the
>> physical regions come out in optimal decreasing order, we're able to map
>> over 80% of the whole buffer with just 3 section entries, with a
>> corresponding saving on TLB pressure, page table maintenance (cache
>> flushing), etc.
>>
>> That said, unless you're in the middle of some crazy allocator-thrashing
>> race, then it's probably safe to assume that once allocation fails at a
>> given order that's going to remain the case in the near future - would you
>> mind taking the following diff for a spin under your test conditions to see
>> how it compares?
>>
>> Robin.
>>
>> ----->8-----
>> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
>> index dfb5001..95e75c4 100644
>> --- a/arch/arm/mm/dma-mapping.c
>> +++ b/arch/arm/mm/dma-mapping.c
>> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
>> device *dev, size_t size,
>>          int count = size >> PAGE_SHIFT;
>>          int array_size = count * sizeof(struct page *);
>>          int i = 0;
>> +       unsigned int order = MAX_ORDER;
>>
>>          if (array_size <= PAGE_SIZE)
>>                  pages = kzalloc(array_size, GFP_KERNEL);
>> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
>> device *dev, size_t size,
>>          gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
>>
>>          while (count) {
>> -               int j, order;
>> +               int j;
>>
>> -               for (order = __fls(count); order > 0; --order) {
>> +               for (order = min_t(unsigned int, order, __fls(count));
>> +                    order > 0; --order) {
>
> Yeah, I'd been playing with things like that, though not that exact patch.
>
> I just tried it now.  As should be obvious, it certainly makes a
> DRASTIC improvement in things but it still has some downsides as
> compared to my patch.
>
> 1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
> seconds.  I can no longer reproduce the 24 second or 120 second
> allocation call, but I still see things like "alloc 4194304 bytes:
> 3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
> compared with 250 ms max with my patch.

Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps 
instead of just __GFP_NORETRY we should go all the way to clearing 
__GFP_RECLAIM for the opportunistic calls so they really fail fast?

> 2. We still have the same problem that we're taking away all the
> contiguous memory that other users may want.  I've got a dwc2 USB
> controller in my system and it needs to allocate bounce buffers for
> its DMA.  While looking at cat videos on Facebook and running a
> program to simulate memory pressure (4 userspace programs each walking
> through 350 Megs of memory over and over) I start seeing lots of order
> 3 allocation failures in dwc2.  It's true that the USB/network stack
> is resilient against these allocation failures (other than spamming my
> log), but performance will decrease.  When I switch to WiFi I suddenly
> start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> performance.
>
>
>
> I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> we don't try for > 64K chunks.  This is might be a reasonable
> compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> all the large chunks of memory that everyone else would like to have.
>
>
> Oh, or how about this: we start allocating of order 4.  Upon the first
> failure we jump to order 1.  AKA: if there's no memory pressure we're
> golden.  The moment we have the first bit of memory pressure we fold.
> That's basically just a slight optimization on Marek's suggestion.  I
> still see 450 ms for an allocation, but that's not too bad.  It can
> still take away large chunks from other users, but maybe that's OK?

That makes sense - there's really no benefit to be had from trying 
orders which don't correspond to our relevant IOMMU page sizes - I'm not 
sure off-hand how many contortions you'd have to go through to actually 
get at those from here, although it might be another argument in favour 
of moving the pgsize_bitmap into the iommu_domain as Will proposed some 
time ago. In lieu of an actual lookup, my general inclination would be 
to go 2MB->1MB->64K->4K to cover all the common page sizes, but Marek's 
probably right that the larger two are less relevant in the context of 
mobile graphics stuff, which lets face it is the prime concern for 
IOMMUs on 32-bit ARM.

> Anyway, I'll plan to send that patch up.  I'll also do a quick test to
> see if my "sort()" actually helps anything.

Sounds good. I'm about to disappear off for holidays, but it'll be good 
to see how much you've improved everything when I get back :D

Thanks,
Robin.

>
> -Doug
>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18 20:20         ` Robin Murphy
@ 2015-12-18 22:05           ` Doug Anderson
  -1 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-18 22:05 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Russell King, Laurent Pinchart, linux-kernel, Pawel Osciak,
	mike.looijmans, Lorenzo Nava, Dmitry Torokhov, Will Deacon,
	Tomasz Figa, David Rientjes, Carlo Caione, Andrew Morton,
	linux-arm-kernel, Marek Szyprowski

Hi,

On Fri, Dec 18, 2015 at 12:20 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps instead
> of just __GFP_NORETRY we should go all the way to clearing __GFP_RECLAIM for
> the opportunistic calls so they really fail fast?

Ah, interesting.

Hrmm, I thought I mentioned somewhere that I'm testing on 3.14, but
looking back it looks like I didn't.  :(  It is entirely possible that
the memory management has improved in newer versions of Linux so that
things aren't so bad even without my patch.  Since I'm doing a full
system test it's pretty hard for me to bump up to a new version of
Linux and test (it looks like the accelerated video hasn't landed
there).

On 3.14 thre's no "__GFP_RECLAIM".

Since many ARM users are running old versions of Linux and are
interested in nice backportable patches, it probably makes sense to
post up and use __GFP_NORETRY and someone in the future can try poking
it to clearing __GFP_RECLAIM instead?


>> 2. We still have the same problem that we're taking away all the
>> contiguous memory that other users may want.  I've got a dwc2 USB
>> controller in my system and it needs to allocate bounce buffers for
>> its DMA.  While looking at cat videos on Facebook and running a
>> program to simulate memory pressure (4 userspace programs each walking
>> through 350 Megs of memory over and over) I start seeing lots of order
>> 3 allocation failures in dwc2.  It's true that the USB/network stack
>> is resilient against these allocation failures (other than spamming my
>> log), but performance will decrease.  When I switch to WiFi I suddenly
>> start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
>> drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
>> performance.
>>
>>
>>
>> I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
>> we don't try for > 64K chunks.  This is might be a reasonable
>> compromise.  My cat video test still reproduces "alloc 4194304 bytes:
>> 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
>> all the large chunks of memory that everyone else would like to have.
>>
>>
>> Oh, or how about this: we start allocating of order 4.  Upon the first
>> failure we jump to order 1.  AKA: if there's no memory pressure we're
>> golden.  The moment we have the first bit of memory pressure we fold.
>> That's basically just a slight optimization on Marek's suggestion.  I
>> still see 450 ms for an allocation, but that's not too bad.  It can
>> still take away large chunks from other users, but maybe that's OK?
>
>
> That makes sense - there's really no benefit to be had from trying orders
> which don't correspond to our relevant IOMMU page sizes - I'm not sure
> off-hand how many contortions you'd have to go through to actually get at
> those from here, although it might be another argument in favour of moving
> the pgsize_bitmap into the iommu_domain as Will proposed some time ago. In
> lieu of an actual lookup, my general inclination would be to go
> 2MB->1MB->64K->4K to cover all the common page sizes, but Marek's probably
> right that the larger two are less relevant in the context of mobile
> graphics stuff, which lets face it is the prime concern for IOMMUs on 32-bit
> ARM.

OK, adding 1MB in the mix isn't too hard and doesn't seem to affect
performance too negatively compared to just trying 64K too.  2MB
either.

Note that I re-tested trying 64K vs. just always allocation 1 page at
a time.  In my usage model (Facebook cat videos under memory pressure)
performance was visually better with page at a time.  That's because:

1. I'm on a system that uses IOMMU for video decoding.  I don't think
TLB optimization in this case is very critical since video decoding is
a linear optimization.  Also these IOMMU mappings are allocated once
per video, so every time I scroll down and a new video starts playing
it allocates a new buffer.  The time it takes is critical.

2. The page at a time definitely affects my other peripherals, so by
eating up large pages I kill my network connectivity.  :(

It might make sense to choose peripheral by peripheral.


>> Anyway, I'll plan to send that patch up.  I'll also do a quick test to
>> see if my "sort()" actually helps anything.

The sort actually did help in some cases.  I'll throw it up as a
separate patch and people can see if they want it.


> Sounds good. I'm about to disappear off for holidays, but it'll be good to
> see how much you've improved everything when I get back :D

I'm going to be out for a while too.


Anyway, I'm about out of time.  I'll send up what I have and people
can debate about it if they want...  Unless there's something truly
terribly about it maybe it would be good to land since the newest
patch shouldn't cause any major regressions but should massively
improve performance for some cases.

-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-18 22:05           ` Doug Anderson
  0 siblings, 0 replies; 22+ messages in thread
From: Doug Anderson @ 2015-12-18 22:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi,

On Fri, Dec 18, 2015 at 12:20 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps instead
> of just __GFP_NORETRY we should go all the way to clearing __GFP_RECLAIM for
> the opportunistic calls so they really fail fast?

Ah, interesting.

Hrmm, I thought I mentioned somewhere that I'm testing on 3.14, but
looking back it looks like I didn't.  :(  It is entirely possible that
the memory management has improved in newer versions of Linux so that
things aren't so bad even without my patch.  Since I'm doing a full
system test it's pretty hard for me to bump up to a new version of
Linux and test (it looks like the accelerated video hasn't landed
there).

On 3.14 thre's no "__GFP_RECLAIM".

Since many ARM users are running old versions of Linux and are
interested in nice backportable patches, it probably makes sense to
post up and use __GFP_NORETRY and someone in the future can try poking
it to clearing __GFP_RECLAIM instead?


>> 2. We still have the same problem that we're taking away all the
>> contiguous memory that other users may want.  I've got a dwc2 USB
>> controller in my system and it needs to allocate bounce buffers for
>> its DMA.  While looking at cat videos on Facebook and running a
>> program to simulate memory pressure (4 userspace programs each walking
>> through 350 Megs of memory over and over) I start seeing lots of order
>> 3 allocation failures in dwc2.  It's true that the USB/network stack
>> is resilient against these allocation failures (other than spamming my
>> log), but performance will decrease.  When I switch to WiFi I suddenly
>> start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
>> drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
>> performance.
>>
>>
>>
>> I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
>> we don't try for > 64K chunks.  This is might be a reasonable
>> compromise.  My cat video test still reproduces "alloc 4194304 bytes:
>> 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
>> all the large chunks of memory that everyone else would like to have.
>>
>>
>> Oh, or how about this: we start allocating of order 4.  Upon the first
>> failure we jump to order 1.  AKA: if there's no memory pressure we're
>> golden.  The moment we have the first bit of memory pressure we fold.
>> That's basically just a slight optimization on Marek's suggestion.  I
>> still see 450 ms for an allocation, but that's not too bad.  It can
>> still take away large chunks from other users, but maybe that's OK?
>
>
> That makes sense - there's really no benefit to be had from trying orders
> which don't correspond to our relevant IOMMU page sizes - I'm not sure
> off-hand how many contortions you'd have to go through to actually get at
> those from here, although it might be another argument in favour of moving
> the pgsize_bitmap into the iommu_domain as Will proposed some time ago. In
> lieu of an actual lookup, my general inclination would be to go
> 2MB->1MB->64K->4K to cover all the common page sizes, but Marek's probably
> right that the larger two are less relevant in the context of mobile
> graphics stuff, which lets face it is the prime concern for IOMMUs on 32-bit
> ARM.

OK, adding 1MB in the mix isn't too hard and doesn't seem to affect
performance too negatively compared to just trying 64K too.  2MB
either.

Note that I re-tested trying 64K vs. just always allocation 1 page at
a time.  In my usage model (Facebook cat videos under memory pressure)
performance was visually better with page at a time.  That's because:

1. I'm on a system that uses IOMMU for video decoding.  I don't think
TLB optimization in this case is very critical since video decoding is
a linear optimization.  Also these IOMMU mappings are allocated once
per video, so every time I scroll down and a new video starts playing
it allocates a new buffer.  The time it takes is critical.

2. The page at a time definitely affects my other peripherals, so by
eating up large pages I kill my network connectivity.  :(

It might make sense to choose peripheral by peripheral.


>> Anyway, I'll plan to send that patch up.  I'll also do a quick test to
>> see if my "sort()" actually helps anything.

The sort actually did help in some cases.  I'll throw it up as a
separate patch and people can see if they want it.


> Sounds good. I'm about to disappear off for holidays, but it'll be good to
> see how much you've improved everything when I get back :D

I'm going to be out for a while too.


Anyway, I'm about out of time.  I'll send up what I have and people
can debate about it if they want...  Unless there's something truly
terribly about it maybe it would be good to land since the newest
patch shouldn't cause any major regressions but should massively
improve performance for some cases.

-Doug

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18  6:05     ` Tomasz Figa
@ 2015-12-21  1:12       ` Laurent Pinchart
  -1 siblings, 0 replies; 22+ messages in thread
From: Laurent Pinchart @ 2015-12-21  1:12 UTC (permalink / raw)
  To: Tomasz Figa
  Cc: Doug Anderson, Russell King, Marek Szyprowski, Pawel Osciak,
	Dmitry Torokhov, Will Deacon, Andrew Morton, David Rientjes,
	Carlo Caione, Laurent Pinchart, mike.looijmans, lorenx4,
	linux-arm-kernel, linux-kernel

Hi Tomasz,

On Friday 18 December 2015 15:05:45 Tomasz Figa wrote:
> On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson wrote:
> > On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote:
> >> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> >> sizeable buffers.  Upon simple tests of video I saw it trying to
> >> allocate 4,194,304 bytes.  The function tries to be efficient about this
> >> by starting out allocating large chunks and then moving to smaller and
> >> smaller chunk sizes until it succeeds.
> >> 
> >> The current function is very, very slow.
> >> 
> >> One problem is the way it keeps trying and trying to allocate big
> >> chunks.  Imagine a very fragmented memory that has 4M free but no
> >> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> >> We'll do the following memory allocations:
> >> 
> >> - For page 1:
> >>   - Try to allocate order 10 (no retry)
> >>   - Try to allocate order 9 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - For page 2:
> >>   - Try to allocate order 9 (no retry)
> >>   - Try to allocate order 8 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - ...
> >> - ...
> >> 
> >> Total number of calls to alloc() calls for this case is:
> >>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
> >>   => 9228
> >> 
> >> The above is obviously worse case, but given how slow alloc can be we
> >> really want to try to avoid even somewhat bad cases.  I timed the old
> >> code with a device under memory pressure and it wasn't hard to see it
> >> take more than 24 seconds to allocate 4 megs of memory (!!).
> >> 
> >> A second problem (and maybe even more important) is that allocating big
> >> chunks when we don't need them is just not a good idea anyway.  The
> >> first thing we do with these big chunks is break them into smaller
> >> chunks!  If we allocate small chunks:
> >> - The memory manager doesn't need to work so hard to give us big chunks.
> >> - We can save the big chunks for those that really need them and this
> >> 
> >>   code can make great use of all the small chunks sitting around.
> >> 
> >> Let's simplify by just allocating one page at a time.  We may make more
> >> total allocate calls but it works way better.  In real world tests that
> >> used to sometimes see a 24 second allocation call I can now see at most
> >> 250 ms.
> > 
> > Off-list I talked to Dmitry about this a little bit and he pointed out
> > that contiguous chunks actually give a benefit to the IOMMU.  I don't
> > think the benefit outweighs the cost in this case, but I'm happy to
> > hear what others have to say.
> 
> Yeah, I'd like to see some discussion about the effect of allocating
> bigger chunks on IOMMU performance. Dmitry (on CC), could you
> elaborate a bit on what Doug mentioned?
> 
> As for my own understanding, some IOMMUs can map memory using big
> pages, which should improve TLB efficiency and so look-up speed.
> However AFAICT current implementation of allocating function doesn't
> allocate the chunks properly, because there is no guarantee that
> particular chunks are aligned on big page boundary. For example, it
> might happen that we allocate first chunk of order 0, then second
> chunk of order 4 (64KiB - typical big page), then we won't be able to
> map the second chunk using a big page, because the IOVA at that point
> will not be aligned properly.

That might be true of the current implementations, but there's nothing that 
would stop an IOMMU driver to map the start of the buffer at an IOVA address 
aligned to 64kB minus 4kB in the example you mentioned. This would move to a 
trade-off between allocation complexity and runtime performances.

> Is there any other case when bigger physically contiguous chunks can
> help the IOMMU?

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-21  1:12       ` Laurent Pinchart
  0 siblings, 0 replies; 22+ messages in thread
From: Laurent Pinchart @ 2015-12-21  1:12 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Tomasz,

On Friday 18 December 2015 15:05:45 Tomasz Figa wrote:
> On Fri, Dec 18, 2015 at 7:31 AM, Doug Anderson wrote:
> > On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote:
> >> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> >> sizeable buffers.  Upon simple tests of video I saw it trying to
> >> allocate 4,194,304 bytes.  The function tries to be efficient about this
> >> by starting out allocating large chunks and then moving to smaller and
> >> smaller chunk sizes until it succeeds.
> >> 
> >> The current function is very, very slow.
> >> 
> >> One problem is the way it keeps trying and trying to allocate big
> >> chunks.  Imagine a very fragmented memory that has 4M free but no
> >> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> >> We'll do the following memory allocations:
> >> 
> >> - For page 1:
> >>   - Try to allocate order 10 (no retry)
> >>   - Try to allocate order 9 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - For page 2:
> >>   - Try to allocate order 9 (no retry)
> >>   - Try to allocate order 8 (no retry)
> >>   - ...
> >>   - Try to allocate order 0 (with retry, but not needed)
> >> 
> >> - ...
> >> - ...
> >> 
> >> Total number of calls to alloc() calls for this case is:
> >>   sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
> >>   => 9228
> >> 
> >> The above is obviously worse case, but given how slow alloc can be we
> >> really want to try to avoid even somewhat bad cases.  I timed the old
> >> code with a device under memory pressure and it wasn't hard to see it
> >> take more than 24 seconds to allocate 4 megs of memory (!!).
> >> 
> >> A second problem (and maybe even more important) is that allocating big
> >> chunks when we don't need them is just not a good idea anyway.  The
> >> first thing we do with these big chunks is break them into smaller
> >> chunks!  If we allocate small chunks:
> >> - The memory manager doesn't need to work so hard to give us big chunks.
> >> - We can save the big chunks for those that really need them and this
> >> 
> >>   code can make great use of all the small chunks sitting around.
> >> 
> >> Let's simplify by just allocating one page at a time.  We may make more
> >> total allocate calls but it works way better.  In real world tests that
> >> used to sometimes see a 24 second allocation call I can now see at most
> >> 250 ms.
> > 
> > Off-list I talked to Dmitry about this a little bit and he pointed out
> > that contiguous chunks actually give a benefit to the IOMMU.  I don't
> > think the benefit outweighs the cost in this case, but I'm happy to
> > hear what others have to say.
> 
> Yeah, I'd like to see some discussion about the effect of allocating
> bigger chunks on IOMMU performance. Dmitry (on CC), could you
> elaborate a bit on what Doug mentioned?
> 
> As for my own understanding, some IOMMUs can map memory using big
> pages, which should improve TLB efficiency and so look-up speed.
> However AFAICT current implementation of allocating function doesn't
> allocate the chunks properly, because there is no guarantee that
> particular chunks are aligned on big page boundary. For example, it
> might happen that we allocate first chunk of order 0, then second
> chunk of order 4 (64KiB - typical big page), then we won't be able to
> map the second chunk using a big page, because the IOVA at that point
> will not be aligned properly.

That might be true of the current implementations, but there's nothing that 
would stop an IOMMU driver to map the start of the buffer at an IOVA address 
aligned to 64kB minus 4kB in the example you mentioned. This would move to a 
trade-off between allocation complexity and runtime performances.

> Is there any other case when bigger physically contiguous chunks can
> help the IOMMU?

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-18 20:20         ` Robin Murphy
@ 2015-12-21  1:26           ` Laurent Pinchart
  -1 siblings, 0 replies; 22+ messages in thread
From: Laurent Pinchart @ 2015-12-21  1:26 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Doug Anderson, Russell King, Laurent Pinchart, linux-kernel,
	Pawel Osciak, mike.looijmans, Lorenzo Nava, Dmitry Torokhov,
	Will Deacon, Tomasz Figa, David Rientjes, Carlo Caione,
	Andrew Morton, linux-arm-kernel, Marek Szyprowski

Hi Robin,

On Friday 18 December 2015 20:20:56 Robin Murphy wrote:
> On 18/12/15 18:55, Doug Anderson wrote:
> > On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy wrote:
> >> On 17/12/15 22:31, Doug Anderson wrote:
> >>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote:
> >>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> >>>> sizeable buffers.  Upon simple tests of video I saw it trying to
> >>>> allocate 4,194,304 bytes.  The function tries to be efficient about
> >>>> this by starting out allocating large chunks and then moving to smaller
> >>>> and smaller chunk sizes until it succeeds.
> >>>> 
> >>>> The current function is very, very slow.
> >>>> 
> >>>> One problem is the way it keeps trying and trying to allocate big
> >>>> chunks.  Imagine a very fragmented memory that has 4M free but no
> >>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> >>>> We'll do the following memory allocations:
> >>>> 
> >>>> - For page 1:
> >>>>     - Try to allocate order 10 (no retry)
> >>>>     - Try to allocate order 9 (no retry)
> >>>>     - ...
> >>>>     - Try to allocate order 0 (with retry, but not needed)
> >>>> 
> >>>> - For page 2:
> >>>>     - Try to allocate order 9 (no retry)
> >>>>     - Try to allocate order 8 (no retry)
> >>>>     - ...
> >>>>     - Try to allocate order 0 (with retry, but not needed)
> >>>> 
> >>>> - ...
> >>>> - ...
> >>>> 
> >>>> Total number of calls to alloc() calls for this case is:
> >>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
> >>>>     => 9228
> >>>> 
> >>>> The above is obviously worse case, but given how slow alloc can be we
> >>>> really want to try to avoid even somewhat bad cases.  I timed the old
> >>>> code with a device under memory pressure and it wasn't hard to see it
> >>>> take more than 24 seconds to allocate 4 megs of memory (!!).
> >>>> 
> >>>> A second problem (and maybe even more important) is that allocating big
> >>>> chunks when we don't need them is just not a good idea anyway.  The
> >>>> first thing we do with these big chunks is break them into smaller
> >>>> chunks!  If we allocate small chunks:
> >>>> - The memory manager doesn't need to work so hard to give us big
> >>>>   chunks.
> >>>> - We can save the big chunks for those that really need them and this
> >>>>   code can make great use of all the small chunks sitting around.
> >>>> 
> >>>> Let's simplify by just allocating one page at a time.  We may make more
> >>>> total allocate calls but it works way better.  In real world tests that
> >>>> used to sometimes see a 24 second allocation call I can now see at most
> >>>> 250 ms.
> > 
> > One thing to note is that testing yesterday I actually managed to
> > reproduce an allocation taking 120 seconds (!) with the old code.
> 
> Yikes! That really is worth avoiding...
> 
> >>> Off-list I talked to Dmitry about this a little bit and he pointed out
> >>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> >>> think the benefit outweighs the cost in this case, but I'm happy to
> >>> hear what others have to say.  I did some quick printouts and it turns
> >>> out that even when requesting page at a time the memory manager
> >>> (unsurprisingly) can in many cases still give us pages that are
> >>> contiguous.
> >>> 
> >>> Also I'm happy to post up
> >>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
> >>> array and could possibly give us larger chunks of contiguous memory.
> >> 
> >> I think sorting individually-allocated pages really isn't worth the
> >> effort - I'm not aware of anything that's going to be capable of using
> >> larger page/section mappings without also having the necessary physical
> >> alignment, and if you _can_ cobble together, say, 2MB worth of
> >> contiguous pages *at 2MB alignment*, then you would have been far better
> >> off just asking the slab allocator for that in the first place.
> >> 
> >> That's the key point of the higher-order allocation - not that you get
> >> some contiguous pages, but that the region you get is also naturally
> >> aligned to its size physically. That we break up the CPU page tables for
> >> that region into individual pages is just an inconsequential
> >> implementation detail from the IOMMU side. When you _do_ have plenty of
> >> unfragmented free memory it can really be a big win - here's an
> >> instrumented example of what happens on my Juno with the ARM HDLCD/SMMU
> >> combo setting up a framebuffer at boot time:
> >>    iommu_dma_alloc: alloc size 0x753000, 1875 pages
> >>    __iommu_dma_alloc_pages: allocated at order 10
> >>    __iommu_dma_alloc_pages: allocated at order 9
> >>    __iommu_dma_alloc_pages: allocated at order 8
> >>    __iommu_dma_alloc_pages: allocated at order 6
> >>    __iommu_dma_alloc_pages: allocated at order 4
> >>    __iommu_dma_alloc_pages: allocated at order 1
> >>    __iommu_dma_alloc_pages: allocated at order 0
> >>    iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
> >>    iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
> >>    iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
> >>    iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
> >>    iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
> >>    iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
> >>    iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
> >>    ...
> >> 
> >> Since the IOVA region itself is aligned to 8MB (for the total size) and
> >> the physical regions come out in optimal decreasing order, we're able to
> >> map over 80% of the whole buffer with just 3 section entries, with a
> >> corresponding saving on TLB pressure, page table maintenance (cache
> >> flushing), etc.
> >> 
> >> That said, unless you're in the middle of some crazy allocator-thrashing
> >> race, then it's probably safe to assume that once allocation fails at a
> >> given order that's going to remain the case in the near future - would
> >> you mind taking the following diff for a spin under your test conditions
> >> to see how it compares?
> >> 
> >> Robin.
> >> 
> >> ----->8-----
> >> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> >> index dfb5001..95e75c4 100644
> >> --- a/arch/arm/mm/dma-mapping.c
> >> +++ b/arch/arm/mm/dma-mapping.c
> >> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
> >> device *dev, size_t size,
> >>          int count = size >> PAGE_SHIFT;
> >>          int array_size = count * sizeof(struct page *);
> >>          int i = 0;
> >> +        unsigned int order = MAX_ORDER;
> >> 
> >>          if (array_size <= PAGE_SIZE)
> >>                  pages = kzalloc(array_size, GFP_KERNEL);
> >> 
> >> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
> >> device *dev, size_t size,
> >>          gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
> >>          
> >>          while (count) {
> >> -               int j, order;
> >> +               int j;
> >> 
> >> -               for (order = __fls(count); order > 0; --order) {
> >> +               for (order = min_t(unsigned int, order, __fls(count));
> >> +                    order > 0; --order) {
> > 
> > Yeah, I'd been playing with things like that, though not that exact patch.
> > 
> > I just tried it now.  As should be obvious, it certainly makes a
> > DRASTIC improvement in things but it still has some downsides as
> > compared to my patch.
> > 
> > 1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
> > seconds.  I can no longer reproduce the 24 second or 120 second
> > allocation call, but I still see things like "alloc 4194304 bytes:
> > 3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
> > compared with 250 ms max with my patch.
> 
> Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps
> instead of just __GFP_NORETRY we should go all the way to clearing
> __GFP_RECLAIM for the opportunistic calls so they really fail fast?
> 
> > 2. We still have the same problem that we're taking away all the
> > contiguous memory that other users may want.  I've got a dwc2 USB
> > controller in my system and it needs to allocate bounce buffers for
> > its DMA.  While looking at cat videos on Facebook and running a
> > program to simulate memory pressure (4 userspace programs each walking
> > through 350 Megs of memory over and over) I start seeing lots of order
> > 3 allocation failures in dwc2.  It's true that the USB/network stack
> > is resilient against these allocation failures (other than spamming my
> > log), but performance will decrease.  When I switch to WiFi I suddenly
> > start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> > drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> > performance.
> > 
> > I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> > we don't try for > 64K chunks.  This is might be a reasonable
> > compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> > 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> > all the large chunks of memory that everyone else would like to have.
> > 
> > Oh, or how about this: we start allocating of order 4.  Upon the first
> > failure we jump to order 1.  AKA: if there's no memory pressure we're
> > golden.  The moment we have the first bit of memory pressure we fold.
> > That's basically just a slight optimization on Marek's suggestion.  I
> > still see 450 ms for an allocation, but that's not too bad.  It can
> > still take away large chunks from other users, but maybe that's OK?
> 
> That makes sense - there's really no benefit to be had from trying
> orders which don't correspond to our relevant IOMMU page sizes - I'm not
> sure off-hand how many contortions you'd have to go through to actually
> get at those from here, although it might be another argument in favour
> of moving the pgsize_bitmap into the iommu_domain as Will proposed some
> time ago.

What's the status of that ? Do we just need a volunteer to do the job ?

> In lieu of an actual lookup, my general inclination would be to go 2MB->1MB-
> >64K->4K to cover all the common page sizes, but Marek's probably right that
> >the larger two are less relevant in the context of mobile graphics stuff,
> >which lets face it is the prime concern for IOMMUs on 32-bit ARM.
> 
> > Anyway, I'll plan to send that patch up.  I'll also do a quick test to
> > see if my "sort()" actually helps anything.
> 
> Sounds good. I'm about to disappear off for holidays, but it'll be good
> to see how much you've improved everything when I get back :D

-- 
Regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-21  1:26           ` Laurent Pinchart
  0 siblings, 0 replies; 22+ messages in thread
From: Laurent Pinchart @ 2015-12-21  1:26 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Robin,

On Friday 18 December 2015 20:20:56 Robin Murphy wrote:
> On 18/12/15 18:55, Doug Anderson wrote:
> > On Fri, Dec 18, 2015 at 4:41 AM, Robin Murphy wrote:
> >> On 17/12/15 22:31, Doug Anderson wrote:
> >>> On Thu, Dec 17, 2015 at 12:30 PM, Douglas Anderson wrote:
> >>>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
> >>>> sizeable buffers.  Upon simple tests of video I saw it trying to
> >>>> allocate 4,194,304 bytes.  The function tries to be efficient about
> >>>> this by starting out allocating large chunks and then moving to smaller
> >>>> and smaller chunk sizes until it succeeds.
> >>>> 
> >>>> The current function is very, very slow.
> >>>> 
> >>>> One problem is the way it keeps trying and trying to allocate big
> >>>> chunks.  Imagine a very fragmented memory that has 4M free but no
> >>>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
> >>>> We'll do the following memory allocations:
> >>>> 
> >>>> - For page 1:
> >>>>     - Try to allocate order 10 (no retry)
> >>>>     - Try to allocate order 9 (no retry)
> >>>>     - ...
> >>>>     - Try to allocate order 0 (with retry, but not needed)
> >>>> 
> >>>> - For page 2:
> >>>>     - Try to allocate order 9 (no retry)
> >>>>     - Try to allocate order 8 (no retry)
> >>>>     - ...
> >>>>     - Try to allocate order 0 (with retry, but not needed)
> >>>> 
> >>>> - ...
> >>>> - ...
> >>>> 
> >>>> Total number of calls to alloc() calls for this case is:
> >>>>     sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
> >>>>     => 9228
> >>>> 
> >>>> The above is obviously worse case, but given how slow alloc can be we
> >>>> really want to try to avoid even somewhat bad cases.  I timed the old
> >>>> code with a device under memory pressure and it wasn't hard to see it
> >>>> take more than 24 seconds to allocate 4 megs of memory (!!).
> >>>> 
> >>>> A second problem (and maybe even more important) is that allocating big
> >>>> chunks when we don't need them is just not a good idea anyway.  The
> >>>> first thing we do with these big chunks is break them into smaller
> >>>> chunks!  If we allocate small chunks:
> >>>> - The memory manager doesn't need to work so hard to give us big
> >>>>   chunks.
> >>>> - We can save the big chunks for those that really need them and this
> >>>>   code can make great use of all the small chunks sitting around.
> >>>> 
> >>>> Let's simplify by just allocating one page at a time.  We may make more
> >>>> total allocate calls but it works way better.  In real world tests that
> >>>> used to sometimes see a 24 second allocation call I can now see at most
> >>>> 250 ms.
> > 
> > One thing to note is that testing yesterday I actually managed to
> > reproduce an allocation taking 120 seconds (!) with the old code.
> 
> Yikes! That really is worth avoiding...
> 
> >>> Off-list I talked to Dmitry about this a little bit and he pointed out
> >>> that contiguous chunks actually give a benefit to the IOMMU.  I don't
> >>> think the benefit outweighs the cost in this case, but I'm happy to
> >>> hear what others have to say.  I did some quick printouts and it turns
> >>> out that even when requesting page at a time the memory manager
> >>> (unsurprisingly) can in many cases still give us pages that are
> >>> contiguous.
> >>> 
> >>> Also I'm happy to post up
> >>> <https://chromium-review.googlesource.com/#/c/319210/> which sorts the
> >>> array and could possibly give us larger chunks of contiguous memory.
> >> 
> >> I think sorting individually-allocated pages really isn't worth the
> >> effort - I'm not aware of anything that's going to be capable of using
> >> larger page/section mappings without also having the necessary physical
> >> alignment, and if you _can_ cobble together, say, 2MB worth of
> >> contiguous pages *at 2MB alignment*, then you would have been far better
> >> off just asking the slab allocator for that in the first place.
> >> 
> >> That's the key point of the higher-order allocation - not that you get
> >> some contiguous pages, but that the region you get is also naturally
> >> aligned to its size physically. That we break up the CPU page tables for
> >> that region into individual pages is just an inconsequential
> >> implementation detail from the IOMMU side. When you _do_ have plenty of
> >> unfragmented free memory it can really be a big win - here's an
> >> instrumented example of what happens on my Juno with the ARM HDLCD/SMMU
> >> combo setting up a framebuffer at boot time:
> >>    iommu_dma_alloc: alloc size 0x753000, 1875 pages
> >>    __iommu_dma_alloc_pages: allocated at order 10
> >>    __iommu_dma_alloc_pages: allocated at order 9
> >>    __iommu_dma_alloc_pages: allocated at order 8
> >>    __iommu_dma_alloc_pages: allocated at order 6
> >>    __iommu_dma_alloc_pages: allocated at order 4
> >>    __iommu_dma_alloc_pages: allocated at order 1
> >>    __iommu_dma_alloc_pages: allocated at order 0
> >>    iommu: map: iova 0xff800000 pa 0x00000009f5400000 size 0x400000
> >>    iommu: mapping: iova 0xff800000 pa 0x00000009f5400000 pgsize 0x200000
> >>    iommu: mapping: iova 0xffa00000 pa 0x00000009f5600000 pgsize 0x200000
> >>    iommu: map: iova 0xffc00000 pa 0x00000000fa200000 size 0x200000
> >>    iommu: mapping: iova 0xffc00000 pa 0x00000000fa200000 pgsize 0x200000
> >>    iommu: map: iova 0xffe00000 pa 0x00000009f5a00000 size 0x100000
> >>    iommu: mapping: iova 0xffe00000 pa 0x00000009f5a00000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe01000 pa 0x00000009f5a01000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe02000 pa 0x00000009f5a02000 pgsize 0x1000
> >>    iommu: mapping: iova 0xffe03000 pa 0x00000009f5a03000 pgsize 0x1000
> >>    ...
> >> 
> >> Since the IOVA region itself is aligned to 8MB (for the total size) and
> >> the physical regions come out in optimal decreasing order, we're able to
> >> map over 80% of the whole buffer with just 3 section entries, with a
> >> corresponding saving on TLB pressure, page table maintenance (cache
> >> flushing), etc.
> >> 
> >> That said, unless you're in the middle of some crazy allocator-thrashing
> >> race, then it's probably safe to assume that once allocation fails at a
> >> given order that's going to remain the case in the near future - would
> >> you mind taking the following diff for a spin under your test conditions
> >> to see how it compares?
> >> 
> >> Robin.
> >> 
> >> ----->8-----
> >> diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
> >> index dfb5001..95e75c4 100644
> >> --- a/arch/arm/mm/dma-mapping.c
> >> +++ b/arch/arm/mm/dma-mapping.c
> >> @@ -1129,6 +1129,7 @@ static struct page **__iommu_alloc_buffer(struct
> >> device *dev, size_t size,
> >>          int count = size >> PAGE_SHIFT;
> >>          int array_size = count * sizeof(struct page *);
> >>          int i = 0;
> >> +        unsigned int order = MAX_ORDER;
> >> 
> >>          if (array_size <= PAGE_SIZE)
> >>                  pages = kzalloc(array_size, GFP_KERNEL);
> >> 
> >> @@ -1160,9 +1161,10 @@ static struct page **__iommu_alloc_buffer(struct
> >> device *dev, size_t size,
> >>          gfp |= __GFP_NOWARN | __GFP_HIGHMEM;
> >>          
> >>          while (count) {
> >> -               int j, order;
> >> +               int j;
> >> 
> >> -               for (order = __fls(count); order > 0; --order) {
> >> +               for (order = min_t(unsigned int, order, __fls(count));
> >> +                    order > 0; --order) {
> > 
> > Yeah, I'd been playing with things like that, though not that exact patch.
> > 
> > I just tried it now.  As should be obvious, it certainly makes a
> > DRASTIC improvement in things but it still has some downsides as
> > compared to my patch.
> > 
> > 1. It's still pretty easy for arm_iommu_alloc_attrs() to take many
> > seconds.  I can no longer reproduce the 24 second or 120 second
> > allocation call, but I still see things like "alloc 4194304 bytes:
> > 3208093877 ns" (AKA an allocation taking > 3 seconds).  That's
> > compared with 250 ms max with my patch.
> 
> Hmm, I'm no mm expert, but from a look at the flags in gfp.h perhaps
> instead of just __GFP_NORETRY we should go all the way to clearing
> __GFP_RECLAIM for the opportunistic calls so they really fail fast?
> 
> > 2. We still have the same problem that we're taking away all the
> > contiguous memory that other users may want.  I've got a dwc2 USB
> > controller in my system and it needs to allocate bounce buffers for
> > its DMA.  While looking at cat videos on Facebook and running a
> > program to simulate memory pressure (4 userspace programs each walking
> > through 350 Megs of memory over and over) I start seeing lots of order
> > 3 allocation failures in dwc2.  It's true that the USB/network stack
> > is resilient against these allocation failures (other than spamming my
> > log), but performance will decrease.  When I switch to WiFi I suddenly
> > start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> > drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> > performance.
> > 
> > I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> > we don't try for > 64K chunks.  This is might be a reasonable
> > compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> > 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> > all the large chunks of memory that everyone else would like to have.
> > 
> > Oh, or how about this: we start allocating of order 4.  Upon the first
> > failure we jump to order 1.  AKA: if there's no memory pressure we're
> > golden.  The moment we have the first bit of memory pressure we fold.
> > That's basically just a slight optimization on Marek's suggestion.  I
> > still see 450 ms for an allocation, but that's not too bad.  It can
> > still take away large chunks from other users, but maybe that's OK?
> 
> That makes sense - there's really no benefit to be had from trying
> orders which don't correspond to our relevant IOMMU page sizes - I'm not
> sure off-hand how many contortions you'd have to go through to actually
> get at those from here, although it might be another argument in favour
> of moving the pgsize_bitmap into the iommu_domain as Will proposed some
> time ago.

What's the status of that ? Do we just need a volunteer to do the job ?

> In lieu of an actual lookup, my general inclination would be to go 2MB->1MB-
> >64K->4K to cover all the common page sizes, but Marek's probably right that
> >the larger two are less relevant in the context of mobile graphics stuff,
> >which lets face it is the prime concern for IOMMUs on 32-bit ARM.
> 
> > Anyway, I'll plan to send that patch up.  I'll also do a quick test to
> > see if my "sort()" actually helps anything.
> 
> Sounds good. I'm about to disappear off for holidays, but it'll be good
> to see how much you've improved everything when I get back :D

-- 
Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
  2015-12-21  1:26           ` Laurent Pinchart
@ 2015-12-21 10:15             ` Will Deacon
  -1 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2015-12-21 10:15 UTC (permalink / raw)
  To: Laurent Pinchart
  Cc: Robin Murphy, Doug Anderson, Russell King, Laurent Pinchart,
	linux-kernel, Pawel Osciak, mike.looijmans, Lorenzo Nava,
	Dmitry Torokhov, Tomasz Figa, David Rientjes, Carlo Caione,
	Andrew Morton, linux-arm-kernel, Marek Szyprowski

On Mon, Dec 21, 2015 at 03:26:27AM +0200, Laurent Pinchart wrote:
> On Friday 18 December 2015 20:20:56 Robin Murphy wrote:
> > On 18/12/15 18:55, Doug Anderson wrote:
> > > 2. We still have the same problem that we're taking away all the
> > > contiguous memory that other users may want.  I've got a dwc2 USB
> > > controller in my system and it needs to allocate bounce buffers for
> > > its DMA.  While looking at cat videos on Facebook and running a
> > > program to simulate memory pressure (4 userspace programs each walking
> > > through 350 Megs of memory over and over) I start seeing lots of order
> > > 3 allocation failures in dwc2.  It's true that the USB/network stack
> > > is resilient against these allocation failures (other than spamming my
> > > log), but performance will decrease.  When I switch to WiFi I suddenly
> > > start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> > > drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> > > performance.
> > > 
> > > I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> > > we don't try for > 64K chunks.  This is might be a reasonable
> > > compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> > > 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> > > all the large chunks of memory that everyone else would like to have.
> > > 
> > > Oh, or how about this: we start allocating of order 4.  Upon the first
> > > failure we jump to order 1.  AKA: if there's no memory pressure we're
> > > golden.  The moment we have the first bit of memory pressure we fold.
> > > That's basically just a slight optimization on Marek's suggestion.  I
> > > still see 450 ms for an allocation, but that's not too bad.  It can
> > > still take away large chunks from other users, but maybe that's OK?
> > 
> > That makes sense - there's really no benefit to be had from trying
> > orders which don't correspond to our relevant IOMMU page sizes - I'm not
> > sure off-hand how many contortions you'd have to go through to actually
> > get at those from here, although it might be another argument in favour
> > of moving the pgsize_bitmap into the iommu_domain as Will proposed some
> > time ago.
> 
> What's the status of that ? Do we just need a volunteer to do the job ?

The pgsize_bitmap per domain stuff? It got a bunch of Acks, but Joerg
didn't like it :(

  http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/334729.html

The idea being that you should be able to attach arbitrary devices to
arbitrary domains, something that I still don't think works in practice.

One way forward would be to do what dwmw2 suggested here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2015-April/335023.html

by extending the page table code to iterate and therefore support all
page sizes. At that point, the pgsize_bitmap can be removed, although
we will run into similar issues expressing the minimum supported page
size.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] ARM: dma-mapping: Just allocate one chunk at a time
@ 2015-12-21 10:15             ` Will Deacon
  0 siblings, 0 replies; 22+ messages in thread
From: Will Deacon @ 2015-12-21 10:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 21, 2015 at 03:26:27AM +0200, Laurent Pinchart wrote:
> On Friday 18 December 2015 20:20:56 Robin Murphy wrote:
> > On 18/12/15 18:55, Doug Anderson wrote:
> > > 2. We still have the same problem that we're taking away all the
> > > contiguous memory that other users may want.  I've got a dwc2 USB
> > > controller in my system and it needs to allocate bounce buffers for
> > > its DMA.  While looking at cat videos on Facebook and running a
> > > program to simulate memory pressure (4 userspace programs each walking
> > > through 350 Megs of memory over and over) I start seeing lots of order
> > > 3 allocation failures in dwc2.  It's true that the USB/network stack
> > > is resilient against these allocation failures (other than spamming my
> > > log), but performance will decrease.  When I switch to WiFi I suddenly
> > > start seeing "mwifiex_sdio mmc2:0001:1: single skb allocated fail,
> > > drop pkt port=28 len=33024".  Again, it's robust, but you're affecting
> > > performance.
> > > 
> > > I also tried using "4" instead of "MAX_ORDER" (as per Marek) so that
> > > we don't try for > 64K chunks.  This is might be a reasonable
> > > compromise.  My cat video test still reproduces "alloc 4194304 bytes:
> > > 674318751 ns", but maybe ~700 ms is an OK?  Of course, this still eats
> > > all the large chunks of memory that everyone else would like to have.
> > > 
> > > Oh, or how about this: we start allocating of order 4.  Upon the first
> > > failure we jump to order 1.  AKA: if there's no memory pressure we're
> > > golden.  The moment we have the first bit of memory pressure we fold.
> > > That's basically just a slight optimization on Marek's suggestion.  I
> > > still see 450 ms for an allocation, but that's not too bad.  It can
> > > still take away large chunks from other users, but maybe that's OK?
> > 
> > That makes sense - there's really no benefit to be had from trying
> > orders which don't correspond to our relevant IOMMU page sizes - I'm not
> > sure off-hand how many contortions you'd have to go through to actually
> > get at those from here, although it might be another argument in favour
> > of moving the pgsize_bitmap into the iommu_domain as Will proposed some
> > time ago.
> 
> What's the status of that ? Do we just need a volunteer to do the job ?

The pgsize_bitmap per domain stuff? It got a bunch of Acks, but Joerg
didn't like it :(

  http://lists.infradead.org/pipermail/linux-arm-kernel/2015-March/334729.html

The idea being that you should be able to attach arbitrary devices to
arbitrary domains, something that I still don't think works in practice.

One way forward would be to do what dwmw2 suggested here:

  http://lists.infradead.org/pipermail/linux-arm-kernel/2015-April/335023.html

by extending the page table code to iterate and therefore support all
page sizes. At that point, the pgsize_bitmap can be removed, although
we will run into similar issues expressing the minimum supported page
size.

Will

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2015-12-21 10:15 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-17 20:30 [PATCH] ARM: dma-mapping: Just allocate one chunk at a time Douglas Anderson
2015-12-17 20:30 ` Douglas Anderson
2015-12-17 22:31 ` Doug Anderson
2015-12-17 22:31   ` Doug Anderson
2015-12-18  6:05   ` Tomasz Figa
2015-12-18  6:05     ` Tomasz Figa
2015-12-21  1:12     ` Laurent Pinchart
2015-12-21  1:12       ` Laurent Pinchart
2015-12-18 12:41   ` Robin Murphy
2015-12-18 12:41     ` Robin Murphy
2015-12-18 14:38     ` Marek Szyprowski
2015-12-18 14:38       ` Marek Szyprowski
2015-12-18 18:55     ` Doug Anderson
2015-12-18 18:55       ` Doug Anderson
2015-12-18 20:20       ` Robin Murphy
2015-12-18 20:20         ` Robin Murphy
2015-12-18 22:05         ` Doug Anderson
2015-12-18 22:05           ` Doug Anderson
2015-12-21  1:26         ` Laurent Pinchart
2015-12-21  1:26           ` Laurent Pinchart
2015-12-21 10:15           ` Will Deacon
2015-12-21 10:15             ` Will Deacon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.