All of lore.kernel.org
 help / color / mirror / Atom feed
From: Tomasz Figa <tfiga@chromium.org>
To: Robin Murphy <robin.murphy@arm.com>
Cc: Douglas Anderson <dianders@chromium.org>,
	Russell King <linux@arm.linux.org.uk>,
	Mauro Carvalho Chehab <mchehab@osg.samsung.com>,
	Marek Szyprowski <m.szyprowski@samsung.com>,
	Pawel Osciak <pawel@osciak.com>,
	Dmitry Torokhov <dmitry.torokhov@gmail.com>,
	Will Deacon <will.deacon@arm.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	dan.j.williams@intel.com, Carlo Caione <carlo@caione.org>,
	Laurent Pinchart <laurent.pinchart+renesas@ideasonboard.com>,
	mike.looijmans@topic.nl, Lorenzo Nava <lorenx4@gmail.com>,
	"linux-arm-kernel@lists.infradead.org" 
	<linux-arm-kernel@lists.infradead.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation
Date: Thu, 14 Jan 2016 02:33:00 +0900	[thread overview]
Message-ID: <CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com> (raw)
In-Reply-To: <56964070.1020303@arm.com>

On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Doug,
>
>
> On 08/01/16 23:05, Douglas Anderson wrote:
>>
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>> in order to optimize IOMMU TLB usage.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>    - Try to allocate order 10 (no retry)
>>    - Try to allocate order 9 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>    - Try to allocate order 9 (no retry)
>>    - Try to allocate order 8 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>    => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>> was done on kernel 3.14, so possibly mainline would behave
>> differently).
>>
>> A second problem is that allocating big chunks under memory pressure
>> when we don't need them is just not a great idea anyway unless we really
>> need them.  We can make due pretty well with smaller chunks so it's
>> probably wise to leave bigger chunks for other users once memory
>> pressure is on.
>>
>> Let's adjust the allocation like this:
>>
>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>     order allocations.
>> 2. Don't try useless orders.  The whole point of big chunks is to
>>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>     4K sizes.
>>
>> We'll still tend to eat up a bunch of big chunks, but that might be the
>> right answer for some users.  A future patch could possibly add a new
>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>> important and that we should use smaller chunks.  Presumably this would
>> be a sane strategy for some callers.
>
>
> Now that I've had time to think about it properly:
>
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>
> I just had an absolutely disgusting idea of how to get the same progression
> with just a single variable and no static array, but I'll keep that firmly
> to myself as it's almost IOCCC-grade WTF :D

Just out of curiosity, a bitmap and loop with fls() and clearing bit
on failure or something more freaky? :)

Anyway:

Reviewed-by: Tomasz Figa <tfiga@chromium.org>

Best regards,
Tomasz

WARNING: multiple messages have this Message-ID (diff)
From: tfiga@chromium.org (Tomasz Figa)
To: linux-arm-kernel@lists.infradead.org
Subject: [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation
Date: Thu, 14 Jan 2016 02:33:00 +0900	[thread overview]
Message-ID: <CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com> (raw)
In-Reply-To: <56964070.1020303@arm.com>

On Wed, Jan 13, 2016 at 9:17 PM, Robin Murphy <robin.murphy@arm.com> wrote:
> Hi Doug,
>
>
> On 08/01/16 23:05, Douglas Anderson wrote:
>>
>> The __iommu_alloc_buffer() is expected to be called to allocate pretty
>> sizeable buffers.  Upon simple tests of video I saw it trying to
>> allocate 4,194,304 bytes.  The function tries to allocate large chunks
>> in order to optimize IOMMU TLB usage.
>>
>> The current function is very, very slow.
>>
>> One problem is the way it keeps trying and trying to allocate big
>> chunks.  Imagine a very fragmented memory that has 4M free but no
>> contiguous pages at all.  Further imagine allocating 4M (1024 pages).
>> We'll do the following memory allocations:
>> - For page 1:
>>    - Try to allocate order 10 (no retry)
>>    - Try to allocate order 9 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - For page 2:
>>    - Try to allocate order 9 (no retry)
>>    - Try to allocate order 8 (no retry)
>>    - ...
>>    - Try to allocate order 0 (with retry, but not needed)
>> - ...
>> - ...
>>
>> Total number of calls to alloc() calls for this case is:
>>    sum(int(math.log(i, 2)) + 1 for i in range(1, 1025))
>>    => 9228
>>
>> The above is obviously worse case, but given how slow alloc can be we
>> really want to try to avoid even somewhat bad cases.  I timed the old
>> code with a device under memory pressure and it wasn't hard to see it
>> take more than 120 seconds to allocate 4 megs of memory! (NOTE: testing
>> was done on kernel 3.14, so possibly mainline would behave
>> differently).
>>
>> A second problem is that allocating big chunks under memory pressure
>> when we don't need them is just not a great idea anyway unless we really
>> need them.  We can make due pretty well with smaller chunks so it's
>> probably wise to leave bigger chunks for other users once memory
>> pressure is on.
>>
>> Let's adjust the allocation like this:
>>
>> 1. If a big chunk fails, stop trying to hard and bump down to lower
>>     order allocations.
>> 2. Don't try useless orders.  The whole point of big chunks is to
>>     optimize the TLB and it can really only make use of 2M, 1M, 64K and
>>     4K sizes.
>>
>> We'll still tend to eat up a bunch of big chunks, but that might be the
>> right answer for some users.  A future patch could possibly add a new
>> DMA_ATTR that would let the caller decide that TLB optimization isn't
>> important and that we should use smaller chunks.  Presumably this would
>> be a sane strategy for some callers.
>
>
> Now that I've had time to think about it properly:
>
> Reviewed-by: Robin Murphy <robin.murphy@arm.com>
>
> I just had an absolutely disgusting idea of how to get the same progression
> with just a single variable and no static array, but I'll keep that firmly
> to myself as it's almost IOCCC-grade WTF :D

Just out of curiosity, a bitmap and loop with fls() and clearing bit
on failure or something more freaky? :)

Anyway:

Reviewed-by: Tomasz Figa <tfiga@chromium.org>

Best regards,
Tomasz

  reply	other threads:[~2016-01-13 17:33 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-01-08 23:05 [PATCH v5 0/5] dma-mapping: Patches for speeding up allocation Douglas Anderson
2016-01-08 23:05 ` Douglas Anderson
2016-01-08 23:05 ` [PATCH v5 1/5] ARM: dma-mapping: Optimize allocation Douglas Anderson
2016-01-08 23:05   ` Douglas Anderson
2016-01-13 12:17   ` Robin Murphy
2016-01-13 12:17     ` Robin Murphy
2016-01-13 17:33     ` Tomasz Figa [this message]
2016-01-13 17:33       ` Tomasz Figa
2016-01-13 17:44       ` Robin Murphy
2016-01-13 17:44         ` Robin Murphy
2016-01-08 23:05 ` [PATCH v5 2/5] common: DMA-mapping: add DMA_ATTR_NO_HUGE_PAGE attribute Douglas Anderson
2016-01-08 23:05 ` [PATCH v5 3/5] ARM: dma-mapping: Use DMA_ATTR_NO_HUGE_PAGE hint to optimize allocation Douglas Anderson
2016-01-08 23:05   ` Douglas Anderson
2016-01-08 23:05 ` [PATCH v5 4/5] [media] videobuf2-dc: Let drivers specify DMA attrs Douglas Anderson
2016-01-08 23:05 ` [PATCH v5 5/5] [media] s5p-mfc: Set DMA_ATTR_NO_HUGE_PAGE Douglas Anderson
2016-01-08 23:05   ` Douglas Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAAFQd5C8asfo8wSa=jKvp4Vmg6A83R-vG7vQXtsHyOTADLo+9g@mail.gmail.com' \
    --to=tfiga@chromium.org \
    --cc=akpm@linux-foundation.org \
    --cc=carlo@caione.org \
    --cc=dan.j.williams@intel.com \
    --cc=dianders@chromium.org \
    --cc=dmitry.torokhov@gmail.com \
    --cc=laurent.pinchart+renesas@ideasonboard.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@arm.linux.org.uk \
    --cc=lorenx4@gmail.com \
    --cc=m.szyprowski@samsung.com \
    --cc=mchehab@osg.samsung.com \
    --cc=mike.looijmans@topic.nl \
    --cc=pawel@osciak.com \
    --cc=robin.murphy@arm.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.