From mboxrd@z Thu Jan  1 00:00:00 1970
From: Pauli Nieminen <suokkos@gmail.com>
Subject: Re: [RFC] drm/ttm: add pool wc/uc page allocator
Date: Wed, 3 Mar 2010 17:46:43 +0200
Message-ID: <548cdfc21003030746v5bfab478g1b6e3203b0e6f8d1@mail.gmail.com>
References: <1267392603-8915-1-git-send-email-suokkos@gmail.com>
	<548cdfc21003011432o3c6103dbj3e9d29d8f42609c3@mail.gmail.com>
	<20100303133300.GB2078@barney.localdomain>
	<548cdfc21003030623t2ec356bbob6609d8b86fd13c0@mail.gmail.com>
	<20100303145046.GD2078@barney.localdomain>
Mime-Version: 1.0
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Return-path: <dri-devel-bounces@lists.sourceforge.net>
In-Reply-To: <20100303145046.GD2078@barney.localdomain>
List-Unsubscribe: <https://lists.sourceforge.net/lists/listinfo/dri-devel>,
	<mailto:dri-devel-request@lists.sourceforge.net?subject=unsubscribe>
List-Archive: <http://sourceforge.net/mailarchive/forum.php?forum_name=dri-devel>
List-Post: <mailto:dri-devel@lists.sourceforge.net>
List-Help: <mailto:dri-devel-request@lists.sourceforge.net?subject=help>
List-Subscribe: <https://lists.sourceforge.net/lists/listinfo/dri-devel>,
	<mailto:dri-devel-request@lists.sourceforge.net?subject=subscribe>
Errors-To: dri-devel-bounces@lists.sourceforge.net
To: Jerome Glisse <glisse@freedesktop.org>
Cc: Jerome Glisse <jglisse@redhat.com>, dri-devel@lists.sourceforge.net
List-Id: dri-devel@lists.freedesktop.org

On Wed, Mar 3, 2010 at 4:50 PM, Jerome Glisse <glisse@freedesktop.org> wrot=
e:
> On Wed, Mar 03, 2010 at 04:23:08PM +0200, Pauli Nieminen wrote:
>> On Wed, Mar 3, 2010 at 3:33 PM, Jerome Glisse <glisse@freedesktop.org> w=
rote:
>> > On Tue, Mar 02, 2010 at 12:32:54AM +0200, Pauli Nieminen wrote:
>> >> On Sun, Feb 28, 2010 at 11:30 PM, Pauli Nieminen <suokkos@gmail.com> =
wrote:
>> >> > On AGP system we might allocate/free routinely uncached or wc memor=
y,
>> >> > changing page from cached (wb) to uc or wc is very expensive and in=
volves
>> >> > a lot of flushing. To improve performance this allocator use a pool
>> >> > of uc,wc pages.
>> >> >
>> >> > Pools are linked lists of pages. Ordered so that first is the lates=
t adition
>> >> > to the pool and last is the oldest page. Old pages are periodicaly =
freed from
>> >> > pools to keep memory use at minimum.
>> >> >
>> >> > Pools are protected with spinlocks to allow multiple threads to all=
ocate pages
>> >> > simultanously. Expensive operations must not hold the spinlock to m=
aximise
>> >> > performance for multiple threads.
>> >> >
>> >> > Based on Jerome Glisse's and Dave Airlie's pool allocator.
>> >> >
>> >> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
>> >> > Signed-off-by: Dave Airlie <airlied@redhat.com>
>> >> > Signed-off-by: Pauli Nieminen <suokkos@gmail.com>
>> >
>> > I think it's overdesigned, by trying to be to clever we often endup
>> > with code more complex and less efficient in the end. While the idea
>> > to free page only after some amount of time is the way to go, i think
>> > a simpler approach will perform a better job. For instance:
>> >
>> > pool {
>> > =A0npages // npages in pool
>> > =A0page *pages[enoughttohold16Morsomethingconfigurable]
>> > =A0unsigned long next_free_time;
>> > =A0lock
>> > }
>> > filllocked(pool,npages)
>> > {
>> > =A0 =A0 =A0 =A0if (npages > MAXPAGESPERPOOL)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0npages =3D MAXPAGESPERPOOL;
>> > =A0 =A0 =A0 =A0for i, npages : allocpage
>> > }
>> > alloc(apages, npages)
>> > {
>> > =A0 =A0 =A0 =A0do {
>> > =A0 =A0 =A0 =A0lock(pool)
>> > =A0 =A0 =A0 =A0pool->next_free_time =3D jiffies + timeout;
>> > =A0 =A0 =A0 =A0if (npages > pool->npages) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0npages -=3D pool->npages;
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0copypageptr(apages, pool->pages, pool->=
npages)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pool->npages =3D 0;
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0fillpool(pool, npages)
>> > =A0 =A0 =A0 =A0} else {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0pool->npages -=3D npages;
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0copypageptr(apages, pool->pages, npages)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0npages =3D 0;
>> > =A0 =A0 =A0 =A0}
>> > =A0 =A0 =A0 =A0unlock(pool)
>> > =A0 =A0 =A0 =A0} while (npages)
>> > }
>> > poolshrinkcallbacktimer()
>> > {
>> > =A0 =A0 =A0 =A0for i in npools {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (trylock(pool[i].lock) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0if (timeafter(jiffies, =
pool[i].nextfreetime)) {
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0free bu=
nch of pages for instance 1M at a time
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0unlock(pool[i].lock)
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0}
>> > =A0 =A0 =A0 =A0}
>> > }
>>
>> I would need to use split buffer for that big arrays. My aim was to
>> have FILO queue pages so knowing which pages can be freed now is
>> simple because the oldest pages is always in the head of the list. I
>> know that FILO is easy to implement on top of deque but requirement to
>> split underlying array sounded like more complex than using linked
>> list. But as I noted in IRC. Requesting for N pages from the linked
>> list is doomed to have O(N) performance which is bad. So better
>> implementation would be switching to an array.
>
> The design i outlined above drop the concept of recording time per
> page so no need for a FILO, you just fill the pool and you empty
> it 1M or less at a time every Nmsec this is a lot simpler no complex
> spliting or list walking. One simple optimization is to make the
> amount of page we free dependent of how much the pool is fill so
> if there is 16M of memory in the pool free 1M at a time, but if there
> 64K then free one page at a time.
>

I think that we should take into acount the dynamic memory amount that
current user active is requiring. What if there was case ever that
dynamic memory sizes is 20M? List based implementation is easier to
scale to what user space is doing.
20M isn't even much if you think how much dynamic allocations would be
done by multiseat system (driving multiple GPUs same time).

I also tough about the array implementation and there is also that
O(N) cost anyway.

But imporements that I think would help here:
- Add the shrinker callback and don't use own timer.
- Simplify the implementation a lot.
- Maybe having pool for cached pages too would make it a lot simpler
when all allocations would use same code path.

I did originaly choose list implementation for easier scalability and
I think that is good reason not to switch to array based
implementation. Another point against array is that it takes some
extra memory space.

>> >
>> > General remark i don't think you handle jiffies wrap around properly.
>>
>> Only place where jiffies are read is ttm_page_pool_free. There
>> difference is calculate with =A0timeafter call so I think it should
>> handle the wrapping.
>>
>
> No doesn't work, for instance (assuming jiffies to be 16 bits)
> page get it's private set to 0xFFF0, now ttm_page_pool_free jiffies
> wrap around and are at 0x0 the test will evaluate to false and page
> will be free prematuraly, i think better would have been to do:
> page->private =3D jiffies + timeout in ttm_put_pages and do
> time_after(jiffie, page->private) in ttm_page_pool_free that would
> have avoided the wrap around.
>

Yes. Now I noticed the bug. I had the + timeout in wrong function.

Simpler solution for determining number of pages to free based on demand:
- Set nlowpages variable in pool to number of pages in pool
- Always after returning pages from pool check if npages is less than
nlowpages and update nlowpages if needed.
shrinkcallback()
{
 foreach pool in pools {
  lock(pool)
  if (time_elapsed() > min_shrinker_interval) {
   free_pages(nlowpages)
   nlowpages =3D npages
   last_shrinker_call =3D jiffies;
  }
 unlock(pool)
 }
}

Quarantining minimum interval is important or we might free too much
pages that would have very bad impact on performance. Maybe something
like 500ms to 1 second could be good minimum interval.

> Cheers,
> Jerome
>
>> > Also the whole lock dropping in middle of list traversal looks very
>> > fragile to me, right now we don't have much concurency in radeon so
>> > i don't think you well tested the effect of concurently using the
>> > allocator. Anyway see bellow for more comments on the code.
>> >
>> > Btw thanks for looking into this, this isn't a sexy area but it
>> > definitly needs love.
>> >
>> > Cheers,
>> > Jerome
>> >
>> >
>> >
>> > page_to_alloc might be negative if pool->npages > count, i think it can
>> > happen because there is lock dropping in the page alloc function below.
>> > also pages_to_alloc*sizeof(*page) might be to big for kmalloc and we
>> > might need to use vmalloc (this is why i think a design with a static
>> > allocated on table is better)
>>
>> page_to_alloc can't be negative because lock is held untl later into
>> pool allocator.
>>
>> Locks are only dropped:
>> - When allocation succeed
>> - When returning failure.
>> - After entering =A0ttm_pages_pool_fill_locked. Before droping lock
>> there code reserves pages from pool to private list and sets pool to
>> empty state.
>>
>> =A0ttm_pages_pool_fill_locked had 2 errors in the first version. One of
>> them was wrong count paramter to kmalloc. 2nd version fixes that we
>> never alloc more than PAGE_SIZES memory. Second error was that
>> pool->npages wasn't set to zero when taking the pages from pool.
>>
>> Basicaly fill_pool operates only in private variables when it is
>> outside of pool locks which makes it reentrant for unlocked part.
>>
>> >
>> >
>> > This function is quite complex, especialy the failing filling pool path
>> > that looks supicious, idea of splitting work is not to have to do the
>> > work in other function so if pool filling fails the alloc should fail
>> > and shouldn't try to fill the pool on its own this kind of code
>> > duplication is a recipie for failure i think.
>>
>> yes. Failing path should just return -ENOMEM. I will fix that.
>>
>> >
>> >>
>> >> While no comments yet I did notice huge errors in
>> >> ttm_pages_pool_fill_locked. But I still would like to comments on
>> >> overall design.
>> >>
>> >> Is it correct to use page->private in pool allocator?
>> >>
>> >> bo allocation is still too expensive operation for dma buffers in
>> >> classic mesa. It takes quite a lot cpu time in bind memory (agp
>> >> system) and ttm_mem_global functions. Would it be possible to move
>> >> some parts those expensive operations into pool fill and pool free
>> >> code?
>> >>
>> >> Also using single call to allocate all pages for buffer would improve
>> >> situation when pool is not large enough. I can change ttm_tt_populate
>> >> to allocate all pages with single call if it sounds ok.
>> >>
>> >> I will still look at how to integrate pool allocator to shinker.
>> >>
>> >>
>> >> Attached fixed full path and here is fixed version of
>> >> ttm_pages_pool_fill_locked :
>> >>
>> >

---------------------------------------------------------------------------=
---
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--