Re: [RFC] drm/ttm: add pool wc/uc page allocator

From: Jerome Glisse <glisse@freedesktop.org>
To: Pauli Nieminen <suokkos@gmail.com>
Cc: Jerome Glisse <jglisse@redhat.com>, dri-devel@lists.sourceforge.net
Subject: Re: [RFC] drm/ttm: add pool wc/uc page allocator
Date: Wed, 3 Mar 2010 17:02:02 +0100	[thread overview]
Message-ID: <20100303160202.GE2078@barney.localdomain> (raw)
In-Reply-To: <548cdfc21003030746v5bfab478g1b6e3203b0e6f8d1@mail.gmail.com>

On Wed, Mar 03, 2010 at 05:46:43PM +0200, Pauli Nieminen wrote:
> On Wed, Mar 3, 2010 at 4:50 PM, Jerome Glisse <glisse@freedesktop.org> wrote:
> > On Wed, Mar 03, 2010 at 04:23:08PM +0200, Pauli Nieminen wrote:
> >> On Wed, Mar 3, 2010 at 3:33 PM, Jerome Glisse <glisse@freedesktop.org> wrote:
> >> > On Tue, Mar 02, 2010 at 12:32:54AM +0200, Pauli Nieminen wrote:
> >> >> On Sun, Feb 28, 2010 at 11:30 PM, Pauli Nieminen <suokkos@gmail.com> wrote:
> >> >> > On AGP system we might allocate/free routinely uncached or wc memory,
> >> >> > changing page from cached (wb) to uc or wc is very expensive and involves
> >> >> > a lot of flushing. To improve performance this allocator use a pool
> >> >> > of uc,wc pages.
> >> >> >
> >> >> > Pools are linked lists of pages. Ordered so that first is the latest adition
> >> >> > to the pool and last is the oldest page. Old pages are periodicaly freed from
> >> >> > pools to keep memory use at minimum.
> >> >> >
> >> >> > Pools are protected with spinlocks to allow multiple threads to allocate pages
> >> >> > simultanously. Expensive operations must not hold the spinlock to maximise
> >> >> > performance for multiple threads.
> >> >> >
> >> >> > Based on Jerome Glisse's and Dave Airlie's pool allocator.
> >> >> >
> >> >> > Signed-off-by: Jerome Glisse <jglisse@redhat.com>
> >> >> > Signed-off-by: Dave Airlie <airlied@redhat.com>
> >> >> > Signed-off-by: Pauli Nieminen <suokkos@gmail.com>
> >> >
> >> > I think it's overdesigned, by trying to be to clever we often endup
> >> > with code more complex and less efficient in the end. While the idea
> >> > to free page only after some amount of time is the way to go, i think
> >> > a simpler approach will perform a better job. For instance:
> >> >
> >> > pool {
> >> >  npages // npages in pool
> >> >  page *pages[enoughttohold16Morsomethingconfigurable]
> >> >  unsigned long next_free_time;
> >> >  lock
> >> > }
> >> > filllocked(pool,npages)
> >> > {
> >> >        if (npages > MAXPAGESPERPOOL)
> >> >                npages = MAXPAGESPERPOOL;
> >> >        for i, npages : allocpage
> >> > }
> >> > alloc(apages, npages)
> >> > {
> >> >        do {
> >> >        lock(pool)
> >> >        pool->next_free_time = jiffies + timeout;
> >> >        if (npages > pool->npages) {
> >> >                npages -= pool->npages;
> >> >                copypageptr(apages, pool->pages, pool->npages)
> >> >                pool->npages = 0;
> >> >                fillpool(pool, npages)
> >> >        } else {
> >> >                pool->npages -= npages;
> >> >                copypageptr(apages, pool->pages, npages)
> >> >                npages = 0;
> >> >        }
> >> >        unlock(pool)
> >> >        } while (npages)
> >> > }
> >> > poolshrinkcallbacktimer()
> >> > {
> >> >        for i in npools {
> >> >                if (trylock(pool[i].lock) {
> >> >                        if (timeafter(jiffies, pool[i].nextfreetime)) {
> >> >                                free bunch of pages for instance 1M at a time
> >> >                        }
> >> >                        unlock(pool[i].lock)
> >> >                }
> >> >        }
> >> > }
> >>
> >> I would need to use split buffer for that big arrays. My aim was to
> >> have FILO queue pages so knowing which pages can be freed now is
> >> simple because the oldest pages is always in the head of the list. I
> >> know that FILO is easy to implement on top of deque but requirement to
> >> split underlying array sounded like more complex than using linked
> >> list. But as I noted in IRC. Requesting for N pages from the linked
> >> list is doomed to have O(N) performance which is bad. So better
> >> implementation would be switching to an array.
> >
> > The design i outlined above drop the concept of recording time per
> > page so no need for a FILO, you just fill the pool and you empty
> > it 1M or less at a time every Nmsec this is a lot simpler no complex
> > spliting or list walking. One simple optimization is to make the
> > amount of page we free dependent of how much the pool is fill so
> > if there is 16M of memory in the pool free 1M at a time, but if there
> > 64K then free one page at a time.
> >
> 
> I think that we should take into acount the dynamic memory amount that
> current user active is requiring. What if there was case ever that
> dynamic memory sizes is 20M? List based implementation is easier to
> scale to what user space is doing.
> 20M isn't even much if you think how much dynamic allocations would be
> done by multiseat system (driving multiple GPUs same time).
> 

You mean an app constantly allocating 20M and freeing 20M ? even with
multiple GPU this sounds like broken userspace, note that i don't think
there is dual AGP motherboard out there but i kind of forget about such
evil thing. Also, i picked 32M randomly as i think best choice i making
the pool as big as AGP aperture. Anyway bottom line is that we should
not think to all possible use case but rather worry about the most
common sensible one.

> I also tough about the array implementation and there is also that
> O(N) cost anyway.
> 
> But imporements that I think would help here:
> - Add the shrinker callback and don't use own timer.
> - Simplify the implementation a lot.
> - Maybe having pool for cached pages too would make it a lot simpler
> when all allocations would use same code path.
> 
> I did originaly choose list implementation for easier scalability and
> I think that is good reason not to switch to array based
> implementation. Another point against array is that it takes some
> extra memory space.

I don't think we want scalability, i think having a static table storing
page which means we won't stole more memory from the system than the
pool can handle (we are talking about a pool here not an ocean ;))
Anyway if you can provide a simpler implementation with list why not,
but the current one really looks too complex and worry me a lot about
corner case.

Cheers,
Jerome

> >> >
> >> > General remark i don't think you handle jiffies wrap around properly.
> >>
> >> Only place where jiffies are read is ttm_page_pool_free. There
> >> difference is calculate with  timeafter call so I think it should
> >> handle the wrapping.
> >>
> >
> > No doesn't work, for instance (assuming jiffies to be 16 bits)
> > page get it's private set to 0xFFF0, now ttm_page_pool_free jiffies
> > wrap around and are at 0x0 the test will evaluate to false and page
> > will be free prematuraly, i think better would have been to do:
> > page->private = jiffies + timeout in ttm_put_pages and do
> > time_after(jiffie, page->private) in ttm_page_pool_free that would
> > have avoided the wrap around.
> >
> 
> Yes. Now I noticed the bug. I had the + timeout in wrong function.
> 
> Simpler solution for determining number of pages to free based on demand:
> - Set nlowpages variable in pool to number of pages in pool
> - Always after returning pages from pool check if npages is less than
> nlowpages and update nlowpages if needed.
> shrinkcallback()
> {
>  foreach pool in pools {
>   lock(pool)
>   if (time_elapsed() > min_shrinker_interval) {
>    free_pages(nlowpages)
>    nlowpages = npages
>    last_shrinker_call = jiffies;
>   }
>  unlock(pool)
>  }
> }
> 
> Quarantining minimum interval is important or we might free too much
> pages that would have very bad impact on performance. Maybe something
> like 500ms to 1 second could be good minimum interval.
> 
> > Cheers,
> > Jerome
> >
> >> > Also the whole lock dropping in middle of list traversal looks very
> >> > fragile to me, right now we don't have much concurency in radeon so
> >> > i don't think you well tested the effect of concurently using the
> >> > allocator. Anyway see bellow for more comments on the code.
> >> >
> >> > Btw thanks for looking into this, this isn't a sexy area but it
> >> > definitly needs love.
> >> >
> >> > Cheers,
> >> > Jerome
> >> >
> >> >
> >> >
> >> > page_to_alloc might be negative if pool->npages > count, i think it can
> >> > happen because there is lock dropping in the page alloc function below.
> >> > also pages_to_alloc*sizeof(*page) might be to big for kmalloc and we
> >> > might need to use vmalloc (this is why i think a design with a static
> >> > allocated on table is better)
> >>
> >> page_to_alloc can't be negative because lock is held untl later into
> >> pool allocator.
> >>
> >> Locks are only dropped:
> >> - When allocation succeed
> >> - When returning failure.
> >> - After entering  ttm_pages_pool_fill_locked. Before droping lock
> >> there code reserves pages from pool to private list and sets pool to
> >> empty state.
> >>
> >>  ttm_pages_pool_fill_locked had 2 errors in the first version. One of
> >> them was wrong count paramter to kmalloc. 2nd version fixes that we
> >> never alloc more than PAGE_SIZES memory. Second error was that
> >> pool->npages wasn't set to zero when taking the pages from pool.
> >>
> >> Basicaly fill_pool operates only in private variables when it is
> >> outside of pool locks which makes it reentrant for unlocked part.
> >>
> >> >
> >> >
> >> > This function is quite complex, especialy the failing filling pool path
> >> > that looks supicious, idea of splitting work is not to have to do the
> >> > work in other function so if pool filling fails the alloc should fail
> >> > and shouldn't try to fill the pool on its own this kind of code
> >> > duplication is a recipie for failure i think.
> >>
> >> yes. Failing path should just return -ENOMEM. I will fix that.
> >>
> >> >
> >> >>
> >> >> While no comments yet I did notice huge errors in
> >> >> ttm_pages_pool_fill_locked. But I still would like to comments on
> >> >> overall design.
> >> >>
> >> >> Is it correct to use page->private in pool allocator?
> >> >>
> >> >> bo allocation is still too expensive operation for dma buffers in
> >> >> classic mesa. It takes quite a lot cpu time in bind memory (agp
> >> >> system) and ttm_mem_global functions. Would it be possible to move
> >> >> some parts those expensive operations into pool fill and pool free
> >> >> code?
> >> >>
> >> >> Also using single call to allocate all pages for buffer would improve
> >> >> situation when pool is not large enough. I can change ttm_tt_populate
> >> >> to allocate all pages with single call if it sounds ok.
> >> >>
> >> >> I will still look at how to integrate pool allocator to shinker.
> >> >>
> >> >>
> >> >> Attached fixed full path and here is fixed version of
> >> >> ttm_pages_pool_fill_locked :
> >> >>
> >> >
> 

------------------------------------------------------------------------------
Download Intel&#174; Parallel Studio Eval
Try the new software tools for yourself. Speed compiling, find bugs
proactively, and fine-tune applications for parallel performance.
See why Intel Parallel Studio got high marks during beta.
http://p.sf.net/sfu/intel-sw-dev
--