All of lore.kernel.org
 help / color / mirror / Atom feed
* TTM page pool allocator
@ 2009-06-25 12:01 Jerome Glisse
  2009-06-25 15:53 ` Thomas Hellström
  2009-06-26  0:00 ` Dave Airlie
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2009-06-25 12:01 UTC (permalink / raw)
  To: thomas, Dave Airlie; +Cc: linux-kernel, dri-devel

[-- Attachment #1: Type: text/plain, Size: 233 bytes --]

Hi,

Thomas i attach a reworked page pool allocator based on Dave works,
this one should be ok with ttm cache status tracking. It definitely
helps on AGP system, now the bottleneck is in mesa vertex's dma
allocation.

Cheers,
Jerome

[-- Attachment #2: 0001-ttm-add-pool-wc-uc-page-allocator.patch --]
[-- Type: text/x-patch, Size: 14672 bytes --]

>From d2581e2a97e9c694fd1238f1a5652e1f385b5636 Mon Sep 17 00:00:00 2001
From: Dave Airlie <airlied@redhat.com>
Date: Wed, 24 Jun 2009 16:31:43 +1000
Subject: [PATCH] ttm: add pool wc/uc page allocator

On AGP system we might allocate/free routinely uncached or wc memory,
changing page from cached (wb) to uc or wc is very expensive and involves
a lot of flushing. To improve performance this allocator use a pool
of uc,wc pages.

Currently each pool (wc, uc) is 256 pages big, improvement would be
to tweak this according to memory pressure so we can give back memory
to system.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/ttm/Makefile         |    2 +-
 drivers/gpu/drm/ttm/ttm_memory.c     |    3 +
 drivers/gpu/drm/ttm/ttm_page_alloc.c |  255 ++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_page_alloc.h |   43 ++++++
 drivers/gpu/drm/ttm/ttm_tt.c         |   42 +++++-
 include/drm/ttm/ttm_bo_driver.h      |    6 +
 6 files changed, 342 insertions(+), 9 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.c
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.h

diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
index b0a9de7..93e002c 100644
--- a/drivers/gpu/drm/ttm/Makefile
+++ b/drivers/gpu/drm/ttm/Makefile
@@ -3,6 +3,6 @@
 
 ccflags-y := -Iinclude/drm
 ttm-y := ttm_agp_backend.o ttm_memory.o ttm_tt.o ttm_bo.o \
-	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o
+	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o ttm_page_alloc.o
 
 obj-$(CONFIG_DRM_TTM) += ttm.o
diff --git a/drivers/gpu/drm/ttm/ttm_memory.c b/drivers/gpu/drm/ttm/ttm_memory.c
index 87323d4..6da4a08 100644
--- a/drivers/gpu/drm/ttm/ttm_memory.c
+++ b/drivers/gpu/drm/ttm/ttm_memory.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 
+#include "ttm_page_alloc.h"
 #define TTM_PFX "[TTM] "
 #define TTM_MEMORY_ALLOC_RETRIES 4
 
@@ -124,6 +125,7 @@ int ttm_mem_global_init(struct ttm_mem_global *glob)
 	printk(KERN_INFO TTM_PFX "TTM available object memory: %llu MiB\n",
 	       glob->max_memory >> 20);
 
+	ttm_page_alloc_init();
 	return 0;
 }
 EXPORT_SYMBOL(ttm_mem_global_init);
@@ -135,6 +137,7 @@ void ttm_mem_global_release(struct ttm_mem_global *glob)
 	flush_workqueue(glob->swap_queue);
 	destroy_workqueue(glob->swap_queue);
 	glob->swap_queue = NULL;
+	ttm_page_alloc_fini();
 }
 EXPORT_SYMBOL(ttm_mem_global_release);
 
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
new file mode 100644
index 0000000..f841637
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -0,0 +1,255 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ */
+
+/* simple list based uncached page allocator
+ * - Add chunks of 1MB to the allocator at a time.
+ * - Use page->lru to keep a free list
+ * - doesn't track currently in use pages
+ *
+ *  TODO: Add shrinker support
+ */
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+
+#include <asm/agp.h>
+#include "ttm/ttm_bo_driver.h"
+#include "ttm_page_alloc.h"
+
+static struct list_head wc_free_list;
+static struct list_head uc_free_list;
+
+static struct mutex page_alloc_mutex;
+static int page_alloc_inited;
+
+struct ttm_page_alloc_usage ttm_page_alloc_data;
+
+/* add 1MB at a time */
+#define NUM_PAGES_TO_ADD 256
+
+static inline void ttm_page_put(struct page *page, bool setwb)
+{
+#ifdef CONFIG_X86
+	if (setwb && !PageHighMem(page)) {
+		set_memory_wb((unsigned long)page_address(page), 1);
+	}
+#else
+	if (setwb) {
+		/* This is a generic interface on non x86 to handle
+		 * wc/uc page */
+		unmap_page_from_agp(page);
+	}
+#endif
+	put_page(page);
+	__free_page(page);
+}
+
+static int ttm_add_pages_locked(int num_pages, int flags)
+{
+	struct page *page;
+	int gfp_flags = GFP_HIGHUSER;
+	int i, cflag;
+
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
+		gfp_flags |= __GFP_ZERO;
+	if (flags & TTM_PAGE_FLAG_DMA32)
+		gfp_flags |= GFP_DMA32;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		for (i = 0; i < num_pages; i++) {
+			page = alloc_page(gfp_flags);
+			if (!page) {
+				printk(KERN_ERR "unable to get page %d\n", i);
+				return i;
+			}
+			get_page(page);
+#ifdef CONFIG_X86
+			if (!PageHighMem(page))
+				set_memory_uc((unsigned long)page_address(page), 1);
+#else
+			map_page_into_agp(page);
+#endif
+			ttm_tt_cache_flush(&page, 1);
+			list_add(&page->lru, &uc_free_list);
+			ttm_page_alloc_data.total_uc_pages++;
+			ttm_page_alloc_data.uc_pages_in_list++;
+		}
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		for (i = 0; i < num_pages; i++) {
+			page = alloc_page(gfp_flags);
+			if (!page) {
+				printk(KERN_ERR "unable to get page %d\n", i);
+				return i;
+			}
+			get_page(page);
+#ifdef CONFIG_X86
+			if (!PageHighMem(page))
+				set_memory_wc((unsigned long)page_address(page), 1);
+#else
+			map_page_into_agp(page);
+#endif
+			ttm_tt_cache_flush(&page, 1);
+			list_add(&page->lru, &wc_free_list);
+			ttm_page_alloc_data.total_wc_pages++;
+			ttm_page_alloc_data.wc_pages_in_list++;
+		}
+		break;
+	default:
+		printk(KERN_ERR "Wrong caching flags %d'n", cflag);
+		return 0;
+	}
+	return i;
+}
+
+struct page *ttm_get_page(int flags)
+{
+	struct page *page = NULL;
+	int ret;
+	struct list_head *free_list;
+	int *pages_in_list;
+	int cflag;
+	int gfp_flags = GFP_HIGHUSER;
+
+	if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
+		gfp_flags |= __GFP_ZERO;
+	if (flags & TTM_PAGE_FLAG_DMA32)
+		gfp_flags |= GFP_DMA32;
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		free_list = &uc_free_list;
+		pages_in_list = &ttm_page_alloc_data.uc_pages_in_list;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		free_list = &wc_free_list;
+		pages_in_list = &ttm_page_alloc_data.wc_pages_in_list;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WB:
+	default:
+		page = alloc_page(gfp_flags);
+		if (!page) {
+			return NULL;
+		}
+		get_page(page);
+		return page;
+	}
+
+	mutex_lock(&page_alloc_mutex);
+	if (list_empty(free_list)) {
+		ret = ttm_add_pages_locked(NUM_PAGES_TO_ADD, flags);
+		if (ret == 0) {
+			mutex_unlock(&page_alloc_mutex);
+			return NULL;
+		}
+	}
+
+	page = list_first_entry(free_list, struct page, lru);
+	list_del(&page->lru);
+	(*pages_in_list)--;
+	mutex_unlock(&page_alloc_mutex);
+
+	return page;
+}
+
+void ttm_put_page(struct page *page, int flags)
+{
+	struct list_head *free_list;
+	int *pages_in_list;
+	int cflag;
+	bool setwb;
+
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		free_list = &uc_free_list;
+		pages_in_list = &ttm_page_alloc_data.uc_pages_in_list;
+		setwb = true;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		free_list = &wc_free_list;
+		pages_in_list = &ttm_page_alloc_data.wc_pages_in_list;
+		setwb = true;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WB:
+	default:
+		put_page(page);
+		__free_page(page);
+		return;
+	}
+
+	mutex_lock(&page_alloc_mutex);
+	if ((*pages_in_list) > NUM_PAGES_TO_ADD) {
+		ttm_page_put(page, setwb);
+		mutex_unlock(&page_alloc_mutex);
+		return;
+	}
+	list_add(&page->lru, free_list);
+	(*pages_in_list)++;
+	mutex_unlock(&page_alloc_mutex);
+}
+
+void ttm_release_all_pages(struct list_head *free_list, bool setwb)
+{
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, free_list, lru) {
+		list_del(&page->lru);
+		ttm_page_put(page, setwb);
+	}
+}
+
+int ttm_page_alloc_init(void)
+{
+	if (page_alloc_inited)
+		return 0;
+
+	INIT_LIST_HEAD(&wc_free_list);
+	INIT_LIST_HEAD(&uc_free_list);
+	ttm_page_alloc_data.total_uc_pages = 0;
+	ttm_page_alloc_data.total_wc_pages = 0;
+	ttm_page_alloc_data.uc_pages_in_list = 0;
+	ttm_page_alloc_data.wc_pages_in_list = 0;
+
+	mutex_init(&page_alloc_mutex);
+	page_alloc_inited = 1;
+	return 0;
+}
+
+void ttm_page_alloc_fini(void)
+{
+	if (!page_alloc_inited)
+		return;
+
+	ttm_release_all_pages(&wc_free_list, true);
+	ttm_release_all_pages(&uc_free_list, true);
+	ttm_page_alloc_data.total_uc_pages = 0;
+	ttm_page_alloc_data.total_wc_pages = 0;
+	ttm_page_alloc_data.uc_pages_in_list = 0;
+	ttm_page_alloc_data.wc_pages_in_list = 0;
+	page_alloc_inited = 0;
+}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.h b/drivers/gpu/drm/ttm/ttm_page_alloc.h
new file mode 100644
index 0000000..1e103ba
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.h
@@ -0,0 +1,43 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ */
+
+#ifndef TTM_PAGE_ALLOC
+#define TTM_PAGE_ALLOC
+
+struct ttm_page_alloc_usage {
+	int total_uc_pages;
+	int total_wc_pages;
+	int uc_pages_in_list;
+	int wc_pages_in_list;
+};
+
+extern struct ttm_page_alloc_usage ttm_page_alloc_data;
+
+void ttm_put_page(struct page *page, int flags);
+struct page *ttm_get_page(int flags);
+int ttm_page_alloc_init(void);
+void ttm_page_alloc_fini(void);
+
+#endif
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 0331fa7..0df3fa3 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -38,6 +38,7 @@
 #include "ttm/ttm_module.h"
 #include "ttm/ttm_bo_driver.h"
 #include "ttm/ttm_placement.h"
+#include "ttm_page_alloc.h"
 
 static int ttm_tt_swapin(struct ttm_tt *ttm);
 
@@ -132,10 +133,10 @@ static void ttm_tt_free_page_directory(struct ttm_tt *ttm)
 
 static struct page *ttm_tt_alloc_page(unsigned page_flags)
 {
-	if (page_flags & TTM_PAGE_FLAG_ZERO_ALLOC)
-		return alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	struct page *page;
 
-	return alloc_page(GFP_HIGHUSER);
+	page = ttm_get_page(page_flags);
+	return page;
 }
 
 static void ttm_tt_free_user_pages(struct ttm_tt *ttm)
@@ -180,10 +181,23 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
 	struct page *p;
 	struct ttm_bo_device *bdev = ttm->bdev;
 	struct ttm_mem_global *mem_glob = bdev->mem_glob;
+	unsigned cache_flag;
 	int ret;
 
+	ttm->page_flags &= ~TTM_PAGE_FLAG_CACHE_MASK; 
+	switch (ttm->caching_state) {
+	case tt_uncached:
+		cache_flag = TTM_PAGE_FLAG_CACHE_UC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	case tt_wc:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	default:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WB << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	}
 	while (NULL == (p = ttm->pages[index])) {
-		p = ttm_tt_alloc_page(ttm->page_flags);
+		p = ttm_tt_alloc_page(ttm->page_flags | cache_flag);
 
 		if (!p)
 			return NULL;
@@ -344,21 +358,33 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm)
 	int i;
 	struct page *cur_page;
 	struct ttm_backend *be = ttm->be;
+	unsigned cache_flag;
 
 	if (be)
 		be->func->clear(be);
-	(void)ttm_tt_set_caching(ttm, tt_cached);
+	switch (ttm->caching_state) {
+	case tt_uncached:
+		cache_flag = TTM_PAGE_FLAG_CACHE_UC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	case tt_wc:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	default:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WB << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	}
 	for (i = 0; i < ttm->num_pages; ++i) {
 		cur_page = ttm->pages[i];
 		ttm->pages[i] = NULL;
 		if (cur_page) {
-			if (page_count(cur_page) != 1)
+			if (page_count(cur_page) != 2)
 				printk(KERN_ERR TTM_PFX
 				       "Erroneous page count. "
-				       "Leaking pages.\n");
+				       "Leaking pages (%d).\n",
+				       page_count(cur_page));
+			ttm_put_page(cur_page, cache_flag);
 			ttm_mem_global_free(ttm->bdev->mem_glob, PAGE_SIZE,
 					    PageHighMem(cur_page));
-			__free_page(cur_page);
 		}
 	}
 	ttm->state = tt_unpopulated;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index 62ed733..0dac8a8 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -121,6 +121,12 @@ struct ttm_backend {
 #define TTM_PAGE_FLAG_SWAPPED         (1 << 4)
 #define TTM_PAGE_FLAG_PERSISTANT_SWAP (1 << 5)
 #define TTM_PAGE_FLAG_ZERO_ALLOC      (1 << 6)
+#define TTM_PAGE_FLAG_DMA32           (1 << 7)
+#define TTM_PAGE_FLAG_CACHE_MASK      (3 << 8)
+#define TTM_PAGE_FLAG_CACHE_SHIFT     8
+#define TTM_PAGE_FLAG_CACHE_UC        0
+#define TTM_PAGE_FLAG_CACHE_WC        1
+#define TTM_PAGE_FLAG_CACHE_WB        2
 
 enum ttm_caching_state {
 	tt_uncached,
-- 
1.6.2.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-25 12:01 TTM page pool allocator Jerome Glisse
@ 2009-06-25 15:53 ` Thomas Hellström
  2009-07-21 17:34   ` Jerome Glisse
  2009-06-26  0:00 ` Dave Airlie
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-06-25 15:53 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Airlie, linux-kernel, dri-devel

Jerome Glisse skrev:
> Hi,
>
> Thomas i attach a reworked page pool allocator based on Dave works,
> this one should be ok with ttm cache status tracking. It definitely
> helps on AGP system, now the bottleneck is in mesa vertex's dma
> allocation.
>
> Cheers,
> Jerome
>   
Hi, Jerome!
In general it looks very good. Some things that need fixing:

1)  We must have a way to hand back pages. I still not quite understand 
how the shrink callbacks work and whether they are applicable. Another 
scheme would be to free, say 1MB when we have at least 2MB available.

2) We should avoid including AGP headers if AGP is not configured. 
Either reimplement unmap_page_from_agp or map_page_into_agp or move them 
out from the AGP headers. We've hade complaints before from people with 
AGP free systems that the code doesn't compile.

3) Since we're allocating (and freeing) in batches we should use the 
set_pages_array() interface to avoid a global tlb flush per page.

4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
it will always allocate cached pages and then transition them.

5) Use TTM_PFX in printouts.

/Thomas




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-25 12:01 TTM page pool allocator Jerome Glisse
  2009-06-25 15:53 ` Thomas Hellström
@ 2009-06-26  0:00 ` Dave Airlie
  2009-06-26  6:31   ` Thomas Hellström
                     ` (2 more replies)
  1 sibling, 3 replies; 29+ messages in thread
From: Dave Airlie @ 2009-06-26  0:00 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: thomas, linux-kernel, dri-devel

On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
> Hi,
>
> Thomas i attach a reworked page pool allocator based on Dave works,
> this one should be ok with ttm cache status tracking. It definitely
> helps on AGP system, now the bottleneck is in mesa vertex's dma
> allocation.
>

My original version kept a list of wb pages as well, this proved to be
quite a useful
optimisation on my test systems when I implemented it, without it I
was spending ~20%
of my CPU in getting free pages, granted I always used WB pages on
PCIE/IGP systems.

Another optimisation I made at the time was around the populate call,
(not sure if this
is what still happens):

Allocate a 64K local BO for DMA object.
Write into the first 5 pages from userspace - get WB pages.
Bind to GART, swap those 5 pages to WC + flush.
Then populate the rest with WC pages from the list.

Granted I think allocating WC in the first place from the pool might
work just as well since most of the DMA buffers are write only.

Dave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26  0:00 ` Dave Airlie
@ 2009-06-26  6:31   ` Thomas Hellström
  2009-06-26  7:33     ` Jerome Glisse
  2009-06-26  7:31   ` Jerome Glisse
  2009-06-26 13:59   ` Jerome Glisse
  2 siblings, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-06-26  6:31 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Jerome Glisse, linux-kernel, dri-devel

Dave Airlie skrev:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
>   
>> Hi,
>>
>> Thomas i attach a reworked page pool allocator based on Dave works,
>> this one should be ok with ttm cache status tracking. It definitely
>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>> allocation.
>>
>>     
>
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
>
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
>
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
>   

Yes, I think in the latter case the user-space driver should take care 
to specify WC from the beginning, when the BO is allocated.

BTW is there any DMA buffer verification taking place on WC buffers on 
Radeon?

> Dave.
>   

/Thomas


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26  0:00 ` Dave Airlie
  2009-06-26  6:31   ` Thomas Hellström
@ 2009-06-26  7:31   ` Jerome Glisse
  2009-06-26  7:38     ` Dave Airlie
  2009-06-26 13:59   ` Jerome Glisse
  2 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-06-26  7:31 UTC (permalink / raw)
  To: Dave Airlie; +Cc: thomas, linux-kernel, dri-devel

On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
> 
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
> 
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
> 
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
> 
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
> 
> Dave.
> 

I think it's better to fix userspace to not allocate as much buffer per
frame as it does now rather than having a pool of wb pages, i removed
it because on my 64M box memory is getting tight, we need to compute
the number of page we still based on memory. Also i think it's ok
to assume that page allocation is fast enough.

I am reworking the patch with lastes Thomas comment, will post new one
after a bit of testing.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26  6:31   ` Thomas Hellström
@ 2009-06-26  7:33     ` Jerome Glisse
  0 siblings, 0 replies; 29+ messages in thread
From: Jerome Glisse @ 2009-06-26  7:33 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Dave Airlie, linux-kernel, dri-devel

On Fri, 2009-06-26 at 08:31 +0200, Thomas Hellström wrote:
> Dave Airlie skrev:
> > On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
> >   
> >> Hi,
> >>
> >> Thomas i attach a reworked page pool allocator based on Dave works,
> >> this one should be ok with ttm cache status tracking. It definitely
> >> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >> allocation.
> >>
> >>     
> >
> > My original version kept a list of wb pages as well, this proved to be
> > quite a useful
> > optimisation on my test systems when I implemented it, without it I
> > was spending ~20%
> > of my CPU in getting free pages, granted I always used WB pages on
> > PCIE/IGP systems.
> >
> > Another optimisation I made at the time was around the populate call,
> > (not sure if this
> > is what still happens):
> >
> > Allocate a 64K local BO for DMA object.
> > Write into the first 5 pages from userspace - get WB pages.
> > Bind to GART, swap those 5 pages to WC + flush.
> > Then populate the rest with WC pages from the list.
> > Granted I think allocating WC in the first place from the pool might
> > work just as well since most of the DMA buffers are write only.
> >   
> 
> Yes, I think in the latter case the user-space driver should take care 
> to specify WC from the beginning, when the BO is allocated.
> 
> BTW is there any DMA buffer verification taking place on WC buffers on 
> Radeon?


Command buffer submitted from userspace are never bo object, they are
normal memory so they are cached, kernel copy the content into bo
buffer which have proper cache status.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26  7:31   ` Jerome Glisse
@ 2009-06-26  7:38     ` Dave Airlie
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Airlie @ 2009-06-26  7:38 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Airlie, linux-kernel, dri-devel

> 
> I think it's better to fix userspace to not allocate as much buffer per
> frame as it does now rather than having a pool of wb pages, i removed
> it because on my 64M box memory is getting tight, we need to compute
> the number of page we still based on memory. Also i think it's ok
> to assume that page allocation is fast enough.

Yup we could try decreasing the DMA buffer size, I think the DMA buffer 
would be a lot better suited to suballocation, doing individual bo allocs 
like the code used to was a major hit, assuming page alloc is fast enough 
isn't valid since I've already proved it was a major CPU user when I fixed 
it the first time :)

The problem with making the dma buffer smaller is the having a buffer 
mapped and sent to the kernel for relocs.

Dave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26  0:00 ` Dave Airlie
  2009-06-26  6:31   ` Thomas Hellström
  2009-06-26  7:31   ` Jerome Glisse
@ 2009-06-26 13:59   ` Jerome Glisse
  2009-06-29 21:12     ` Thomas Hellström
  2 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-06-26 13:59 UTC (permalink / raw)
  To: Dave Airlie; +Cc: thomas, linux-kernel, dri-devel

[-- Attachment #1: Type: text/plain, Size: 1228 bytes --]

On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
> 
> My original version kept a list of wb pages as well, this proved to be
> quite a useful
> optimisation on my test systems when I implemented it, without it I
> was spending ~20%
> of my CPU in getting free pages, granted I always used WB pages on
> PCIE/IGP systems.
> 
> Another optimisation I made at the time was around the populate call,
> (not sure if this
> is what still happens):
> 
> Allocate a 64K local BO for DMA object.
> Write into the first 5 pages from userspace - get WB pages.
> Bind to GART, swap those 5 pages to WC + flush.
> Then populate the rest with WC pages from the list.
> 
> Granted I think allocating WC in the first place from the pool might
> work just as well since most of the DMA buffers are write only.
> 
> Dave.
> --

Attached a new version of the patch, which integrate changes discussed.

Cheers,
Jerome

[-- Attachment #2: 0001-ttm-add-pool-wc-uc-page-allocator.patch --]
[-- Type: text/x-patch, Size: 17578 bytes --]

>From 9083e5466a7b81edcc989ac22d6ccdebef63116d Mon Sep 17 00:00:00 2001
From: Dave Airlie <airlied@redhat.com>
Date: Wed, 24 Jun 2009 16:31:43 +1000
Subject: [PATCH] ttm: add pool wc/uc page allocator

On AGP system we might allocate/free routinely uncached or wc memory,
changing page from cached (wb) to uc or wc is very expensive and involves
a lot of flushing. To improve performance this allocator use a pool
of uc,wc pages.

Currently each pool (wc, uc) is 256 pages big, improvement would be
to tweak this according to memory pressure so we can give back memory
to system.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/ttm/Makefile         |    2 +-
 drivers/gpu/drm/ttm/ttm_memory.c     |    3 +
 drivers/gpu/drm/ttm/ttm_page_alloc.c |  352 ++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_page_alloc.h |   34 ++++
 drivers/gpu/drm/ttm/ttm_tt.c         |   50 ++++--
 include/drm/ttm/ttm_bo_driver.h      |    6 +
 6 files changed, 434 insertions(+), 13 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.c
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.h

diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
index b0a9de7..93e002c 100644
--- a/drivers/gpu/drm/ttm/Makefile
+++ b/drivers/gpu/drm/ttm/Makefile
@@ -3,6 +3,6 @@
 
 ccflags-y := -Iinclude/drm
 ttm-y := ttm_agp_backend.o ttm_memory.o ttm_tt.o ttm_bo.o \
-	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o
+	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o ttm_page_alloc.o
 
 obj-$(CONFIG_DRM_TTM) += ttm.o
diff --git a/drivers/gpu/drm/ttm/ttm_memory.c b/drivers/gpu/drm/ttm/ttm_memory.c
index 87323d4..6da4a08 100644
--- a/drivers/gpu/drm/ttm/ttm_memory.c
+++ b/drivers/gpu/drm/ttm/ttm_memory.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 
+#include "ttm_page_alloc.h"
 #define TTM_PFX "[TTM] "
 #define TTM_MEMORY_ALLOC_RETRIES 4
 
@@ -124,6 +125,7 @@ int ttm_mem_global_init(struct ttm_mem_global *glob)
 	printk(KERN_INFO TTM_PFX "TTM available object memory: %llu MiB\n",
 	       glob->max_memory >> 20);
 
+	ttm_page_alloc_init();
 	return 0;
 }
 EXPORT_SYMBOL(ttm_mem_global_init);
@@ -135,6 +137,7 @@ void ttm_mem_global_release(struct ttm_mem_global *glob)
 	flush_workqueue(glob->swap_queue);
 	destroy_workqueue(glob->swap_queue);
 	glob->swap_queue = NULL;
+	ttm_page_alloc_fini();
 }
 EXPORT_SYMBOL(ttm_mem_global_release);
 
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
new file mode 100644
index 0000000..860541f
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -0,0 +1,352 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ */
+
+/* simple list based uncached page allocator
+ * - Add chunks of 1MB to the allocator at a time.
+ * - Use page->lru to keep a free list
+ * - doesn't track currently in use pages
+ *
+ *  TODO: Add shrinker support
+ */
+
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+#include <asm/cacheflush.h>
+#include <ttm/ttm_module.h>
+#include <ttm/ttm_bo_driver.h>
+#include "ttm_page_alloc.h"
+
+#ifdef TTM_HAS_AGP
+#include <asm/agp.h>
+#endif
+
+#define NUM_PAGES_TO_ADD 128
+
+
+struct ttm_page_alloc_usage {
+	int total_uc_pages;
+	int total_wc_pages;
+	int total_wb_pages;
+	int total_hm_pages;
+	int count_uc_pages;
+	int count_wc_pages;
+	int count_wb_pages;
+	int count_hm_pages;
+};
+
+static struct page *wb_pages[NUM_PAGES_TO_ADD];
+static struct page *wc_pages[NUM_PAGES_TO_ADD];
+static struct page *uc_pages[NUM_PAGES_TO_ADD];
+static struct page *hm_pages[NUM_PAGES_TO_ADD];
+static struct mutex page_alloc_mutex;
+static int page_alloc_inited;
+struct ttm_page_alloc_usage ttm_page_alloc;
+
+
+static inline void ttm_page_put(struct page *page, bool setwb)
+{
+#ifdef CONFIG_X86
+	if (setwb && !PageHighMem(page)) {
+		set_memory_wb((unsigned long)page_address(page), 1);
+	}
+#else
+#ifdef TTM_HAS_AGP
+	if (setwb) {
+		/* This is a generic interface on non x86 to handle
+		 * wc/uc page */
+		unmap_page_from_agp(page);
+	}
+#endif
+#endif
+	put_page(page);
+	__free_page(page);
+}
+
+static void ttm_release_all_pages(struct page **pages, int count, bool setwb)
+{
+	int i;
+
+	for (i = 0; i < count; i++) {
+		ttm_page_put(pages[i], setwb);
+		pages[i] = NULL;
+	}
+}
+
+static int ttm_add_pages_locked(int flags)
+{
+	struct page *page;
+	int gfp_flags = GFP_HIGHUSER;
+	int i, cflag, r;
+
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
+		gfp_flags |= __GFP_ZERO;
+	if (flags & TTM_PAGE_FLAG_DMA32)
+		gfp_flags |= GFP_DMA32;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		for (i = ttm_page_alloc.count_uc_pages; i < NUM_PAGES_TO_ADD; i++) {
+			page = alloc_page(gfp_flags);
+			if (!page) {
+				printk(KERN_ERR TTM_PFX "unable to get page %d\n", i);
+				return i;
+			}
+			get_page(page);
+#ifdef CONFIG_X86
+			if (!PageHighMem(page)) {
+				uc_pages[ttm_page_alloc.count_uc_pages++] = page;
+				ttm_page_alloc.total_uc_pages++;
+			} else {
+				if (ttm_page_alloc.count_hm_pages < NUM_PAGES_TO_ADD) {
+					hm_pages[ttm_page_alloc.count_hm_pages++] = page;
+					ttm_page_alloc.total_hm_pages++;
+				} else {
+					put_page(page);
+					__free_page(page);
+				}
+			}
+#else
+#ifdef TTM_HAS_AGP
+			map_page_into_agp(page);
+#endif
+#endif
+		}
+#ifdef CONFIG_X86
+		r = set_pages_array_uc(uc_pages, ttm_page_alloc.count_uc_pages);
+		if (r) {
+			/* On error we don't need to set page back to WB.
+			 * Or should we do that to be extra safe ?
+			 */
+			ttm_release_all_pages(uc_pages, ttm_page_alloc.count_uc_pages, false);
+			return r;
+		}
+#endif
+		ttm_tt_cache_flush(uc_pages, ttm_page_alloc.count_uc_pages);
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		for (i = ttm_page_alloc.count_wc_pages; i < NUM_PAGES_TO_ADD; i++) {
+			page = alloc_page(gfp_flags);
+			if (!page) {
+				printk(KERN_ERR TTM_PFX "unable to get page %d\n", i);
+				return i;
+			}
+			get_page(page);
+#ifdef CONFIG_X86
+			if (!PageHighMem(page)) {
+				wc_pages[ttm_page_alloc.count_wc_pages++] = page;
+				set_memory_wc((unsigned long)page_address(page), 1);
+				ttm_page_alloc.total_wc_pages++;
+			} else {
+				if (ttm_page_alloc.count_hm_pages < NUM_PAGES_TO_ADD) {
+					hm_pages[ttm_page_alloc.count_hm_pages++] = page;
+					ttm_page_alloc.total_hm_pages++;
+				} else {
+					put_page(page);
+					__free_page(page);
+				}
+			}
+#else
+#ifdef TTM_HAS_AGP
+			map_page_into_agp(page);
+#endif
+#endif
+		}
+		ttm_tt_cache_flush(wc_pages, ttm_page_alloc.count_wc_pages);
+		break;
+	case TTM_PAGE_FLAG_CACHE_WB:
+		for (i = ttm_page_alloc.count_wb_pages; i < NUM_PAGES_TO_ADD; i++) {
+			page = alloc_page(gfp_flags);
+			if (!page) {
+				printk(KERN_ERR TTM_PFX "unable to get page %d\n", i);
+				return i;
+			}
+			get_page(page);
+#ifdef CONFIG_X86
+			if (!PageHighMem(page)) {
+				wb_pages[ttm_page_alloc.count_wb_pages++] = page;
+				ttm_page_alloc.total_wb_pages++;
+			} else {
+				if (ttm_page_alloc.count_hm_pages < NUM_PAGES_TO_ADD) {
+					hm_pages[ttm_page_alloc.count_hm_pages++] = page;
+					ttm_page_alloc.total_hm_pages++;
+				} else {
+					put_page(page);
+					__free_page(page);
+				}
+			}
+#endif
+		}
+		break;
+	default:
+		printk(KERN_ERR TTM_PFX "Wrong caching flags %d'n", cflag);
+		return 0;
+	}
+	return i;
+}
+
+struct page *ttm_get_page(int flags)
+{
+	struct page *page = NULL;
+	int ret;
+	struct page **pages;
+	int *count_pages;
+	int cflag;
+	int gfp_flags = GFP_HIGHUSER;
+
+	if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
+		gfp_flags |= __GFP_ZERO;
+	if (flags & TTM_PAGE_FLAG_DMA32)
+		gfp_flags |= GFP_DMA32;
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		pages = uc_pages;
+		count_pages = &ttm_page_alloc.count_uc_pages;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		pages = wc_pages;
+		count_pages = &ttm_page_alloc.count_wc_pages;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WB:
+	default:
+		pages = wb_pages;
+		count_pages = &ttm_page_alloc.count_wb_pages;
+		break;
+	}
+
+	mutex_lock(&page_alloc_mutex);
+	if (!(*count_pages) && !ttm_page_alloc.count_hm_pages) {
+		ret = ttm_add_pages_locked(flags);
+		if (ret == 0) {
+			mutex_unlock(&page_alloc_mutex);
+			return NULL;
+		}
+	}
+	if (ttm_page_alloc.count_hm_pages) {
+		page = hm_pages[--ttm_page_alloc.count_hm_pages];
+	} else {
+		page = pages[--(*count_pages)];
+	}
+	mutex_unlock(&page_alloc_mutex);
+	return page;
+}
+
+void ttm_put_page(struct page *page, int flags)
+{
+	struct page **pages;
+	int *count_pages;
+	int cflag;
+	bool setwb;
+
+	if (PageHighMem(page)) {
+		mutex_lock(&page_alloc_mutex);
+		if (ttm_page_alloc.count_hm_pages < NUM_PAGES_TO_ADD) {
+			hm_pages[ttm_page_alloc.count_hm_pages++] = page;
+			ttm_page_alloc.total_hm_pages++;
+			mutex_unlock(&page_alloc_mutex);
+			return;
+		} else {
+			put_page(page);
+			__free_page(page);
+			mutex_unlock(&page_alloc_mutex);
+			return;
+		}
+	}
+
+	cflag = (flags & TTM_PAGE_FLAG_CACHE_MASK) >> TTM_PAGE_FLAG_CACHE_SHIFT;
+	switch (cflag) {
+	case TTM_PAGE_FLAG_CACHE_UC:
+		pages = uc_pages;
+		count_pages = &ttm_page_alloc.count_uc_pages;
+		setwb = true;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WC:
+		pages = wc_pages;
+		count_pages = &ttm_page_alloc.count_wc_pages;
+		setwb = true;
+		break;
+	case TTM_PAGE_FLAG_CACHE_WB:
+	default:
+		pages = wb_pages;
+		count_pages = &ttm_page_alloc.count_wb_pages;
+		setwb = false;
+		break;
+	}
+
+	mutex_lock(&page_alloc_mutex);
+	if ((*count_pages) >= NUM_PAGES_TO_ADD) {
+		ttm_page_put(page, setwb);
+		mutex_unlock(&page_alloc_mutex);
+		return;
+	}
+	pages[(*count_pages)++] = page;
+	mutex_unlock(&page_alloc_mutex);
+}
+
+int ttm_page_alloc_init(void)
+{
+	int i;
+
+	if (page_alloc_inited)
+		return 0;
+
+	for (i = 0; i < NUM_PAGES_TO_ADD; i++) {
+		wb_pages[i] = NULL;
+		wc_pages[i] = NULL;
+		uc_pages[i] = NULL;
+		hm_pages[i] = NULL;
+	}
+	ttm_page_alloc.total_uc_pages = 0;
+	ttm_page_alloc.total_wc_pages = 0;
+	ttm_page_alloc.total_wb_pages = 0;
+	ttm_page_alloc.total_hm_pages = 0;
+	ttm_page_alloc.count_uc_pages = 0;
+	ttm_page_alloc.count_wc_pages = 0;
+	ttm_page_alloc.count_wb_pages = 0;
+	ttm_page_alloc.count_hm_pages = 0;
+	mutex_init(&page_alloc_mutex);
+	page_alloc_inited = 1;
+	return 0;
+}
+
+void ttm_page_alloc_fini(void)
+{
+	if (!page_alloc_inited)
+		return;
+
+	ttm_release_all_pages(wc_pages, ttm_page_alloc.total_wc_pages, true);
+	ttm_release_all_pages(wb_pages, ttm_page_alloc.total_wb_pages, true);
+	ttm_release_all_pages(uc_pages, ttm_page_alloc.total_uc_pages, true);
+	ttm_page_alloc.total_uc_pages = 0;
+	ttm_page_alloc.total_wc_pages = 0;
+	ttm_page_alloc.total_wb_pages = 0;
+	ttm_page_alloc.total_hm_pages = 0;
+	ttm_page_alloc.count_uc_pages = 0;
+	ttm_page_alloc.count_wc_pages = 0;
+	ttm_page_alloc.count_wb_pages = 0;
+	ttm_page_alloc.count_hm_pages = 0;
+	page_alloc_inited = 0;
+}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.h b/drivers/gpu/drm/ttm/ttm_page_alloc.h
new file mode 100644
index 0000000..3d079a8
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.h
@@ -0,0 +1,34 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ */
+
+#ifndef TTM_PAGE_ALLOC
+#define TTM_PAGE_ALLOC
+
+void ttm_put_page(struct page *page, int flags);
+struct page *ttm_get_page(int flags);
+int ttm_page_alloc_init(void);
+void ttm_page_alloc_fini(void);
+
+#endif
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75dc8bd..6402ec9 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -37,6 +37,7 @@
 #include "ttm/ttm_module.h"
 #include "ttm/ttm_bo_driver.h"
 #include "ttm/ttm_placement.h"
+#include "ttm_page_alloc.h"
 
 static int ttm_tt_swapin(struct ttm_tt *ttm);
 
@@ -131,10 +132,10 @@ static void ttm_tt_free_page_directory(struct ttm_tt *ttm)
 
 static struct page *ttm_tt_alloc_page(unsigned page_flags)
 {
-	if (page_flags & TTM_PAGE_FLAG_ZERO_ALLOC)
-		return alloc_page(GFP_HIGHUSER | __GFP_ZERO);
+	struct page *page;
 
-	return alloc_page(GFP_HIGHUSER);
+	page = ttm_get_page(page_flags);
+	return page;
 }
 
 static void ttm_tt_free_user_pages(struct ttm_tt *ttm)
@@ -179,10 +180,23 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
 	struct page *p;
 	struct ttm_bo_device *bdev = ttm->bdev;
 	struct ttm_mem_global *mem_glob = bdev->mem_glob;
+	unsigned cache_flag;
 	int ret;
 
+	ttm->page_flags &= ~TTM_PAGE_FLAG_CACHE_MASK;
+	switch (ttm->caching_state) {
+	case tt_uncached:
+		cache_flag = TTM_PAGE_FLAG_CACHE_UC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	case tt_wc:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	default:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WB << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	}
 	while (NULL == (p = ttm->pages[index])) {
-		p = ttm_tt_alloc_page(ttm->page_flags);
+		p = ttm_tt_alloc_page(ttm->page_flags | cache_flag);
 
 		if (!p)
 			return NULL;
@@ -290,10 +304,10 @@ static int ttm_tt_set_caching(struct ttm_tt *ttm,
 	if (ttm->caching_state == c_state)
 		return 0;
 
-	if (c_state != tt_cached) {
-		ret = ttm_tt_populate(ttm);
-		if (unlikely(ret != 0))
-			return ret;
+	if (ttm->state == tt_unpopulated) {
+		/* Change caching but don't populate */
+		ttm->caching_state = c_state;
+		return 0;
 	}
 
 	if (ttm->caching_state == tt_cached)
@@ -343,21 +357,33 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm)
 	int i;
 	struct page *cur_page;
 	struct ttm_backend *be = ttm->be;
+	unsigned cache_flag;
 
 	if (be)
 		be->func->clear(be);
-	(void)ttm_tt_set_caching(ttm, tt_cached);
+	switch (ttm->caching_state) {
+	case tt_uncached:
+		cache_flag = TTM_PAGE_FLAG_CACHE_UC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	case tt_wc:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WC << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	default:
+		cache_flag = TTM_PAGE_FLAG_CACHE_WB << TTM_PAGE_FLAG_CACHE_SHIFT;
+		break;
+	}
 	for (i = 0; i < ttm->num_pages; ++i) {
 		cur_page = ttm->pages[i];
 		ttm->pages[i] = NULL;
 		if (cur_page) {
-			if (page_count(cur_page) != 1)
+			if (page_count(cur_page) != 2)
 				printk(KERN_ERR TTM_PFX
 				       "Erroneous page count. "
-				       "Leaking pages.\n");
+				       "Leaking pages (%d).\n",
+				       page_count(cur_page));
+			ttm_put_page(cur_page, cache_flag);
 			ttm_mem_global_free(ttm->bdev->mem_glob, PAGE_SIZE,
 					    PageHighMem(cur_page));
-			__free_page(cur_page);
 		}
 	}
 	ttm->state = tt_unpopulated;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index 62ed733..0dac8a8 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -121,6 +121,12 @@ struct ttm_backend {
 #define TTM_PAGE_FLAG_SWAPPED         (1 << 4)
 #define TTM_PAGE_FLAG_PERSISTANT_SWAP (1 << 5)
 #define TTM_PAGE_FLAG_ZERO_ALLOC      (1 << 6)
+#define TTM_PAGE_FLAG_DMA32           (1 << 7)
+#define TTM_PAGE_FLAG_CACHE_MASK      (3 << 8)
+#define TTM_PAGE_FLAG_CACHE_SHIFT     8
+#define TTM_PAGE_FLAG_CACHE_UC        0
+#define TTM_PAGE_FLAG_CACHE_WC        1
+#define TTM_PAGE_FLAG_CACHE_WB        2
 
 enum ttm_caching_state {
 	tt_uncached,
-- 
1.6.2.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-26 13:59   ` Jerome Glisse
@ 2009-06-29 21:12     ` Thomas Hellström
  2009-07-09  6:06       ` Dave Airlie
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-06-29 21:12 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Airlie, linux-kernel, dri-devel

Jerome Glisse skrev:
> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
>   
>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org> wrote:
>>     
>>> Hi,
>>>
>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>> this one should be ok with ttm cache status tracking. It definitely
>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>> allocation.
>>>
>>>       
>> My original version kept a list of wb pages as well, this proved to be
>> quite a useful
>> optimisation on my test systems when I implemented it, without it I
>> was spending ~20%
>> of my CPU in getting free pages, granted I always used WB pages on
>> PCIE/IGP systems.
>>
>> Another optimisation I made at the time was around the populate call,
>> (not sure if this
>> is what still happens):
>>
>> Allocate a 64K local BO for DMA object.
>> Write into the first 5 pages from userspace - get WB pages.
>> Bind to GART, swap those 5 pages to WC + flush.
>> Then populate the rest with WC pages from the list.
>>
>> Granted I think allocating WC in the first place from the pool might
>> work just as well since most of the DMA buffers are write only.
>>
>> Dave.
>> --
>>     
>
> Attached a new version of the patch, which integrate changes discussed.
>
> Cheers,
> Jerome
>   
Hi, Jerome!
Still some outstanding things:

1) The AGP protection fixes compilation errors when AGP is not enabled, 
but what about architectures that need the map_page_into_agp() semantics 
for TTM even when AGP is not enabled? At the very least TTM should be 
disabled on those architectures. The best option would be to make those 
calls non-agp specific.

2) Why is the page refcount upped with get_page() after an alloc_page()?

3) It seems like pages are cache-transitioned one-by-one when freed. 
Again, this is a global TLB flush per page. Can't we free a large chunk 
of pages at once?

/Thanks,
Thomas







^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-29 21:12     ` Thomas Hellström
@ 2009-07-09  6:06       ` Dave Airlie
  2009-07-09  8:48         ` Michel Dänzer
  0 siblings, 1 reply; 29+ messages in thread
From: Dave Airlie @ 2009-07-09  6:06 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Jerome Glisse, linux-kernel, dri-devel

2009/6/30 Thomas Hellström <thomas@shipmail.org>:
> Jerome Glisse skrev:
>>
>> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
>>
>>>
>>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org>
>>> wrote:
>>>
>>>>
>>>> Hi,
>>>>
>>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>>> this one should be ok with ttm cache status tracking. It definitely
>>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>>> allocation.
>>>>
>>>>
>>>
>>> My original version kept a list of wb pages as well, this proved to be
>>> quite a useful
>>> optimisation on my test systems when I implemented it, without it I
>>> was spending ~20%
>>> of my CPU in getting free pages, granted I always used WB pages on
>>> PCIE/IGP systems.
>>>
>>> Another optimisation I made at the time was around the populate call,
>>> (not sure if this
>>> is what still happens):
>>>
>>> Allocate a 64K local BO for DMA object.
>>> Write into the first 5 pages from userspace - get WB pages.
>>> Bind to GART, swap those 5 pages to WC + flush.
>>> Then populate the rest with WC pages from the list.
>>>
>>> Granted I think allocating WC in the first place from the pool might
>>> work just as well since most of the DMA buffers are write only.
>>>
>>> Dave.
>>> --
>>>
>>
>> Attached a new version of the patch, which integrate changes discussed.
>>
>> Cheers,
>> Jerome
>>
>
> Hi, Jerome!
> Still some outstanding things:
>
> 1) The AGP protection fixes compilation errors when AGP is not enabled, but
> what about architectures that need the map_page_into_agp() semantics for TTM
> even when AGP is not enabled? At the very least TTM should be disabled on
> those architectures. The best option would be to make those calls non-agp
> specific.
>
> 2) Why is the page refcount upped with get_page() after an alloc_page()?
>
> 3) It seems like pages are cache-transitioned one-by-one when freed. Again,
> this is a global TLB flush per page. Can't we free a large chunk of pages at
> once?
>

Jerome,

have we addressed these?

I'd really like to push this soon, as I'd like to fix up the 32 vs 36
bit dma masks if possible
which relies on us being able to tell the allocator to use GFP_DMA32 on some hw
(32-bit PAE mainly with a PCI card).

Dave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-09  6:06       ` Dave Airlie
@ 2009-07-09  8:48         ` Michel Dänzer
  0 siblings, 0 replies; 29+ messages in thread
From: Michel Dänzer @ 2009-07-09  8:48 UTC (permalink / raw)
  To: Dave Airlie; +Cc: Thomas Hellström, linux-kernel, dri-devel

On Thu, 2009-07-09 at 16:06 +1000, Dave Airlie wrote:
> 2009/6/30 Thomas Hellström <thomas@shipmail.org>:
> > Jerome Glisse skrev:
> >>
> >> On Fri, 2009-06-26 at 10:00 +1000, Dave Airlie wrote:
> >>
> >>>
> >>> On Thu, Jun 25, 2009 at 10:01 PM, Jerome Glisse<glisse@freedesktop.org>
> >>> wrote:
> >>>
> >>>>
> >>>> Hi,
> >>>>
> >>>> Thomas i attach a reworked page pool allocator based on Dave works,
> >>>> this one should be ok with ttm cache status tracking. It definitely
> >>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >>>> allocation.
> >>>>
> >>>>
> >>>
> >>> My original version kept a list of wb pages as well, this proved to be
> >>> quite a useful
> >>> optimisation on my test systems when I implemented it, without it I
> >>> was spending ~20%
> >>> of my CPU in getting free pages, granted I always used WB pages on
> >>> PCIE/IGP systems.
> >>>
> >>> Another optimisation I made at the time was around the populate call,
> >>> (not sure if this
> >>> is what still happens):
> >>>
> >>> Allocate a 64K local BO for DMA object.
> >>> Write into the first 5 pages from userspace - get WB pages.
> >>> Bind to GART, swap those 5 pages to WC + flush.
> >>> Then populate the rest with WC pages from the list.
> >>>
> >>> Granted I think allocating WC in the first place from the pool might
> >>> work just as well since most of the DMA buffers are write only.
> >>>
> >>> Dave.
> >>> --
> >>>
> >>
> >> Attached a new version of the patch, which integrate changes discussed.
> >>
> >> Cheers,
> >> Jerome
> >>
> >
> > Hi, Jerome!
> > Still some outstanding things:
> >
> > 1) The AGP protection fixes compilation errors when AGP is not enabled, but
> > what about architectures that need the map_page_into_agp() semantics for TTM
> > even when AGP is not enabled? At the very least TTM should be disabled on
> > those architectures. The best option would be to make those calls non-agp
> > specific.
> >
> > 2) Why is the page refcount upped with get_page() after an alloc_page()?
> >
> > 3) It seems like pages are cache-transitioned one-by-one when freed. Again,
> > this is a global TLB flush per page. Can't we free a large chunk of pages at
> > once?
> >
> 
> Jerome,
> 
> have we addressed these?
> 
> I'd really like to push this soon, as I'd like to fix up the 32 vs 36
> bit dma masks if possible
> which relies on us being able to tell the allocator to use GFP_DMA32 on some hw
> (32-bit PAE mainly with a PCI card).

FWIW, I tried this patch on my PowerBook, and it didn't go too well:

With AGP enabled, the kernel panics before there's any KMS display.

With AGP disabled, I get a KMS display, but it shows a failure to
allocate the ring buffer, and then it stops updating.


-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-06-25 15:53 ` Thomas Hellström
@ 2009-07-21 17:34   ` Jerome Glisse
  2009-07-21 18:00     ` Jerome Glisse
  2009-07-22  8:27     ` Thomas Hellström
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2009-07-21 17:34 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Dave Airlie, linux-kernel, dri-devel

On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> Jerome Glisse skrev:
> > Hi,
> >
> > Thomas i attach a reworked page pool allocator based on Dave works,
> > this one should be ok with ttm cache status tracking. It definitely
> > helps on AGP system, now the bottleneck is in mesa vertex's dma
> > allocation.
> >
> > Cheers,
> > Jerome
> >   
> Hi, Jerome!
> In general it looks very good. Some things that need fixing:
> 
> 1)  We must have a way to hand back pages. I still not quite understand 
> how the shrink callbacks work and whether they are applicable. Another 
> scheme would be to free, say 1MB when we have at least 2MB available.
> 
> 2) We should avoid including AGP headers if AGP is not configured. 
> Either reimplement unmap_page_from_agp or map_page_into_agp or move them 
> out from the AGP headers. We've hade complaints before from people with 
> AGP free systems that the code doesn't compile.
> 
> 3) Since we're allocating (and freeing) in batches we should use the 
> set_pages_array() interface to avoid a global tlb flush per page.
> 
> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> it will always allocate cached pages and then transition them.
> 

Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
wrong) :
1 - bo get allocated tt->state = unpopulated
2 - bo is mapped few page are faulted tt->state = unpopulated
3 - bo is cache transitioned but tt->state == unpopulated but
    they are page which have been touch by the cpu so we need
    to clflush them and transition them, this never happen if
    we don't call ttm_tt_populate and proceed with the remaining
    of the cache transitioning functions

As a workaround i will try to go through the pages tables and
transition existing pages. Do you have any idea for a better
plan ?

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-21 17:34   ` Jerome Glisse
@ 2009-07-21 18:00     ` Jerome Glisse
  2009-07-21 19:22       ` Jerome Glisse
  2009-07-22  8:27     ` Thomas Hellström
  1 sibling, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-21 18:00 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: linux-kernel, dri-devel

On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > 
> > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> > it will always allocate cached pages and then transition them.
> > 
> 
> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> wrong) :
> 1 - bo get allocated tt->state = unpopulated
> 2 - bo is mapped few page are faulted tt->state = unpopulated
> 3 - bo is cache transitioned but tt->state == unpopulated but
>     they are page which have been touch by the cpu so we need
>     to clflush them and transition them, this never happen if
>     we don't call ttm_tt_populate and proceed with the remaining
>     of the cache transitioning functions
> 
> As a workaround i will try to go through the pages tables and
> transition existing pages. Do you have any idea for a better
> plan ?
> 
> Cheers,
> Jerome

My workaround ruin the whole idea of pool allocation what happens
is that most bo get cache transition page per page. My thinking
is that we should do the following:
	- is there is a least one page allocated then fully populate
	the object and do cache transition on all the pages.
	- otherwise update caching_state and leaves object unpopulated

This needs that we some how reflect the fact that there is at least
one page allocated, i am thinking to adding a new state for that :
ttm_partialy_populated

Thomas what do you think about that ?

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-21 18:00     ` Jerome Glisse
@ 2009-07-21 19:22       ` Jerome Glisse
  2009-07-22  8:37         ` Thomas Hellström
  2009-07-22 13:16         ` TTM page pool allocator Michel Dänzer
  0 siblings, 2 replies; 29+ messages in thread
From: Jerome Glisse @ 2009-07-21 19:22 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: linux-kernel, dri-devel

[-- Attachment #1: Type: text/plain, Size: 2031 bytes --]

On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > > 
> > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> > > it will always allocate cached pages and then transition them.
> > > 
> > 
> > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > wrong) :
> > 1 - bo get allocated tt->state = unpopulated
> > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > 3 - bo is cache transitioned but tt->state == unpopulated but
> >     they are page which have been touch by the cpu so we need
> >     to clflush them and transition them, this never happen if
> >     we don't call ttm_tt_populate and proceed with the remaining
> >     of the cache transitioning functions
> > 
> > As a workaround i will try to go through the pages tables and
> > transition existing pages. Do you have any idea for a better
> > plan ?
> > 
> > Cheers,
> > Jerome
> 
> My workaround ruin the whole idea of pool allocation what happens
> is that most bo get cache transition page per page. My thinking
> is that we should do the following:
> 	- is there is a least one page allocated then fully populate
> 	the object and do cache transition on all the pages.
> 	- otherwise update caching_state and leaves object unpopulated
> 
> This needs that we some how reflect the fact that there is at least
> one page allocated, i am thinking to adding a new state for that :
> ttm_partialy_populated
> 
> Thomas what do you think about that ?
> 
> Cheers,
> Jerome

Attached updated patch it doesn't introduce ttm_partialy_populated
but keep the populate call in cache transition. So far it seems to
work properly on AGP platform and helps quite a lot with performances.
I wonder if i should rather allocate some memory to store the pool
structure in ttm_page_pool_init rather than having quite a lot of
static variables ? Anyone has thought on that ?

Cheers,
Jerome

[-- Attachment #2: 0001-ttm-add-pool-wc-uc-page-allocator.patch --]
[-- Type: text/x-patch, Size: 16081 bytes --]

>From 35cbb15ca10e2ed8ba215298268f2ead8d13e759 Mon Sep 17 00:00:00 2001
From: Jerome Glisse <jglisse@redhat.com>
Date: Thu, 16 Jul 2009 18:12:22 +0200
Subject: [PATCH] ttm: add pool wc/uc page allocator

On AGP system we might allocate/free routinely uncached or wc memory,
changing page from cached (wb) to uc or wc is very expensive and involves
a lot of flushing. To improve performance this allocator use a pool
of uc,wc pages.

Currently each pool (wc, uc) is 256 pages big, improvement would be
to tweak this according to memory pressure so we can give back memory
to system.

Signed-off-by: Dave Airlie <airlied@redhat.com>
Signed-off-by: Jerome Glisse <jglisse@redhat.com>
---
 drivers/gpu/drm/ttm/Makefile         |    2 +-
 drivers/gpu/drm/ttm/ttm_memory.c     |    3 +
 drivers/gpu/drm/ttm/ttm_page_alloc.c |  354 ++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/ttm/ttm_page_alloc.h |   36 ++++
 drivers/gpu/drm/ttm/ttm_tt.c         |   37 ++---
 5 files changed, 407 insertions(+), 25 deletions(-)
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.c
 create mode 100644 drivers/gpu/drm/ttm/ttm_page_alloc.h

diff --git a/drivers/gpu/drm/ttm/Makefile b/drivers/gpu/drm/ttm/Makefile
index b0a9de7..93e002c 100644
--- a/drivers/gpu/drm/ttm/Makefile
+++ b/drivers/gpu/drm/ttm/Makefile
@@ -3,6 +3,6 @@
 
 ccflags-y := -Iinclude/drm
 ttm-y := ttm_agp_backend.o ttm_memory.o ttm_tt.o ttm_bo.o \
-	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o
+	ttm_bo_util.o ttm_bo_vm.o ttm_module.o ttm_global.o ttm_page_alloc.o
 
 obj-$(CONFIG_DRM_TTM) += ttm.o
diff --git a/drivers/gpu/drm/ttm/ttm_memory.c b/drivers/gpu/drm/ttm/ttm_memory.c
index 87323d4..6da4a08 100644
--- a/drivers/gpu/drm/ttm/ttm_memory.c
+++ b/drivers/gpu/drm/ttm/ttm_memory.c
@@ -32,6 +32,7 @@
 #include <linux/mm.h>
 #include <linux/module.h>
 
+#include "ttm_page_alloc.h"
 #define TTM_PFX "[TTM] "
 #define TTM_MEMORY_ALLOC_RETRIES 4
 
@@ -124,6 +125,7 @@ int ttm_mem_global_init(struct ttm_mem_global *glob)
 	printk(KERN_INFO TTM_PFX "TTM available object memory: %llu MiB\n",
 	       glob->max_memory >> 20);
 
+	ttm_page_alloc_init();
 	return 0;
 }
 EXPORT_SYMBOL(ttm_mem_global_init);
@@ -135,6 +137,7 @@ void ttm_mem_global_release(struct ttm_mem_global *glob)
 	flush_workqueue(glob->swap_queue);
 	destroy_workqueue(glob->swap_queue);
 	glob->swap_queue = NULL;
+	ttm_page_alloc_fini();
 }
 EXPORT_SYMBOL(ttm_mem_global_release);
 
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
new file mode 100644
index 0000000..8079693
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -0,0 +1,354 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ *          Jerome Glisse <jglisse@redhat.com>
+ */
+
+/* simple list based uncached page allocator
+ * - Add chunks of 1MB to the allocator at a time.
+ * - Use page->lru to keep a free list
+ * - doesn't track currently in use pages
+ */
+#include <linux/list.h>
+#include <linux/mutex.h>
+#include <linux/mm_types.h>
+
+#include <asm/agp.h>
+#include "drm_cache.h"
+#include "ttm/ttm_bo_driver.h"
+#include "ttm_page_alloc.h"
+
+/* add 1MB at a time */
+#define NUM_PAGES_TO_ADD 256
+
+struct page_pool {
+	struct list_head	list;
+	int			npages;
+	int			nzeropages;
+};
+
+static struct page	*_pages[NUM_PAGES_TO_ADD];
+static int		_npages_to_free;
+static struct page_pool	_wc_pool;
+static struct page_pool	_uc_pool;
+static struct page_pool	_wc_pool_dma32;
+static struct page_pool	_uc_pool_dma32;
+static struct mutex	page_alloc_mutex;
+static bool		page_alloc_inited = false;
+
+
+#ifdef CONFIG_X86
+/* TODO: add this to x86 like _uc, this version here is inefficient */
+static int set_pages_array_wc(struct page **pages, int addrinarray)
+{
+	int i;
+
+	for (i = 0; i < addrinarray; i++) {
+		set_memory_wc((unsigned long)page_address(pages[i]), 1);
+	}
+	return 0;
+}
+#else
+static int set_pages_array_wb(struct page **pages, int addrinarray)
+{
+#ifdef TTM_HAS_AGP
+	int i;
+
+	for (i = 0; i < addrinarray; i++) {
+		unmap_page_from_agp(pages[i]);
+	}
+#endif
+	return 0;
+}
+
+static int set_pages_array_wc(struct page **pages, int addrinarray)
+{
+#ifdef TTM_HAS_AGP
+	int i;
+
+	for (i = 0; i < addrinarray; i++) {
+		map_page_into_agp(pages[i]);
+	}
+#endif
+	return 0;
+}
+
+static int set_pages_array_uc(struct page **pages, int addrinarray)
+{
+#ifdef TTM_HAS_AGP
+	int i;
+
+	for (i = 0; i < addrinarray; i++) {
+		map_page_into_agp(pages[i]);
+	}
+#endif
+	return 0;
+}
+#endif
+
+
+void pages_free_locked(void)
+{
+	int i;
+
+	set_pages_array_wb(_pages, _npages_to_free);
+	for (i = 0; i < _npages_to_free; i++) {
+		__free_page(_pages[i]);
+	}
+	_npages_to_free = 0;
+}
+
+static void ttm_page_pool_init_locked(struct page_pool *pool)
+{
+	INIT_LIST_HEAD(&pool->list);
+	pool->npages = 0;
+	pool->nzeropages = 0;
+}
+
+static int page_pool_fill_locked(struct page_pool *pool, int gfp_flags,
+				enum ttm_caching_state cstate)
+{
+	struct page *page;
+	int i, cpages, npages;
+
+	/* We need the _pages table to change page cache status so empty it */
+	if (cstate != tt_cached && _npages_to_free)
+		pages_free_locked();
+
+	npages = NUM_PAGES_TO_ADD - pool->npages - pool->nzeropages;
+	for (i = 0, cpages = 0; i < npages; i++) {
+		page = alloc_page(gfp_flags);
+		if (!page) {
+			printk(KERN_ERR "unable to get page %d\n", i);
+			return -ENOMEM;
+		}
+		if (gfp_flags & __GFP_ZERO) {
+			list_add(&page->lru, &pool->list);
+			pool->nzeropages++;
+		} else {
+			list_add_tail(&page->lru, &pool->list);
+			pool->npages++;
+		}
+		if (!PageHighMem(page) && cstate != tt_cached) {
+			_pages[cpages++] = page;
+		}
+	}
+	switch(cstate) {
+	case tt_uncached:
+		drm_clflush_pages(_pages, cpages);
+		set_pages_array_uc(_pages, cpages);
+		break;
+	case tt_wc:
+		drm_clflush_pages(_pages, cpages);
+		set_pages_array_wc(_pages, cpages);
+		break;
+	case tt_cached:
+	default:
+		break;
+	}
+	return 0;
+}
+
+static inline void ttm_page_put_locked(struct page *page)
+{
+	if (_npages_to_free >= NUM_PAGES_TO_ADD)
+		pages_free_locked();
+	_pages[_npages_to_free++] = page;
+}
+
+static void ttm_page_pool_empty_locked(struct page_pool *pool)
+{
+	struct page *page, *tmp;
+
+	list_for_each_entry_safe(page, tmp, &pool->list, lru) {
+		list_del(&page->lru);
+#ifdef CONFIG_X86
+		if (PageHighMem(page)) {
+			__free_page(page);
+		} else
+#endif
+		{
+			ttm_page_put_locked(page);
+		}
+	}
+	pool->npages = 0;
+	pool->nzeropages = 0;
+}
+
+static struct page *ttm_page_pool_get_locked(struct page_pool *pool, int flags)
+{
+	struct page *page = NULL;
+
+	if (flags & __GFP_ZERO) {
+		if (pool->nzeropages) {
+			page = list_first_entry(&pool->list, struct page, lru);
+			list_del(&page->lru);
+			pool->nzeropages--;
+			return page;
+		} else {
+			page = list_first_entry(&pool->list, struct page, lru);
+			list_del(&page->lru);
+			pool->npages--;
+			if (PageHighMem(page)) {
+				/* This should never happen ! */
+				__free_page(page);
+				return NULL;
+			}
+			clear_page(page_address(page));
+			return page;
+		}
+	}
+	if (pool->npages) {
+		page = list_entry(pool->list.prev, struct page, lru);
+		list_del(&page->lru);
+		pool->npages--;
+		return page;
+	}
+	page = list_first_entry(&pool->list, struct page, lru);
+	list_del(&page->lru);
+	pool->nzeropages--;
+	return page;
+}
+
+
+struct page *ttm_get_page(int flags, enum ttm_caching_state cstate)
+{
+	struct page_pool *pool;
+	struct page *page = NULL;
+	int gfp_flags = GFP_USER;
+	int r;
+
+	if (flags & TTM_PAGE_FLAG_ZERO_ALLOC)
+		gfp_flags |= __GFP_ZERO;
+	if (flags & TTM_PAGE_FLAG_DMA32) {
+		gfp_flags |= __GFP_DMA32;
+	} else {
+		gfp_flags |= __GFP_HIGHMEM;
+	}
+	switch (cstate) {
+	case tt_uncached:
+		if (gfp_flags & __GFP_DMA32)
+			pool = &_uc_pool_dma32;
+		else
+			pool = &_uc_pool;
+		break;
+	case tt_wc:
+		if (gfp_flags & __GFP_DMA32)
+			pool = &_wc_pool_dma32;
+		else
+			pool = &_wc_pool;
+		break;
+	case tt_cached:
+	default:
+		return alloc_page(gfp_flags);
+	}
+	mutex_lock(&page_alloc_mutex);
+	if (!pool->npages && !pool->nzeropages) {
+		/* We force to allocate zeroed page so that when we free page
+		 * and only add non highmem page to the pool we know that we
+		 * will be able to clear them.
+		 */
+		r = page_pool_fill_locked(pool, gfp_flags | __GFP_ZERO, cstate);
+		if (r) {
+			mutex_unlock(&page_alloc_mutex);
+			return NULL;
+		}
+	}
+	page = ttm_page_pool_get_locked(pool, gfp_flags);
+	mutex_unlock(&page_alloc_mutex);
+	return page;
+}
+
+void ttm_put_page(struct page *page, int flags, enum ttm_caching_state cstate)
+{
+	struct page_pool *pool;
+
+	switch (cstate) {
+	case tt_uncached:
+		if (flags & TTM_PAGE_FLAG_DMA32)
+			pool = & _uc_pool_dma32;
+		else
+			pool = & _uc_pool;
+		break;
+	case tt_wc:
+		if (flags & TTM_PAGE_FLAG_DMA32)
+			pool = & _wc_pool_dma32;
+		else
+			pool = & _wc_pool;
+		break;
+	case tt_cached:
+	default:
+		__free_page(page);
+		return;
+	}
+	mutex_lock(&page_alloc_mutex);
+#ifdef CONFIG_X86
+	if (PageHighMem(page)) {
+		/* Free highmem page as we can't easily set them to zero,
+		 * so we know that in non zeroed page we only have page
+		 * we can zero through a mapping prolerly set according
+		 * to cache setting (ie we are missing kmap_wc_atomic|
+		 * kmap_uc_atomic)
+		 */
+		__free_page(page);
+	} else
+#endif
+	{
+		if (pool->npages >= NUM_PAGES_TO_ADD) {
+			ttm_page_put_locked(page);
+		} else {
+			list_add_tail(&page->lru, &pool->list);
+			pool->npages++;
+		}
+	}
+	mutex_unlock(&page_alloc_mutex);
+}
+
+int ttm_page_alloc_init(void)
+{
+	if (page_alloc_inited)
+		return 0;
+
+	mutex_init(&page_alloc_mutex);
+	ttm_page_pool_init_locked(&_wc_pool);
+	ttm_page_pool_init_locked(&_uc_pool);
+	ttm_page_pool_init_locked(&_wc_pool_dma32);
+	ttm_page_pool_init_locked(&_uc_pool_dma32);
+	_npages_to_free = 0;
+	page_alloc_inited = 1;
+	return 0;
+}
+
+void ttm_page_alloc_fini(void)
+{
+	if (!page_alloc_inited)
+		return;
+	mutex_lock(&page_alloc_mutex);
+	ttm_page_pool_empty_locked(&_wc_pool);
+	ttm_page_pool_empty_locked(&_uc_pool);
+	ttm_page_pool_empty_locked(&_wc_pool_dma32);
+	ttm_page_pool_empty_locked(&_uc_pool_dma32);
+	pages_free_locked();
+	page_alloc_inited = 0;
+	mutex_unlock(&page_alloc_mutex);
+}
diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.h b/drivers/gpu/drm/ttm/ttm_page_alloc.h
new file mode 100644
index 0000000..9212554
--- /dev/null
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.h
@@ -0,0 +1,36 @@
+/*
+ * Copyright (c) Red Hat Inc.
+
+ * Permission is hereby granted, free of charge, to any person obtaining a
+ * copy of this software and associated documentation files (the "Software"),
+ * to deal in the Software without restriction, including without limitation
+ * the rights to use, copy, modify, merge, publish, distribute, sub license,
+ * and/or sell copies of the Software, and to permit persons to whom the
+ * Software is furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice (including the
+ * next paragraph) shall be included in all copies or substantial portions
+ * of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NON-INFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
+ * Authors: Dave Airlie <airlied@redhat.com>
+ *          Jerome Glisse <jglisse@redhat.com>
+ */
+#ifndef TTM_PAGE_ALLOC
+#define TTM_PAGE_ALLOC
+
+#include "ttm/ttm_bo_driver.h"
+
+void ttm_put_page(struct page *page, int flags, enum ttm_caching_state cstate);
+struct page *ttm_get_page(int flags, enum ttm_caching_state cstate);
+int ttm_page_alloc_init(void);
+void ttm_page_alloc_fini(void);
+
+#endif
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index b106b1d..00ced23 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -38,6 +38,7 @@
 #include "ttm/ttm_module.h"
 #include "ttm/ttm_bo_driver.h"
 #include "ttm/ttm_placement.h"
+#include "ttm_page_alloc.h"
 
 static int ttm_tt_swapin(struct ttm_tt *ttm);
 
@@ -72,21 +73,6 @@ static void ttm_tt_free_page_directory(struct ttm_tt *ttm)
 	ttm->pages = NULL;
 }
 
-static struct page *ttm_tt_alloc_page(unsigned page_flags)
-{
-	gfp_t gfp_flags = GFP_USER;
-
-	if (page_flags & TTM_PAGE_FLAG_ZERO_ALLOC)
-		gfp_flags |= __GFP_ZERO;
-
-	if (page_flags & TTM_PAGE_FLAG_DMA32)
-		gfp_flags |= __GFP_DMA32;
-	else
-		gfp_flags |= __GFP_HIGHMEM;
-
-	return alloc_page(gfp_flags);
-}
-
 static void ttm_tt_free_user_pages(struct ttm_tt *ttm)
 {
 	int write;
@@ -132,7 +118,7 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
 	int ret;
 
 	while (NULL == (p = ttm->pages[index])) {
-		p = ttm_tt_alloc_page(ttm->page_flags);
+		p = ttm_get_page(ttm->page_flags, ttm->caching_state);
 
 		if (!p)
 			return NULL;
@@ -155,7 +141,8 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
 	}
 	return p;
 out_err:
-	put_page(p);
+	if (p)
+		ttm_put_page(p, ttm->page_flags, ttm->caching_state);
 	return NULL;
 }
 
@@ -229,7 +216,6 @@ static inline int ttm_tt_set_page_caching(struct page *p,
  * Change caching policy for the linear kernel map
  * for range of pages in a ttm.
  */
-
 static int ttm_tt_set_caching(struct ttm_tt *ttm,
 			      enum ttm_caching_state c_state)
 {
@@ -246,6 +232,12 @@ static int ttm_tt_set_caching(struct ttm_tt *ttm,
 			return ret;
 	}
 
+	if (ttm->state == tt_unpopulated) {
+		/* Change caching but don't populate */
+		ttm->caching_state = c_state;
+		return 0;
+	}
+
 	if (ttm->caching_state == tt_cached)
 		drm_clflush_pages(ttm->pages, ttm->num_pages);
 
@@ -257,11 +249,8 @@ static int ttm_tt_set_caching(struct ttm_tt *ttm,
 				goto out_err;
 		}
 	}
-
 	ttm->caching_state = c_state;
-
 	return 0;
-
 out_err:
 	for (j = 0; j < i; ++j) {
 		cur_page = ttm->pages[j];
@@ -296,7 +285,6 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm)
 
 	if (be)
 		be->func->clear(be);
-	(void)ttm_tt_set_caching(ttm, tt_cached);
 	for (i = 0; i < ttm->num_pages; ++i) {
 		cur_page = ttm->pages[i];
 		ttm->pages[i] = NULL;
@@ -304,10 +292,11 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm)
 			if (page_count(cur_page) != 1)
 				printk(KERN_ERR TTM_PFX
 				       "Erroneous page count. "
-				       "Leaking pages.\n");
+				       "Leaking pages (%d).\n",
+				       page_count(cur_page));
 			ttm_mem_global_free(ttm->bdev->mem_glob, PAGE_SIZE,
 					    PageHighMem(cur_page));
-			__free_page(cur_page);
+			ttm_put_page(cur_page, ttm->page_flags, ttm->caching_state);
 		}
 	}
 	ttm->state = tt_unpopulated;
-- 
1.6.2.2


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-21 17:34   ` Jerome Glisse
  2009-07-21 18:00     ` Jerome Glisse
@ 2009-07-22  8:27     ` Thomas Hellström
  2009-07-22 12:12       ` Jerome Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-07-22  8:27 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Airlie, linux-kernel, dri-devel

Jerome Glisse wrote:
> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>   
>> Jerome Glisse skrev:
>>     
>>> Hi,
>>>
>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>> this one should be ok with ttm cache status tracking. It definitely
>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>> allocation.
>>>
>>> Cheers,
>>> Jerome
>>>   
>>>       
>> Hi, Jerome!
>> In general it looks very good. Some things that need fixing:
>>
>> 1)  We must have a way to hand back pages. I still not quite understand 
>> how the shrink callbacks work and whether they are applicable. Another 
>> scheme would be to free, say 1MB when we have at least 2MB available.
>>
>> 2) We should avoid including AGP headers if AGP is not configured. 
>> Either reimplement unmap_page_from_agp or map_page_into_agp or move them 
>> out from the AGP headers. We've hade complaints before from people with 
>> AGP free systems that the code doesn't compile.
>>
>> 3) Since we're allocating (and freeing) in batches we should use the 
>> set_pages_array() interface to avoid a global tlb flush per page.
>>
>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
>> it will always allocate cached pages and then transition them.
>>
>>     
>
> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> wrong) :
> 1 - bo get allocated tt->state = unpopulated
> 2 - bo is mapped few page are faulted tt->state = unpopulated
> 3 - bo is cache transitioned but tt->state == unpopulated but
>     they are page which have been touch by the cpu so we need
>     to clflush them and transition them, 

Right.

> this never happen if
>     we don't call ttm_tt_populate and proceed with the remaining
>     of the cache transitioning functions
>   

Why can't we just skip ttm_tt_populate and proceed with the remaining of 
the cache transitioning functions? Then all pages currently allocated to 
the TTM will be transitioned.

/Thomas




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-21 19:22       ` Jerome Glisse
@ 2009-07-22  8:37         ` Thomas Hellström
  2009-07-28 16:48           ` ttm_mem_global Jerome Glisse
  2009-07-22 13:16         ` TTM page pool allocator Michel Dänzer
  1 sibling, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-07-22  8:37 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, dri-devel

Jerome Glisse wrote:
> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>   
>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>>     
>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>       
>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
>>>> it will always allocate cached pages and then transition them.
>>>>
>>>>         
>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>> wrong) :
>>> 1 - bo get allocated tt->state = unpopulated
>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>>     they are page which have been touch by the cpu so we need
>>>     to clflush them and transition them, this never happen if
>>>     we don't call ttm_tt_populate and proceed with the remaining
>>>     of the cache transitioning functions
>>>
>>> As a workaround i will try to go through the pages tables and
>>> transition existing pages. Do you have any idea for a better
>>> plan ?
>>>
>>> Cheers,
>>> Jerome
>>>       
>> My workaround ruin the whole idea of pool allocation what happens
>> is that most bo get cache transition page per page. My thinking
>> is that we should do the following:
>> 	- is there is a least one page allocated then fully populate
>> 	the object and do cache transition on all the pages.
>> 	- otherwise update caching_state and leaves object unpopulated
>>
>> This needs that we some how reflect the fact that there is at least
>> one page allocated, i am thinking to adding a new state for that :
>> ttm_partialy_populated
>>
>> Thomas what do you think about that ?
>>
>> Cheers,
>> Jerome
>>     
>
> Attached updated patch it doesn't introduce ttm_partialy_populated
> but keep the populate call in cache transition. So far it seems to
> work properly on AGP platform and helps quite a lot with performances.
> I wonder if i should rather allocate some memory to store the pool
> structure in ttm_page_pool_init rather than having quite a lot of
> static variables ? Anyone has thought on that ?
>
>   
Jerome,

TTM has a device struct per device and an optional global struct that is 
common for all devices and intended to be per subsystem.

The only subsystem currently having a global structure is the memory 
accounting subsystem:
struct ttm_mem_global

You can either put the global stuff there or create and register a 
separate global struct for the page pool.

I'll probably also later add a global structure for the bo subsystem, 
since we need a common buffer object memory footprint shrinker for 
multiple cards.

/Thomas



> Cheers,
> Jerome
>   




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22  8:27     ` Thomas Hellström
@ 2009-07-22 12:12       ` Jerome Glisse
  2009-07-22 19:10         ` Thomas Hellström
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-22 12:12 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Dave Airlie, linux-kernel, dri-devel

On Wed, 2009-07-22 at 10:27 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> >   
> >> Jerome Glisse skrev:
> >>     
> >>> Hi,
> >>>
> >>> Thomas i attach a reworked page pool allocator based on Dave works,
> >>> this one should be ok with ttm cache status tracking. It definitely
> >>> helps on AGP system, now the bottleneck is in mesa vertex's dma
> >>> allocation.
> >>>
> >>> Cheers,
> >>> Jerome
> >>>   
> >>>       
> >> Hi, Jerome!
> >> In general it looks very good. Some things that need fixing:
> >>
> >> 1)  We must have a way to hand back pages. I still not quite understand 
> >> how the shrink callbacks work and whether they are applicable. Another 
> >> scheme would be to free, say 1MB when we have at least 2MB available.
> >>
> >> 2) We should avoid including AGP headers if AGP is not configured. 
> >> Either reimplement unmap_page_from_agp or map_page_into_agp or move them 
> >> out from the AGP headers. We've hade complaints before from people with 
> >> AGP free systems that the code doesn't compile.
> >>
> >> 3) Since we're allocating (and freeing) in batches we should use the 
> >> set_pages_array() interface to avoid a global tlb flush per page.
> >>
> >> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> >> it will always allocate cached pages and then transition them.
> >>
> >>     
> >
> > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > wrong) :
> > 1 - bo get allocated tt->state = unpopulated
> > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > 3 - bo is cache transitioned but tt->state == unpopulated but
> >     they are page which have been touch by the cpu so we need
> >     to clflush them and transition them, 
> 
> Right.
> 
> > this never happen if
> >     we don't call ttm_tt_populate and proceed with the remaining
> >     of the cache transitioning functions
> >   
> 
> Why can't we just skip ttm_tt_populate and proceed with the remaining of 
> the cache transitioning functions? Then all pages currently allocated to 
> the TTM will be transitioned.
> 

Because not all tt->pages entry have valid page some are null and thus
we have to go through the whole table and transition page per page. I
did that and quick benchmark showed that it's faster to fully populate
and fully transition.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-21 19:22       ` Jerome Glisse
  2009-07-22  8:37         ` Thomas Hellström
@ 2009-07-22 13:16         ` Michel Dänzer
  2009-07-22 13:31           ` Jerome Glisse
  1 sibling, 1 reply; 29+ messages in thread
From: Michel Dänzer @ 2009-07-22 13:16 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Thomas Hellström, linux-kernel, dri-devel

On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > > > 
> > > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> > > > it will always allocate cached pages and then transition them.
> > > > 
> > > 
> > > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > > wrong) :
> > > 1 - bo get allocated tt->state = unpopulated
> > > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > > 3 - bo is cache transitioned but tt->state == unpopulated but
> > >     they are page which have been touch by the cpu so we need
> > >     to clflush them and transition them, this never happen if
> > >     we don't call ttm_tt_populate and proceed with the remaining
> > >     of the cache transitioning functions
> > > 
> > > As a workaround i will try to go through the pages tables and
> > > transition existing pages. Do you have any idea for a better
> > > plan ?
> > > 
> > > Cheers,
> > > Jerome
> > 
> > My workaround ruin the whole idea of pool allocation what happens
> > is that most bo get cache transition page per page. My thinking
> > is that we should do the following:
> > 	- is there is a least one page allocated then fully populate
> > 	the object and do cache transition on all the pages.
> > 	- otherwise update caching_state and leaves object unpopulated
> > 
> > This needs that we some how reflect the fact that there is at least
> > one page allocated, i am thinking to adding a new state for that :
> > ttm_partialy_populated
> > 
> > Thomas what do you think about that ?
> > 
> > Cheers,
> > Jerome
> 
> Attached updated patch it doesn't introduce ttm_partialy_populated
> but keep the populate call in cache transition. So far it seems to
> work properly on AGP platform

Yeah, this one works for me as well.

> and helps quite a lot with performances.

Can't say I've noticed that however. How did you measure?


-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 13:16         ` TTM page pool allocator Michel Dänzer
@ 2009-07-22 13:31           ` Jerome Glisse
  2009-07-22 19:13             ` Thomas Hellström
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-22 13:31 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: Thomas Hellström, linux-kernel, dri-devel

On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> > On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > > On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > > > On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > > > > 
> > > > > 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> > > > > it will always allocate cached pages and then transition them.
> > > > > 
> > > > 
> > > > Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > > > wrong) :
> > > > 1 - bo get allocated tt->state = unpopulated
> > > > 2 - bo is mapped few page are faulted tt->state = unpopulated
> > > > 3 - bo is cache transitioned but tt->state == unpopulated but
> > > >     they are page which have been touch by the cpu so we need
> > > >     to clflush them and transition them, this never happen if
> > > >     we don't call ttm_tt_populate and proceed with the remaining
> > > >     of the cache transitioning functions
> > > > 
> > > > As a workaround i will try to go through the pages tables and
> > > > transition existing pages. Do you have any idea for a better
> > > > plan ?
> > > > 
> > > > Cheers,
> > > > Jerome
> > > 
> > > My workaround ruin the whole idea of pool allocation what happens
> > > is that most bo get cache transition page per page. My thinking
> > > is that we should do the following:
> > > 	- is there is a least one page allocated then fully populate
> > > 	the object and do cache transition on all the pages.
> > > 	- otherwise update caching_state and leaves object unpopulated
> > > 
> > > This needs that we some how reflect the fact that there is at least
> > > one page allocated, i am thinking to adding a new state for that :
> > > ttm_partialy_populated
> > > 
> > > Thomas what do you think about that ?
> > > 
> > > Cheers,
> > > Jerome
> > 
> > Attached updated patch it doesn't introduce ttm_partialy_populated
> > but keep the populate call in cache transition. So far it seems to
> > work properly on AGP platform
> 
> Yeah, this one works for me as well.
> 
> > and helps quite a lot with performances.
> 
> Can't say I've noticed that however. How did you measure?

gears & quake3 but i haven't yet done any X 2d benchmark,
2d is dead slow with kms at leat this is the feeling i have.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 12:12       ` Jerome Glisse
@ 2009-07-22 19:10         ` Thomas Hellström
  0 siblings, 0 replies; 29+ messages in thread
From: Thomas Hellström @ 2009-07-22 19:10 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Dave Airlie, linux-kernel, dri-devel

Jerome Glisse wrote:
> On Wed, 2009-07-22 at 10:27 +0200, Thomas Hellström wrote:
>   
>> Jerome Glisse wrote:
>>     
>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>   
>>>       
>>>> Jerome Glisse skrev:
>>>>     
>>>>         
>>>>> Hi,
>>>>>
>>>>> Thomas i attach a reworked page pool allocator based on Dave works,
>>>>> this one should be ok with ttm cache status tracking. It definitely
>>>>> helps on AGP system, now the bottleneck is in mesa vertex's dma
>>>>> allocation.
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>>>   
>>>>>       
>>>>>           
>>>> Hi, Jerome!
>>>> In general it looks very good. Some things that need fixing:
>>>>
>>>> 1)  We must have a way to hand back pages. I still not quite understand 
>>>> how the shrink callbacks work and whether they are applicable. Another 
>>>> scheme would be to free, say 1MB when we have at least 2MB available.
>>>>
>>>> 2) We should avoid including AGP headers if AGP is not configured. 
>>>> Either reimplement unmap_page_from_agp or map_page_into_agp or move them 
>>>> out from the AGP headers. We've hade complaints before from people with 
>>>> AGP free systems that the code doesn't compile.
>>>>
>>>> 3) Since we're allocating (and freeing) in batches we should use the 
>>>> set_pages_array() interface to avoid a global tlb flush per page.
>>>>
>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
>>>> it will always allocate cached pages and then transition them.
>>>>
>>>>     
>>>>         
>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>> wrong) :
>>> 1 - bo get allocated tt->state = unpopulated
>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>>     they are page which have been touch by the cpu so we need
>>>     to clflush them and transition them, 
>>>       
>> Right.
>>
>>     
>>> this never happen if
>>>     we don't call ttm_tt_populate and proceed with the remaining
>>>     of the cache transitioning functions
>>>   
>>>       
>> Why can't we just skip ttm_tt_populate and proceed with the remaining of 
>> the cache transitioning functions? Then all pages currently allocated to 
>> the TTM will be transitioned.
>>
>>     
>
> Because not all tt->pages entry have valid page some are null and thus
> we have to go through the whole table and transition page per page. I
> did that and quick benchmark showed that it's faster to fully populate
> and fully transition.
>
>   
OK.
That's probably sufficient for now.

/Thomas


> Cheers,
> Jerome
>
>   




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 13:31           ` Jerome Glisse
@ 2009-07-22 19:13             ` Thomas Hellström
  2009-07-22 22:35               ` Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-07-22 19:13 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: Michel Dänzer, linux-kernel, dri-devel

Jerome Glisse wrote:
> On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
>   
>> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
>>     
>>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>>>       
>>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>>>>         
>>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>>>>>           
>>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
>>>>>> it will always allocate cached pages and then transition them.
>>>>>>
>>>>>>             
>>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>>>>> wrong) :
>>>>> 1 - bo get allocated tt->state = unpopulated
>>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>>>>>     they are page which have been touch by the cpu so we need
>>>>>     to clflush them and transition them, this never happen if
>>>>>     we don't call ttm_tt_populate and proceed with the remaining
>>>>>     of the cache transitioning functions
>>>>>
>>>>> As a workaround i will try to go through the pages tables and
>>>>> transition existing pages. Do you have any idea for a better
>>>>> plan ?
>>>>>
>>>>> Cheers,
>>>>> Jerome
>>>>>           
>>>> My workaround ruin the whole idea of pool allocation what happens
>>>> is that most bo get cache transition page per page. My thinking
>>>> is that we should do the following:
>>>> 	- is there is a least one page allocated then fully populate
>>>> 	the object and do cache transition on all the pages.
>>>> 	- otherwise update caching_state and leaves object unpopulated
>>>>
>>>> This needs that we some how reflect the fact that there is at least
>>>> one page allocated, i am thinking to adding a new state for that :
>>>> ttm_partialy_populated
>>>>
>>>> Thomas what do you think about that ?
>>>>
>>>> Cheers,
>>>> Jerome
>>>>         
>>> Attached updated patch it doesn't introduce ttm_partialy_populated
>>> but keep the populate call in cache transition. So far it seems to
>>> work properly on AGP platform
>>>       
>> Yeah, this one works for me as well.
>>
>>     
>>> and helps quite a lot with performances.
>>>       
>> Can't say I've noticed that however. How did you measure?
>>     
>
> gears 
Hmm,
In gears there shouldn't really be any buffer allocation / freeing going 
on at all once the display lists are set up, and gears should really be 
gpu bound in most cases.

what's the source of the buffer allocations / frees when gears is run?

/Thomas






^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 19:13             ` Thomas Hellström
@ 2009-07-22 22:35               ` Jerome Glisse
  2009-07-22 23:24                 ` Keith Whitwell
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-22 22:35 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: Michel Dänzer, linux-kernel, dri-devel

On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> >   
> >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> >>     
> >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> >>>       
> >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> >>>>         
> >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> >>>>>           
> >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> >>>>>> it will always allocate cached pages and then transition them.
> >>>>>>
> >>>>>>             
> >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> >>>>> wrong) :
> >>>>> 1 - bo get allocated tt->state = unpopulated
> >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
> >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
> >>>>>     they are page which have been touch by the cpu so we need
> >>>>>     to clflush them and transition them, this never happen if
> >>>>>     we don't call ttm_tt_populate and proceed with the remaining
> >>>>>     of the cache transitioning functions
> >>>>>
> >>>>> As a workaround i will try to go through the pages tables and
> >>>>> transition existing pages. Do you have any idea for a better
> >>>>> plan ?
> >>>>>
> >>>>> Cheers,
> >>>>> Jerome
> >>>>>           
> >>>> My workaround ruin the whole idea of pool allocation what happens
> >>>> is that most bo get cache transition page per page. My thinking
> >>>> is that we should do the following:
> >>>> 	- is there is a least one page allocated then fully populate
> >>>> 	the object and do cache transition on all the pages.
> >>>> 	- otherwise update caching_state and leaves object unpopulated
> >>>>
> >>>> This needs that we some how reflect the fact that there is at least
> >>>> one page allocated, i am thinking to adding a new state for that :
> >>>> ttm_partialy_populated
> >>>>
> >>>> Thomas what do you think about that ?
> >>>>
> >>>> Cheers,
> >>>> Jerome
> >>>>         
> >>> Attached updated patch it doesn't introduce ttm_partialy_populated
> >>> but keep the populate call in cache transition. So far it seems to
> >>> work properly on AGP platform
> >>>       
> >> Yeah, this one works for me as well.
> >>
> >>     
> >>> and helps quite a lot with performances.
> >>>       
> >> Can't say I've noticed that however. How did you measure?
> >>     
> >
> > gears 
> Hmm,
> In gears there shouldn't really be any buffer allocation / freeing going 
> on at all once the display lists are set up, and gears should really be 
> gpu bound in most cases.
> 
> what's the source of the buffer allocations / frees when gears is run?
> 
> /Thomas

We free reallocate vertex buffer each frame iirc.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 22:35               ` Jerome Glisse
@ 2009-07-22 23:24                 ` Keith Whitwell
  2009-07-22 23:27                   ` Dave Airlie
  0 siblings, 1 reply; 29+ messages in thread
From: Keith Whitwell @ 2009-07-22 23:24 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Thomas Hellström, Michel Dänzer, linux-kernel, dri-devel

On Wed, 2009-07-22 at 15:35 -0700, Jerome Glisse wrote:
> On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellström wrote:
> > Jerome Glisse wrote:
> > > On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
> > >   
> > >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
> > >>     
> > >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
> > >>>       
> > >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
> > >>>>         
> > >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
> > >>>>>           
> > >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since 
> > >>>>>> it will always allocate cached pages and then transition them.
> > >>>>>>
> > >>>>>>             
> > >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
> > >>>>> wrong) :
> > >>>>> 1 - bo get allocated tt->state = unpopulated
> > >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
> > >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
> > >>>>>     they are page which have been touch by the cpu so we need
> > >>>>>     to clflush them and transition them, this never happen if
> > >>>>>     we don't call ttm_tt_populate and proceed with the remaining
> > >>>>>     of the cache transitioning functions
> > >>>>>
> > >>>>> As a workaround i will try to go through the pages tables and
> > >>>>> transition existing pages. Do you have any idea for a better
> > >>>>> plan ?
> > >>>>>
> > >>>>> Cheers,
> > >>>>> Jerome
> > >>>>>           
> > >>>> My workaround ruin the whole idea of pool allocation what happens
> > >>>> is that most bo get cache transition page per page. My thinking
> > >>>> is that we should do the following:
> > >>>> 	- is there is a least one page allocated then fully populate
> > >>>> 	the object and do cache transition on all the pages.
> > >>>> 	- otherwise update caching_state and leaves object unpopulated
> > >>>>
> > >>>> This needs that we some how reflect the fact that there is at least
> > >>>> one page allocated, i am thinking to adding a new state for that :
> > >>>> ttm_partialy_populated
> > >>>>
> > >>>> Thomas what do you think about that ?
> > >>>>
> > >>>> Cheers,
> > >>>> Jerome
> > >>>>         
> > >>> Attached updated patch it doesn't introduce ttm_partialy_populated
> > >>> but keep the populate call in cache transition. So far it seems to
> > >>> work properly on AGP platform
> > >>>       
> > >> Yeah, this one works for me as well.
> > >>
> > >>     
> > >>> and helps quite a lot with performances.
> > >>>       
> > >> Can't say I've noticed that however. How did you measure?
> > >>     
> > >
> > > gears 
> > Hmm,
> > In gears there shouldn't really be any buffer allocation / freeing going 
> > on at all once the display lists are set up, and gears should really be 
> > gpu bound in most cases.
> > 
> > what's the source of the buffer allocations / frees when gears is run?
> > 
> > /Thomas
> 
> We free reallocate vertex buffer each frame iirc.

Gears does everything in display lists which means geometry is held in
VBOs retained for the life of the application.  Once the first frame is
rendered there shouldn't be any more uploads.

Keith


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: TTM page pool allocator
  2009-07-22 23:24                 ` Keith Whitwell
@ 2009-07-22 23:27                   ` Dave Airlie
  0 siblings, 0 replies; 29+ messages in thread
From: Dave Airlie @ 2009-07-22 23:27 UTC (permalink / raw)
  To: Keith Whitwell
  Cc: Jerome Glisse, Thomas Hellström, Michel Dänzer,
	linux-kernel, dri-devel

On Thu, Jul 23, 2009 at 9:24 AM, Keith Whitwell<keithw@vmware.com> wrote:
> On Wed, 2009-07-22 at 15:35 -0700, Jerome Glisse wrote:
>> On Wed, 2009-07-22 at 21:13 +0200, Thomas Hellström wrote:
>> > Jerome Glisse wrote:
>> > > On Wed, 2009-07-22 at 15:16 +0200, Michel Dänzer wrote:
>> > >
>> > >> On Tue, 2009-07-21 at 21:22 +0200, Jerome Glisse wrote:
>> > >>
>> > >>> On Tue, 2009-07-21 at 20:00 +0200, Jerome Glisse wrote:
>> > >>>
>> > >>>> On Tue, 2009-07-21 at 19:34 +0200, Jerome Glisse wrote:
>> > >>>>
>> > >>>>> On Thu, 2009-06-25 at 17:53 +0200, Thomas Hellström wrote:
>> > >>>>>
>> > >>>>>> 4) We could now skip the ttm_tt_populate() in ttm_tt_set_caching, since
>> > >>>>>> it will always allocate cached pages and then transition them.
>> > >>>>>>
>> > >>>>>>
>> > >>>>> Okay 4) is bad, what happens (my brain is a bit meltdown so i might be
>> > >>>>> wrong) :
>> > >>>>> 1 - bo get allocated tt->state = unpopulated
>> > >>>>> 2 - bo is mapped few page are faulted tt->state = unpopulated
>> > >>>>> 3 - bo is cache transitioned but tt->state == unpopulated but
>> > >>>>>     they are page which have been touch by the cpu so we need
>> > >>>>>     to clflush them and transition them, this never happen if
>> > >>>>>     we don't call ttm_tt_populate and proceed with the remaining
>> > >>>>>     of the cache transitioning functions
>> > >>>>>
>> > >>>>> As a workaround i will try to go through the pages tables and
>> > >>>>> transition existing pages. Do you have any idea for a better
>> > >>>>> plan ?
>> > >>>>>
>> > >>>>> Cheers,
>> > >>>>> Jerome
>> > >>>>>
>> > >>>> My workaround ruin the whole idea of pool allocation what happens
>> > >>>> is that most bo get cache transition page per page. My thinking
>> > >>>> is that we should do the following:
>> > >>>>        - is there is a least one page allocated then fully populate
>> > >>>>        the object and do cache transition on all the pages.
>> > >>>>        - otherwise update caching_state and leaves object unpopulated
>> > >>>>
>> > >>>> This needs that we some how reflect the fact that there is at least
>> > >>>> one page allocated, i am thinking to adding a new state for that :
>> > >>>> ttm_partialy_populated
>> > >>>>
>> > >>>> Thomas what do you think about that ?
>> > >>>>
>> > >>>> Cheers,
>> > >>>> Jerome
>> > >>>>
>> > >>> Attached updated patch it doesn't introduce ttm_partialy_populated
>> > >>> but keep the populate call in cache transition. So far it seems to
>> > >>> work properly on AGP platform
>> > >>>
>> > >> Yeah, this one works for me as well.
>> > >>
>> > >>
>> > >>> and helps quite a lot with performances.
>> > >>>
>> > >> Can't say I've noticed that however. How did you measure?
>> > >>
>> > >
>> > > gears
>> > Hmm,
>> > In gears there shouldn't really be any buffer allocation / freeing going
>> > on at all once the display lists are set up, and gears should really be
>> > gpu bound in most cases.
>> >
>> > what's the source of the buffer allocations / frees when gears is run?
>> >
>> > /Thomas
>>
>> We free reallocate vertex buffer each frame iirc.
>
> Gears does everything in display lists which means geometry is held in
> VBOs retained for the life of the application.  Once the first frame is
> rendered there shouldn't be any more uploads.
>

I don't think the r300 driver does proper vbo's yet.

Dave.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* ttm_mem_global
  2009-07-22  8:37         ` Thomas Hellström
@ 2009-07-28 16:48           ` Jerome Glisse
  2009-07-28 18:55             ` ttm_mem_global Thomas Hellström
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-28 16:48 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: linux-kernel, dri-devel

On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> TTM has a device struct per device and an optional global struct that is 
> common for all devices and intended to be per subsystem.
> 
> The only subsystem currently having a global structure is the memory 
> accounting subsystem:
> struct ttm_mem_global

Thomas i don't think the way we init ttm_mem_global today make
it follow the 1 struct ttm_mem_global for everyone. I think it
should be initialized and refcounted by device struct.

So on first device creation a ttm_mem_global is created and
then anytime a new device is created the refcount of ttm_mem_global
is increased. This would mean some static global refcount inside
ttm_memory.c, maybe there is something similar to singleton in
the linux toolbox.

Thought ? Idea ?

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ttm_mem_global
  2009-07-28 16:48           ` ttm_mem_global Jerome Glisse
@ 2009-07-28 18:55             ` Thomas Hellström
  2009-07-29  8:59               ` ttm_mem_global Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-07-28 18:55 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, dri-devel

Jerome Glisse skrev:
> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
>   
>> TTM has a device struct per device and an optional global struct that is 
>> common for all devices and intended to be per subsystem.
>>
>> The only subsystem currently having a global structure is the memory 
>> accounting subsystem:
>> struct ttm_mem_global
>>     
>
> Thomas i don't think the way we init ttm_mem_global today make
> it follow the 1 struct ttm_mem_global for everyone. I think it
> should be initialized and refcounted by device struct.
>
> So on first device creation a ttm_mem_global is created and
> then anytime a new device is created the refcount of ttm_mem_global
> is increased. 
Jerome,
This is exactly what the current code intends to do.

Are you seeing something different?

/Thomas



^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ttm_mem_global
  2009-07-28 18:55             ` ttm_mem_global Thomas Hellström
@ 2009-07-29  8:59               ` Jerome Glisse
  2009-07-29  9:39                 ` ttm_mem_global Thomas Hellström
  0 siblings, 1 reply; 29+ messages in thread
From: Jerome Glisse @ 2009-07-29  8:59 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: linux-kernel, dri-devel

On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
> Jerome Glisse skrev:
> > On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> >   
> >> TTM has a device struct per device and an optional global struct that is 
> >> common for all devices and intended to be per subsystem.
> >>
> >> The only subsystem currently having a global structure is the memory 
> >> accounting subsystem:
> >> struct ttm_mem_global
> >>     
> >
> > Thomas i don't think the way we init ttm_mem_global today make
> > it follow the 1 struct ttm_mem_global for everyone. I think it
> > should be initialized and refcounted by device struct.
> >
> > So on first device creation a ttm_mem_global is created and
> > then anytime a new device is created the refcount of ttm_mem_global
> > is increased. 
> Jerome,
> This is exactly what the current code intends to do.
> 
> Are you seeing something different?

I definitly don't see that :) In radeon we do create a structure
which hold the ttm_mem_global struct so it's not shared at all
it got inited & destroyed along the driver. This is why i think
it's better to remove the driver initialization and let bo_device
init path take care of initializing one and only one object which
can be shared by multiple driverttm_mem_global_inits.

So what i propose is remove mem_glob parameter from :
ttm_bo_device_init, add a call to ttm_mem_global_init in
ttm_bo_device_init and add some static refcount in ttm_memory.c
if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
struct and initialize things, if refcount > 0 then it gives
back the already initialized ttm_mem_global.

Of course we unref with ttm_mem_global_release and destroy
once refcount reach 0.

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ttm_mem_global
  2009-07-29  8:59               ` ttm_mem_global Jerome Glisse
@ 2009-07-29  9:39                 ` Thomas Hellström
  2009-07-29 13:04                   ` ttm_mem_global Jerome Glisse
  0 siblings, 1 reply; 29+ messages in thread
From: Thomas Hellström @ 2009-07-29  9:39 UTC (permalink / raw)
  To: Jerome Glisse; +Cc: linux-kernel, dri-devel

Jerome Glisse wrote:
> On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
>   
>> Jerome Glisse skrev:
>>     
>>> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
>>>   
>>>       
>>>> TTM has a device struct per device and an optional global struct that is 
>>>> common for all devices and intended to be per subsystem.
>>>>
>>>> The only subsystem currently having a global structure is the memory 
>>>> accounting subsystem:
>>>> struct ttm_mem_global
>>>>     
>>>>         
>>> Thomas i don't think the way we init ttm_mem_global today make
>>> it follow the 1 struct ttm_mem_global for everyone. I think it
>>> should be initialized and refcounted by device struct.
>>>
>>> So on first device creation a ttm_mem_global is created and
>>> then anytime a new device is created the refcount of ttm_mem_global
>>> is increased. 
>>>       
>> Jerome,
>> This is exactly what the current code intends to do.
>>
>> Are you seeing something different?
>>     
>
> I definitly don't see that :) In radeon we do create a structure
> which hold the ttm_mem_global struct so it's not shared at all
> it got inited & destroyed along the driver. This is why i think
> it's better to remove the driver initialization and let bo_device
> init path take care of initializing one and only one object which
> can be shared by multiple driverttm_mem_global_inits.
>
>   
Which radeon struct is holding the ttm_mem_global struct?

The radeon code looks very similar to the openchrome code in which the 
struct ttm_mem_global is allocated at ttm_global.c, line 74 and freed at 
ttm_global.c, line 108 when its refcount has reached zero.

So the device holds a struct ttm_global_reference that *only points* to 
the global item, and which is destroyed on device takedown. If there are 
more than one device pointing to the mem_global object, it won't get 
destroyed.

So the code should be working perfectly fine unless there is a bug.

> So what i propose is remove mem_glob parameter from :
> ttm_bo_device_init, add a call to ttm_mem_global_init in
> ttm_bo_device_init 

Nope, The ttm_mem_global object is used by other  ttm subsystems 
(fencing, user-space objects),
so that can't be done.

> and add some static refcount in ttm_memory.c
> if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
> struct and initialize things, if refcount > 0 then it gives
> back the already initialized ttm_mem_global.
>
>   

This is exactly what ttm_global was created to do, and what it hopefully 
does. If you create two radeon devices the ttm_mem_global object should 
be the same, even though the global references pointing to it are of 
course different. Have you actually tried this?

/Thomas

> Of course we unref with ttm_mem_global_release and destroy
> once refcount reach 0.
>
> Cheers,
> Jerome
>
>   




^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: ttm_mem_global
  2009-07-29  9:39                 ` ttm_mem_global Thomas Hellström
@ 2009-07-29 13:04                   ` Jerome Glisse
  0 siblings, 0 replies; 29+ messages in thread
From: Jerome Glisse @ 2009-07-29 13:04 UTC (permalink / raw)
  To: Thomas Hellström; +Cc: linux-kernel, dri-devel

On Wed, 2009-07-29 at 11:39 +0200, Thomas Hellström wrote:
> Jerome Glisse wrote:
> > On Tue, 2009-07-28 at 20:55 +0200, Thomas Hellström wrote:
> >   
> >> Jerome Glisse skrev:
> >>     
> >>> On Wed, 2009-07-22 at 10:37 +0200, Thomas Hellström wrote:
> >>>   
> >>>       
> >>>> TTM has a device struct per device and an optional global struct that is 
> >>>> common for all devices and intended to be per subsystem.
> >>>>
> >>>> The only subsystem currently having a global structure is the memory 
> >>>> accounting subsystem:
> >>>> struct ttm_mem_global
> >>>>     
> >>>>         
> >>> Thomas i don't think the way we init ttm_mem_global today make
> >>> it follow the 1 struct ttm_mem_global for everyone. I think it
> >>> should be initialized and refcounted by device struct.
> >>>
> >>> So on first device creation a ttm_mem_global is created and
> >>> then anytime a new device is created the refcount of ttm_mem_global
> >>> is increased. 
> >>>       
> >> Jerome,
> >> This is exactly what the current code intends to do.
> >>
> >> Are you seeing something different?
> >>     
> >
> > I definitly don't see that :) In radeon we do create a structure
> > which hold the ttm_mem_global struct so it's not shared at all
> > it got inited & destroyed along the driver. This is why i think
> > it's better to remove the driver initialization and let bo_device
> > init path take care of initializing one and only one object which
> > can be shared by multiple driverttm_mem_global_inits.
> >
> >   
> Which radeon struct is holding the ttm_mem_global struct?
> 
> The radeon code looks very similar to the openchrome code in which the 
> struct ttm_mem_global is allocated at ttm_global.c, line 74 and freed at 
> ttm_global.c, line 108 when its refcount has reached zero.
> 
> So the device holds a struct ttm_global_reference that *only points* to 
> the global item, and which is destroyed on device takedown. If there are 
> more than one device pointing to the mem_global object, it won't get 
> destroyed.
> 
> So the code should be working perfectly fine unless there is a bug.
> 
> > So what i propose is remove mem_glob parameter from :
> > ttm_bo_device_init, add a call to ttm_mem_global_init in
> > ttm_bo_device_init 
> 
> Nope, The ttm_mem_global object is used by other  ttm subsystems 
> (fencing, user-space objects),
> so that can't be done.
> 
> > and add some static refcount in ttm_memory.c
> > if refcount = 0 then ttm_mem_global_init create a ttm_mem_global
> > struct and initialize things, if refcount > 0 then it gives
> > back the already initialized ttm_mem_global.
> >
> >   
> 
> This is exactly what ttm_global was created to do, and what it hopefully 
> does. If you create two radeon devices the ttm_mem_global object should 
> be the same, even though the global references pointing to it are of 
> course different. Have you actually tried this?
> 
> /Thomas
> 

Ok code wasn't clear for me until i read ttm_global.c

Cheers,
Jerome


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2009-07-29 13:06 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2009-06-25 12:01 TTM page pool allocator Jerome Glisse
2009-06-25 15:53 ` Thomas Hellström
2009-07-21 17:34   ` Jerome Glisse
2009-07-21 18:00     ` Jerome Glisse
2009-07-21 19:22       ` Jerome Glisse
2009-07-22  8:37         ` Thomas Hellström
2009-07-28 16:48           ` ttm_mem_global Jerome Glisse
2009-07-28 18:55             ` ttm_mem_global Thomas Hellström
2009-07-29  8:59               ` ttm_mem_global Jerome Glisse
2009-07-29  9:39                 ` ttm_mem_global Thomas Hellström
2009-07-29 13:04                   ` ttm_mem_global Jerome Glisse
2009-07-22 13:16         ` TTM page pool allocator Michel Dänzer
2009-07-22 13:31           ` Jerome Glisse
2009-07-22 19:13             ` Thomas Hellström
2009-07-22 22:35               ` Jerome Glisse
2009-07-22 23:24                 ` Keith Whitwell
2009-07-22 23:27                   ` Dave Airlie
2009-07-22  8:27     ` Thomas Hellström
2009-07-22 12:12       ` Jerome Glisse
2009-07-22 19:10         ` Thomas Hellström
2009-06-26  0:00 ` Dave Airlie
2009-06-26  6:31   ` Thomas Hellström
2009-06-26  7:33     ` Jerome Glisse
2009-06-26  7:31   ` Jerome Glisse
2009-06-26  7:38     ` Dave Airlie
2009-06-26 13:59   ` Jerome Glisse
2009-06-29 21:12     ` Thomas Hellström
2009-07-09  6:06       ` Dave Airlie
2009-07-09  8:48         ` Michel Dänzer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.