linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2] Utilize the PCI API in the TTM framework.
@ 2011-01-07 17:11 Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 1/5] ttm: Introduce a placeholder for DMA (bus) addresses Konrad Rzeszutek Wilk
                   ` (8 more replies)
  0 siblings, 9 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad

Attached is a set of patches that make it possible for drivers using TTM API
(nouveau and radeon graphic drivers) to work under Xen. The explanation
is a bit complex and I am not sure if I am explaining it that well..so if
something is unclear please do ping me.

Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
 - Cleaned up commit message (forgot to add my SoB).

Short explanation of problem: What we are hitting under Xen is that instead
of programming the GART with the physical DMA address of the TTM page, we
end up programming the bounce buffer DMA (bus) address!

Long explanation:
The reason we end up doing this is that:

 1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers
     - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath
     4GB available to the the Linux page allocator we would not be able to
     give those to other guest devices. This would mean if we tried to pass
     in a USB device to one guest and in another were running the Xorg server
     we wouldn't be able to do so as we would run out of pages under 4GB. So
     pages that we get from alloc_page have a PFN that is under 4GB but in
     reality the real physical address (MFN) is above 4GB. Ugh..

 2). The backends for "struct ttm_backend_func" utilize the PCI API. When
     they get a page allocated via alloc_page, the use 'pci_map_page' and
     program the DMA (bus) address in the GART - which is correct. But then
     the calls that kick off the graphic driver to process the pages do not
     use the pci_page_sync_* calls. If the physical address of the page
     is the same as the DMA bus address returned from pci_map_page then there
     are no trouble. But if they are different:
	virt_to_phys(page_address(p)) != pci_map_page(p,..)
     then the graphic card fetches data from the DMA (bus) address (so the
     value returned from pci_map_page). The data however that the user wrote
     to (the page p) ends up being untouched. You are probably saying:
     "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN
     and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p)
     the GART ends up with the bus (DMA) address of the PFN!" That is true.
     But if you combine this with 1) where you end up with page that is
     above the dma_mask (even if you called it with GFP_DMA32) and then
     make a call on pci_map_page you would end up with a bounce buffer!
     

The problem above can be easily reproduced on bare-metal if you pass in
"swiotlb=force iommu=soft". 


There are two ways of fixing this:
 
 1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is
     struct pcidev present), instead of alloc_page for GFP_DMA32. The
     'dma_alloc_coherent' guarantees that the allocated page fits
     within the device dma_mask (or uses the default DMA32 if no device
     is passed in). This also guarantees that any subsequent call
     to the PCI API for this page will return the same DMA (bus) address
     as the first call (so pci_alloc_consistent, and then pci_map_page
     will give the same DMA bus address).

 2). Use the pci_sync_range_* after sending a page to the graphics
     engine. If the bounce buffer is used then we end up copying the
     pages.

 3). This one I really don't want to think about, but I should mention
     it. Alter the alloc_page and its backend to know about Xen.
     The pushback from the MM guys will be probably: "why not use the PCI API.."
    

So attached is a draft set of patches that use solution #1. I've tested
this on baremetal (and Xen) on PCIe Radeon and nouveau cards with success.

1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
but figured it would make first sense to get your guys input before heading that route.

This patch-set is also available at:

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v3

Konrad Rzeszutek Wilk (5):
      ttm: Introduce a placeholder for DMA (bus) addresses.
      tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool.
      ttm: Expand (*populate) to support an array of DMA addresses.
      radeon/ttm/PCIe: Use dma_addr if TTM has set it.
      nouveau/ttm/PCIe: Use dma_addr if TTM has set it.

And diffstat:

 drivers/gpu/drm/nouveau/nouveau_sgdma.c |   31 +++++++++++++++++++-------
 drivers/gpu/drm/radeon/radeon.h         |    4 ++-
 drivers/gpu/drm/radeon/radeon_gart.c    |   36 ++++++++++++++++++++++--------
 drivers/gpu/drm/radeon/radeon_ttm.c     |    8 +++++-
 drivers/gpu/drm/ttm/ttm_agp_backend.c   |    3 +-
 drivers/gpu/drm/ttm/ttm_page_alloc.c    |   35 +++++++++++++++++++++++++-----
 drivers/gpu/drm/ttm/ttm_tt.c            |   12 +++++++--
 drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c  |    3 +-
 include/drm/ttm/ttm_bo_driver.h         |    6 ++++-
 include/drm/ttm/ttm_page_alloc.h        |    8 +++++-
 10 files changed, 111 insertions(+), 35 deletions(-)



^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/5] ttm: Introduce a placeholder for DMA (bus) addresses.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
@ 2011-01-07 17:11 ` Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 2/5] tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool Konrad Rzeszutek Wilk
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad, Konrad Rzeszutek Wilk

This is right now limited to only non-pool constructs.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c |    9 ++++++---
 drivers/gpu/drm/ttm/ttm_tt.c         |   10 ++++++++--
 include/drm/ttm/ttm_bo_driver.h      |    2 ++
 include/drm/ttm/ttm_page_alloc.h     |    8 ++++++--
 4 files changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index b1e02ff..6859288 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -38,6 +38,7 @@
 #include <linux/mm.h>
 #include <linux/seq_file.h> /* for seq_printf */
 #include <linux/slab.h>
+#include <linux/dma-mapping.h>
 
 #include <asm/atomic.h>
 
@@ -662,7 +663,8 @@ out:
  * cached pages.
  */
 int ttm_get_pages(struct list_head *pages, int flags,
-		enum ttm_caching_state cstate, unsigned count)
+		enum ttm_caching_state cstate, unsigned count,
+		dma_addr_t *dma_address)
 {
 	struct ttm_page_pool *pool = ttm_get_pool(flags, cstate);
 	struct page *p = NULL;
@@ -720,7 +722,7 @@ int ttm_get_pages(struct list_head *pages, int flags,
 			printk(KERN_ERR TTM_PFX
 			       "Failed to allocate extra pages "
 			       "for large request.");
-			ttm_put_pages(pages, 0, flags, cstate);
+			ttm_put_pages(pages, 0, flags, cstate, NULL);
 			return r;
 		}
 	}
@@ -731,7 +733,8 @@ int ttm_get_pages(struct list_head *pages, int flags,
 
 /* Put all pages in pages list to correct pool to wait for reuse */
 void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags,
-		enum ttm_caching_state cstate)
+		enum ttm_caching_state cstate,
+		dma_addr_t *dma_address)
 {
 	unsigned long irq_flags;
 	struct ttm_page_pool *pool = ttm_get_pool(flags, cstate);
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index af789dc..0d39001 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -49,12 +49,16 @@ static int ttm_tt_swapin(struct ttm_tt *ttm);
 static void ttm_tt_alloc_page_directory(struct ttm_tt *ttm)
 {
 	ttm->pages = drm_calloc_large(ttm->num_pages, sizeof(*ttm->pages));
+	ttm->dma_address = drm_calloc_large(ttm->num_pages,
+					    sizeof(*ttm->dma_address));
 }
 
 static void ttm_tt_free_page_directory(struct ttm_tt *ttm)
 {
 	drm_free_large(ttm->pages);
 	ttm->pages = NULL;
+	drm_free_large(ttm->dma_address);
+	ttm->dma_address = NULL;
 }
 
 static void ttm_tt_free_user_pages(struct ttm_tt *ttm)
@@ -105,7 +109,8 @@ static struct page *__ttm_tt_get_page(struct ttm_tt *ttm, int index)
 
 		INIT_LIST_HEAD(&h);
 
-		ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1);
+		ret = ttm_get_pages(&h, ttm->page_flags, ttm->caching_state, 1,
+				    &ttm->dma_address[index]);
 
 		if (ret != 0)
 			return NULL;
@@ -298,7 +303,8 @@ static void ttm_tt_free_alloced_pages(struct ttm_tt *ttm)
 			count++;
 		}
 	}
-	ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state);
+	ttm_put_pages(&h, count, ttm->page_flags, ttm->caching_state,
+		      ttm->dma_address);
 	ttm->state = tt_unpopulated;
 	ttm->first_himem_page = ttm->num_pages;
 	ttm->last_lomem_page = -1;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index 8e0c848..6dc4fcc 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -149,6 +149,7 @@ enum ttm_caching_state {
  * @swap_storage: Pointer to shmem struct file for swap storage.
  * @caching_state: The current caching state of the pages.
  * @state: The current binding state of the pages.
+ * @dma_address: The DMA (bus) addresses of the pages (if TTM_PAGE_FLAG_DMA32)
  *
  * This is a structure holding the pages, caching- and aperture binding
  * status for a buffer object that isn't backed by fixed (VRAM / AGP)
@@ -173,6 +174,7 @@ struct ttm_tt {
 		tt_unbound,
 		tt_unpopulated,
 	} state;
+	dma_addr_t *dma_address;
 };
 
 #define TTM_MEMTYPE_FLAG_FIXED         (1 << 0)	/* Fixed (on-card) PCI memory */
diff --git a/include/drm/ttm/ttm_page_alloc.h b/include/drm/ttm/ttm_page_alloc.h
index 1168214..8062890 100644
--- a/include/drm/ttm/ttm_page_alloc.h
+++ b/include/drm/ttm/ttm_page_alloc.h
@@ -36,11 +36,13 @@
  * @flags: ttm flags for page allocation.
  * @cstate: ttm caching state for the page.
  * @count: number of pages to allocate.
+ * @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set).
  */
 int ttm_get_pages(struct list_head *pages,
 		  int flags,
 		  enum ttm_caching_state cstate,
-		  unsigned count);
+		  unsigned count,
+		  dma_addr_t *dma_address);
 /**
  * Put linked list of pages to pool.
  *
@@ -49,11 +51,13 @@ int ttm_get_pages(struct list_head *pages,
  * count.
  * @flags: ttm flags for page allocation.
  * @cstate: ttm caching state.
+ * @dma_address: The DMA (bus) address of pages (if TTM_PAGE_FLAG_DMA32 set).
  */
 void ttm_put_pages(struct list_head *pages,
 		   unsigned page_count,
 		   int flags,
-		   enum ttm_caching_state cstate);
+		   enum ttm_caching_state cstate,
+		   dma_addr_t *dma_address);
 /**
  * Initialize pool allocator.
  */
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/5] tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 1/5] ttm: Introduce a placeholder for DMA (bus) addresses Konrad Rzeszutek Wilk
@ 2011-01-07 17:11 ` Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 3/5] ttm: Expand (*populate) to support an array of DMA addresses Konrad Rzeszutek Wilk
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad, Konrad Rzeszutek Wilk

We only use the "if (pool == NULL)" path for right now.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc.c |   26 +++++++++++++++++++++++---
 1 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc.c b/drivers/gpu/drm/ttm/ttm_page_alloc.c
index 6859288..5d09677 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc.c
@@ -683,14 +683,22 @@ int ttm_get_pages(struct list_head *pages, int flags,
 			gfp_flags |= GFP_HIGHUSER;
 
 		for (r = 0; r < count; ++r) {
-			p = alloc_page(gfp_flags);
+			if ((flags & TTM_PAGE_FLAG_DMA32) && dma_address) {
+				void *addr;
+				addr = dma_alloc_coherent(NULL, PAGE_SIZE,
+							&dma_address[r],
+							gfp_flags);
+				if (addr == NULL)
+					return -ENOMEM;
+				p = virt_to_page(addr);
+			} else
+				p = alloc_page(gfp_flags);
 			if (!p) {
 
 				printk(KERN_ERR TTM_PFX
 				       "Unable to allocate page.");
 				return -ENOMEM;
 			}
-
 			list_add(&p->lru, pages);
 		}
 		return 0;
@@ -739,12 +747,24 @@ void ttm_put_pages(struct list_head *pages, unsigned page_count, int flags,
 	unsigned long irq_flags;
 	struct ttm_page_pool *pool = ttm_get_pool(flags, cstate);
 	struct page *p, *tmp;
+	unsigned r;
 
 	if (pool == NULL) {
 		/* No pool for this memory type so free the pages */
 
+		r = page_count-1;
 		list_for_each_entry_safe(p, tmp, pages, lru) {
-			__free_page(p);
+			if ((flags & TTM_PAGE_FLAG_DMA32) && dma_address) {
+				void *addr = page_address(p);
+				WARN_ON(!addr || !dma_address[r]);
+				if (addr)
+					dma_free_coherent(NULL, PAGE_SIZE,
+							addr,
+							dma_address[r]);
+				dma_address[r] = 0;
+			} else
+				__free_page(p);
+			r--;
 		}
 		/* Make the pages list empty */
 		INIT_LIST_HEAD(pages);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/5] ttm: Expand (*populate) to support an array of DMA addresses.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 1/5] ttm: Introduce a placeholder for DMA (bus) addresses Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 2/5] tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool Konrad Rzeszutek Wilk
@ 2011-01-07 17:11 ` Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it Konrad Rzeszutek Wilk
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad, Konrad Rzeszutek Wilk

We pass in the array of ttm pages to be populated in the GART/MM
of the card (or AGP). Patch titled: "ttm: Utilize the dma_addr_t array
for pages that are to in DMA32 pool." uses the DMA API to make those
pages have a proper DMA addresses (in the situation where
page_to_phys or virt_to_phys do not give use the DMA (bus) address).

Since we are using the DMA API on those pages, we should pass in the
DMA address to this function so it can save it in its proper fields
(later patches use it).

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/gpu/drm/nouveau/nouveau_sgdma.c |    3 ++-
 drivers/gpu/drm/radeon/radeon_ttm.c     |    3 ++-
 drivers/gpu/drm/ttm/ttm_agp_backend.c   |    3 ++-
 drivers/gpu/drm/ttm/ttm_tt.c            |    2 +-
 drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c  |    3 ++-
 include/drm/ttm/ttm_bo_driver.h         |    4 +++-
 6 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
index 288baca..edc140a 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
@@ -20,7 +20,8 @@ struct nouveau_sgdma_be {
 
 static int
 nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages,
-		       struct page **pages, struct page *dummy_read_page)
+		       struct page **pages, struct page *dummy_read_page,
+		       dma_addr_t *dma_addrs)
 {
 	struct nouveau_sgdma_be *nvbe = (struct nouveau_sgdma_be *)be;
 	struct drm_device *dev = nvbe->dev;
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 01c2c73..6f156e9 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -655,7 +655,8 @@ struct radeon_ttm_backend {
 static int radeon_ttm_backend_populate(struct ttm_backend *backend,
 				       unsigned long num_pages,
 				       struct page **pages,
-				       struct page *dummy_read_page)
+				       struct page *dummy_read_page,
+				       dma_addr_t *dma_addrs)
 {
 	struct radeon_ttm_backend *gtt;
 
diff --git a/drivers/gpu/drm/ttm/ttm_agp_backend.c b/drivers/gpu/drm/ttm/ttm_agp_backend.c
index f999e36..1c4a72f 100644
--- a/drivers/gpu/drm/ttm/ttm_agp_backend.c
+++ b/drivers/gpu/drm/ttm/ttm_agp_backend.c
@@ -47,7 +47,8 @@ struct ttm_agp_backend {
 
 static int ttm_agp_populate(struct ttm_backend *backend,
 			    unsigned long num_pages, struct page **pages,
-			    struct page *dummy_read_page)
+			    struct page *dummy_read_page,
+			    dma_addr_t *dma_addrs)
 {
 	struct ttm_agp_backend *agp_be =
 	    container_of(backend, struct ttm_agp_backend, backend);
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 0d39001..86d5b17 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -169,7 +169,7 @@ int ttm_tt_populate(struct ttm_tt *ttm)
 	}
 
 	be->func->populate(be, ttm->num_pages, ttm->pages,
-			   ttm->dummy_read_page);
+			   ttm->dummy_read_page, ttm->dma_address);
 	ttm->state = tt_unbound;
 	return 0;
 }
diff --git a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c
index 80bc37b..87e43e0 100644
--- a/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c
+++ b/drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c
@@ -102,7 +102,8 @@ struct vmw_ttm_backend {
 
 static int vmw_ttm_populate(struct ttm_backend *backend,
 			    unsigned long num_pages, struct page **pages,
-			    struct page *dummy_read_page)
+			    struct page *dummy_read_page,
+			    dma_addr_t *dma_addrs)
 {
 	struct vmw_ttm_backend *vmw_be =
 	    container_of(backend, struct vmw_ttm_backend, backend);
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index 6dc4fcc..ebcd3dd 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -50,13 +50,15 @@ struct ttm_backend_func {
 	 * @pages: Array of pointers to ttm pages.
 	 * @dummy_read_page: Page to be used instead of NULL pages in the
 	 * array @pages.
+	 * @dma_addrs: Array of DMA (bus) address of the ttm pages.
 	 *
 	 * Populate the backend with ttm pages. Depending on the backend,
 	 * it may or may not copy the @pages array.
 	 */
 	int (*populate) (struct ttm_backend *backend,
 			 unsigned long num_pages, struct page **pages,
-			 struct page *dummy_read_page);
+			 struct page *dummy_read_page,
+			 dma_addr_t *dma_addrs);
 	/**
 	 * struct ttm_backend_func member clear
 	 *
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (2 preceding siblings ...)
  2011-01-07 17:11 ` [PATCH 3/5] ttm: Expand (*populate) to support an array of DMA addresses Konrad Rzeszutek Wilk
@ 2011-01-07 17:11 ` Konrad Rzeszutek Wilk
  2011-01-27 21:20   ` Konrad Rzeszutek Wilk
  2011-01-07 17:11 ` [PATCH 5/5] nouveau/ttm/PCIe: " Konrad Rzeszutek Wilk
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad, Konrad Rzeszutek Wilk

If the TTM layer has used the DMA API to setup pages that are
TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
array for pages that are to in DMA32 pool."), lets use it
when programming the GART in the PCIe type cards.

This patch skips doing the pci_map_page (and pci_unmap_page) if
there is a DMA addresses passed in for that page. If the dma_address
is zero (or DMA_ERROR_CODE), then we continue on with our old
behaviour.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/gpu/drm/radeon/radeon.h      |    4 ++-
 drivers/gpu/drm/radeon/radeon_gart.c |   36 ++++++++++++++++++++++++---------
 drivers/gpu/drm/radeon/radeon_ttm.c  |    5 +++-
 3 files changed, 33 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 73f600d..c9bbab9 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -317,6 +317,7 @@ struct radeon_gart {
 	union radeon_gart_table		table;
 	struct page			**pages;
 	dma_addr_t			*pages_addr;
+	bool				*ttm_alloced;
 	bool				ready;
 };
 
@@ -329,7 +330,8 @@ void radeon_gart_fini(struct radeon_device *rdev);
 void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
 			int pages);
 int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
-		     int pages, struct page **pagelist);
+		     int pages, struct page **pagelist,
+		     dma_addr_t *dma_addr);
 
 
 /*
diff --git a/drivers/gpu/drm/radeon/radeon_gart.c b/drivers/gpu/drm/radeon/radeon_gart.c
index e65b903..4a5ac4b 100644
--- a/drivers/gpu/drm/radeon/radeon_gart.c
+++ b/drivers/gpu/drm/radeon/radeon_gart.c
@@ -149,8 +149,9 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
 	p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
 	for (i = 0; i < pages; i++, p++) {
 		if (rdev->gart.pages[p]) {
-			pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
-				       PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
+			if (!rdev->gart.ttm_alloced[p])
+				pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
+				       		PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
 			rdev->gart.pages[p] = NULL;
 			rdev->gart.pages_addr[p] = rdev->dummy_page.addr;
 			page_base = rdev->gart.pages_addr[p];
@@ -165,7 +166,7 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
 }
 
 int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
-		     int pages, struct page **pagelist)
+		     int pages, struct page **pagelist, dma_addr_t *dma_addr)
 {
 	unsigned t;
 	unsigned p;
@@ -180,15 +181,22 @@ int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
 	p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
 
 	for (i = 0; i < pages; i++, p++) {
-		/* we need to support large memory configurations */
-		/* assume that unbind have already been call on the range */
-		rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i],
+		/* On TTM path, we only use the DMA API if TTM_PAGE_FLAG_DMA32
+		 * is requested. */
+		if (dma_addr[i] != DMA_ERROR_CODE) {
+			rdev->gart.ttm_alloced[p] = true;
+			rdev->gart.pages_addr[p] = dma_addr[i];
+		} else {
+			/* we need to support large memory configurations */
+			/* assume that unbind have already been call on the range */
+			rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i],
 							0, PAGE_SIZE,
 							PCI_DMA_BIDIRECTIONAL);
-		if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
-			/* FIXME: failed to map page (return -ENOMEM?) */
-			radeon_gart_unbind(rdev, offset, pages);
-			return -ENOMEM;
+			if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
+				/* FIXME: failed to map page (return -ENOMEM?) */
+				radeon_gart_unbind(rdev, offset, pages);
+				return -ENOMEM;
+			}
 		}
 		rdev->gart.pages[p] = pagelist[i];
 		page_base = rdev->gart.pages_addr[p];
@@ -251,6 +259,12 @@ int radeon_gart_init(struct radeon_device *rdev)
 		radeon_gart_fini(rdev);
 		return -ENOMEM;
 	}
+	rdev->gart.ttm_alloced = kzalloc(sizeof(bool) *
+					rdev->gart.num_cpu_pages, GFP_KERNEL);
+	if (rdev->gart.ttm_alloced == NULL) {
+		radeon_gart_fini(rdev);
+		return -ENOMEM;
+	}
 	/* set GART entry to point to the dummy page by default */
 	for (i = 0; i < rdev->gart.num_cpu_pages; i++) {
 		rdev->gart.pages_addr[i] = rdev->dummy_page.addr;
@@ -267,6 +281,8 @@ void radeon_gart_fini(struct radeon_device *rdev)
 	rdev->gart.ready = false;
 	kfree(rdev->gart.pages);
 	kfree(rdev->gart.pages_addr);
+	kfree(rdev->gart.ttm_alloced);
 	rdev->gart.pages = NULL;
 	rdev->gart.pages_addr = NULL;
+	rdev->gart.ttm_alloced = NULL;
 }
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 6f156e9..ca04505 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -647,6 +647,7 @@ struct radeon_ttm_backend {
 	unsigned long			num_pages;
 	struct page			**pages;
 	struct page			*dummy_read_page;
+	dma_addr_t			*dma_addrs;
 	bool				populated;
 	bool				bound;
 	unsigned			offset;
@@ -662,6 +663,7 @@ static int radeon_ttm_backend_populate(struct ttm_backend *backend,
 
 	gtt = container_of(backend, struct radeon_ttm_backend, backend);
 	gtt->pages = pages;
+	gtt->dma_addrs = dma_addrs;
 	gtt->num_pages = num_pages;
 	gtt->dummy_read_page = dummy_read_page;
 	gtt->populated = true;
@@ -674,6 +676,7 @@ static void radeon_ttm_backend_clear(struct ttm_backend *backend)
 
 	gtt = container_of(backend, struct radeon_ttm_backend, backend);
 	gtt->pages = NULL;
+	gtt->dma_addrs = NULL;
 	gtt->num_pages = 0;
 	gtt->dummy_read_page = NULL;
 	gtt->populated = false;
@@ -694,7 +697,7 @@ static int radeon_ttm_backend_bind(struct ttm_backend *backend,
 		     gtt->num_pages, bo_mem, backend);
 	}
 	r = radeon_gart_bind(gtt->rdev, gtt->offset,
-			     gtt->num_pages, gtt->pages);
+			     gtt->num_pages, gtt->pages, gtt->dma_addrs);
 	if (r) {
 		DRM_ERROR("failed to bind %lu pages at 0x%08X\n",
 			  gtt->num_pages, gtt->offset);
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 5/5] nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (3 preceding siblings ...)
  2011-01-07 17:11 ` [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it Konrad Rzeszutek Wilk
@ 2011-01-07 17:11 ` Konrad Rzeszutek Wilk
  2011-01-27 21:22   ` Konrad Rzeszutek Wilk
  2011-01-07 22:21 ` [RFC PATCH v2] Utilize the PCI API in the TTM framework Ian Campbell
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-07 17:11 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel; +Cc: konrad, Konrad Rzeszutek Wilk

If the TTM layer has used the DMA API to setup pages that are
TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
array for pages that are to in DMA32 pool."), lets use it
when programming the GART in the PCIe type cards.

This patch skips doing the pci_map_page (and pci_unmap_page) if
there is a DMA addresses passed in for that page. If the dma_address
is zero (or DMA_ERROR_CODE), then we continue on with our old
behaviour.

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 drivers/gpu/drm/nouveau/nouveau_sgdma.c |   28 +++++++++++++++++++++-------
 1 files changed, 21 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
index edc140a..bbdd982 100644
--- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c
+++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
@@ -12,6 +12,7 @@ struct nouveau_sgdma_be {
 	struct drm_device *dev;
 
 	dma_addr_t *pages;
+	bool *ttm_alloced;
 	unsigned nr_pages;
 
 	unsigned pte_start;
@@ -35,15 +36,25 @@ nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages,
 	if (!nvbe->pages)
 		return -ENOMEM;
 
+	nvbe->ttm_alloced = kmalloc(sizeof(bool) * num_pages, GFP_KERNEL);
+	if (!nvbe->ttm_alloced)
+		return -ENOMEM;
+
 	nvbe->nr_pages = 0;
 	while (num_pages--) {
-		nvbe->pages[nvbe->nr_pages] =
-			pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0,
+		if (dma_addrs[nvbe->nr_pages] != DMA_ERROR_CODE) {
+			nvbe->pages[nvbe->nr_pages] =
+					dma_addrs[nvbe->nr_pages];
+		 	nvbe->ttm_alloced[nvbe->nr_pages] = true;
+		} else {
+			nvbe->pages[nvbe->nr_pages] =
+				pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0,
 				     PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
-		if (pci_dma_mapping_error(dev->pdev,
-					  nvbe->pages[nvbe->nr_pages])) {
-			be->func->clear(be);
-			return -EFAULT;
+			if (pci_dma_mapping_error(dev->pdev,
+						  nvbe->pages[nvbe->nr_pages])) {
+				be->func->clear(be);
+				return -EFAULT;
+			}
 		}
 
 		nvbe->nr_pages++;
@@ -66,11 +77,14 @@ nouveau_sgdma_clear(struct ttm_backend *be)
 			be->func->unbind(be);
 
 		while (nvbe->nr_pages--) {
-			pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages],
+			if (!nvbe->ttm_alloced[nvbe->nr_pages])
+				pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages],
 				       PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
 		}
 		kfree(nvbe->pages);
+		kfree(nvbe->ttm_alloced);
 		nvbe->pages = NULL;
+		nvbe->ttm_alloced = NULL;
 		nvbe->nr_pages = 0;
 	}
 }
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (4 preceding siblings ...)
  2011-01-07 17:11 ` [PATCH 5/5] nouveau/ttm/PCIe: " Konrad Rzeszutek Wilk
@ 2011-01-07 22:21 ` Ian Campbell
  2011-01-08 10:41 ` Thomas Hellstrom
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 33+ messages in thread
From: Ian Campbell @ 2011-01-07 22:21 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Fri, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> Attached is a set of patches that make it possible for drivers using TTM API
> (nouveau and radeon graphic drivers) to work under Xen. The explanation
> is a bit complex and I am not sure if I am explaining it that well..so if
> something is unclear please do ping me.
> 
> Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
>  - Cleaned up commit message (forgot to add my SoB).

v1 was:
Tested-by: Ian Campbell <ian.campbell@citrix.com>
(actually I tested a backport to the 2.6.32.x Debian Squeeze kernel
which uses 2.6.33.x era DRM code, others have also reported success with
this backport in http://bugs.debian.org/602418)

Ian.

-- 
Ian Campbell

Did you know that for the price of a 280-Z you can buy two Z-80's?
		-- P. J. Plauger


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (5 preceding siblings ...)
  2011-01-07 22:21 ` [RFC PATCH v2] Utilize the PCI API in the TTM framework Ian Campbell
@ 2011-01-08 10:41 ` Thomas Hellstrom
  2011-01-10 14:25 ` Thomas Hellstrom
  2011-03-21 13:11 ` Michel Dänzer
  8 siblings, 0 replies; 33+ messages in thread
From: Thomas Hellstrom @ 2011-01-08 10:41 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, airlied, linux-kernel, konrad

Hi, Konrad!

Just back from vacation. I'll try to review this patch set in the 
upcoming week!

Thanks,
Thomas


On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
> Attached is a set of patches that make it possible for drivers using TTM API
> (nouveau and radeon graphic drivers) to work under Xen. The explanation
> is a bit complex and I am not sure if I am explaining it that well..so if
> something is unclear please do ping me.
>
> Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
>   - Cleaned up commit message (forgot to add my SoB).
>
> Short explanation of problem: What we are hitting under Xen is that instead
> of programming the GART with the physical DMA address of the TTM page, we
> end up programming the bounce buffer DMA (bus) address!
>
> Long explanation:
> The reason we end up doing this is that:
>
>   1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers
>       - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath
>       4GB available to the the Linux page allocator we would not be able to
>       give those to other guest devices. This would mean if we tried to pass
>       in a USB device to one guest and in another were running the Xorg server
>       we wouldn't be able to do so as we would run out of pages under 4GB. So
>       pages that we get from alloc_page have a PFN that is under 4GB but in
>       reality the real physical address (MFN) is above 4GB. Ugh..
>
>   2). The backends for "struct ttm_backend_func" utilize the PCI API. When
>       they get a page allocated via alloc_page, the use 'pci_map_page' and
>       program the DMA (bus) address in the GART - which is correct. But then
>       the calls that kick off the graphic driver to process the pages do not
>       use the pci_page_sync_* calls. If the physical address of the page
>       is the same as the DMA bus address returned from pci_map_page then there
>       are no trouble. But if they are different:
> 	virt_to_phys(page_address(p)) != pci_map_page(p,..)
>       then the graphic card fetches data from the DMA (bus) address (so the
>       value returned from pci_map_page). The data however that the user wrote
>       to (the page p) ends up being untouched. You are probably saying:
>       "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN
>       and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p)
>       the GART ends up with the bus (DMA) address of the PFN!" That is true.
>       But if you combine this with 1) where you end up with page that is
>       above the dma_mask (even if you called it with GFP_DMA32) and then
>       make a call on pci_map_page you would end up with a bounce buffer!
>
>
> The problem above can be easily reproduced on bare-metal if you pass in
> "swiotlb=force iommu=soft".
>
>
> There are two ways of fixing this:
>
>   1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is
>       struct pcidev present), instead of alloc_page for GFP_DMA32. The
>       'dma_alloc_coherent' guarantees that the allocated page fits
>       within the device dma_mask (or uses the default DMA32 if no device
>       is passed in). This also guarantees that any subsequent call
>       to the PCI API for this page will return the same DMA (bus) address
>       as the first call (so pci_alloc_consistent, and then pci_map_page
>       will give the same DMA bus address).
>
>   2). Use the pci_sync_range_* after sending a page to the graphics
>       engine. If the bounce buffer is used then we end up copying the
>       pages.
>
>   3). This one I really don't want to think about, but I should mention
>       it. Alter the alloc_page and its backend to know about Xen.
>       The pushback from the MM guys will be probably: "why not use the PCI API.."
>
>
> So attached is a draft set of patches that use solution #1. I've tested
> this on baremetal (and Xen) on PCIe Radeon and nouveau cards with success.
>
> 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> but figured it would make first sense to get your guys input before heading that route.
>
> This patch-set is also available at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v3
>
> Konrad Rzeszutek Wilk (5):
>        ttm: Introduce a placeholder for DMA (bus) addresses.
>        tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool.
>        ttm: Expand (*populate) to support an array of DMA addresses.
>        radeon/ttm/PCIe: Use dma_addr if TTM has set it.
>        nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
>
> And diffstat:
>
>   drivers/gpu/drm/nouveau/nouveau_sgdma.c |   31 +++++++++++++++++++-------
>   drivers/gpu/drm/radeon/radeon.h         |    4 ++-
>   drivers/gpu/drm/radeon/radeon_gart.c    |   36 ++++++++++++++++++++++--------
>   drivers/gpu/drm/radeon/radeon_ttm.c     |    8 +++++-
>   drivers/gpu/drm/ttm/ttm_agp_backend.c   |    3 +-
>   drivers/gpu/drm/ttm/ttm_page_alloc.c    |   35 +++++++++++++++++++++++++-----
>   drivers/gpu/drm/ttm/ttm_tt.c            |   12 +++++++--
>   drivers/gpu/drm/vmwgfx/vmwgfx_buffer.c  |    3 +-
>   include/drm/ttm/ttm_bo_driver.h         |    6 ++++-
>   include/drm/ttm/ttm_page_alloc.h        |    8 +++++-
>   10 files changed, 111 insertions(+), 35 deletions(-)
>
>
>    


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (6 preceding siblings ...)
  2011-01-08 10:41 ` Thomas Hellstrom
@ 2011-01-10 14:25 ` Thomas Hellstrom
  2011-01-10 15:21   ` Konrad Rzeszutek Wilk
  2011-03-21 13:11 ` Michel Dänzer
  8 siblings, 1 reply; 33+ messages in thread
From: Thomas Hellstrom @ 2011-01-10 14:25 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, airlied, linux-kernel, konrad

Konrad,

Before looking further into the patch series, I need to make sure I've 
completely understood the problem and why you've chosen this solution: 
Please see inline.

On 01/07/2011 06:11 PM, Konrad Rzeszutek Wilk wrote:
> Attached is a set of patches that make it possible for drivers using TTM API
> (nouveau and radeon graphic drivers) to work under Xen. The explanation
> is a bit complex and I am not sure if I am explaining it that well..so if
> something is unclear please do ping me.
>
> Changes since v1: [https://lkml.org/lkml/2010/12/6/516]
>   - Cleaned up commit message (forgot to add my SoB).
>
> Short explanation of problem: What we are hitting under Xen is that instead
> of programming the GART with the physical DMA address of the TTM page, we
> end up programming the bounce buffer DMA (bus) address!
>
> Long explanation:
> The reason we end up doing this is that:
>
>   1). alloc_page with GFP_DMA32 does not allocate "real" (Machine Frame Numbers
>       - MFN) under the 4GB in Xen. That is b/c if actually made the pages underneath
>       4GB available to the the Linux page allocator we would not be able to
>       give those to other guest devices. This would mean if we tried to pass
>       in a USB device to one guest and in another were running the Xorg server
>       we wouldn't be able to do so as we would run out of pages under 4GB. So
>       pages that we get from alloc_page have a PFN that is under 4GB but in
>       reality the real physical address (MFN) is above 4GB. Ugh..
>    
>   2). The backends for "struct ttm_backend_func" utilize the PCI API. When
>       they get a page allocated via alloc_page, the use 'pci_map_page' and
>       program the DMA (bus) address in the GART - which is correct. But then
>       the calls that kick off the graphic driver to process the pages do not
>       use the pci_page_sync_* calls. If the physical address of the page
>       is the same as the DMA bus address returned from pci_map_page then there
>       are no trouble. But if they are different:
> 	virt_to_phys(page_address(p)) != pci_map_page(p,..)
>       then the graphic card fetches data from the DMA (bus) address (so the
>       value returned from pci_map_page). The data however that the user wrote
>       to (the page p) ends up being untouched. You are probably saying:
>       "What? Nonsense, we stitch the page in ttm_bo_vm_fault using the PFN
>       and .. and even if the virt_to_phys(page_address(p)) != pci_map_page(p)
>       the GART ends up with the bus (DMA) address of the PFN!" That is true.
>       But if you combine this with 1) where you end up with page that is
>       above the dma_mask (even if you called it with GFP_DMA32) and then
>       make a call on pci_map_page you would end up with a bounce buffer!
>
>
> The problem above can be easily reproduced on bare-metal if you pass in
> "swiotlb=force iommu=soft".
>
>    

At a first glance, this would seem to be a driver error since the 
drivers are not calling pci_page_sync(), however I understand that the 
TTM infrastructure and desire to avoid bounce buffers add more 
implications to this...

> There are two ways of fixing this:
>
>   1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is
>       struct pcidev present), instead of alloc_page for GFP_DMA32. The
>       'dma_alloc_coherent' guarantees that the allocated page fits
>       within the device dma_mask (or uses the default DMA32 if no device
>       is passed in). This also guarantees that any subsequent call
>       to the PCI API for this page will return the same DMA (bus) address
>       as the first call (so pci_alloc_consistent, and then pci_map_page
>       will give the same DMA bus address).
>    


I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that 
brings up a couple of questions:
1) Is it possible to change caching policy on pages allocated using 
dma_alloc_coherent?
2) What about accounting? In a *non-Xen* environment, will the number of 
coherent pages be less than the number of DMA32 pages, or will 
dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
3) Same as above, but in a Xen environment, what will stop multiple 
guests to exhaust the coherent pages? It seems that the TTM accounting 
mechanisms will no longer be valid unless the number of available 
coherent pages are split across the guests?

>   2). Use the pci_sync_range_* after sending a page to the graphics
>       engine. If the bounce buffer is used then we end up copying the
>       pages.
>    

Is the reason for choosing 1) instead of 2) purely a performance concern?


Finally, I wanted to ask why we need to pass / store the dma address of 
the TTM pages? Isn't it possible to just call into the DMA / PCI api to 
obtain it, and the coherent allocation will make sure it doesn't change?


Thanks,
Thomas


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-10 14:25 ` Thomas Hellstrom
@ 2011-01-10 15:21   ` Konrad Rzeszutek Wilk
  2011-01-10 15:58     ` Thomas Hellstrom
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-10 15:21 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, airlied, linux-kernel, konrad

On Mon, Jan 10, 2011 at 03:25:55PM +0100, Thomas Hellstrom wrote:
> Konrad,
> 
> Before looking further into the patch series, I need to make sure
> I've completely understood the problem and why you've chosen this
> solution: Please see inline.

Of course.

.. snip ..
> >The problem above can be easily reproduced on bare-metal if you pass in
> >"swiotlb=force iommu=soft".
> >
> 
> At a first glance, this would seem to be a driver error since the
> drivers are not calling pci_page_sync(), however I understand that
> the TTM infrastructure and desire to avoid bounce buffers add more
> implications to this...

<nods>
> 
> >There are two ways of fixing this:
> >
> >  1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is
> >      struct pcidev present), instead of alloc_page for GFP_DMA32. The
> >      'dma_alloc_coherent' guarantees that the allocated page fits
> >      within the device dma_mask (or uses the default DMA32 if no device
> >      is passed in). This also guarantees that any subsequent call
> >      to the PCI API for this page will return the same DMA (bus) address
> >      as the first call (so pci_alloc_consistent, and then pci_map_page
> >      will give the same DMA bus address).
> 
> 
> I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that
> brings up a couple of questions:
> 1) Is it possible to change caching policy on pages allocated using
> dma_alloc_coherent?

Yes. They are the same "form-factor" as any normal page, except
that the IOMMU makes extra efforts to set this page up.

> 2) What about accounting? In a *non-Xen* environment, will the
> number of coherent pages be less than the number of DMA32 pages, or
> will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?

The code in the IOMMUs end up calling __get_free_pages, which ends up
in alloc_pages. So the call doe ends up in alloc_page(flags).


native SWIOTLB (so no IOMMU): GFP_DMA32
GART (AMD's old IOMMU): GFP_DMA32:

For the hardware IOMMUs:

AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32.
   If it is in DMA translation mode (normal mode) it allocates a page
   with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately
   translates the bus address.

The flags change a bit:
VT-d: if there is no identity mapping, nor the PCI device is one of the special ones
   (GFX, Azalia), then it will pass it with GFP_DMA32.
   If it is in identity mapping state, and the device is a GFX or Azalia sound
   card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate
   the buss address.

However, the interesting thing is that I've passed in the 'NULL' as
the struct device (not intentionally - did not want to add more changes
to the API) so all of the IOMMUs end up doing GFP_DMA32.

But it does mess up the accounting with the AMD-VI and VT-D as they strip
of the __GFP_DMA32 flag off. That is a big problem, I presume?

> 3) Same as above, but in a Xen environment, what will stop multiple
> guests to exhaust the coherent pages? It seems that the TTM
> accounting mechanisms will no longer be valid unless the number of
> available coherent pages are split across the guests?

Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to
four guests. Lets also assume that we are doing heavy operations in all
of the guests.  Since there are no communication between each TTM
accounting in each guest you could end up eating all of the 4GB physical
memory that is available to each guest. It could end up that the first
guess gets a lion share of the 4GB memory, while the other ones are
less so.

And if one was to do that on baremetal, with four ATI Radeon cards, the
TTM accounting mechanism would realize it is nearing the watermark
and do.. something, right? What would it do actually?

I think the error path would be the same in both cases?

> 
> >  2). Use the pci_sync_range_* after sending a page to the graphics
> >      engine. If the bounce buffer is used then we end up copying the
> >      pages.
> 
> Is the reason for choosing 1) instead of 2) purely a performance concern?

Yes, and also not understanding where I should insert the pci_sync_range
calls in the drivers.

> 
> 
> Finally, I wanted to ask why we need to pass / store the dma address
> of the TTM pages? Isn't it possible to just call into the DMA / PCI
> api to obtain it, and the coherent allocation will make sure it
> doesn't change?

It won't change, but you need the dma address during de-allocation:
dma_free_coherent..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-10 15:21   ` Konrad Rzeszutek Wilk
@ 2011-01-10 15:58     ` Thomas Hellstrom
  2011-01-10 16:45       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Hellstrom @ 2011-01-10 15:58 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, airlied, linux-kernel, konrad

On 01/10/2011 04:21 PM, Konrad Rzeszutek Wilk wrote:
> On Mon, Jan 10, 2011 at 03:25:55PM +0100, Thomas Hellstrom wrote:
>    
>> Konrad,
>>
>> Before looking further into the patch series, I need to make sure
>> I've completely understood the problem and why you've chosen this
>> solution: Please see inline.
>>      
> Of course.
>
> .. snip ..
>    
>>> The problem above can be easily reproduced on bare-metal if you pass in
>>> "swiotlb=force iommu=soft".
>>>
>>>        
>> At a first glance, this would seem to be a driver error since the
>> drivers are not calling pci_page_sync(), however I understand that
>> the TTM infrastructure and desire to avoid bounce buffers add more
>> implications to this...
>>      
> <nods>
>    
>>      
>>> There are two ways of fixing this:
>>>
>>>   1). Use the 'dma_alloc_coherent' (or pci_alloc_consistent if there is
>>>       struct pcidev present), instead of alloc_page for GFP_DMA32. The
>>>       'dma_alloc_coherent' guarantees that the allocated page fits
>>>       within the device dma_mask (or uses the default DMA32 if no device
>>>       is passed in). This also guarantees that any subsequent call
>>>       to the PCI API for this page will return the same DMA (bus) address
>>>       as the first call (so pci_alloc_consistent, and then pci_map_page
>>>       will give the same DMA bus address).
>>>        
>>
>> I guess dma_alloc_coherent() will allocate *real* DMA32 pages? that
>> brings up a couple of questions:
>> 1) Is it possible to change caching policy on pages allocated using
>> dma_alloc_coherent?
>>      
> Yes. They are the same "form-factor" as any normal page, except
> that the IOMMU makes extra efforts to set this page up.
>
>    
>> 2) What about accounting? In a *non-Xen* environment, will the
>> number of coherent pages be less than the number of DMA32 pages, or
>> will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
>>      
> The code in the IOMMUs end up calling __get_free_pages, which ends up
> in alloc_pages. So the call doe ends up in alloc_page(flags).
>
>
> native SWIOTLB (so no IOMMU): GFP_DMA32
> GART (AMD's old IOMMU): GFP_DMA32:
>
> For the hardware IOMMUs:
>
> AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32.
>     If it is in DMA translation mode (normal mode) it allocates a page
>     with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately
>     translates the bus address.
>
> The flags change a bit:
> VT-d: if there is no identity mapping, nor the PCI device is one of the special ones
>     (GFX, Azalia), then it will pass it with GFP_DMA32.
>     If it is in identity mapping state, and the device is a GFX or Azalia sound
>     card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate
>     the buss address.
>
> However, the interesting thing is that I've passed in the 'NULL' as
> the struct device (not intentionally - did not want to add more changes
> to the API) so all of the IOMMUs end up doing GFP_DMA32.
>
> But it does mess up the accounting with the AMD-VI and VT-D as they strip
> of the __GFP_DMA32 flag off. That is a big problem, I presume?
>    

Actually, I don't think it's a big problem. TTM allows a small 
discrepancy between allocated pages and accounted pages to be able to 
account on actual allocation result. IIRC, This means that a DMA32 page 
will always be accounted as such, or at least we can make it behave that 
way. As long as the device can always handle the page, we should be fine.

>    
>> 3) Same as above, but in a Xen environment, what will stop multiple
>> guests to exhaust the coherent pages? It seems that the TTM
>> accounting mechanisms will no longer be valid unless the number of
>> available coherent pages are split across the guests?
>>      
> Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to
> four guests. Lets also assume that we are doing heavy operations in all
> of the guests.  Since there are no communication between each TTM
> accounting in each guest you could end up eating all of the 4GB physical
> memory that is available to each guest. It could end up that the first
> guess gets a lion share of the 4GB memory, while the other ones are
> less so.
>
> And if one was to do that on baremetal, with four ATI Radeon cards, the
> TTM accounting mechanism would realize it is nearing the watermark
> and do.. something, right? What would it do actually?
>
> I think the error path would be the same in both cases?
>    

Not really. The really dangerous situation is if TTM is allowed to 
exhaust all GFP_KERNEL memory. Then any application or kernel task might 
fail with an OOM, so TTM doesn't really allow that to happen *). Within 
a Xen guest OS using this patch that won't happen either, but TTM itself 
may receive unexpected allocation failures, since the amount of 
GFP_DMA32 memory TTM thinks is available is larger than actually available.
It is possible to trigger such allocation failures on bare metal as 
well, but they'd be much less likely. Those errors should result in 
application OOM errors with a possible application crash.
Anyway it's possible to adjust TTM's memory limits using sysfs (even on 
the fly) so any advanced user should be able to do that.

What *might* be possible, however, is that the GFP_KERNEL memory on the 
host gets exhausted due to extensive TTM allocations in the guest, but I 
guess that's a problem for XEN to resolve, not TTM.

>    
>>      
>>>   2). Use the pci_sync_range_* after sending a page to the graphics
>>>       engine. If the bounce buffer is used then we end up copying the
>>>       pages.
>>>        
>> Is the reason for choosing 1) instead of 2) purely a performance concern?
>>      
> Yes, and also not understanding where I should insert the pci_sync_range
> calls in the drivers.
>
>    
>>
>> Finally, I wanted to ask why we need to pass / store the dma address
>> of the TTM pages? Isn't it possible to just call into the DMA / PCI
>> api to obtain it, and the coherent allocation will make sure it
>> doesn't change?
>>      
> It won't change, but you need the dma address during de-allocation:
> dma_free_coherent..
>    

Isn't there a quick way to determine the DMA address from the struct 
page pointer, or would that require an explicit dma_map() operation?

/Thomas

*) I think gem's flink still is vulnerable to this, though, so it 
affects Nvidia and Radeon.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-10 15:58     ` Thomas Hellstrom
@ 2011-01-10 16:45       ` Konrad Rzeszutek Wilk
  2011-01-10 20:50         ` Thomas Hellstrom
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-10 16:45 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, airlied, linux-kernel, konrad

. snip ..
> >>2) What about accounting? In a *non-Xen* environment, will the
> >>number of coherent pages be less than the number of DMA32 pages, or
> >>will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
> >The code in the IOMMUs end up calling __get_free_pages, which ends up
> >in alloc_pages. So the call doe ends up in alloc_page(flags).
> >
> >
> >native SWIOTLB (so no IOMMU): GFP_DMA32
> >GART (AMD's old IOMMU): GFP_DMA32:
> >
> >For the hardware IOMMUs:
> >
> >AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32.
> >    If it is in DMA translation mode (normal mode) it allocates a page
> >    with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately
> >    translates the bus address.
> >
> >The flags change a bit:
> >VT-d: if there is no identity mapping, nor the PCI device is one of the special ones
> >    (GFX, Azalia), then it will pass it with GFP_DMA32.
> >    If it is in identity mapping state, and the device is a GFX or Azalia sound
> >    card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate
> >    the buss address.
> >
> >However, the interesting thing is that I've passed in the 'NULL' as
> >the struct device (not intentionally - did not want to add more changes
> >to the API) so all of the IOMMUs end up doing GFP_DMA32.
> >
> >But it does mess up the accounting with the AMD-VI and VT-D as they strip
> >of the __GFP_DMA32 flag off. That is a big problem, I presume?
> 
> Actually, I don't think it's a big problem. TTM allows a small
> discrepancy between allocated pages and accounted pages to be able
> to account on actual allocation result. IIRC, This means that a
> DMA32 page will always be accounted as such, or at least we can make
> it behave that way. As long as the device can always handle the
> page, we should be fine.

Excellent.
> 
> >>3) Same as above, but in a Xen environment, what will stop multiple
> >>guests to exhaust the coherent pages? It seems that the TTM
> >>accounting mechanisms will no longer be valid unless the number of
> >>available coherent pages are split across the guests?
> >Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to
> >four guests. Lets also assume that we are doing heavy operations in all
> >of the guests.  Since there are no communication between each TTM
> >accounting in each guest you could end up eating all of the 4GB physical
> >memory that is available to each guest. It could end up that the first
> >guess gets a lion share of the 4GB memory, while the other ones are
> >less so.
> >
> >And if one was to do that on baremetal, with four ATI Radeon cards, the
> >TTM accounting mechanism would realize it is nearing the watermark
> >and do.. something, right? What would it do actually?
> >
> >I think the error path would be the same in both cases?
> 
> Not really. The really dangerous situation is if TTM is allowed to
> exhaust all GFP_KERNEL memory. Then any application or kernel task

Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then
this should be OK?

> might fail with an OOM, so TTM doesn't really allow that to happen
> *). Within a Xen guest OS using this patch that won't happen either,
> but TTM itself may receive unexpected allocation failures, since the
> amount of GFP_DMA32 memory TTM thinks is available is larger than
> actually available.

Ooooh, perfect opportunity to test the error paths then :-)

> It is possible to trigger such allocation failures on bare metal as
> well, but they'd be much less likely. Those errors should result in
> application OOM errors with a possible application crash.
> Anyway it's possible to adjust TTM's memory limits using sysfs (even
> on the fly) so any advanced user should be able to do that.
> 
> What *might* be possible, however, is that the GFP_KERNEL memory on
> the host gets exhausted due to extensive TTM allocations in the
> guest, but I guess that's a problem for XEN to resolve, not TTM.

Hmm. I think I am missing something here. The GFP_KERNEL is any memory
and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start
using the PCI-API, what happens underneath (so under Linux) is that
"real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark
get swizzled in for the guest's PFNs (this is for the PCI devices
that have the dma_mask set to 32bit). However, that is a Xen MMU
accounting issue.

The GFP_KERNEL memory on the other hand does not get the same treatment,
so whichever MFNs were allocated for that memory are still the same.

The amount of memory in the guest during the treatment that the PCI API does
when running under Xen remains the same. The PFNs, the zone's, etc
are all the same. It is just that when you program the PTE's or
pass in the (DMA) bus address to the devices, the numbers are different
from what a 'virt_to_phys' call would return.

.. snip..
> >>Finally, I wanted to ask why we need to pass / store the dma address
> >>of the TTM pages? Isn't it possible to just call into the DMA / PCI
> >>api to obtain it, and the coherent allocation will make sure it
> >>doesn't change?
> >It won't change, but you need the dma address during de-allocation:
> >dma_free_coherent..
> 
> Isn't there a quick way to determine the DMA address from the struct

Sadly no. You need to squirrel it away.
> page pointer, or would that require an explicit dma_map() operation?

<nods> The DMA API only offers two ways to get the (DMA) bus address.

The first is the 'dma_alloc_coherent' (pci_create_coherent) and the other
is the 'dma_map_page' (or pci_map_page). Both calls return the DMA address
and there is no "translate this virtual address to a DMA address please
API call."

One way to potentially not carry this dma address around is in the
de-alloc path do this:

  dma_addr_t _d = pci_map_page(...);
  pci_unmap_page(..);
  dma_free_coherent(_d, ... )

which would be one way of avoiding carrying around the dma_addr_t
array.

> 
> /Thomas
> 
> *) I think gem's flink still is vulnerable to this, though, so it

Is there a good test-case for this?

> affects Nvidia and Radeon.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-10 16:45       ` Konrad Rzeszutek Wilk
@ 2011-01-10 20:50         ` Thomas Hellstrom
  2011-01-11 15:55           ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 33+ messages in thread
From: Thomas Hellstrom @ 2011-01-10 20:50 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, airlied, linux-kernel, konrad

On 01/10/2011 05:45 PM, Konrad Rzeszutek Wilk wrote:
> . snip ..
>    
>>>> 2) What about accounting? In a *non-Xen* environment, will the
>>>> number of coherent pages be less than the number of DMA32 pages, or
>>>> will dma_alloc_coherent just translate into a alloc_page(GFP_DMA32)?
>>>>          
>>> The code in the IOMMUs end up calling __get_free_pages, which ends up
>>> in alloc_pages. So the call doe ends up in alloc_page(flags).
>>>
>>>
>>> native SWIOTLB (so no IOMMU): GFP_DMA32
>>> GART (AMD's old IOMMU): GFP_DMA32:
>>>
>>> For the hardware IOMMUs:
>>>
>>> AMD VI: if it is in Passthrough mode, it calls it with GFP_DMA32.
>>>     If it is in DMA translation mode (normal mode) it allocates a page
>>>     with GFP_ZERO | ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32) and immediately
>>>     translates the bus address.
>>>
>>> The flags change a bit:
>>> VT-d: if there is no identity mapping, nor the PCI device is one of the special ones
>>>     (GFX, Azalia), then it will pass it with GFP_DMA32.
>>>     If it is in identity mapping state, and the device is a GFX or Azalia sound
>>>     card, then it will ~(__GFP_DMA | GFP_DMA32) and immediately translate
>>>     the buss address.
>>>
>>> However, the interesting thing is that I've passed in the 'NULL' as
>>> the struct device (not intentionally - did not want to add more changes
>>> to the API) so all of the IOMMUs end up doing GFP_DMA32.
>>>
>>> But it does mess up the accounting with the AMD-VI and VT-D as they strip
>>> of the __GFP_DMA32 flag off. That is a big problem, I presume?
>>>        
>> Actually, I don't think it's a big problem. TTM allows a small
>> discrepancy between allocated pages and accounted pages to be able
>> to account on actual allocation result. IIRC, This means that a
>> DMA32 page will always be accounted as such, or at least we can make
>> it behave that way. As long as the device can always handle the
>> page, we should be fine.
>>      
> Excellent.
>    
>>      
>>>> 3) Same as above, but in a Xen environment, what will stop multiple
>>>> guests to exhaust the coherent pages? It seems that the TTM
>>>> accounting mechanisms will no longer be valid unless the number of
>>>> available coherent pages are split across the guests?
>>>>          
>>> Say I pass in four ATI Radeon cards (wherein each is a 32-bit card) to
>>> four guests. Lets also assume that we are doing heavy operations in all
>>> of the guests.  Since there are no communication between each TTM
>>> accounting in each guest you could end up eating all of the 4GB physical
>>> memory that is available to each guest. It could end up that the first
>>> guess gets a lion share of the 4GB memory, while the other ones are
>>> less so.
>>>
>>> And if one was to do that on baremetal, with four ATI Radeon cards, the
>>> TTM accounting mechanism would realize it is nearing the watermark
>>> and do.. something, right? What would it do actually?
>>>
>>> I think the error path would be the same in both cases?
>>>        
>> Not really. The really dangerous situation is if TTM is allowed to
>> exhaust all GFP_KERNEL memory. Then any application or kernel task
>>      
> Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then
> this should be OK?
>    

No, Unless I miss something, on a machine with 4GB or less, GFP_DMA32 
and GFP_KERNEL are allocated from the same pool of pages?

>
>> What *might* be possible, however, is that the GFP_KERNEL memory on
>> the host gets exhausted due to extensive TTM allocations in the
>> guest, but I guess that's a problem for XEN to resolve, not TTM.
>>      
> Hmm. I think I am missing something here. The GFP_KERNEL is any memory
> and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start
> using the PCI-API, what happens underneath (so under Linux) is that
> "real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark
> get swizzled in for the guest's PFNs (this is for the PCI devices
> that have the dma_mask set to 32bit). However, that is a Xen MMU
> accounting issue.
>    


So I was under the impression that when you allocate coherent memory in 
the guest, the physical page comes from DMA32 memory in the host. On a 
4GB machine or less, that would be the same as kernel memory. Now, if 4 
guests think they can allocate 2GB of coherent memory each, you might 
run out of kernel memory on the host?


Another thing that I was thinking of is what happens if you have a huge 
gart and allocate a lot of coherent memory. Could that potentially 
exhaust IOMMU resources?

>> /Thomas
>>
>> *) I think gem's flink still is vulnerable to this, though, so it
>>      
> Is there a good test-case for this?
>    


Not put in code. What you can do (for example in an openGL app) is to 
write some code that tries to flink with a guessed bo name until it 
succeeds. Then repeatedly from within the app, try to flink the same 
name until something crashes. I don't think the linux OOM killer can 
handle that situation. Should be fairly easy to put together.

/Thomas


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-10 20:50         ` Thomas Hellstrom
@ 2011-01-11 15:55           ` Konrad Rzeszutek Wilk
  2011-01-11 16:21             ` Alex Deucher
  2011-01-12  9:12             ` Thomas Hellstrom
  0 siblings, 2 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-11 15:55 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, airlied, linux-kernel, konrad

. snip ..
> >>>I think the error path would be the same in both cases?
> >>Not really. The really dangerous situation is if TTM is allowed to
> >>exhaust all GFP_KERNEL memory. Then any application or kernel task
> >Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then
> >this should be OK?
> 
> No, Unless I miss something, on a machine with 4GB or less,
> GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?

Yes. Depending on the E820 and where the PCI hole is present. More
details below.
> 
> >
> >>What *might* be possible, however, is that the GFP_KERNEL memory on
> >>the host gets exhausted due to extensive TTM allocations in the
> >>guest, but I guess that's a problem for XEN to resolve, not TTM.
> >Hmm. I think I am missing something here. The GFP_KERNEL is any memory
> >and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start
> >using the PCI-API, what happens underneath (so under Linux) is that
> >"real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark
> >get swizzled in for the guest's PFNs (this is for the PCI devices
> >that have the dma_mask set to 32bit). However, that is a Xen MMU
> >accounting issue.
> 
> 
> So I was under the impression that when you allocate coherent memory
> in the guest, the physical page comes from DMA32 memory in the host.

No. It comes from DMA32 zone off the hypervisor pool. If say you have a machine
with 24GB, the first guest (Dom0) could allocate memory from 20->24GB
(so only 4GB allocated to it). It will then also fetch 64MB from the DMA32
zone for the SWIOTLB. Then the next guest, say 4GB (gets 16GB->20GB) - gets
64MB from DMA32. And  so on.

So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from 
0->4GB. When you start allocating coherent memory from each guest
(and yeah, say we use 2GB each), we end up with the first guest getting
the 2GB, the second getting 1.7GB, and then the next two getting zil.

You still have GFP_KERNEL memory in each guest - the first one has 2GB left
, then second 2.3, the next two have each 4GB.

>From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more
guest (but without PCI passthrough devices).

> On a 4GB machine or less, that would be the same as kernel memory.
> Now, if 4 guests think they can allocate 2GB of coherent memory
> each, you might run out of kernel memory on the host?

So host in this case refers to the Hypervisor and it does not care
about the DMA or what - it does not have any device drivers(*) or such.
The first guest (dom0) is the one that deals with the device drivers.

*: It has one: the serial port, but that is not really that important
for this discussion.
> 
> 
> Another thing that I was thinking of is what happens if you have a
> huge gart and allocate a lot of coherent memory. Could that
> potentially exhaust IOMMU resources?

<scratches his head>

So the GART is in the PCI space in one of the BARs of the device right?
(We are talking about the discrete card GART, not the poor man AMD IOMMU?)
The PCI space is under the 4GB, so it would be considered coherent by
definition.

However the PCI space with its BARs eats in the 4GB space, so if you
have a 1GB region from 0xC0000->0x100000, then you only have 3GB
left of DMA32 zone.

If I think of this as an accounting, and if the PCI space goes further
down (say 0x40000, so from 2GB->4GB it is a E820 gap, and 0GB->2GB is System RAM
with 4GB->6GB being the other System RAM, for a cumulative number of 4GB
of memory in the machine), we would only have 2GB of DMA32 zone (The GFP_KERNEL
zone is 4GB, while GFP_DMA32 zone is 2GB).

Then the answer is yes. However, wouldn't such device be 64-bit? And
if they are 64-bit, then the TTM API wouldn't bother to allocate pages
from the 32-bit region, right?

> 
> >>/Thomas
> >>
> >>*) I think gem's flink still is vulnerable to this, though, so it
> >Is there a good test-case for this?
> 
> 
> Not put in code. What you can do (for example in an openGL app) is
> to write some code that tries to flink with a guessed bo name until
> it succeeds. Then repeatedly from within the app, try to flink the
> same name until something crashes. I don't think the linux OOM
> killer can handle that situation. Should be fairly easy to put
> together.

Uhhh, OK, you just flew over what I know about graphics. Let me
research this a bit more.

> 
> /Thomas

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 15:55           ` Konrad Rzeszutek Wilk
@ 2011-01-11 16:21             ` Alex Deucher
  2011-01-11 16:59               ` Konrad Rzeszutek Wilk
  2011-01-12  9:12             ` Thomas Hellstrom
  1 sibling, 1 reply; 33+ messages in thread
From: Alex Deucher @ 2011-01-11 16:21 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Thomas Hellstrom, konrad, linux-kernel, dri-devel

On Tue, Jan 11, 2011 at 10:55 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> . snip ..
>> >>>I think the error path would be the same in both cases?
>> >>Not really. The really dangerous situation is if TTM is allowed to
>> >>exhaust all GFP_KERNEL memory. Then any application or kernel task
>> >Ok, since GFP_KERNEL does not contain the GFP_DMA32 flag then
>> >this should be OK?
>>
>> No, Unless I miss something, on a machine with 4GB or less,
>> GFP_DMA32 and GFP_KERNEL are allocated from the same pool of pages?
>
> Yes. Depending on the E820 and where the PCI hole is present. More
> details below.
>>
>> >
>> >>What *might* be possible, however, is that the GFP_KERNEL memory on
>> >>the host gets exhausted due to extensive TTM allocations in the
>> >>guest, but I guess that's a problem for XEN to resolve, not TTM.
>> >Hmm. I think I am missing something here. The GFP_KERNEL is any memory
>> >and the GFP_DMA32 is memory from the ZONE_DMA32. When we do start
>> >using the PCI-API, what happens underneath (so under Linux) is that
>> >"real PFNs" (Machine Frame Numbers) which are above the 0x100000 mark
>> >get swizzled in for the guest's PFNs (this is for the PCI devices
>> >that have the dma_mask set to 32bit). However, that is a Xen MMU
>> >accounting issue.
>>
>>
>> So I was under the impression that when you allocate coherent memory
>> in the guest, the physical page comes from DMA32 memory in the host.
>
> No. It comes from DMA32 zone off the hypervisor pool. If say you have a machine
> with 24GB, the first guest (Dom0) could allocate memory from 20->24GB
> (so only 4GB allocated to it). It will then also fetch 64MB from the DMA32
> zone for the SWIOTLB. Then the next guest, say 4GB (gets 16GB->20GB) - gets
> 64MB from DMA32. And  so on.
>
> So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from
> 0->4GB. When you start allocating coherent memory from each guest
> (and yeah, say we use 2GB each), we end up with the first guest getting
> the 2GB, the second getting 1.7GB, and then the next two getting zil.
>
> You still have GFP_KERNEL memory in each guest - the first one has 2GB left
> , then second 2.3, the next two have each 4GB.
>
> From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
> is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more
> guest (but without PCI passthrough devices).
>
>> On a 4GB machine or less, that would be the same as kernel memory.
>> Now, if 4 guests think they can allocate 2GB of coherent memory
>> each, you might run out of kernel memory on the host?
>
> So host in this case refers to the Hypervisor and it does not care
> about the DMA or what - it does not have any device drivers(*) or such.
> The first guest (dom0) is the one that deals with the device drivers.
>
> *: It has one: the serial port, but that is not really that important
> for this discussion.
>>
>>
>> Another thing that I was thinking of is what happens if you have a
>> huge gart and allocate a lot of coherent memory. Could that
>> potentially exhaust IOMMU resources?
>
> <scratches his head>
>
> So the GART is in the PCI space in one of the BARs of the device right?
> (We are talking about the discrete card GART, not the poor man AMD IOMMU?)
> The PCI space is under the 4GB, so it would be considered coherent by
> definition.

GART is not a PCI BAR; it's just a remapper for system pages.  On
radeon GPUs at least there is a memory controller with 3 programmable
apertures: vram, internal gart, and agp gart.  You can map these
resources whereever you want in the GPU's address space and then the
memory controller takes care of the translation to off-board resources
like gart pages.  On chip memory clients (display controllers, texture
blocks, render blocks, etc.) write to internal GPU addresses.  The GPU
has it's own direct connection to vram, so that's not an issue.  For
AGP, the GPU specifies aperture base and size, and you point it to the
bus address of gart aperture provided by the northbridge's AGP
controller.  For internal gart, the GPU has a page table stored in
either vram or uncached system memory depending on the asic.  It
provides a contiguous linear aperture to GPU clients and the memory
controller translates the transactions to the backing pages via the
pagetable.

Alex

>
> However the PCI space with its BARs eats in the 4GB space, so if you
> have a 1GB region from 0xC0000->0x100000, then you only have 3GB
> left of DMA32 zone.
>
> If I think of this as an accounting, and if the PCI space goes further
> down (say 0x40000, so from 2GB->4GB it is a E820 gap, and 0GB->2GB is System RAM
> with 4GB->6GB being the other System RAM, for a cumulative number of 4GB
> of memory in the machine), we would only have 2GB of DMA32 zone (The GFP_KERNEL
> zone is 4GB, while GFP_DMA32 zone is 2GB).
>
> Then the answer is yes. However, wouldn't such device be 64-bit? And
> if they are 64-bit, then the TTM API wouldn't bother to allocate pages
> from the 32-bit region, right?
>
>>
>> >>/Thomas
>> >>
>> >>*) I think gem's flink still is vulnerable to this, though, so it
>> >Is there a good test-case for this?
>>
>>
>> Not put in code. What you can do (for example in an openGL app) is
>> to write some code that tries to flink with a guessed bo name until
>> it succeeds. Then repeatedly from within the app, try to flink the
>> same name until something crashes. I don't think the linux OOM
>> killer can handle that situation. Should be fairly easy to put
>> together.
>
> Uhhh, OK, you just flew over what I know about graphics. Let me
> research this a bit more.
>
>>
>> /Thomas
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel
>

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 16:21             ` Alex Deucher
@ 2011-01-11 16:59               ` Konrad Rzeszutek Wilk
  2011-01-11 18:12                 ` Alex Deucher
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-11 16:59 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Thomas Hellstrom, konrad, linux-kernel, dri-devel

> >> Another thing that I was thinking of is what happens if you have a
> >> huge gart and allocate a lot of coherent memory. Could that
> >> potentially exhaust IOMMU resources?
> >
> > <scratches his head>
> >
> > So the GART is in the PCI space in one of the BARs of the device right?
> > (We are talking about the discrete card GART, not the poor man AMD IOMMU?)
> > The PCI space is under the 4GB, so it would be considered coherent by
> > definition.
> 
> GART is not a PCI BAR; it's just a remapper for system pages.  On
> radeon GPUs at least there is a memory controller with 3 programmable
> apertures: vram, internal gart, and agp gart.  You can map these

To access it, ie, to program it, you would need to access the PCIe card
MMIO regions, right? So that would be considered in PCI BAR space?

> resources whereever you want in the GPU's address space and then the
> memory controller takes care of the translation to off-board resources
> like gart pages.  On chip memory clients (display controllers, texture
> blocks, render blocks, etc.) write to internal GPU addresses.  The GPU
> has it's own direct connection to vram, so that's not an issue.  For
> AGP, the GPU specifies aperture base and size, and you point it to the
> bus address of gart aperture provided by the northbridge's AGP
> controller.  For internal gart, the GPU has a page table stored in

I think we are just talking about the GART on the GPU, not the old AGP
GART.

> either vram or uncached system memory depending on the asic.  It
> provides a contiguous linear aperture to GPU clients and the memory
> controller translates the transactions to the backing pages via the
> pagetable.

So I think I misunderstood what is meant by 'huge gart'. That sounds
like linear address space provided by GPU. And hooking up a lot of coherent
memory (so System RAM) to that linear address space would be no different that what
is currently being done. When you allocate memory using page_alloc(GFP_DMA32)
and hook up that memory to the linear space you exhaust the same amount of
ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same
pool, except that doing it from the PCI API gets you the bus address right
away.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 16:59               ` Konrad Rzeszutek Wilk
@ 2011-01-11 18:12                 ` Alex Deucher
  2011-01-11 18:28                   ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 33+ messages in thread
From: Alex Deucher @ 2011-01-11 18:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Thomas Hellstrom, konrad, linux-kernel, dri-devel

On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> >> Another thing that I was thinking of is what happens if you have a
>> >> huge gart and allocate a lot of coherent memory. Could that
>> >> potentially exhaust IOMMU resources?
>> >
>> > <scratches his head>
>> >
>> > So the GART is in the PCI space in one of the BARs of the device right?
>> > (We are talking about the discrete card GART, not the poor man AMD IOMMU?)
>> > The PCI space is under the 4GB, so it would be considered coherent by
>> > definition.
>>
>> GART is not a PCI BAR; it's just a remapper for system pages.  On
>> radeon GPUs at least there is a memory controller with 3 programmable
>> apertures: vram, internal gart, and agp gart.  You can map these
>
> To access it, ie, to program it, you would need to access the PCIe card
> MMIO regions, right? So that would be considered in PCI BAR space?

yes, you need access to the mmio aperture to configure the gpu.  I was
thinking you mean something akin the the framebuffer BAR only for gart
space which is not the case.

>
>> resources whereever you want in the GPU's address space and then the
>> memory controller takes care of the translation to off-board resources
>> like gart pages.  On chip memory clients (display controllers, texture
>> blocks, render blocks, etc.) write to internal GPU addresses.  The GPU
>> has it's own direct connection to vram, so that's not an issue.  For
>> AGP, the GPU specifies aperture base and size, and you point it to the
>> bus address of gart aperture provided by the northbridge's AGP
>> controller.  For internal gart, the GPU has a page table stored in
>
> I think we are just talking about the GART on the GPU, not the old AGP
> GART.

Ok.  I just mentioned it for completeness.

>
>> either vram or uncached system memory depending on the asic.  It
>> provides a contiguous linear aperture to GPU clients and the memory
>> controller translates the transactions to the backing pages via the
>> pagetable.
>
> So I think I misunderstood what is meant by 'huge gart'. That sounds
> like linear address space provided by GPU. And hooking up a lot of coherent
> memory (so System RAM) to that linear address space would be no different that what
> is currently being done. When you allocate memory using page_alloc(GFP_DMA32)
> and hook up that memory to the linear space you exhaust the same amount of
> ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same
> pool, except that doing it from the PCI API gets you the bus address right
> away.
>

In this case GPU clients refers to the hw blocks on the GPU; they are
the ones that see the contiguous linear aperture.  From the
application's perspective, gart memory looks like any other pages.

Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 18:12                 ` Alex Deucher
@ 2011-01-11 18:28                   ` Konrad Rzeszutek Wilk
  2011-01-11 19:28                     ` Alex Deucher
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-11 18:28 UTC (permalink / raw)
  To: Alex Deucher; +Cc: Thomas Hellstrom, konrad, linux-kernel, dri-devel

On Tue, Jan 11, 2011 at 01:12:57PM -0500, Alex Deucher wrote:
> On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> >> >> Another thing that I was thinking of is what happens if you have a
> >> >> huge gart and allocate a lot of coherent memory. Could that
> >> >> potentially exhaust IOMMU resources?
> >> >
> >> > <scratches his head>
> >> >
> >> > So the GART is in the PCI space in one of the BARs of the device right?
> >> > (We are talking about the discrete card GART, not the poor man AMD IOMMU?)
> >> > The PCI space is under the 4GB, so it would be considered coherent by
> >> > definition.
> >>
> >> GART is not a PCI BAR; it's just a remapper for system pages.  On
> >> radeon GPUs at least there is a memory controller with 3 programmable
> >> apertures: vram, internal gart, and agp gart.  You can map these
> >
> > To access it, ie, to program it, you would need to access the PCIe card
> > MMIO regions, right? So that would be considered in PCI BAR space?
> 
> yes, you need access to the mmio aperture to configure the gpu.  I was
> thinking you mean something akin the the framebuffer BAR only for gart
> space which is not the case.

Aaah, gotcha.
> 
> >
> >> resources whereever you want in the GPU's address space and then the
> >> memory controller takes care of the translation to off-board resources
> >> like gart pages.  On chip memory clients (display controllers, texture
> >> blocks, render blocks, etc.) write to internal GPU addresses.  The GPU
> >> has it's own direct connection to vram, so that's not an issue.  For
> >> AGP, the GPU specifies aperture base and size, and you point it to the
> >> bus address of gart aperture provided by the northbridge's AGP
> >> controller.  For internal gart, the GPU has a page table stored in
> >
> > I think we are just talking about the GART on the GPU, not the old AGP
> > GART.
> 
> Ok.  I just mentioned it for completeness.

<nods>
> 
> >
> >> either vram or uncached system memory depending on the asic.  It
> >> provides a contiguous linear aperture to GPU clients and the memory
> >> controller translates the transactions to the backing pages via the
> >> pagetable.
> >
> > So I think I misunderstood what is meant by 'huge gart'. That sounds
> > like linear address space provided by GPU. And hooking up a lot of coherent
> > memory (so System RAM) to that linear address space would be no different that what
> > is currently being done. When you allocate memory using page_alloc(GFP_DMA32)
> > and hook up that memory to the linear space you exhaust the same amount of
> > ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same
> > pool, except that doing it from the PCI API gets you the bus address right
> > away.
> >
> 
> In this case GPU clients refers to the hw blocks on the GPU; they are
> the ones that see the contiguous linear aperture.  From the
> application's perspective, gart memory looks like any other pages.

<nods>. Those 'hw blocks' or 'gart memory' are in reality
just pages received via the 'alloc_page()' (before this patchset and 
also after this patchset) Oh wait, this 'hw blocks' or 'gart memory' can also
refer to the VRAM memory right? In which case that is not memory allocated via
'alloc_page' but using a different mechanism. Is TTM used then? If so how
do you stick those VRAM pages under its accounting rules? Or do the drivers
use some other mechanism for that that is dependent on each driver?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 18:28                   ` Konrad Rzeszutek Wilk
@ 2011-01-11 19:28                     ` Alex Deucher
  0 siblings, 0 replies; 33+ messages in thread
From: Alex Deucher @ 2011-01-11 19:28 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: Thomas Hellstrom, konrad, linux-kernel, dri-devel

On Tue, Jan 11, 2011 at 1:28 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Tue, Jan 11, 2011 at 01:12:57PM -0500, Alex Deucher wrote:
>> On Tue, Jan 11, 2011 at 11:59 AM, Konrad Rzeszutek Wilk
>> <konrad.wilk@oracle.com> wrote:
>> >> >> Another thing that I was thinking of is what happens if you have a
>> >> >> huge gart and allocate a lot of coherent memory. Could that
>> >> >> potentially exhaust IOMMU resources?
>> >> >
>> >> > <scratches his head>
>> >> >
>> >> > So the GART is in the PCI space in one of the BARs of the device right?
>> >> > (We are talking about the discrete card GART, not the poor man AMD IOMMU?)
>> >> > The PCI space is under the 4GB, so it would be considered coherent by
>> >> > definition.
>> >>
>> >> GART is not a PCI BAR; it's just a remapper for system pages.  On
>> >> radeon GPUs at least there is a memory controller with 3 programmable
>> >> apertures: vram, internal gart, and agp gart.  You can map these
>> >
>> > To access it, ie, to program it, you would need to access the PCIe card
>> > MMIO regions, right? So that would be considered in PCI BAR space?
>>
>> yes, you need access to the mmio aperture to configure the gpu.  I was
>> thinking you mean something akin the the framebuffer BAR only for gart
>> space which is not the case.
>
> Aaah, gotcha.
>>
>> >
>> >> resources whereever you want in the GPU's address space and then the
>> >> memory controller takes care of the translation to off-board resources
>> >> like gart pages.  On chip memory clients (display controllers, texture
>> >> blocks, render blocks, etc.) write to internal GPU addresses.  The GPU
>> >> has it's own direct connection to vram, so that's not an issue.  For
>> >> AGP, the GPU specifies aperture base and size, and you point it to the
>> >> bus address of gart aperture provided by the northbridge's AGP
>> >> controller.  For internal gart, the GPU has a page table stored in
>> >
>> > I think we are just talking about the GART on the GPU, not the old AGP
>> > GART.
>>
>> Ok.  I just mentioned it for completeness.
>
> <nods>
>>
>> >
>> >> either vram or uncached system memory depending on the asic.  It
>> >> provides a contiguous linear aperture to GPU clients and the memory
>> >> controller translates the transactions to the backing pages via the
>> >> pagetable.
>> >
>> > So I think I misunderstood what is meant by 'huge gart'. That sounds
>> > like linear address space provided by GPU. And hooking up a lot of coherent
>> > memory (so System RAM) to that linear address space would be no different that what
>> > is currently being done. When you allocate memory using page_alloc(GFP_DMA32)
>> > and hook up that memory to the linear space you exhaust the same amount of
>> > ZONE_DMA32 memory as if you were to use the PCI API. It comes from the same
>> > pool, except that doing it from the PCI API gets you the bus address right
>> > away.
>> >
>>
>> In this case GPU clients refers to the hw blocks on the GPU; they are
>> the ones that see the contiguous linear aperture.  From the
>> application's perspective, gart memory looks like any other pages.
>
> <nods>. Those 'hw blocks' or 'gart memory' are in reality
> just pages received via the 'alloc_page()' (before this patchset and
> also after this patchset) Oh wait, this 'hw blocks' or 'gart memory' can also
> refer to the VRAM memory right? In which case that is not memory allocated via
> 'alloc_page' but using a different mechanism. Is TTM used then? If so how
> do you stick those VRAM pages under its accounting rules? Or do the drivers
> use some other mechanism for that that is dependent on each driver?
>

"hw blocks" refers to the different sections of the GPU (texture
fetchers, render targets, display controllers), not memory buffers.
E.g., if you want to read a texture from vram or gart, you'd program
the texture base address to the address of the texture in the GPU's
address space.  E.g., you might map 512 MB of vram at from 0x00000000
and a 512 MB gart aperture at 0x20000000 in the GPU's address space.
If you have a texture at the start of vram, you'd program the texture
base address to 0x0000000 or if it was at the start of the gart
aperture, you'd program it to 0x20000000.  To the GPU, the gart looks
like a linear array, but to everything else (driver, apps), it's just
pages.  The driver manages vram access using ttm.  The GPU has access
to the entire amount of vram directly, but the CPU can only access it
via the PCI framebuffer BAR.  On systems with more vram than
framebuffer BAR space, the CPU can only access buffers in the region
covered
by the BAR (usually the first 128 or 256 MB of vram depending on the
BAR).  For the CPU to access a buffer in vram, the GPU has to move it
to the area covered by the BAR or to gart memory.

Alex

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-11 15:55           ` Konrad Rzeszutek Wilk
  2011-01-11 16:21             ` Alex Deucher
@ 2011-01-12  9:12             ` Thomas Hellstrom
  2011-01-12 15:19               ` Konrad Rzeszutek Wilk
  1 sibling, 1 reply; 33+ messages in thread
From: Thomas Hellstrom @ 2011-01-12  9:12 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, airlied, linux-kernel, konrad

Hi, Konrad.

This discussion has become a bit lenghty. I'll filter out the sorted-out 
stuff, which leaves me with two remaining issues:


On 01/11/2011 04:55 PM, Konrad Rzeszutek Wilk wrote:
>
> So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from
> 0->4GB. When you start allocating coherent memory from each guest
> (and yeah, say we use 2GB each), we end up with the first guest getting
> the 2GB, the second getting 1.7GB, and then the next two getting zil.
>
> You still have GFP_KERNEL memory in each guest - the first one has 2GB left
> , then second 2.3, the next two have each 4GB.
>
> > From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
> is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more
> guest (but without PCI passthrough devices).
>
>    
>> On a 4GB machine or less, that would be the same as kernel memory.
>> Now, if 4 guests think they can allocate 2GB of coherent memory
>> each, you might run out of kernel memory on the host?
>>      
> So host in this case refers to the Hypervisor and it does not care
> about the DMA or what - it does not have any device drivers(*) or such.
> The first guest (dom0) is the one that deals with the device drivers.
>
> *: It has one: the serial port, but that is not really that important
> for this discussion.
>    

Let's assume we're at where the hypervisor (or host) has exhausted the 
0-4GB zone, due to guests coherent memory allocations, and that the 
physical machine has 4GB of memory, all in the 0-4GB zone. Now if the 
hypervisor was running on a Linux kernel, there would be no more 
GFP_KERNEL memory available on the *host* (hypervisor), and the 
hypervisor would crash. Now I don't know much about Xen, but it might be 
that this is not a problem with Xen at all?


>>
>> Another thing that I was thinking of is what happens if you have a
>> huge gart and allocate a lot of coherent memory. Could that
>> potentially exhaust IOMMU resources?
>>      
> <scratches his head>
>    

I need to be more specific. Let's assume we're on "bare metal", and we 
want to allocate 4GB of coherent memory. For most IOMMUs that would mean 
as you previously state, that we actually allocate GFP_DMA32 memory. But 
for some IOMMUs that would perhaps mean that we allocate *any* memory 
and set up a permanent DMA mapping in the IOMMU for the coherent pages. 
What if, in such a case, the IOMMU can only set up 2GB of coherent memory?

Or in short, can there *ever* be "bare metal" cases where the amount of 
coherent memory available is less than DMA32 memory available?

Thanks,
Thomas



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-12  9:12             ` Thomas Hellstrom
@ 2011-01-12 15:19               ` Konrad Rzeszutek Wilk
  2011-01-24 14:49                 ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-12 15:19 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, airlied, linux-kernel, konrad

On Wed, Jan 12, 2011 at 10:12:14AM +0100, Thomas Hellstrom wrote:
> Hi, Konrad.
> 
> This discussion has become a bit lenghty. I'll filter out the
> sorted-out stuff, which leaves me with two remaining issues:

<nods>
> 
> 
> On 01/11/2011 04:55 PM, Konrad Rzeszutek Wilk wrote:
> >
> >So at the end we have 16GB taken from 8GB->24GB, and 320MB taken from
> >0->4GB. When you start allocating coherent memory from each guest
> >(and yeah, say we use 2GB each), we end up with the first guest getting
> >the 2GB, the second getting 1.7GB, and then the next two getting zil.
> >
> >You still have GFP_KERNEL memory in each guest - the first one has 2GB left
> >, then second 2.3, the next two have each 4GB.
> >
> >> From the hyprevisor pool perspective, the 0-4GB zone is exhausted, so
> >is the 8GB->24GB, but it still has 4GB->8GB free - so it can launch one more
> >guest (but without PCI passthrough devices).
> >
> >>On a 4GB machine or less, that would be the same as kernel memory.
> >>Now, if 4 guests think they can allocate 2GB of coherent memory
> >>each, you might run out of kernel memory on the host?
> >So host in this case refers to the Hypervisor and it does not care
> >about the DMA or what - it does not have any device drivers(*) or such.
> >The first guest (dom0) is the one that deals with the device drivers.
> >
> >*: It has one: the serial port, but that is not really that important
> >for this discussion.
> 
> Let's assume we're at where the hypervisor (or host) has exhausted
> the 0-4GB zone, due to guests coherent memory allocations, and that
> the physical machine has 4GB of memory, all in the 0-4GB zone. Now
> if the hypervisor was running on a Linux kernel, there would be no
> more GFP_KERNEL memory available on the *host* (hypervisor), and the
> hypervisor would crash. Now I don't know much about Xen, but it
> might be that this is not a problem with Xen at all?

It will have no problem. It allocates at boot all the memory it needs
and won't get bigger (or smaller) after that.

> 
> 
> >>
> >>Another thing that I was thinking of is what happens if you have a
> >>huge gart and allocate a lot of coherent memory. Could that
> >>potentially exhaust IOMMU resources?
> ><scratches his head>
> 
> I need to be more specific. Let's assume we're on "bare metal", and
> we want to allocate 4GB of coherent memory. For most IOMMUs that
> would mean as you previously state, that we actually allocate
> GFP_DMA32 memory. But for some IOMMUs that would perhaps mean that
> we allocate *any* memory and set up a permanent DMA mapping in the
> IOMMU for the coherent pages. What if, in such a case, the IOMMU can
> only set up 2GB of coherent memory?
> 
> Or in short, can there *ever* be "bare metal" cases where the amount
> of coherent memory available is less than DMA32 memory available?

There is no such case where the amount of coherent memory is
less than DMA32 memory. [unless the IOMMU has some chipset problem where it can't map
2^31 -> 2^32 addresses, but that is not a something we should worry
about]


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-12 15:19               ` Konrad Rzeszutek Wilk
@ 2011-01-24 14:49                 ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-24 14:49 UTC (permalink / raw)
  To: Thomas Hellstrom; +Cc: dri-devel, airlied, linux-kernel, konrad

On Wed, Jan 12, 2011 at 10:19:39AM -0500, Konrad Rzeszutek Wilk wrote:
> On Wed, Jan 12, 2011 at 10:12:14AM +0100, Thomas Hellstrom wrote:
> > Hi, Konrad.
> > 
> > This discussion has become a bit lenghty. I'll filter out the
> > sorted-out stuff, which leaves me with two remaining issues:
> 
> <nods>

ping?

... did I miss some extra discussion?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-07 17:11 ` [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it Konrad Rzeszutek Wilk
@ 2011-01-27 21:20   ` Konrad Rzeszutek Wilk
  2011-01-28 14:42     ` Jerome Glisse
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-27 21:20 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel, Alex Deucher, Jerome Glisse
  Cc: konrad

On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
> If the TTM layer has used the DMA API to setup pages that are
> TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
> array for pages that are to in DMA32 pool."), lets use it
> when programming the GART in the PCIe type cards.
> 
> This patch skips doing the pci_map_page (and pci_unmap_page) if
> there is a DMA addresses passed in for that page. If the dma_address
> is zero (or DMA_ERROR_CODE), then we continue on with our old
> behaviour.

Hey Jerome,

I should have CC-ed you earlier but missed that and instead just
CC-ed the mailing list. I was wondering what your thoughts are
about this patchset? Thomas took a look at the patchset and he is OK
but more eyes never hurt.

FYI, for clarity I hadn't made the old calls that got moved due
to adding of "{" checkpatch.pl compliant. It complains about it being
past 80 characters. I can naturally fix that up, but thought
that it might detract from the nature of the patch.


> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/gpu/drm/radeon/radeon.h      |    4 ++-
>  drivers/gpu/drm/radeon/radeon_gart.c |   36 ++++++++++++++++++++++++---------
>  drivers/gpu/drm/radeon/radeon_ttm.c  |    5 +++-
>  3 files changed, 33 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
> index 73f600d..c9bbab9 100644
> --- a/drivers/gpu/drm/radeon/radeon.h
> +++ b/drivers/gpu/drm/radeon/radeon.h
> @@ -317,6 +317,7 @@ struct radeon_gart {
>  	union radeon_gart_table		table;
>  	struct page			**pages;
>  	dma_addr_t			*pages_addr;
> +	bool				*ttm_alloced;
>  	bool				ready;
>  };
>  
> @@ -329,7 +330,8 @@ void radeon_gart_fini(struct radeon_device *rdev);
>  void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
>  			int pages);
>  int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
> -		     int pages, struct page **pagelist);
> +		     int pages, struct page **pagelist,
> +		     dma_addr_t *dma_addr);
>  
>  
>  /*
> diff --git a/drivers/gpu/drm/radeon/radeon_gart.c b/drivers/gpu/drm/radeon/radeon_gart.c
> index e65b903..4a5ac4b 100644
> --- a/drivers/gpu/drm/radeon/radeon_gart.c
> +++ b/drivers/gpu/drm/radeon/radeon_gart.c
> @@ -149,8 +149,9 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
>  	p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
>  	for (i = 0; i < pages; i++, p++) {
>  		if (rdev->gart.pages[p]) {
> -			pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
> -				       PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
> +			if (!rdev->gart.ttm_alloced[p])
> +				pci_unmap_page(rdev->pdev, rdev->gart.pages_addr[p],
> +				       		PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
>  			rdev->gart.pages[p] = NULL;
>  			rdev->gart.pages_addr[p] = rdev->dummy_page.addr;
>  			page_base = rdev->gart.pages_addr[p];
> @@ -165,7 +166,7 @@ void radeon_gart_unbind(struct radeon_device *rdev, unsigned offset,
>  }
>  
>  int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
> -		     int pages, struct page **pagelist)
> +		     int pages, struct page **pagelist, dma_addr_t *dma_addr)
>  {
>  	unsigned t;
>  	unsigned p;
> @@ -180,15 +181,22 @@ int radeon_gart_bind(struct radeon_device *rdev, unsigned offset,
>  	p = t / (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
>  
>  	for (i = 0; i < pages; i++, p++) {
> -		/* we need to support large memory configurations */
> -		/* assume that unbind have already been call on the range */
> -		rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i],
> +		/* On TTM path, we only use the DMA API if TTM_PAGE_FLAG_DMA32
> +		 * is requested. */
> +		if (dma_addr[i] != DMA_ERROR_CODE) {
> +			rdev->gart.ttm_alloced[p] = true;
> +			rdev->gart.pages_addr[p] = dma_addr[i];
> +		} else {
> +			/* we need to support large memory configurations */
> +			/* assume that unbind have already been call on the range */
> +			rdev->gart.pages_addr[p] = pci_map_page(rdev->pdev, pagelist[i],
>  							0, PAGE_SIZE,
>  							PCI_DMA_BIDIRECTIONAL);
> -		if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
> -			/* FIXME: failed to map page (return -ENOMEM?) */
> -			radeon_gart_unbind(rdev, offset, pages);
> -			return -ENOMEM;
> +			if (pci_dma_mapping_error(rdev->pdev, rdev->gart.pages_addr[p])) {
> +				/* FIXME: failed to map page (return -ENOMEM?) */
> +				radeon_gart_unbind(rdev, offset, pages);
> +				return -ENOMEM;
> +			}
>  		}
>  		rdev->gart.pages[p] = pagelist[i];
>  		page_base = rdev->gart.pages_addr[p];
> @@ -251,6 +259,12 @@ int radeon_gart_init(struct radeon_device *rdev)
>  		radeon_gart_fini(rdev);
>  		return -ENOMEM;
>  	}
> +	rdev->gart.ttm_alloced = kzalloc(sizeof(bool) *
> +					rdev->gart.num_cpu_pages, GFP_KERNEL);
> +	if (rdev->gart.ttm_alloced == NULL) {
> +		radeon_gart_fini(rdev);
> +		return -ENOMEM;
> +	}
>  	/* set GART entry to point to the dummy page by default */
>  	for (i = 0; i < rdev->gart.num_cpu_pages; i++) {
>  		rdev->gart.pages_addr[i] = rdev->dummy_page.addr;
> @@ -267,6 +281,8 @@ void radeon_gart_fini(struct radeon_device *rdev)
>  	rdev->gart.ready = false;
>  	kfree(rdev->gart.pages);
>  	kfree(rdev->gart.pages_addr);
> +	kfree(rdev->gart.ttm_alloced);
>  	rdev->gart.pages = NULL;
>  	rdev->gart.pages_addr = NULL;
> +	rdev->gart.ttm_alloced = NULL;
>  }
> diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
> index 6f156e9..ca04505 100644
> --- a/drivers/gpu/drm/radeon/radeon_ttm.c
> +++ b/drivers/gpu/drm/radeon/radeon_ttm.c
> @@ -647,6 +647,7 @@ struct radeon_ttm_backend {
>  	unsigned long			num_pages;
>  	struct page			**pages;
>  	struct page			*dummy_read_page;
> +	dma_addr_t			*dma_addrs;
>  	bool				populated;
>  	bool				bound;
>  	unsigned			offset;
> @@ -662,6 +663,7 @@ static int radeon_ttm_backend_populate(struct ttm_backend *backend,
>  
>  	gtt = container_of(backend, struct radeon_ttm_backend, backend);
>  	gtt->pages = pages;
> +	gtt->dma_addrs = dma_addrs;
>  	gtt->num_pages = num_pages;
>  	gtt->dummy_read_page = dummy_read_page;
>  	gtt->populated = true;
> @@ -674,6 +676,7 @@ static void radeon_ttm_backend_clear(struct ttm_backend *backend)
>  
>  	gtt = container_of(backend, struct radeon_ttm_backend, backend);
>  	gtt->pages = NULL;
> +	gtt->dma_addrs = NULL;
>  	gtt->num_pages = 0;
>  	gtt->dummy_read_page = NULL;
>  	gtt->populated = false;
> @@ -694,7 +697,7 @@ static int radeon_ttm_backend_bind(struct ttm_backend *backend,
>  		     gtt->num_pages, bo_mem, backend);
>  	}
>  	r = radeon_gart_bind(gtt->rdev, gtt->offset,
> -			     gtt->num_pages, gtt->pages);
> +			     gtt->num_pages, gtt->pages, gtt->dma_addrs);
>  	if (r) {
>  		DRM_ERROR("failed to bind %lu pages at 0x%08X\n",
>  			  gtt->num_pages, gtt->offset);
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 5/5] nouveau/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-07 17:11 ` [PATCH 5/5] nouveau/ttm/PCIe: " Konrad Rzeszutek Wilk
@ 2011-01-27 21:22   ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-27 21:22 UTC (permalink / raw)
  To: dri-devel, thomas, airlied, linux-kernel, Ben Skeggs, Jerome Glisse
  Cc: konrad

On Fri, Jan 07, 2011 at 12:11:44PM -0500, Konrad Rzeszutek Wilk wrote:
> If the TTM layer has used the DMA API to setup pages that are
> TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
> array for pages that are to in DMA32 pool."), lets use it
> when programming the GART in the PCIe type cards.
> 
> This patch skips doing the pci_map_page (and pci_unmap_page) if
> there is a DMA addresses passed in for that page. If the dma_address
> is zero (or DMA_ERROR_CODE), then we continue on with our old
> behaviour.

Hey Ben and Jerome,

I should have CC-ed you guys earlier but missed that and instead just
CC-ed the mailing list. I was wondering what your thoughts are
about this patchset? Thomas took a look at the patchset and he is OK
but more eyes never hurt.


> 
> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_sgdma.c |   28 +++++++++++++++++++++-------
>  1 files changed, 21 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_sgdma.c b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
> index edc140a..bbdd982 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_sgdma.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_sgdma.c
> @@ -12,6 +12,7 @@ struct nouveau_sgdma_be {
>  	struct drm_device *dev;
>  
>  	dma_addr_t *pages;
> +	bool *ttm_alloced;
>  	unsigned nr_pages;
>  
>  	unsigned pte_start;
> @@ -35,15 +36,25 @@ nouveau_sgdma_populate(struct ttm_backend *be, unsigned long num_pages,
>  	if (!nvbe->pages)
>  		return -ENOMEM;
>  
> +	nvbe->ttm_alloced = kmalloc(sizeof(bool) * num_pages, GFP_KERNEL);
> +	if (!nvbe->ttm_alloced)
> +		return -ENOMEM;
> +
>  	nvbe->nr_pages = 0;
>  	while (num_pages--) {
> -		nvbe->pages[nvbe->nr_pages] =
> -			pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0,
> +		if (dma_addrs[nvbe->nr_pages] != DMA_ERROR_CODE) {
> +			nvbe->pages[nvbe->nr_pages] =
> +					dma_addrs[nvbe->nr_pages];
> +		 	nvbe->ttm_alloced[nvbe->nr_pages] = true;
> +		} else {
> +			nvbe->pages[nvbe->nr_pages] =
> +				pci_map_page(dev->pdev, pages[nvbe->nr_pages], 0,
>  				     PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
> -		if (pci_dma_mapping_error(dev->pdev,
> -					  nvbe->pages[nvbe->nr_pages])) {
> -			be->func->clear(be);
> -			return -EFAULT;
> +			if (pci_dma_mapping_error(dev->pdev,
> +						  nvbe->pages[nvbe->nr_pages])) {
> +				be->func->clear(be);
> +				return -EFAULT;
> +			}
>  		}
>  
>  		nvbe->nr_pages++;
> @@ -66,11 +77,14 @@ nouveau_sgdma_clear(struct ttm_backend *be)
>  			be->func->unbind(be);
>  
>  		while (nvbe->nr_pages--) {
> -			pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages],
> +			if (!nvbe->ttm_alloced[nvbe->nr_pages])
> +				pci_unmap_page(dev->pdev, nvbe->pages[nvbe->nr_pages],
>  				       PAGE_SIZE, PCI_DMA_BIDIRECTIONAL);
>  		}
>  		kfree(nvbe->pages);
> +		kfree(nvbe->ttm_alloced);
>  		nvbe->pages = NULL;
> +		nvbe->ttm_alloced = NULL;
>  		nvbe->nr_pages = 0;
>  	}
>  }
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-27 21:20   ` Konrad Rzeszutek Wilk
@ 2011-01-28 14:42     ` Jerome Glisse
  2011-01-28 15:03       ` Konrad Rzeszutek Wilk
  2011-02-16 15:54       ` Konrad Rzeszutek Wilk
  0 siblings, 2 replies; 33+ messages in thread
From: Jerome Glisse @ 2011-01-28 14:42 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: dri-devel, thomas, airlied, linux-kernel, Alex Deucher,
	Jerome Glisse, konrad

On Thu, Jan 27, 2011 at 4:20 PM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
> On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
>> If the TTM layer has used the DMA API to setup pages that are
>> TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
>> array for pages that are to in DMA32 pool."), lets use it
>> when programming the GART in the PCIe type cards.
>>
>> This patch skips doing the pci_map_page (and pci_unmap_page) if
>> there is a DMA addresses passed in for that page. If the dma_address
>> is zero (or DMA_ERROR_CODE), then we continue on with our old
>> behaviour.
>
> Hey Jerome,
>
> I should have CC-ed you earlier but missed that and instead just
> CC-ed the mailing list. I was wondering what your thoughts are
> about this patchset? Thomas took a look at the patchset and he is OK
> but more eyes never hurt.
>
> FYI, for clarity I hadn't made the old calls that got moved due
> to adding of "{" checkpatch.pl compliant. It complains about it being
> past 80 characters. I can naturally fix that up, but thought
> that it might detract from the nature of the patch.

What happen when we don't ask dma32 alloc to ttm ? Will this still
work in virtualized environnement ?

Code looks good but GART stuff can be picky, i will try to do a full
scale testing in next couple week.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-28 14:42     ` Jerome Glisse
@ 2011-01-28 15:03       ` Konrad Rzeszutek Wilk
  2011-02-16 15:54       ` Konrad Rzeszutek Wilk
  1 sibling, 0 replies; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-01-28 15:03 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: dri-devel, thomas, airlied, linux-kernel, Alex Deucher,
	Jerome Glisse, konrad

On Fri, Jan 28, 2011 at 09:42:48AM -0500, Jerome Glisse wrote:
> On Thu, Jan 27, 2011 at 4:20 PM, Konrad Rzeszutek Wilk
> <konrad.wilk@oracle.com> wrote:
> > On Fri, Jan 07, 2011 at 12:11:43PM -0500, Konrad Rzeszutek Wilk wrote:
> >> If the TTM layer has used the DMA API to setup pages that are
> >> TTM_PAGE_FLAG_DMA32 (look at patch titled: "ttm: Utilize the dma_addr_t
> >> array for pages that are to in DMA32 pool."), lets use it
> >> when programming the GART in the PCIe type cards.
> >>
> >> This patch skips doing the pci_map_page (and pci_unmap_page) if
> >> there is a DMA addresses passed in for that page. If the dma_address
> >> is zero (or DMA_ERROR_CODE), then we continue on with our old
> >> behaviour.
> >
> > Hey Jerome,
> >
> > I should have CC-ed you earlier but missed that and instead just
> > CC-ed the mailing list. I was wondering what your thoughts are
> > about this patchset? Thomas took a look at the patchset and he is OK
> > but more eyes never hurt.
> >
> > FYI, for clarity I hadn't made the old calls that got moved due
> > to adding of "{" checkpatch.pl compliant. It complains about it being
> > past 80 characters. I can naturally fix that up, but thought
> > that it might detract from the nature of the patch.
> 
> What happen when we don't ask dma32 alloc to ttm ? Will this still
> work in virtualized environnement ?

You mean if we don't pass in the TTM_PAGE_FLAG_DMA32? It will go back
to using the old mechanism - which is to allocate page using
alloc_page(GFP_HIGHUSER). Which should be OK in baremetal or
virtualized environment - and as long as the driver uses the page
for non-DMA type things, or the graphic card is 64-bit capable at which
point it has no trouble reaching it.

> 
> Code looks good but GART stuff can be picky, i will try to do a full
> scale testing in next couple week.

Thank you! Much appreciated. Would it be OK if I pinged you in two weeks
time just in case?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-01-28 14:42     ` Jerome Glisse
  2011-01-28 15:03       ` Konrad Rzeszutek Wilk
@ 2011-02-16 15:54       ` Konrad Rzeszutek Wilk
  2011-02-16 18:51         ` Jerome Glisse
  1 sibling, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-02-16 15:54 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: dri-devel, thomas, airlied, linux-kernel, Alex Deucher,
	Jerome Glisse, konrad

> Code looks good but GART stuff can be picky, i will try to do a full
> scale testing in next couple week.

ping?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it.
  2011-02-16 15:54       ` Konrad Rzeszutek Wilk
@ 2011-02-16 18:51         ` Jerome Glisse
  0 siblings, 0 replies; 33+ messages in thread
From: Jerome Glisse @ 2011-02-16 18:51 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk
  Cc: dri-devel, thomas, airlied, linux-kernel, Alex Deucher,
	Jerome Glisse, konrad

On Wed, Feb 16, 2011 at 10:54 AM, Konrad Rzeszutek Wilk
<konrad.wilk@oracle.com> wrote:
>> Code looks good but GART stuff can be picky, i will try to do a full
>> scale testing in next couple week.
>
> ping?
>

Sorry haven't time to do one. But it looks good to me.

Cheers,
Jerome

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
                   ` (7 preceding siblings ...)
  2011-01-10 14:25 ` Thomas Hellstrom
@ 2011-03-21 13:11 ` Michel Dänzer
  2011-03-21 23:18   ` Konrad Rzeszutek Wilk
  8 siblings, 1 reply; 33+ messages in thread
From: Michel Dänzer @ 2011-03-21 13:11 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> 
> 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> but figured it would make first sense to get your guys input before heading that route.

It's worse than unsightly: It breaks TTM on PPC. See
arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if
NULL is passed in for the device, and most of its callers BUG in that
case. The exception being dma_supported(), so a possible solution might
be to use that for checking if dma_alloc_coherent can be used.

Dave, please prevent this change from entering mainline before there's a
solution for this.


-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-03-21 13:11 ` Michel Dänzer
@ 2011-03-21 23:18   ` Konrad Rzeszutek Wilk
  2011-03-22 13:13     ` Michel Dänzer
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-03-21 23:18 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
> On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> > 
> > 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> > with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> > but figured it would make first sense to get your guys input before heading that route.
> 
> It's worse than unsightly: It breaks TTM on PPC. See
> arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if
> NULL is passed in for the device, and most of its callers BUG in that
> case. The exception being dma_supported(), so a possible solution might
> be to use that for checking if dma_alloc_coherent can be used.
> 
> Dave, please prevent this change from entering mainline before there's a
> solution for this.

We do have a fix for it: 

git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5

Can you tell me if that works for you?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-03-21 23:18   ` Konrad Rzeszutek Wilk
@ 2011-03-22 13:13     ` Michel Dänzer
  2011-03-22 14:54       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 33+ messages in thread
From: Michel Dänzer @ 2011-03-22 13:13 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote: 
> On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
> > On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> > > 
> > > 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> > > with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> > > but figured it would make first sense to get your guys input before heading that route.
> > 
> > It's worse than unsightly: It breaks TTM on PPC. See
> > arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if
> > NULL is passed in for the device, and most of its callers BUG in that
> > case. The exception being dma_supported(), so a possible solution might
> > be to use that for checking if dma_alloc_coherent can be used.
> > 
> > Dave, please prevent this change from entering mainline before there's a
> > solution for this.
> 
> We do have a fix for it: 
> 
> git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
> 
> Can you tell me if that works for you?

Yeah,
http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bbfbab7dd6a107ea2f5d6e882631cd31c72eda and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=733301920082553b52ce4453493fe6abf6aa7d1a fix the problem.


-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-03-22 13:13     ` Michel Dänzer
@ 2011-03-22 14:54       ` Konrad Rzeszutek Wilk
  2011-03-22 15:10         ` Michel Dänzer
  0 siblings, 1 reply; 33+ messages in thread
From: Konrad Rzeszutek Wilk @ 2011-03-22 14:54 UTC (permalink / raw)
  To: Michel Dänzer; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Tue, Mar 22, 2011 at 02:13:19PM +0100, Michel Dänzer wrote:
> On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote: 
> > On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
> > > On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> > > > 
> > > > 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> > > > with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> > > > but figured it would make first sense to get your guys input before heading that route.
> > > 
> > > It's worse than unsightly: It breaks TTM on PPC. See
> > > arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if
> > > NULL is passed in for the device, and most of its callers BUG in that
> > > case. The exception being dma_supported(), so a possible solution might
> > > be to use that for checking if dma_alloc_coherent can be used.
> > > 
> > > Dave, please prevent this change from entering mainline before there's a
> > > solution for this.
> > 
> > We do have a fix for it: 
> > 
> > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
> > 
> > Can you tell me if that works for you?
> 
> Yeah,
> http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bbfbab7dd6a107ea2f5d6e882631cd31c72eda and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=733301920082553b52ce4453493fe6abf6aa7d1a fix the problem.
> 

Wheew. Good! What kind of hardware do you have that triggered this? When I implemented
this patchset I hadn't thought about PPC b/c..well I didn't have the hardware nor
did I think there were any ATI/Nvidia cards that worked with it. Can you give me
an idea of type of hardware this is and where I could purchase it?

Much appreciated and thank you for testing this!

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC PATCH v2] Utilize the PCI API in the TTM framework.
  2011-03-22 14:54       ` Konrad Rzeszutek Wilk
@ 2011-03-22 15:10         ` Michel Dänzer
  0 siblings, 0 replies; 33+ messages in thread
From: Michel Dänzer @ 2011-03-22 15:10 UTC (permalink / raw)
  To: Konrad Rzeszutek Wilk; +Cc: dri-devel, thomas, airlied, linux-kernel, konrad

On Die, 2011-03-22 at 10:54 -0400, Konrad Rzeszutek Wilk wrote: 
> On Tue, Mar 22, 2011 at 02:13:19PM +0100, Michel Dänzer wrote:
> > On Mon, 2011-03-21 at 19:18 -0400, Konrad Rzeszutek Wilk wrote: 
> > > On Mon, Mar 21, 2011 at 02:11:07PM +0100, Michel Dänzer wrote:
> > > > On Fre, 2011-01-07 at 12:11 -0500, Konrad Rzeszutek Wilk wrote: 
> > > > > 
> > > > > 1) The 'NULL' when doing dma_alloc_coherent is unsightly. I was toying
> > > > > with modifying the TTM API to pass in 'struct device' or 'struct pci_device'
> > > > > but figured it would make first sense to get your guys input before heading that route.
> > > > 
> > > > It's worse than unsightly: It breaks TTM on PPC. See
> > > > arch/powerpc/include/asm/dma-mapping.h: get_dma_ops() returns NULL if
> > > > NULL is passed in for the device, and most of its callers BUG in that
> > > > case. The exception being dma_supported(), so a possible solution might
> > > > be to use that for checking if dma_alloc_coherent can be used.
> > > > 
> > > > Dave, please prevent this change from entering mainline before there's a
> > > > solution for this.
> > > 
> > > We do have a fix for it: 
> > > 
> > > git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git devel/ttm.pci-api.v5
> > > 
> > > Can you tell me if that works for you?
> > 
> > Yeah,
> > http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=02bbfbab7dd6a107ea2f5d6e882631cd31c72eda and http://git.kernel.org/?p=linux/kernel/git/konrad/xen.git;a=commitdiff;h=733301920082553b52ce4453493fe6abf6aa7d1a fix the problem.
> 
> Wheew. Good! What kind of hardware do you have that triggered this?
> When I implemented this patchset I hadn't thought about PPC b/c..well
> I didn't have the hardware nor did I think there were any ATI/Nvidia
> cards that worked with it. Can you give me an idea of type of hardware
> this is and where I could purchase it?

It's an Apple PowerBook with a Mobility Radeon 9700, to buy one new
you'd have to go back in time a couple of years first. ;) You should be
able to use more recent GPUs in a second-hand Apple G5 or YDL
PowerStation, which also have reasonably fast CPUs.


-- 
Earthling Michel Dänzer           |                http://www.vmware.com
Libre software enthusiast         |          Debian, X and DRI developer

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-03-22 15:10 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2011-01-07 17:11 [RFC PATCH v2] Utilize the PCI API in the TTM framework Konrad Rzeszutek Wilk
2011-01-07 17:11 ` [PATCH 1/5] ttm: Introduce a placeholder for DMA (bus) addresses Konrad Rzeszutek Wilk
2011-01-07 17:11 ` [PATCH 2/5] tm: Utilize the dma_addr_t array for pages that are to in DMA32 pool Konrad Rzeszutek Wilk
2011-01-07 17:11 ` [PATCH 3/5] ttm: Expand (*populate) to support an array of DMA addresses Konrad Rzeszutek Wilk
2011-01-07 17:11 ` [PATCH 4/5] radeon/ttm/PCIe: Use dma_addr if TTM has set it Konrad Rzeszutek Wilk
2011-01-27 21:20   ` Konrad Rzeszutek Wilk
2011-01-28 14:42     ` Jerome Glisse
2011-01-28 15:03       ` Konrad Rzeszutek Wilk
2011-02-16 15:54       ` Konrad Rzeszutek Wilk
2011-02-16 18:51         ` Jerome Glisse
2011-01-07 17:11 ` [PATCH 5/5] nouveau/ttm/PCIe: " Konrad Rzeszutek Wilk
2011-01-27 21:22   ` Konrad Rzeszutek Wilk
2011-01-07 22:21 ` [RFC PATCH v2] Utilize the PCI API in the TTM framework Ian Campbell
2011-01-08 10:41 ` Thomas Hellstrom
2011-01-10 14:25 ` Thomas Hellstrom
2011-01-10 15:21   ` Konrad Rzeszutek Wilk
2011-01-10 15:58     ` Thomas Hellstrom
2011-01-10 16:45       ` Konrad Rzeszutek Wilk
2011-01-10 20:50         ` Thomas Hellstrom
2011-01-11 15:55           ` Konrad Rzeszutek Wilk
2011-01-11 16:21             ` Alex Deucher
2011-01-11 16:59               ` Konrad Rzeszutek Wilk
2011-01-11 18:12                 ` Alex Deucher
2011-01-11 18:28                   ` Konrad Rzeszutek Wilk
2011-01-11 19:28                     ` Alex Deucher
2011-01-12  9:12             ` Thomas Hellstrom
2011-01-12 15:19               ` Konrad Rzeszutek Wilk
2011-01-24 14:49                 ` Konrad Rzeszutek Wilk
2011-03-21 13:11 ` Michel Dänzer
2011-03-21 23:18   ` Konrad Rzeszutek Wilk
2011-03-22 13:13     ` Michel Dänzer
2011-03-22 14:54       ` Konrad Rzeszutek Wilk
2011-03-22 15:10         ` Michel Dänzer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).