All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/6] drm: nouveau: memory coherency on ARM
@ 2014-07-08  8:25 ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Another revision of this patchset critical for GK20A to operate.

Previous attempts were exclusively using either TTM's regular page allocator or 
the DMA API one. Both have their advantages and drawbacks: the page allocator is 
fast but requires explicit synchronization on non-coherent architectures, 
whereas the DMA allocator always returns coherent memory, but is also slower,
creates a permanent kernel mapping, and is more constrained as to which memory
it can use.

This version attempts to use the most-fit allocator according to the buffer 
use-case:
- buffers that are passed to user-space can explicitly be synced during their 
  validation and preparation for CPU access, as previously shown by Lucas 
  (http://lists.freedesktop.org/archives/nouveau/2013-August/014029.html ). For 
  these, we don't mind if the memory is not coherent and prefer to use the page 
  allocator.
- buffers that are used by the kernel, typically fences and GPFIFO buffers, are
  accessed rarely and thus should not trigger a costly flush or cache
  invalidation. For these, we want to guarantee coherent access and use the DMA
  API if necessary.

This series attempts to implement this behavior by allowing the
TTM_PL_FLAG_UNCACHED flag to be passed to nouveau_bo_new(). On coherent 
architectures this flag is a no-op ; on non-coherent architectures, it will 
force the creation of a coherent buffer using the DMA-API.

Several fixes and changes were necessary to enable this behavior:
- CPU addresses of DMA-allocated BOs must be made visible (patch 1) so the 
  coherent mapping can be used by drivers
- The DMA-sync functions are required for BOs populated using the page allocator 
  (patch 4). Pages need to be mapped to the device using the correct API if we
  are to call the sync functions (patch 2). Additionally, we need to understand 
  whether we are on a CPU-coherent architecture (patch 3).
- Coherent BOs need to be detected by Nouveau so their coherent kernel mapping
  can be used instead of creating a new one (patch 5).
- Finally, buffers that are used by the kernel should be requested to be
  coherent (page 6).

Changes since v3:
- Only use the DMA allocator for BOs that strictly require to be coherent
- Fixed the way pages are mapped to the GPU on platform devices
- Thoroughly checked with CONFIG_DMA_API_DEBUG that there were no API violations

Alexandre Courbot (6):
  drm/ttm: expose CPU address of DMA-allocated pages
  drm/nouveau: map pages using DMA API on platform devices
  drm/nouveau: introduce nv_device_is_cpu_coherent()
  drm/nouveau: synchronize BOs when required
  drm/nouveau: implement explicitly coherent BOs
  drm/nouveau: allocate GPFIFOs and fences coherently

 drivers/gpu/drm/nouveau/core/engine/device/base.c  |  14 ++-
 drivers/gpu/drm/nouveau/core/include/core/device.h |   3 +
 drivers/gpu/drm/nouveau/nouveau_bo.c               | 132 +++++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.h               |   3 +
 drivers/gpu/drm/nouveau/nouveau_chan.c             |   2 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  12 ++
 drivers/gpu/drm/nouveau/nv84_fence.c               |   4 +-
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c           |   2 +
 drivers/gpu/drm/ttm/ttm_tt.c                       |   6 +-
 include/drm/ttm/ttm_bo_driver.h                    |   2 +
 10 files changed, 167 insertions(+), 13 deletions(-)

-- 
2.0.0

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 0/6] drm: nouveau: memory coherency on ARM
@ 2014-07-08  8:25 ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Another revision of this patchset critical for GK20A to operate.

Previous attempts were exclusively using either TTM's regular page allocator or 
the DMA API one. Both have their advantages and drawbacks: the page allocator is 
fast but requires explicit synchronization on non-coherent architectures, 
whereas the DMA allocator always returns coherent memory, but is also slower,
creates a permanent kernel mapping, and is more constrained as to which memory
it can use.

This version attempts to use the most-fit allocator according to the buffer 
use-case:
- buffers that are passed to user-space can explicitly be synced during their 
  validation and preparation for CPU access, as previously shown by Lucas 
  (http://lists.freedesktop.org/archives/nouveau/2013-August/014029.html ). For 
  these, we don't mind if the memory is not coherent and prefer to use the page 
  allocator.
- buffers that are used by the kernel, typically fences and GPFIFO buffers, are
  accessed rarely and thus should not trigger a costly flush or cache
  invalidation. For these, we want to guarantee coherent access and use the DMA
  API if necessary.

This series attempts to implement this behavior by allowing the
TTM_PL_FLAG_UNCACHED flag to be passed to nouveau_bo_new(). On coherent 
architectures this flag is a no-op ; on non-coherent architectures, it will 
force the creation of a coherent buffer using the DMA-API.

Several fixes and changes were necessary to enable this behavior:
- CPU addresses of DMA-allocated BOs must be made visible (patch 1) so the 
  coherent mapping can be used by drivers
- The DMA-sync functions are required for BOs populated using the page allocator 
  (patch 4). Pages need to be mapped to the device using the correct API if we
  are to call the sync functions (patch 2). Additionally, we need to understand 
  whether we are on a CPU-coherent architecture (patch 3).
- Coherent BOs need to be detected by Nouveau so their coherent kernel mapping
  can be used instead of creating a new one (patch 5).
- Finally, buffers that are used by the kernel should be requested to be
  coherent (page 6).

Changes since v3:
- Only use the DMA allocator for BOs that strictly require to be coherent
- Fixed the way pages are mapped to the GPU on platform devices
- Thoroughly checked with CONFIG_DMA_API_DEBUG that there were no API violations

Alexandre Courbot (6):
  drm/ttm: expose CPU address of DMA-allocated pages
  drm/nouveau: map pages using DMA API on platform devices
  drm/nouveau: introduce nv_device_is_cpu_coherent()
  drm/nouveau: synchronize BOs when required
  drm/nouveau: implement explicitly coherent BOs
  drm/nouveau: allocate GPFIFOs and fences coherently

 drivers/gpu/drm/nouveau/core/engine/device/base.c  |  14 ++-
 drivers/gpu/drm/nouveau/core/include/core/device.h |   3 +
 drivers/gpu/drm/nouveau/nouveau_bo.c               | 132 +++++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.h               |   3 +
 drivers/gpu/drm/nouveau/nouveau_chan.c             |   2 +-
 drivers/gpu/drm/nouveau/nouveau_gem.c              |  12 ++
 drivers/gpu/drm/nouveau/nv84_fence.c               |   4 +-
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c           |   2 +
 drivers/gpu/drm/ttm/ttm_tt.c                       |   6 +-
 include/drm/ttm/ttm_bo_driver.h                    |   2 +
 10 files changed, 167 insertions(+), 13 deletions(-)

-- 
2.0.0


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH v4 1/6] drm/ttm: expose CPU address of DMA-allocated pages
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:25     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Pages allocated using the DMA API have a coherent memory mapping. Make
this mapping visible to drivers so they can decide to use it instead of
creating their own redundant one.

Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 2 ++
 drivers/gpu/drm/ttm/ttm_tt.c             | 6 +++++-
 include/drm/ttm/ttm_bo_driver.h          | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f69839..0301fac5badd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -847,6 +847,7 @@ static int ttm_dma_pool_get_pages(struct dma_pool *pool,
 	if (count) {
 		d_page = list_first_entry(&pool->free_list, struct dma_page, page_list);
 		ttm->pages[index] = d_page->p;
+		ttm_dma->cpu_address[index] = d_page->vaddr;
 		ttm_dma->dma_address[index] = d_page->dma;
 		list_move_tail(&d_page->page_list, &ttm_dma->pages_list);
 		r = 0;
@@ -978,6 +979,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 	INIT_LIST_HEAD(&ttm_dma->pages_list);
 	for (i = 0; i < ttm->num_pages; i++) {
 		ttm->pages[i] = NULL;
+		ttm_dma->cpu_address[i] = 0;
 		ttm_dma->dma_address[i] = 0;
 	}
 
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..341594ede596 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -58,6 +58,8 @@ static void ttm_dma_tt_alloc_page_directory(struct ttm_dma_tt *ttm)
 	ttm->ttm.pages = drm_calloc_large(ttm->ttm.num_pages, sizeof(void*));
 	ttm->dma_address = drm_calloc_large(ttm->ttm.num_pages,
 					    sizeof(*ttm->dma_address));
+	ttm->cpu_address = drm_calloc_large(ttm->ttm.num_pages,
+					    sizeof(*ttm->cpu_address));
 }
 
 #ifdef CONFIG_X86
@@ -228,7 +230,7 @@ int ttm_dma_tt_init(struct ttm_dma_tt *ttm_dma, struct ttm_bo_device *bdev,
 
 	INIT_LIST_HEAD(&ttm_dma->pages_list);
 	ttm_dma_tt_alloc_page_directory(ttm_dma);
-	if (!ttm->pages || !ttm_dma->dma_address) {
+	if (!ttm->pages || !ttm_dma->dma_address || !ttm_dma->cpu_address) {
 		ttm_tt_destroy(ttm);
 		pr_err("Failed allocating page table\n");
 		return -ENOMEM;
@@ -243,6 +245,8 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)
 
 	drm_free_large(ttm->pages);
 	ttm->pages = NULL;
+	drm_free_large(ttm_dma->cpu_address);
+	ttm_dma->cpu_address = NULL;
 	drm_free_large(ttm_dma->dma_address);
 	ttm_dma->dma_address = NULL;
 }
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..7d30f0666d24 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -133,6 +133,7 @@ struct ttm_tt {
  * struct ttm_dma_tt
  *
  * @ttm: Base ttm_tt struct.
+ * @cpu_address: The CPU address of the pages
  * @dma_address: The DMA (bus) addresses of the pages
  * @pages_list: used by some page allocation backend
  *
@@ -142,6 +143,7 @@ struct ttm_tt {
  */
 struct ttm_dma_tt {
 	struct ttm_tt ttm;
+	void **cpu_address;
 	dma_addr_t *dma_address;
 	struct list_head pages_list;
 };
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 1/6] drm/ttm: expose CPU address of DMA-allocated pages
@ 2014-07-08  8:25     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Pages allocated using the DMA API have a coherent memory mapping. Make
this mapping visible to drivers so they can decide to use it instead of
creating their own redundant one.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_page_alloc_dma.c | 2 ++
 drivers/gpu/drm/ttm/ttm_tt.c             | 6 +++++-
 include/drm/ttm/ttm_bo_driver.h          | 2 ++
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
index fb8259f69839..0301fac5badd 100644
--- a/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
+++ b/drivers/gpu/drm/ttm/ttm_page_alloc_dma.c
@@ -847,6 +847,7 @@ static int ttm_dma_pool_get_pages(struct dma_pool *pool,
 	if (count) {
 		d_page = list_first_entry(&pool->free_list, struct dma_page, page_list);
 		ttm->pages[index] = d_page->p;
+		ttm_dma->cpu_address[index] = d_page->vaddr;
 		ttm_dma->dma_address[index] = d_page->dma;
 		list_move_tail(&d_page->page_list, &ttm_dma->pages_list);
 		r = 0;
@@ -978,6 +979,7 @@ void ttm_dma_unpopulate(struct ttm_dma_tt *ttm_dma, struct device *dev)
 	INIT_LIST_HEAD(&ttm_dma->pages_list);
 	for (i = 0; i < ttm->num_pages; i++) {
 		ttm->pages[i] = NULL;
+		ttm_dma->cpu_address[i] = 0;
 		ttm_dma->dma_address[i] = 0;
 	}
 
diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..341594ede596 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -58,6 +58,8 @@ static void ttm_dma_tt_alloc_page_directory(struct ttm_dma_tt *ttm)
 	ttm->ttm.pages = drm_calloc_large(ttm->ttm.num_pages, sizeof(void*));
 	ttm->dma_address = drm_calloc_large(ttm->ttm.num_pages,
 					    sizeof(*ttm->dma_address));
+	ttm->cpu_address = drm_calloc_large(ttm->ttm.num_pages,
+					    sizeof(*ttm->cpu_address));
 }
 
 #ifdef CONFIG_X86
@@ -228,7 +230,7 @@ int ttm_dma_tt_init(struct ttm_dma_tt *ttm_dma, struct ttm_bo_device *bdev,
 
 	INIT_LIST_HEAD(&ttm_dma->pages_list);
 	ttm_dma_tt_alloc_page_directory(ttm_dma);
-	if (!ttm->pages || !ttm_dma->dma_address) {
+	if (!ttm->pages || !ttm_dma->dma_address || !ttm_dma->cpu_address) {
 		ttm_tt_destroy(ttm);
 		pr_err("Failed allocating page table\n");
 		return -ENOMEM;
@@ -243,6 +245,8 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)
 
 	drm_free_large(ttm->pages);
 	ttm->pages = NULL;
+	drm_free_large(ttm_dma->cpu_address);
+	ttm_dma->cpu_address = NULL;
 	drm_free_large(ttm_dma->dma_address);
 	ttm_dma->dma_address = NULL;
 }
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..7d30f0666d24 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -133,6 +133,7 @@ struct ttm_tt {
  * struct ttm_dma_tt
  *
  * @ttm: Base ttm_tt struct.
+ * @cpu_address: The CPU address of the pages
  * @dma_address: The DMA (bus) addresses of the pages
  * @pages_list: used by some page allocation backend
  *
@@ -142,6 +143,7 @@ struct ttm_tt {
  */
 struct ttm_dma_tt {
 	struct ttm_tt ttm;
+	void **cpu_address;
 	dma_addr_t *dma_address;
 	struct list_head pages_list;
 };
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:25     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

page_to_phys() is not the correct way to obtain the DMA address of a
buffer on a non-PCI system. Use the DMA API functions for this, which
are portable and will allow us to use other DMA API functions for
buffer synchronization.

Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index 18c8c7245b73..e4e9e64988fe 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
 		if (pci_dma_mapping_error(device->pdev, ret))
 			ret = 0;
 	} else {
-		ret = page_to_phys(page);
+		ret = dma_map_page(&device->platformdev->dev, page, 0,
+				   PAGE_SIZE, DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(&device->platformdev->dev, ret))
+			ret = 0;
 	}
 
 	return ret;
@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
 	if (nv_device_is_pci(device))
 		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
 			       PCI_DMA_BIDIRECTIONAL);
+	else
+		dma_unmap_page(&device->platformdev->dev, addr,
+			       PAGE_SIZE, DMA_BIDIRECTIONAL);
 }
 
 int
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-08  8:25     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

page_to_phys() is not the correct way to obtain the DMA address of a
buffer on a non-PCI system. Use the DMA API functions for this, which
are portable and will allow us to use other DMA API functions for
buffer synchronization.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index 18c8c7245b73..e4e9e64988fe 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
 		if (pci_dma_mapping_error(device->pdev, ret))
 			ret = 0;
 	} else {
-		ret = page_to_phys(page);
+		ret = dma_map_page(&device->platformdev->dev, page, 0,
+				   PAGE_SIZE, DMA_BIDIRECTIONAL);
+		if (dma_mapping_error(&device->platformdev->dev, ret))
+			ret = 0;
 	}
 
 	return ret;
@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
 	if (nv_device_is_pci(device))
 		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
 			       PCI_DMA_BIDIRECTIONAL);
+	else
+		dma_unmap_page(&device->platformdev->dev, addr,
+			       PAGE_SIZE, DMA_BIDIRECTIONAL);
 }
 
 int
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/6] drm/nouveau: introduce nv_device_is_cpu_coherent()
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:25     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Add a function allowing us to know whether a device is CPU-coherent,
i.e. accesses performed by the CPU on GPU-mapped buffers will
be immediately visible on the GPU side and vice-versa.

For now, a device is considered to be coherent if it uses the PCI bus on
a non-ARM architecture.

Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/nouveau/core/engine/device/base.c  | 6 ++++++
 drivers/gpu/drm/nouveau/core/include/core/device.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index e4e9e64988fe..23c7061aac3c 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -520,6 +520,12 @@ nv_device_get_irq(struct nouveau_device *device, bool stall)
 	}
 }
 
+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device)
+{
+	return (!IS_ENABLED(CONFIG_ARM) && nv_device_is_pci(device));
+}
+
 static struct nouveau_oclass
 nouveau_device_oclass = {
 	.handle = NV_ENGINE(DEVICE, 0x00),
diff --git a/drivers/gpu/drm/nouveau/core/include/core/device.h b/drivers/gpu/drm/nouveau/core/include/core/device.h
index a8a9a9cf16cb..1f9d5d87cf06 100644
--- a/drivers/gpu/drm/nouveau/core/include/core/device.h
+++ b/drivers/gpu/drm/nouveau/core/include/core/device.h
@@ -171,4 +171,7 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr);
 int
 nv_device_get_irq(struct nouveau_device *device, bool stall);
 
+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device);
+
 #endif
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 3/6] drm/nouveau: introduce nv_device_is_cpu_coherent()
@ 2014-07-08  8:25     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Add a function allowing us to know whether a device is CPU-coherent,
i.e. accesses performed by the CPU on GPU-mapped buffers will
be immediately visible on the GPU side and vice-versa.

For now, a device is considered to be coherent if it uses the PCI bus on
a non-ARM architecture.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/core/engine/device/base.c  | 6 ++++++
 drivers/gpu/drm/nouveau/core/include/core/device.h | 3 +++
 2 files changed, 9 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
index e4e9e64988fe..23c7061aac3c 100644
--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
@@ -520,6 +520,12 @@ nv_device_get_irq(struct nouveau_device *device, bool stall)
 	}
 }
 
+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device)
+{
+	return (!IS_ENABLED(CONFIG_ARM) && nv_device_is_pci(device));
+}
+
 static struct nouveau_oclass
 nouveau_device_oclass = {
 	.handle = NV_ENGINE(DEVICE, 0x00),
diff --git a/drivers/gpu/drm/nouveau/core/include/core/device.h b/drivers/gpu/drm/nouveau/core/include/core/device.h
index a8a9a9cf16cb..1f9d5d87cf06 100644
--- a/drivers/gpu/drm/nouveau/core/include/core/device.h
+++ b/drivers/gpu/drm/nouveau/core/include/core/device.h
@@ -171,4 +171,7 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr);
 int
 nv_device_get_irq(struct nouveau_device *device, bool stall);
 
+bool
+nv_device_is_cpu_coherent(struct nouveau_device *device);
+
 #endif
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:25     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On architectures for which access to GPU memory is non-coherent,
caches need to be flushed and invalidated explicitly when BO control
changes between CPU and GPU.

This patch adds buffer synchronization functions which invokes the
correct API (PCI or DMA) to ensure synchronization is effective.

Based on the TTM DMA cache helper patches by Lucas Stach.

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
 drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
 3 files changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 67e9e8e2e2ec..47e4e8886769 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
 		ttm_bo_kunmap(&nvbo->kmap);
 }
 
+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+	struct nouveau_device *device = nouveau_dev(drm->dev);
+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+	int i;
+
+	if (!ttm_dma)
+		return;
+
+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+		return;
+
+	if (nv_device_is_pci(device)) {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			pci_dma_sync_single_for_device(device->pdev,
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				PCI_DMA_TODEVICE);
+	} else {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			dma_sync_single_for_device(nv_device_base(device),
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				DMA_TO_DEVICE);
+	}
+}
+
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+	struct nouveau_device *device = nouveau_dev(drm->dev);
+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+	int i;
+
+	if (!ttm_dma)
+		return;
+
+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+		return;
+
+	if (nv_device_is_pci(device)) {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			pci_dma_sync_single_for_cpu(device->pdev,
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				PCI_DMA_FROMDEVICE);
+	} else {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			dma_sync_single_for_cpu(nv_device_base(device),
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				DMA_FROM_DEVICE);
+	}
+}
+
 int
 nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 		    bool no_wait_gpu)
@@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 	if (!ttm)
 		return ret;
 
+	nouveau_bo_sync_for_device(nvbo);
+
 	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
 	nv_wr32(device, 0x70004, 0x00000001);
 	if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..fa42298d2dca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
 void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
 int  nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
 			 bool no_wait_gpu);
+void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);
 
 struct nouveau_vma *
 nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..08829a720891 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
 	spin_lock(&nvbo->bo.bdev->fence_lock);
 	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
 	spin_unlock(&nvbo->bo.bdev->fence_lock);
+	nouveau_bo_sync_for_cpu(nvbo);
 	drm_gem_object_unreference_unlocked(gem);
 	return ret;
 }
@@ -904,6 +905,17 @@ int
 nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
 			   struct drm_file *file_priv)
 {
+	struct drm_nouveau_gem_cpu_fini *req = data;
+	struct drm_gem_object *gem;
+	struct nouveau_bo *nvbo;
+
+	gem = drm_gem_object_lookup(dev, file_priv, req->handle);
+	if (!gem)
+		return -ENOENT;
+	nvbo = nouveau_gem_object(gem);
+
+	nouveau_bo_sync_for_device(nvbo);
+	drm_gem_object_unreference_unlocked(gem);
 	return 0;
 }
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
@ 2014-07-08  8:25     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:25 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

On architectures for which access to GPU memory is non-coherent,
caches need to be flushed and invalidated explicitly when BO control
changes between CPU and GPU.

This patch adds buffer synchronization functions which invokes the
correct API (PCI or DMA) to ensure synchronization is effective.

Based on the TTM DMA cache helper patches by Lucas Stach.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
 drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
 3 files changed, 70 insertions(+)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 67e9e8e2e2ec..47e4e8886769 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
 		ttm_bo_kunmap(&nvbo->kmap);
 }
 
+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+	struct nouveau_device *device = nouveau_dev(drm->dev);
+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+	int i;
+
+	if (!ttm_dma)
+		return;
+
+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+		return;
+
+	if (nv_device_is_pci(device)) {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			pci_dma_sync_single_for_device(device->pdev,
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				PCI_DMA_TODEVICE);
+	} else {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			dma_sync_single_for_device(nv_device_base(device),
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				DMA_TO_DEVICE);
+	}
+}
+
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
+	struct nouveau_device *device = nouveau_dev(drm->dev);
+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
+	int i;
+
+	if (!ttm_dma)
+		return;
+
+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
+		return;
+
+	if (nv_device_is_pci(device)) {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			pci_dma_sync_single_for_cpu(device->pdev,
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				PCI_DMA_FROMDEVICE);
+	} else {
+		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
+			dma_sync_single_for_cpu(nv_device_base(device),
+				ttm_dma->dma_address[i], PAGE_SIZE,
+				DMA_FROM_DEVICE);
+	}
+}
+
 int
 nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 		    bool no_wait_gpu)
@@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 	if (!ttm)
 		return ret;
 
+	nouveau_bo_sync_for_device(nvbo);
+
 	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
 	nv_wr32(device, 0x70004, 0x00000001);
 	if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..fa42298d2dca 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
 void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
 int  nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
 			 bool no_wait_gpu);
+void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);
 
 struct nouveau_vma *
 nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..08829a720891 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
 	spin_lock(&nvbo->bo.bdev->fence_lock);
 	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
 	spin_unlock(&nvbo->bo.bdev->fence_lock);
+	nouveau_bo_sync_for_cpu(nvbo);
 	drm_gem_object_unreference_unlocked(gem);
 	return ret;
 }
@@ -904,6 +905,17 @@ int
 nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
 			   struct drm_file *file_priv)
 {
+	struct drm_nouveau_gem_cpu_fini *req = data;
+	struct drm_gem_object *gem;
+	struct nouveau_bo *nvbo;
+
+	gem = drm_gem_object_lookup(dev, file_priv, req->handle);
+	if (!gem)
+		return -ENOENT;
+	nvbo = nouveau_gem_object(gem);
+
+	nouveau_bo_sync_for_device(nvbo);
+	drm_gem_object_unreference_unlocked(gem);
 	return 0;
 }
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 5/6] drm/nouveau: implement explicitly coherent BOs
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:26     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:26 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Allow nouveau_bo_new() to recognize the TTM_PL_FLAG_UNCACHED flag, which
means that we want the allocated BO to be perfectly coherent between the
CPU and GPU. This is useful on non-coherent architectures for which we
do not want to manually sync some rarely-accessed buffers: typically,
fences and pushbuffers.

A TTM BO allocated with the TTM_PL_FLAG_UNCACHED on a non-coherent
architecture will be populated using the DMA API, and accesses to it
performed using the coherent mapping performed by dma_alloc_coherent().

Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 76 ++++++++++++++++++++++++++++++++----
 drivers/gpu/drm/nouveau/nouveau_bo.h |  1 +
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 47e4e8886769..23a29adfabf0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -219,6 +219,9 @@ nouveau_bo_new(struct drm_device *dev, int size, int align,
 	nvbo->tile_flags = tile_flags;
 	nvbo->bo.bdev = &drm->ttm.bdev;
 
+	if (!nv_device_is_cpu_coherent(nouveau_dev(dev)))
+		nvbo->force_coherent = flags & TTM_PL_FLAG_UNCACHED;
+
 	nvbo->page_shift = 12;
 	if (drm->client.base.vm) {
 		if (!(flags & TTM_PL_FLAG_TT) && size > 256 * 1024)
@@ -289,8 +292,9 @@ void
 nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy)
 {
 	struct ttm_placement *pl = &nvbo->placement;
-	uint32_t flags = TTM_PL_MASK_CACHING |
-		(nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
+	uint32_t flags = (nvbo->force_coherent ? TTM_PL_FLAG_UNCACHED :
+						 TTM_PL_MASK_CACHING) |
+			 (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
 
 	pl->placement = nvbo->placements;
 	set_placement_list(nvbo->placements, &pl->num_placement,
@@ -390,7 +394,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
 	if (ret)
 		return ret;
 
-	ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, &nvbo->kmap);
+	/*
+	 * TTM buffers allocated using the DMA API already have a mapping, let's
+	 * use it instead.
+	 */
+	if (!nvbo->force_coherent)
+		ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages,
+				  &nvbo->kmap);
+
 	ttm_bo_unreserve(&nvbo->bo);
 	return ret;
 }
@@ -398,7 +409,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
 void
 nouveau_bo_unmap(struct nouveau_bo *nvbo)
 {
-	if (nvbo)
+	if (!nvbo)
+		return;
+
+	/*
+	 * TTM buffers allocated using the DMA API already had a coherent
+	 * mapping which we used, no need to unmap.
+	 */
+	if (!nvbo->force_coherent)
 		ttm_bo_kunmap(&nvbo->kmap);
 }
 
@@ -482,12 +500,36 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 	return 0;
 }
 
+static inline void *
+_nouveau_bo_mem_index(struct nouveau_bo *nvbo, unsigned index, void *mem, u8 sz)
+{
+	struct ttm_dma_tt *dma_tt;
+	u8 *m = mem;
+
+	index *= sz;
+
+	if (m) {
+		/* kmap'd address, return the corresponding offset */
+		m += index;
+	} else {
+		/* DMA-API mapping, lookup the right address */
+		dma_tt = (struct ttm_dma_tt *)nvbo->bo.ttm;
+		m = dma_tt->cpu_address[index / PAGE_SIZE];
+		m += index % PAGE_SIZE;
+	}
+
+	return m;
+}
+#define nouveau_bo_mem_index(o, i, m) _nouveau_bo_mem_index(o, i, m, sizeof(*m))
+
 u16
 nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index)
 {
 	bool is_iomem;
 	u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		return ioread16_native((void __force __iomem *)mem);
 	else
@@ -499,7 +541,9 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
 {
 	bool is_iomem;
 	u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		iowrite16_native(val, (void __force __iomem *)mem);
 	else
@@ -511,7 +555,9 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index)
 {
 	bool is_iomem;
 	u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		return ioread32_native((void __force __iomem *)mem);
 	else
@@ -523,7 +569,9 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
 {
 	bool is_iomem;
 	u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		iowrite32_native(val, (void __force __iomem *)mem);
 	else
@@ -1426,6 +1474,14 @@ nouveau_ttm_tt_populate(struct ttm_tt *ttm)
 	device = nv_device(drm->device);
 	dev = drm->dev;
 
+	/*
+	 * Objects matching this condition have been marked as force_coherent,
+	 * so use the DMA API for them.
+	 */
+	if (!nv_device_is_cpu_coherent(device) &&
+	    ttm->caching_state == tt_uncached)
+		return ttm_dma_populate(ttm_dma, dev->dev);
+
 #if __OS_HAS_AGP
 	if (drm->agp.stat == ENABLED) {
 		return ttm_agp_tt_populate(ttm);
@@ -1476,6 +1532,10 @@ nouveau_ttm_tt_unpopulate(struct ttm_tt *ttm)
 	device = nv_device(drm->device);
 	dev = drm->dev;
 
+	if (!nv_device_is_cpu_coherent(device) &&
+	    ttm->caching_state == tt_uncached)
+		ttm_dma_unpopulate(ttm_dma, dev->dev);
+
 #if __OS_HAS_AGP
 	if (drm->agp.stat == ENABLED) {
 		ttm_agp_tt_unpopulate(ttm);
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index fa42298d2dca..9a111b92fb34 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -11,6 +11,7 @@ struct nouveau_bo {
 	u32 valid_domains;
 	u32 placements[3];
 	u32 busy_placements[3];
+	bool force_coherent;
 	struct ttm_bo_kmap_obj kmap;
 	struct list_head head;
 
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 5/6] drm/nouveau: implement explicitly coherent BOs
@ 2014-07-08  8:26     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:26 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Allow nouveau_bo_new() to recognize the TTM_PL_FLAG_UNCACHED flag, which
means that we want the allocated BO to be perfectly coherent between the
CPU and GPU. This is useful on non-coherent architectures for which we
do not want to manually sync some rarely-accessed buffers: typically,
fences and pushbuffers.

A TTM BO allocated with the TTM_PL_FLAG_UNCACHED on a non-coherent
architecture will be populated using the DMA API, and accesses to it
performed using the coherent mapping performed by dma_alloc_coherent().

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c | 76 ++++++++++++++++++++++++++++++++----
 drivers/gpu/drm/nouveau/nouveau_bo.h |  1 +
 2 files changed, 69 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 47e4e8886769..23a29adfabf0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -219,6 +219,9 @@ nouveau_bo_new(struct drm_device *dev, int size, int align,
 	nvbo->tile_flags = tile_flags;
 	nvbo->bo.bdev = &drm->ttm.bdev;
 
+	if (!nv_device_is_cpu_coherent(nouveau_dev(dev)))
+		nvbo->force_coherent = flags & TTM_PL_FLAG_UNCACHED;
+
 	nvbo->page_shift = 12;
 	if (drm->client.base.vm) {
 		if (!(flags & TTM_PL_FLAG_TT) && size > 256 * 1024)
@@ -289,8 +292,9 @@ void
 nouveau_bo_placement_set(struct nouveau_bo *nvbo, uint32_t type, uint32_t busy)
 {
 	struct ttm_placement *pl = &nvbo->placement;
-	uint32_t flags = TTM_PL_MASK_CACHING |
-		(nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
+	uint32_t flags = (nvbo->force_coherent ? TTM_PL_FLAG_UNCACHED :
+						 TTM_PL_MASK_CACHING) |
+			 (nvbo->pin_refcnt ? TTM_PL_FLAG_NO_EVICT : 0);
 
 	pl->placement = nvbo->placements;
 	set_placement_list(nvbo->placements, &pl->num_placement,
@@ -390,7 +394,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
 	if (ret)
 		return ret;
 
-	ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages, &nvbo->kmap);
+	/*
+	 * TTM buffers allocated using the DMA API already have a mapping, let's
+	 * use it instead.
+	 */
+	if (!nvbo->force_coherent)
+		ret = ttm_bo_kmap(&nvbo->bo, 0, nvbo->bo.mem.num_pages,
+				  &nvbo->kmap);
+
 	ttm_bo_unreserve(&nvbo->bo);
 	return ret;
 }
@@ -398,7 +409,14 @@ nouveau_bo_map(struct nouveau_bo *nvbo)
 void
 nouveau_bo_unmap(struct nouveau_bo *nvbo)
 {
-	if (nvbo)
+	if (!nvbo)
+		return;
+
+	/*
+	 * TTM buffers allocated using the DMA API already had a coherent
+	 * mapping which we used, no need to unmap.
+	 */
+	if (!nvbo->force_coherent)
 		ttm_bo_kunmap(&nvbo->kmap);
 }
 
@@ -482,12 +500,36 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 	return 0;
 }
 
+static inline void *
+_nouveau_bo_mem_index(struct nouveau_bo *nvbo, unsigned index, void *mem, u8 sz)
+{
+	struct ttm_dma_tt *dma_tt;
+	u8 *m = mem;
+
+	index *= sz;
+
+	if (m) {
+		/* kmap'd address, return the corresponding offset */
+		m += index;
+	} else {
+		/* DMA-API mapping, lookup the right address */
+		dma_tt = (struct ttm_dma_tt *)nvbo->bo.ttm;
+		m = dma_tt->cpu_address[index / PAGE_SIZE];
+		m += index % PAGE_SIZE;
+	}
+
+	return m;
+}
+#define nouveau_bo_mem_index(o, i, m) _nouveau_bo_mem_index(o, i, m, sizeof(*m))
+
 u16
 nouveau_bo_rd16(struct nouveau_bo *nvbo, unsigned index)
 {
 	bool is_iomem;
 	u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		return ioread16_native((void __force __iomem *)mem);
 	else
@@ -499,7 +541,9 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
 {
 	bool is_iomem;
 	u16 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		iowrite16_native(val, (void __force __iomem *)mem);
 	else
@@ -511,7 +555,9 @@ nouveau_bo_rd32(struct nouveau_bo *nvbo, unsigned index)
 {
 	bool is_iomem;
 	u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		return ioread32_native((void __force __iomem *)mem);
 	else
@@ -523,7 +569,9 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
 {
 	bool is_iomem;
 	u32 *mem = ttm_kmap_obj_virtual(&nvbo->kmap, &is_iomem);
-	mem = &mem[index];
+
+	mem = nouveau_bo_mem_index(nvbo, index, mem);
+
 	if (is_iomem)
 		iowrite32_native(val, (void __force __iomem *)mem);
 	else
@@ -1426,6 +1474,14 @@ nouveau_ttm_tt_populate(struct ttm_tt *ttm)
 	device = nv_device(drm->device);
 	dev = drm->dev;
 
+	/*
+	 * Objects matching this condition have been marked as force_coherent,
+	 * so use the DMA API for them.
+	 */
+	if (!nv_device_is_cpu_coherent(device) &&
+	    ttm->caching_state == tt_uncached)
+		return ttm_dma_populate(ttm_dma, dev->dev);
+
 #if __OS_HAS_AGP
 	if (drm->agp.stat == ENABLED) {
 		return ttm_agp_tt_populate(ttm);
@@ -1476,6 +1532,10 @@ nouveau_ttm_tt_unpopulate(struct ttm_tt *ttm)
 	device = nv_device(drm->device);
 	dev = drm->dev;
 
+	if (!nv_device_is_cpu_coherent(device) &&
+	    ttm->caching_state == tt_uncached)
+		ttm_dma_unpopulate(ttm_dma, dev->dev);
+
 #if __OS_HAS_AGP
 	if (drm->agp.stat == ENABLED) {
 		ttm_agp_tt_unpopulate(ttm);
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index fa42298d2dca..9a111b92fb34 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -11,6 +11,7 @@ struct nouveau_bo {
 	u32 valid_domains;
 	u32 placements[3];
 	u32 busy_placements[3];
+	bool force_coherent;
 	struct ttm_bo_kmap_obj kmap;
 	struct list_head head;
 
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 6/6] drm/nouveau: allocate GPFIFOs and fences coherently
  2014-07-08  8:25 ` Alexandre Courbot
@ 2014-07-08  8:26   ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:26 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: gnurou, nouveau, linux-kernel, dri-devel, linux-tegra

Specify TTM_PL_FLAG_UNCACHED when allocating GPFIFOs and fences to
allow them to be safely accessed by the kernel without being synced
on non-coherent architectures.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_chan.c | 2 +-
 drivers/gpu/drm/nouveau/nv84_fence.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_chan.c b/drivers/gpu/drm/nouveau/nouveau_chan.c
index ccb6b452d6d0..155b1b192676 100644
--- a/drivers/gpu/drm/nouveau/nouveau_chan.c
+++ b/drivers/gpu/drm/nouveau/nouveau_chan.c
@@ -110,7 +110,7 @@ nouveau_channel_prep(struct nouveau_drm *drm, struct nouveau_cli *cli,
 	chan->handle = handle;
 
 	/* allocate memory for dma push buffer */
-	target = TTM_PL_FLAG_TT;
+	target = TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED;
 	if (nouveau_vram_pushbuf)
 		target = TTM_PL_FLAG_VRAM;
 
diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
index 9fd475c89820..b5d6737b6b8d 100644
--- a/drivers/gpu/drm/nouveau/nv84_fence.c
+++ b/drivers/gpu/drm/nouveau/nv84_fence.c
@@ -257,8 +257,8 @@ nv84_fence_create(struct nouveau_drm *drm)
 
 	if (ret == 0)
 		ret = nouveau_bo_new(drm->dev, 16 * (pfifo->max + 1), 0,
-				     TTM_PL_FLAG_TT, 0, 0, NULL,
-				     &priv->bo_gart);
+				     TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED, 0,
+				     0, NULL, &priv->bo_gart);
 	if (ret == 0) {
 		ret = nouveau_bo_pin(priv->bo_gart, TTM_PL_FLAG_TT);
 		if (ret == 0) {
-- 
2.0.0

^ permalink raw reply related	[flat|nested] 34+ messages in thread

* [PATCH v4 6/6] drm/nouveau: allocate GPFIFOs and fences coherently
@ 2014-07-08  8:26   ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-08  8:26 UTC (permalink / raw)
  To: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Specify TTM_PL_FLAG_UNCACHED when allocating GPFIFOs and fences to
allow them to be safely accessed by the kernel without being synced
on non-coherent architectures.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_chan.c | 2 +-
 drivers/gpu/drm/nouveau/nv84_fence.c   | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_chan.c b/drivers/gpu/drm/nouveau/nouveau_chan.c
index ccb6b452d6d0..155b1b192676 100644
--- a/drivers/gpu/drm/nouveau/nouveau_chan.c
+++ b/drivers/gpu/drm/nouveau/nouveau_chan.c
@@ -110,7 +110,7 @@ nouveau_channel_prep(struct nouveau_drm *drm, struct nouveau_cli *cli,
 	chan->handle = handle;
 
 	/* allocate memory for dma push buffer */
-	target = TTM_PL_FLAG_TT;
+	target = TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED;
 	if (nouveau_vram_pushbuf)
 		target = TTM_PL_FLAG_VRAM;
 
diff --git a/drivers/gpu/drm/nouveau/nv84_fence.c b/drivers/gpu/drm/nouveau/nv84_fence.c
index 9fd475c89820..b5d6737b6b8d 100644
--- a/drivers/gpu/drm/nouveau/nv84_fence.c
+++ b/drivers/gpu/drm/nouveau/nv84_fence.c
@@ -257,8 +257,8 @@ nv84_fence_create(struct nouveau_drm *drm)
 
 	if (ret == 0)
 		ret = nouveau_bo_new(drm->dev, 16 * (pfifo->max + 1), 0,
-				     TTM_PL_FLAG_TT, 0, 0, NULL,
-				     &priv->bo_gart);
+				     TTM_PL_FLAG_TT | TTM_PL_FLAG_UNCACHED, 0,
+				     0, NULL, &priv->bo_gart);
 	if (ret == 0) {
 		ret = nouveau_bo_pin(priv->bo_gart, TTM_PL_FLAG_TT);
 		if (ret == 0) {
-- 
2.0.0


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-08  8:25     ` Alexandre Courbot
@ 2014-07-10 12:58         ` Daniel Vetter
  -1 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-10 12:58 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Herrmann

On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> page_to_phys() is not the correct way to obtain the DMA address of a
> buffer on a non-PCI system. Use the DMA API functions for this, which
> are portable and will allow us to use other DMA API functions for
> buffer synchronization.
> 
> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> index 18c8c7245b73..e4e9e64988fe 100644
> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
>  		if (pci_dma_mapping_error(device->pdev, ret))
>  			ret = 0;
>  	} else {
> -		ret = page_to_phys(page);
> +		ret = dma_map_page(&device->platformdev->dev, page, 0,
> +				   PAGE_SIZE, DMA_BIDIRECTIONAL);
> +		if (dma_mapping_error(&device->platformdev->dev, ret))
> +			ret = 0;
>  	}
>  
>  	return ret;
> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
>  	if (nv_device_is_pci(device))
>  		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>  			       PCI_DMA_BIDIRECTIONAL);

pci_map/unmap alias to dma_unmap/map when called on the underlying struct
device embedded in pci_device (like for platform drivers). Dunno whether
it's worth to track a pointer to the struct device directly and always
call dma_unmap/map.

Just drive-by comment since I'm interested in how you solve this - i915
has similar fun with buffer sharing and coherent and non-coherent
platforms. Although we don't have fun with pci and non-pci based
platforms.
-Daniel

> +	else
> +		dma_unmap_page(&device->platformdev->dev, addr,
> +			       PAGE_SIZE, DMA_BIDIRECTIONAL);
>  }
>  
>  int
> -- 
> 2.0.0
> 
> _______________________________________________
> Nouveau mailing list
> Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-10 12:58         ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-10 12:58 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst, nouveau, linux-kernel,
	dri-devel, linux-tegra

On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> page_to_phys() is not the correct way to obtain the DMA address of a
> buffer on a non-PCI system. Use the DMA API functions for this, which
> are portable and will allow us to use other DMA API functions for
> buffer synchronization.
> 
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> index 18c8c7245b73..e4e9e64988fe 100644
> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
>  		if (pci_dma_mapping_error(device->pdev, ret))
>  			ret = 0;
>  	} else {
> -		ret = page_to_phys(page);
> +		ret = dma_map_page(&device->platformdev->dev, page, 0,
> +				   PAGE_SIZE, DMA_BIDIRECTIONAL);
> +		if (dma_mapping_error(&device->platformdev->dev, ret))
> +			ret = 0;
>  	}
>  
>  	return ret;
> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
>  	if (nv_device_is_pci(device))
>  		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>  			       PCI_DMA_BIDIRECTIONAL);

pci_map/unmap alias to dma_unmap/map when called on the underlying struct
device embedded in pci_device (like for platform drivers). Dunno whether
it's worth to track a pointer to the struct device directly and always
call dma_unmap/map.

Just drive-by comment since I'm interested in how you solve this - i915
has similar fun with buffer sharing and coherent and non-coherent
platforms. Although we don't have fun with pci and non-pci based
platforms.
-Daniel

> +	else
> +		dma_unmap_page(&device->platformdev->dev, addr,
> +			       PAGE_SIZE, DMA_BIDIRECTIONAL);
>  }
>  
>  int
> -- 
> 2.0.0
> 
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
  2014-07-08  8:25     ` Alexandre Courbot
@ 2014-07-10 13:04       ` Daniel Vetter
  -1 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-10 13:04 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: nouveau, linux-kernel, dri-devel, linux-tegra, Ben Skeggs

On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> On architectures for which access to GPU memory is non-coherent,
> caches need to be flushed and invalidated explicitly when BO control
> changes between CPU and GPU.
> 
> This patch adds buffer synchronization functions which invokes the
> correct API (PCI or DMA) to ensure synchronization is effective.
> 
> Based on the TTM DMA cache helper patches by Lucas Stach.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>  drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 67e9e8e2e2ec..47e4e8886769 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>  		ttm_bo_kunmap(&nvbo->kmap);
>  }
>  
> +void
> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> +	struct nouveau_device *device = nouveau_dev(drm->dev);
> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> +	int i;
> +
> +	if (!ttm_dma)
> +		return;
> +
> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> +		return;

Is the is_cpu_coherent check really required? On coherent platforms the
sync_for_foo should be a noop. It's the dma api's job to encapsulate this
knowledge so that drivers can be blissfully ignorant. The explicit
is_coherent check makes this a bit leaky. And same comment that underlying
the bus-specifics dma-mapping functions are identical.
-Daniel

> +
> +	if (nv_device_is_pci(device)) {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			pci_dma_sync_single_for_device(device->pdev,
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				PCI_DMA_TODEVICE);
> +	} else {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			dma_sync_single_for_device(nv_device_base(device),
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				DMA_TO_DEVICE);
> +	}
> +}
> +
> +void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> +	struct nouveau_device *device = nouveau_dev(drm->dev);
> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> +	int i;
> +
> +	if (!ttm_dma)
> +		return;
> +
> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> +		return;
> +
> +	if (nv_device_is_pci(device)) {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			pci_dma_sync_single_for_cpu(device->pdev,
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				PCI_DMA_FROMDEVICE);
> +	} else {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			dma_sync_single_for_cpu(nv_device_base(device),
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				DMA_FROM_DEVICE);
> +	}
> +}
> +
>  int
>  nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  		    bool no_wait_gpu)
> @@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  	if (!ttm)
>  		return ret;
>  
> +	nouveau_bo_sync_for_device(nvbo);
> +
>  	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
>  	nv_wr32(device, 0x70004, 0x00000001);
>  	if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
> index ff17c1f432fc..fa42298d2dca 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.h
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
> @@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
>  void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
>  int  nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
>  			 bool no_wait_gpu);
> +void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);
>  
>  struct nouveau_vma *
>  nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index c90c0dc0afe8..08829a720891 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>  	spin_lock(&nvbo->bo.bdev->fence_lock);
>  	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>  	spin_unlock(&nvbo->bo.bdev->fence_lock);
> +	nouveau_bo_sync_for_cpu(nvbo);
>  	drm_gem_object_unreference_unlocked(gem);
>  	return ret;
>  }
> @@ -904,6 +905,17 @@ int
>  nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
>  			   struct drm_file *file_priv)
>  {
> +	struct drm_nouveau_gem_cpu_fini *req = data;
> +	struct drm_gem_object *gem;
> +	struct nouveau_bo *nvbo;
> +
> +	gem = drm_gem_object_lookup(dev, file_priv, req->handle);
> +	if (!gem)
> +		return -ENOENT;
> +	nvbo = nouveau_gem_object(gem);
> +
> +	nouveau_bo_sync_for_device(nvbo);
> +	drm_gem_object_unreference_unlocked(gem);
>  	return 0;
>  }
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
@ 2014-07-10 13:04       ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-10 13:04 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst, nouveau, linux-kernel,
	dri-devel, linux-tegra

On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> On architectures for which access to GPU memory is non-coherent,
> caches need to be flushed and invalidated explicitly when BO control
> changes between CPU and GPU.
> 
> This patch adds buffer synchronization functions which invokes the
> correct API (PCI or DMA) to ensure synchronization is effective.
> 
> Based on the TTM DMA cache helper patches by Lucas Stach.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>  drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>  3 files changed, 70 insertions(+)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index 67e9e8e2e2ec..47e4e8886769 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>  		ttm_bo_kunmap(&nvbo->kmap);
>  }
>  
> +void
> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> +	struct nouveau_device *device = nouveau_dev(drm->dev);
> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> +	int i;
> +
> +	if (!ttm_dma)
> +		return;
> +
> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> +		return;

Is the is_cpu_coherent check really required? On coherent platforms the
sync_for_foo should be a noop. It's the dma api's job to encapsulate this
knowledge so that drivers can be blissfully ignorant. The explicit
is_coherent check makes this a bit leaky. And same comment that underlying
the bus-specifics dma-mapping functions are identical.
-Daniel

> +
> +	if (nv_device_is_pci(device)) {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			pci_dma_sync_single_for_device(device->pdev,
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				PCI_DMA_TODEVICE);
> +	} else {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			dma_sync_single_for_device(nv_device_base(device),
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				DMA_TO_DEVICE);
> +	}
> +}
> +
> +void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> +	struct nouveau_device *device = nouveau_dev(drm->dev);
> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> +	int i;
> +
> +	if (!ttm_dma)
> +		return;
> +
> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> +		return;
> +
> +	if (nv_device_is_pci(device)) {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			pci_dma_sync_single_for_cpu(device->pdev,
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				PCI_DMA_FROMDEVICE);
> +	} else {
> +		for (i = 0; i < ttm_dma->ttm.num_pages; i++)
> +			dma_sync_single_for_cpu(nv_device_base(device),
> +				ttm_dma->dma_address[i], PAGE_SIZE,
> +				DMA_FROM_DEVICE);
> +	}
> +}
> +
>  int
>  nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  		    bool no_wait_gpu)
> @@ -418,6 +472,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  	if (!ttm)
>  		return ret;
>  
> +	nouveau_bo_sync_for_device(nvbo);
> +
>  	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
>  	nv_wr32(device, 0x70004, 0x00000001);
>  	if (!nv_wait(device, 0x070004, 0x00000001, 0x00000000))
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
> index ff17c1f432fc..fa42298d2dca 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.h
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
> @@ -81,6 +81,8 @@ void nouveau_bo_wr32(struct nouveau_bo *, unsigned index, u32 val);
>  void nouveau_bo_fence(struct nouveau_bo *, struct nouveau_fence *);
>  int  nouveau_bo_validate(struct nouveau_bo *, bool interruptible,
>  			 bool no_wait_gpu);
> +void nouveau_bo_sync_for_device(struct nouveau_bo *nvbo);
> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo);
>  
>  struct nouveau_vma *
>  nouveau_bo_vma_find(struct nouveau_bo *, struct nouveau_vm *);
> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index c90c0dc0afe8..08829a720891 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -896,6 +896,7 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>  	spin_lock(&nvbo->bo.bdev->fence_lock);
>  	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>  	spin_unlock(&nvbo->bo.bdev->fence_lock);
> +	nouveau_bo_sync_for_cpu(nvbo);
>  	drm_gem_object_unreference_unlocked(gem);
>  	return ret;
>  }
> @@ -904,6 +905,17 @@ int
>  nouveau_gem_ioctl_cpu_fini(struct drm_device *dev, void *data,
>  			   struct drm_file *file_priv)
>  {
> +	struct drm_nouveau_gem_cpu_fini *req = data;
> +	struct drm_gem_object *gem;
> +	struct nouveau_bo *nvbo;
> +
> +	gem = drm_gem_object_lookup(dev, file_priv, req->handle);
> +	if (!gem)
> +		return -ENOENT;
> +	nvbo = nouveau_gem_object(gem);
> +
> +	nouveau_bo_sync_for_device(nvbo);
> +	drm_gem_object_unreference_unlocked(gem);
>  	return 0;
>  }
>  
> -- 
> 2.0.0
> 
> _______________________________________________
> Nouveau mailing list
> Nouveau@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau

-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-10 12:58         ` [Nouveau] " Daniel Vetter
@ 2014-07-11  2:35             ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Alexandre Courbot

On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>> page_to_phys() is not the correct way to obtain the DMA address of a
>> buffer on a non-PCI system. Use the DMA API functions for this, which
>> are portable and will allow us to use other DMA API functions for
>> buffer synchronization.
>>
>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> ---
>>   drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> index 18c8c7245b73..e4e9e64988fe 100644
>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
>>   		if (pci_dma_mapping_error(device->pdev, ret))
>>   			ret = 0;
>>   	} else {
>> -		ret = page_to_phys(page);
>> +		ret = dma_map_page(&device->platformdev->dev, page, 0,
>> +				   PAGE_SIZE, DMA_BIDIRECTIONAL);
>> +		if (dma_mapping_error(&device->platformdev->dev, ret))
>> +			ret = 0;
>>   	}
>>
>>   	return ret;
>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
>>   	if (nv_device_is_pci(device))
>>   		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>   			       PCI_DMA_BIDIRECTIONAL);
>
> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> device embedded in pci_device (like for platform drivers). Dunno whether
> it's worth to track a pointer to the struct device directly and always
> call dma_unmap/map.

Isn't it (theoretically) possible to have a platform that does not use 
the DMA API for its PCI implementation and thus requires the pci_* 
functions to be called? I could not find such a case in -next, which 
suggests that all PCI platforms have been converted to the DMA API 
already and that we could indeed refactor this to always use the DMA 
functions.

But at the same time the way we use APIs should not be directed by their 
implementation, but by their intent - and unless the PCI API has been 
deprecated in some way (something I am not aware of), the rule is still 
that you should use it on a PCI device.

>
> Just drive-by comment since I'm interested in how you solve this - i915
> has similar fun with buffer sharing and coherent and non-coherent
> platforms. Although we don't have fun with pci and non-pci based
> platforms.

Yeah, I am not familiar with i915 but it seems like we are on a similar 
boat here (excepted ARM is more constrained as to its memory mappings). 
The strategy in this series is, map buffers used by user-space cached 
and explicitly synchronize them (since the ownership transition from 
user to GPU is always clearly performed by syscalls), and use coherent 
mappings for buffers used by the kernel which are accessed more 
randomly. This has solved all our coherency issues and resulted in the 
best performance so far.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-11  2:35             ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst, nouveau, linux-kernel,
	dri-devel, linux-tegra, Alexandre Courbot

On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>> page_to_phys() is not the correct way to obtain the DMA address of a
>> buffer on a non-PCI system. Use the DMA API functions for this, which
>> are portable and will allow us to use other DMA API functions for
>> buffer synchronization.
>>
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>> ---
>>   drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> index 18c8c7245b73..e4e9e64988fe 100644
>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
>>   		if (pci_dma_mapping_error(device->pdev, ret))
>>   			ret = 0;
>>   	} else {
>> -		ret = page_to_phys(page);
>> +		ret = dma_map_page(&device->platformdev->dev, page, 0,
>> +				   PAGE_SIZE, DMA_BIDIRECTIONAL);
>> +		if (dma_mapping_error(&device->platformdev->dev, ret))
>> +			ret = 0;
>>   	}
>>
>>   	return ret;
>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
>>   	if (nv_device_is_pci(device))
>>   		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>   			       PCI_DMA_BIDIRECTIONAL);
>
> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> device embedded in pci_device (like for platform drivers). Dunno whether
> it's worth to track a pointer to the struct device directly and always
> call dma_unmap/map.

Isn't it (theoretically) possible to have a platform that does not use 
the DMA API for its PCI implementation and thus requires the pci_* 
functions to be called? I could not find such a case in -next, which 
suggests that all PCI platforms have been converted to the DMA API 
already and that we could indeed refactor this to always use the DMA 
functions.

But at the same time the way we use APIs should not be directed by their 
implementation, but by their intent - and unless the PCI API has been 
deprecated in some way (something I am not aware of), the rule is still 
that you should use it on a PCI device.

>
> Just drive-by comment since I'm interested in how you solve this - i915
> has similar fun with buffer sharing and coherent and non-coherent
> platforms. Although we don't have fun with pci and non-pci based
> platforms.

Yeah, I am not familiar with i915 but it seems like we are on a similar 
boat here (excepted ARM is more constrained as to its memory mappings). 
The strategy in this series is, map buffers used by user-space cached 
and explicitly synchronize them (since the ownership transition from 
user to GPU is always clearly performed by syscalls), and use coherent 
mappings for buffers used by the kernel which are accessed more 
randomly. This has solved all our coherency issues and resulted in the 
best performance so far.


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
  2014-07-10 13:04       ` Daniel Vetter
@ 2014-07-11  2:40           ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:40 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Alexandre Courbot

On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>> On architectures for which access to GPU memory is non-coherent,
>> caches need to be flushed and invalidated explicitly when BO control
>> changes between CPU and GPU.
>>
>> This patch adds buffer synchronization functions which invokes the
>> correct API (PCI or DMA) to ensure synchronization is effective.
>>
>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>
>> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>>   drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>>   3 files changed, 70 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index 67e9e8e2e2ec..47e4e8886769 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>>   		ttm_bo_kunmap(&nvbo->kmap);
>>   }
>>
>> +void
>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>> +{
>> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>> +	struct nouveau_device *device = nouveau_dev(drm->dev);
>> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>> +	int i;
>> +
>> +	if (!ttm_dma)
>> +		return;
>> +
>> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>> +		return;
>
> Is the is_cpu_coherent check really required? On coherent platforms the
> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> knowledge so that drivers can be blissfully ignorant. The explicit
> is_coherent check makes this a bit leaky. And same comment that underlying
> the bus-specifics dma-mapping functions are identical.

I think you are right, the is_cpu_coherent check should not be needed 
here. I still think we should have separate paths for the PCI/DMA cases 
though, unless you can point me to a source that clearly states that the 
PCI API is deprecated and that DMA should be used instead.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
@ 2014-07-11  2:40           ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:40 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst, nouveau, linux-kernel,
	dri-devel, linux-tegra, Alexandre Courbot

On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>> On architectures for which access to GPU memory is non-coherent,
>> caches need to be flushed and invalidated explicitly when BO control
>> changes between CPU and GPU.
>>
>> This patch adds buffer synchronization functions which invokes the
>> correct API (PCI or DMA) to ensure synchronization is effective.
>>
>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>
>> Signed-off-by: Lucas Stach <dev@lynxeye.de>
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>> ---
>>   drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>>   drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>>   drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>>   3 files changed, 70 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index 67e9e8e2e2ec..47e4e8886769 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>>   		ttm_bo_kunmap(&nvbo->kmap);
>>   }
>>
>> +void
>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>> +{
>> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>> +	struct nouveau_device *device = nouveau_dev(drm->dev);
>> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>> +	int i;
>> +
>> +	if (!ttm_dma)
>> +		return;
>> +
>> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>> +		return;
>
> Is the is_cpu_coherent check really required? On coherent platforms the
> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> knowledge so that drivers can be blissfully ignorant. The explicit
> is_coherent check makes this a bit leaky. And same comment that underlying
> the bus-specifics dma-mapping functions are identical.

I think you are right, the is_cpu_coherent check should not be needed 
here. I still think we should have separate paths for the PCI/DMA cases 
though, unless you can point me to a source that clearly states that the 
PCI API is deprecated and that DMA should be used instead.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-11  2:35             ` Alexandre Courbot
@ 2014-07-11  2:50                 ` Ben Skeggs
  -1 siblings, 0 replies; 34+ messages in thread
From: Ben Skeggs @ 2014-07-11  2:50 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	Daniel Vetter, linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>
>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>
>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>> are portable and will allow us to use other DMA API functions for
>>> buffer synchronization.
>>>
>>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>> ---
>>>   drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> index 18c8c7245b73..e4e9e64988fe 100644
>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>> struct page *page)
>>>                 if (pci_dma_mapping_error(device->pdev, ret))
>>>                         ret = 0;
>>>         } else {
>>> -               ret = page_to_phys(page);
>>> +               ret = dma_map_page(&device->platformdev->dev, page, 0,
>>> +                                  PAGE_SIZE, DMA_BIDIRECTIONAL);
>>> +               if (dma_mapping_error(&device->platformdev->dev, ret))
>>> +                       ret = 0;
>>>         }
>>>
>>>         return ret;
>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>> dma_addr_t addr)
>>>         if (nv_device_is_pci(device))
>>>                 pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>>                                PCI_DMA_BIDIRECTIONAL);
>>
>>
>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>> device embedded in pci_device (like for platform drivers). Dunno whether
>> it's worth to track a pointer to the struct device directly and always
>> call dma_unmap/map.
>
>
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
>
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.
>
>
>>
>> Just drive-by comment since I'm interested in how you solve this - i915
>> has similar fun with buffer sharing and coherent and non-coherent
>> platforms. Although we don't have fun with pci and non-pci based
>> platforms.
>
>
> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> here (excepted ARM is more constrained as to its memory mappings). The
> strategy in this series is, map buffers used by user-space cached and
> explicitly synchronize them (since the ownership transition from user to GPU
> is always clearly performed by syscalls), and use coherent mappings for
> buffers used by the kernel which are accessed more randomly. This has solved
> all our coherency issues and resulted in the best performance so far.
I wonder if we might want to use unsnooped cached mappings of pages on
non-ARM platforms also, to avoid the overhead of the cache snooping?

>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-11  2:50                 ` Ben Skeggs
  0 siblings, 0 replies; 34+ messages in thread
From: Ben Skeggs @ 2014-07-11  2:50 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Daniel Vetter, Alexandre Courbot, nouveau, linux-kernel,
	dri-devel, linux-tegra, Ben Skeggs

On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <acourbot@nvidia.com> wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>
>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>
>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>> are portable and will allow us to use other DMA API functions for
>>> buffer synchronization.
>>>
>>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>>> ---
>>>   drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> index 18c8c7245b73..e4e9e64988fe 100644
>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>> struct page *page)
>>>                 if (pci_dma_mapping_error(device->pdev, ret))
>>>                         ret = 0;
>>>         } else {
>>> -               ret = page_to_phys(page);
>>> +               ret = dma_map_page(&device->platformdev->dev, page, 0,
>>> +                                  PAGE_SIZE, DMA_BIDIRECTIONAL);
>>> +               if (dma_mapping_error(&device->platformdev->dev, ret))
>>> +                       ret = 0;
>>>         }
>>>
>>>         return ret;
>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>> dma_addr_t addr)
>>>         if (nv_device_is_pci(device))
>>>                 pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>>                                PCI_DMA_BIDIRECTIONAL);
>>
>>
>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>> device embedded in pci_device (like for platform drivers). Dunno whether
>> it's worth to track a pointer to the struct device directly and always
>> call dma_unmap/map.
>
>
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
>
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.
>
>
>>
>> Just drive-by comment since I'm interested in how you solve this - i915
>> has similar fun with buffer sharing and coherent and non-coherent
>> platforms. Although we don't have fun with pci and non-pci based
>> platforms.
>
>
> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> here (excepted ARM is more constrained as to its memory mappings). The
> strategy in this series is, map buffers used by user-space cached and
> explicitly synchronize them (since the ownership transition from user to GPU
> is always clearly performed by syscalls), and use coherent mappings for
> buffers used by the kernel which are accessed more randomly. This has solved
> all our coherency issues and resulted in the best performance so far.
I wonder if we might want to use unsnooped cached mappings of pages on
non-ARM platforms also, to avoid the overhead of the cache snooping?

>
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-11  2:50                 ` [Nouveau] " Ben Skeggs
@ 2014-07-11  2:57                     ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:57 UTC (permalink / raw)
  To: Ben Skeggs
  Cc: Daniel Vetter, Alexandre Courbot,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs

On 07/11/2014 11:50 AM, Ben Skeggs wrote:
> On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org> wrote:
>> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>>
>>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>>> are portable and will allow us to use other DMA API functions for
>>>> buffer synchronization.
>>>>
>>>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>    drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> index 18c8c7245b73..e4e9e64988fe 100644
>>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>>> struct page *page)
>>>>                  if (pci_dma_mapping_error(device->pdev, ret))
>>>>                          ret = 0;
>>>>          } else {
>>>> -               ret = page_to_phys(page);
>>>> +               ret = dma_map_page(&device->platformdev->dev, page, 0,
>>>> +                                  PAGE_SIZE, DMA_BIDIRECTIONAL);
>>>> +               if (dma_mapping_error(&device->platformdev->dev, ret))
>>>> +                       ret = 0;
>>>>          }
>>>>
>>>>          return ret;
>>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>>> dma_addr_t addr)
>>>>          if (nv_device_is_pci(device))
>>>>                  pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>>>                                 PCI_DMA_BIDIRECTIONAL);
>>>
>>>
>>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>>> device embedded in pci_device (like for platform drivers). Dunno whether
>>> it's worth to track a pointer to the struct device directly and always
>>> call dma_unmap/map.
>>
>>
>> Isn't it (theoretically) possible to have a platform that does not use the
>> DMA API for its PCI implementation and thus requires the pci_* functions to
>> be called? I could not find such a case in -next, which suggests that all
>> PCI platforms have been converted to the DMA API already and that we could
>> indeed refactor this to always use the DMA functions.
>>
>> But at the same time the way we use APIs should not be directed by their
>> implementation, but by their intent - and unless the PCI API has been
>> deprecated in some way (something I am not aware of), the rule is still that
>> you should use it on a PCI device.
>>
>>
>>>
>>> Just drive-by comment since I'm interested in how you solve this - i915
>>> has similar fun with buffer sharing and coherent and non-coherent
>>> platforms. Although we don't have fun with pci and non-pci based
>>> platforms.
>>
>>
>> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
>> here (excepted ARM is more constrained as to its memory mappings). The
>> strategy in this series is, map buffers used by user-space cached and
>> explicitly synchronize them (since the ownership transition from user to GPU
>> is always clearly performed by syscalls), and use coherent mappings for
>> buffers used by the kernel which are accessed more randomly. This has solved
>> all our coherency issues and resulted in the best performance so far.
> I wonder if we might want to use unsnooped cached mappings of pages on
> non-ARM platforms also, to avoid the overhead of the cache snooping?

You might want to indeed, now that coherency is guaranteed by the sync 
functions originally introduced by Lucas. The only issue I could see is 
that they always invalidate the full buffer whereas bus snooping only 
affects pages that are actually touched. Someone would need to try this 
on a desktop machine and see how it affects performance.

I'd be all for it though, since it would also allow us to get rid of 
this ungraceful nv_device_is_cpu_coherent() function and result in 
simplifying nouveau_bo.c a bit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-11  2:57                     ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  2:57 UTC (permalink / raw)
  To: Ben Skeggs
  Cc: Daniel Vetter, Alexandre Courbot, nouveau, linux-kernel,
	dri-devel, linux-tegra, Ben Skeggs

On 07/11/2014 11:50 AM, Ben Skeggs wrote:
> On Fri, Jul 11, 2014 at 12:35 PM, Alexandre Courbot <acourbot@nvidia.com> wrote:
>> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
>>>
>>> On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
>>>>
>>>> page_to_phys() is not the correct way to obtain the DMA address of a
>>>> buffer on a non-PCI system. Use the DMA API functions for this, which
>>>> are portable and will allow us to use other DMA API functions for
>>>> buffer synchronization.
>>>>
>>>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>>>> ---
>>>>    drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
>>>>    1 file changed, 7 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> index 18c8c7245b73..e4e9e64988fe 100644
>>>> --- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> +++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
>>>> @@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device,
>>>> struct page *page)
>>>>                  if (pci_dma_mapping_error(device->pdev, ret))
>>>>                          ret = 0;
>>>>          } else {
>>>> -               ret = page_to_phys(page);
>>>> +               ret = dma_map_page(&device->platformdev->dev, page, 0,
>>>> +                                  PAGE_SIZE, DMA_BIDIRECTIONAL);
>>>> +               if (dma_mapping_error(&device->platformdev->dev, ret))
>>>> +                       ret = 0;
>>>>          }
>>>>
>>>>          return ret;
>>>> @@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device,
>>>> dma_addr_t addr)
>>>>          if (nv_device_is_pci(device))
>>>>                  pci_unmap_page(device->pdev, addr, PAGE_SIZE,
>>>>                                 PCI_DMA_BIDIRECTIONAL);
>>>
>>>
>>> pci_map/unmap alias to dma_unmap/map when called on the underlying struct
>>> device embedded in pci_device (like for platform drivers). Dunno whether
>>> it's worth to track a pointer to the struct device directly and always
>>> call dma_unmap/map.
>>
>>
>> Isn't it (theoretically) possible to have a platform that does not use the
>> DMA API for its PCI implementation and thus requires the pci_* functions to
>> be called? I could not find such a case in -next, which suggests that all
>> PCI platforms have been converted to the DMA API already and that we could
>> indeed refactor this to always use the DMA functions.
>>
>> But at the same time the way we use APIs should not be directed by their
>> implementation, but by their intent - and unless the PCI API has been
>> deprecated in some way (something I am not aware of), the rule is still that
>> you should use it on a PCI device.
>>
>>
>>>
>>> Just drive-by comment since I'm interested in how you solve this - i915
>>> has similar fun with buffer sharing and coherent and non-coherent
>>> platforms. Although we don't have fun with pci and non-pci based
>>> platforms.
>>
>>
>> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
>> here (excepted ARM is more constrained as to its memory mappings). The
>> strategy in this series is, map buffers used by user-space cached and
>> explicitly synchronize them (since the ownership transition from user to GPU
>> is always clearly performed by syscalls), and use coherent mappings for
>> buffers used by the kernel which are accessed more randomly. This has solved
>> all our coherency issues and resulted in the best performance so far.
> I wonder if we might want to use unsnooped cached mappings of pages on
> non-ARM platforms also, to avoid the overhead of the cache snooping?

You might want to indeed, now that coherency is guaranteed by the sync 
functions originally introduced by Lucas. The only issue I could see is 
that they always invalidate the full buffer whereas bus snooping only 
affects pages that are actually touched. Someone would need to try this 
on a desktop machine and see how it affects performance.

I'd be all for it though, since it would also allow us to get rid of 
this ungraceful nv_device_is_cpu_coherent() function and result in 
simplifying nouveau_bo.c a bit.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-11  2:35             ` Alexandre Courbot
@ 2014-07-11  7:38               ` Daniel Vetter
  -1 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-11  7:38 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Alexandre Courbot, nouveau, linux-kernel, dri-devel, linux-tegra,
	Ben Skeggs

On Fri, Jul 11, 2014 at 11:35:23AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> >>page_to_phys() is not the correct way to obtain the DMA address of a
> >>buffer on a non-PCI system. Use the DMA API functions for this, which
> >>are portable and will allow us to use other DMA API functions for
> >>buffer synchronization.
> >>
> >>Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> >>---
> >>  drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>index 18c8c7245b73..e4e9e64988fe 100644
> >>--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
> >>  		if (pci_dma_mapping_error(device->pdev, ret))
> >>  			ret = 0;
> >>  	} else {
> >>-		ret = page_to_phys(page);
> >>+		ret = dma_map_page(&device->platformdev->dev, page, 0,
> >>+				   PAGE_SIZE, DMA_BIDIRECTIONAL);
> >>+		if (dma_mapping_error(&device->platformdev->dev, ret))
> >>+			ret = 0;
> >>  	}
> >>
> >>  	return ret;
> >>@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
> >>  	if (nv_device_is_pci(device))
> >>  		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
> >>  			       PCI_DMA_BIDIRECTIONAL);
> >
> >pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> >device embedded in pci_device (like for platform drivers). Dunno whether
> >it's worth to track a pointer to the struct device directly and always
> >call dma_unmap/map.
> 
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
> 
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.

Hm, somehow I've thought that it's recommended to just use the dma api
directly. It's what we're doing in i915 at least, but now I'm not so sure
any more. My guess is that this is just history really when the dma api
was pci-only.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-11  7:38               ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-11  7:38 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Daniel Vetter, Ben Skeggs, David Airlie, David Herrmann,
	Lucas Stach, Thierry Reding, Maarten Lankhorst, nouveau,
	linux-kernel, dri-devel, linux-tegra, Alexandre Courbot

On Fri, Jul 11, 2014 at 11:35:23AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 09:58 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:57PM +0900, Alexandre Courbot wrote:
> >>page_to_phys() is not the correct way to obtain the DMA address of a
> >>buffer on a non-PCI system. Use the DMA API functions for this, which
> >>are portable and will allow us to use other DMA API functions for
> >>buffer synchronization.
> >>
> >>Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> >>---
> >>  drivers/gpu/drm/nouveau/core/engine/device/base.c | 8 +++++++-
> >>  1 file changed, 7 insertions(+), 1 deletion(-)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/core/engine/device/base.c b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>index 18c8c7245b73..e4e9e64988fe 100644
> >>--- a/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>+++ b/drivers/gpu/drm/nouveau/core/engine/device/base.c
> >>@@ -489,7 +489,10 @@ nv_device_map_page(struct nouveau_device *device, struct page *page)
> >>  		if (pci_dma_mapping_error(device->pdev, ret))
> >>  			ret = 0;
> >>  	} else {
> >>-		ret = page_to_phys(page);
> >>+		ret = dma_map_page(&device->platformdev->dev, page, 0,
> >>+				   PAGE_SIZE, DMA_BIDIRECTIONAL);
> >>+		if (dma_mapping_error(&device->platformdev->dev, ret))
> >>+			ret = 0;
> >>  	}
> >>
> >>  	return ret;
> >>@@ -501,6 +504,9 @@ nv_device_unmap_page(struct nouveau_device *device, dma_addr_t addr)
> >>  	if (nv_device_is_pci(device))
> >>  		pci_unmap_page(device->pdev, addr, PAGE_SIZE,
> >>  			       PCI_DMA_BIDIRECTIONAL);
> >
> >pci_map/unmap alias to dma_unmap/map when called on the underlying struct
> >device embedded in pci_device (like for platform drivers). Dunno whether
> >it's worth to track a pointer to the struct device directly and always
> >call dma_unmap/map.
> 
> Isn't it (theoretically) possible to have a platform that does not use the
> DMA API for its PCI implementation and thus requires the pci_* functions to
> be called? I could not find such a case in -next, which suggests that all
> PCI platforms have been converted to the DMA API already and that we could
> indeed refactor this to always use the DMA functions.
> 
> But at the same time the way we use APIs should not be directed by their
> implementation, but by their intent - and unless the PCI API has been
> deprecated in some way (something I am not aware of), the rule is still that
> you should use it on a PCI device.

Hm, somehow I've thought that it's recommended to just use the dma api
directly. It's what we're doing in i915 at least, but now I'm not so sure
any more. My guess is that this is just history really when the dma api
was pci-only.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
  2014-07-11  2:40           ` Alexandre Courbot
@ 2014-07-11  7:41               ` Daniel Vetter
  -1 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-11  7:41 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, Daniel Vetter,
	David Herrmann

On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> >>On architectures for which access to GPU memory is non-coherent,
> >>caches need to be flushed and invalidated explicitly when BO control
> >>changes between CPU and GPU.
> >>
> >>This patch adds buffer synchronization functions which invokes the
> >>correct API (PCI or DMA) to ensure synchronization is effective.
> >>
> >>Based on the TTM DMA cache helper patches by Lucas Stach.
> >>
> >>Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> >>Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> >>---
> >>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
> >>  drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
> >>  drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
> >>  3 files changed, 70 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>index 67e9e8e2e2ec..47e4e8886769 100644
> >>--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
> >>  		ttm_bo_kunmap(&nvbo->kmap);
> >>  }
> >>
> >>+void
> >>+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> >>+{
> >>+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> >>+	struct nouveau_device *device = nouveau_dev(drm->dev);
> >>+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> >>+	int i;
> >>+
> >>+	if (!ttm_dma)
> >>+		return;
> >>+
> >>+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> >>+		return;
> >
> >Is the is_cpu_coherent check really required? On coherent platforms the
> >sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> >knowledge so that drivers can be blissfully ignorant. The explicit
> >is_coherent check makes this a bit leaky. And same comment that underlying
> >the bus-specifics dma-mapping functions are identical.
> 
> I think you are right, the is_cpu_coherent check should not be needed here.
> I still think we should have separate paths for the PCI/DMA cases though,
> unless you can point me to a source that clearly states that the PCI API is
> deprecated and that DMA should be used instead.

Ah, on 2nd look I've found it again. Quoting
Documentation/DMA-API-HOWTO.txt:

"Note that the DMA API works with any bus independent of the underlying
microprocessor architecture. You should use the DMA API rather than the
bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
pci_map_*() interfaces."

The advice is fairly strong here I think ;-) And imo the idea makes sense,
since it allows drivers like nouveau here to care much less about the
actual bus used to get data to/from the ip block. And if you look at intel
gfx it makes even more sense since the pci layer we have is really just a
thin fake shim whacked on top of the hw (on SoCs at least).

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
@ 2014-07-11  7:41               ` Daniel Vetter
  0 siblings, 0 replies; 34+ messages in thread
From: Daniel Vetter @ 2014-07-11  7:41 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Daniel Vetter, Ben Skeggs, David Airlie, David Herrmann,
	Lucas Stach, Thierry Reding, Maarten Lankhorst, nouveau,
	linux-kernel, dri-devel, linux-tegra, Alexandre Courbot

On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
> >On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
> >>On architectures for which access to GPU memory is non-coherent,
> >>caches need to be flushed and invalidated explicitly when BO control
> >>changes between CPU and GPU.
> >>
> >>This patch adds buffer synchronization functions which invokes the
> >>correct API (PCI or DMA) to ensure synchronization is effective.
> >>
> >>Based on the TTM DMA cache helper patches by Lucas Stach.
> >>
> >>Signed-off-by: Lucas Stach <dev@lynxeye.de>
> >>Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> >>---
> >>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
> >>  drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
> >>  drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
> >>  3 files changed, 70 insertions(+)
> >>
> >>diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>index 67e9e8e2e2ec..47e4e8886769 100644
> >>--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> >>@@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
> >>  		ttm_bo_kunmap(&nvbo->kmap);
> >>  }
> >>
> >>+void
> >>+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
> >>+{
> >>+	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
> >>+	struct nouveau_device *device = nouveau_dev(drm->dev);
> >>+	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
> >>+	int i;
> >>+
> >>+	if (!ttm_dma)
> >>+		return;
> >>+
> >>+	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
> >>+		return;
> >
> >Is the is_cpu_coherent check really required? On coherent platforms the
> >sync_for_foo should be a noop. It's the dma api's job to encapsulate this
> >knowledge so that drivers can be blissfully ignorant. The explicit
> >is_coherent check makes this a bit leaky. And same comment that underlying
> >the bus-specifics dma-mapping functions are identical.
> 
> I think you are right, the is_cpu_coherent check should not be needed here.
> I still think we should have separate paths for the PCI/DMA cases though,
> unless you can point me to a source that clearly states that the PCI API is
> deprecated and that DMA should be used instead.

Ah, on 2nd look I've found it again. Quoting
Documentation/DMA-API-HOWTO.txt:

"Note that the DMA API works with any bus independent of the underlying
microprocessor architecture. You should use the DMA API rather than the
bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
pci_map_*() interfaces."

The advice is fairly strong here I think ;-) And imo the idea makes sense,
since it allows drivers like nouveau here to care much less about the
actual bus used to get data to/from the ip block. And if you look at intel
gfx it makes even more sense since the pci layer we have is really just a
thin fake shim whacked on top of the hw (on SoCs at least).

Cheers, Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
  2014-07-11  7:41               ` [Nouveau] " Daniel Vetter
@ 2014-07-11  9:35                   ` Alexandre Courbot
  -1 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  9:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Ben Skeggs, David Herrmann

On 07/11/2014 04:41 PM, Daniel Vetter wrote:
> On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
>> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
>>> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>>>> On architectures for which access to GPU memory is non-coherent,
>>>> caches need to be flushed and invalidated explicitly when BO control
>>>> changes between CPU and GPU.
>>>>
>>>> This patch adds buffer synchronization functions which invokes the
>>>> correct API (PCI or DMA) to ensure synchronization is effective.
>>>>
>>>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>>>
>>>> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>>>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>>>> ---
>>>>   drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>>>>   drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>>>>   3 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> index 67e9e8e2e2ec..47e4e8886769 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>>>>   		ttm_bo_kunmap(&nvbo->kmap);
>>>>   }
>>>>
>>>> +void
>>>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>>>> +{
>>>> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>>>> +	struct nouveau_device *device = nouveau_dev(drm->dev);
>>>> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>>>> +	int i;
>>>> +
>>>> +	if (!ttm_dma)
>>>> +		return;
>>>> +
>>>> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>>>> +		return;
>>>
>>> Is the is_cpu_coherent check really required? On coherent platforms the
>>> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
>>> knowledge so that drivers can be blissfully ignorant. The explicit
>>> is_coherent check makes this a bit leaky. And same comment that underlying
>>> the bus-specifics dma-mapping functions are identical.
>>
>> I think you are right, the is_cpu_coherent check should not be needed here.
>> I still think we should have separate paths for the PCI/DMA cases though,
>> unless you can point me to a source that clearly states that the PCI API is
>> deprecated and that DMA should be used instead.
>
> Ah, on 2nd look I've found it again. Quoting
> Documentation/DMA-API-HOWTO.txt:
>
> "Note that the DMA API works with any bus independent of the underlying
> microprocessor architecture. You should use the DMA API rather than the
> bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
> pci_map_*() interfaces."
>
> The advice is fairly strong here I think ;-) And imo the idea makes sense,
> since it allows drivers like nouveau here to care much less about the
> actual bus used to get data to/from the ip block. And if you look at intel
> gfx it makes even more sense since the pci layer we have is really just a
> thin fake shim whacked on top of the hw (on SoCs at least).

Indeed, I stand corrected. :) That's good news actually, as it will 
simplify the code. Thanks for pointing that out!

I will send a new revision that makes use of the DMA API exclusively and 
will remove the nv_device_map/unmap() functions which are pretty useless 
now.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 4/6] drm/nouveau: synchronize BOs when required
@ 2014-07-11  9:35                   ` Alexandre Courbot
  0 siblings, 0 replies; 34+ messages in thread
From: Alexandre Courbot @ 2014-07-11  9:35 UTC (permalink / raw)
  To: Daniel Vetter
  Cc: Ben Skeggs, David Airlie, David Herrmann, Lucas Stach,
	Thierry Reding, Maarten Lankhorst, nouveau, linux-kernel,
	dri-devel, linux-tegra, Alexandre Courbot

On 07/11/2014 04:41 PM, Daniel Vetter wrote:
> On Fri, Jul 11, 2014 at 11:40:27AM +0900, Alexandre Courbot wrote:
>> On 07/10/2014 10:04 PM, Daniel Vetter wrote:
>>> On Tue, Jul 08, 2014 at 05:25:59PM +0900, Alexandre Courbot wrote:
>>>> On architectures for which access to GPU memory is non-coherent,
>>>> caches need to be flushed and invalidated explicitly when BO control
>>>> changes between CPU and GPU.
>>>>
>>>> This patch adds buffer synchronization functions which invokes the
>>>> correct API (PCI or DMA) to ensure synchronization is effective.
>>>>
>>>> Based on the TTM DMA cache helper patches by Lucas Stach.
>>>>
>>>> Signed-off-by: Lucas Stach <dev@lynxeye.de>
>>>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>>>> ---
>>>>   drivers/gpu/drm/nouveau/nouveau_bo.c  | 56 +++++++++++++++++++++++++++++++++++
>>>>   drivers/gpu/drm/nouveau/nouveau_bo.h  |  2 ++
>>>>   drivers/gpu/drm/nouveau/nouveau_gem.c | 12 ++++++++
>>>>   3 files changed, 70 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> index 67e9e8e2e2ec..47e4e8886769 100644
>>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> @@ -402,6 +402,60 @@ nouveau_bo_unmap(struct nouveau_bo *nvbo)
>>>>   		ttm_bo_kunmap(&nvbo->kmap);
>>>>   }
>>>>
>>>> +void
>>>> +nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
>>>> +{
>>>> +	struct nouveau_drm *drm = nouveau_bdev(nvbo->bo.bdev);
>>>> +	struct nouveau_device *device = nouveau_dev(drm->dev);
>>>> +	struct ttm_dma_tt *ttm_dma = (struct ttm_dma_tt *)nvbo->bo.ttm;
>>>> +	int i;
>>>> +
>>>> +	if (!ttm_dma)
>>>> +		return;
>>>> +
>>>> +	if (nv_device_is_cpu_coherent(device) || nvbo->force_coherent)
>>>> +		return;
>>>
>>> Is the is_cpu_coherent check really required? On coherent platforms the
>>> sync_for_foo should be a noop. It's the dma api's job to encapsulate this
>>> knowledge so that drivers can be blissfully ignorant. The explicit
>>> is_coherent check makes this a bit leaky. And same comment that underlying
>>> the bus-specifics dma-mapping functions are identical.
>>
>> I think you are right, the is_cpu_coherent check should not be needed here.
>> I still think we should have separate paths for the PCI/DMA cases though,
>> unless you can point me to a source that clearly states that the PCI API is
>> deprecated and that DMA should be used instead.
>
> Ah, on 2nd look I've found it again. Quoting
> Documentation/DMA-API-HOWTO.txt:
>
> "Note that the DMA API works with any bus independent of the underlying
> microprocessor architecture. You should use the DMA API rather than the
> bus-specific DMA API, i.e., use the dma_map_*() interfaces rather than the
> pci_map_*() interfaces."
>
> The advice is fairly strong here I think ;-) And imo the idea makes sense,
> since it allows drivers like nouveau here to care much less about the
> actual bus used to get data to/from the ip block. And if you look at intel
> gfx it makes even more sense since the pci layer we have is really just a
> thin fake shim whacked on top of the hw (on SoCs at least).

Indeed, I stand corrected. :) That's good news actually, as it will 
simplify the code. Thanks for pointing that out!

I will send a new revision that makes use of the DMA API exclusively and 
will remove the nv_device_map/unmap() functions which are pretty useless 
now.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
  2014-07-11  2:57                     ` Alexandre Courbot
@ 2014-07-11  9:53                       ` Lucas Stach
  -1 siblings, 0 replies; 34+ messages in thread
From: Lucas Stach @ 2014-07-11  9:53 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Alexandre Courbot, nouveau, linux-kernel, dri-devel, Ben Skeggs,
	linux-tegra

Am Freitag, den 11.07.2014, 11:57 +0900 schrieb Alexandre Courbot:
[...]
> >> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> >> here (excepted ARM is more constrained as to its memory mappings). The
> >> strategy in this series is, map buffers used by user-space cached and
> >> explicitly synchronize them (since the ownership transition from user to GPU
> >> is always clearly performed by syscalls), and use coherent mappings for
> >> buffers used by the kernel which are accessed more randomly. This has solved
> >> all our coherency issues and resulted in the best performance so far.
> > I wonder if we might want to use unsnooped cached mappings of pages on
> > non-ARM platforms also, to avoid the overhead of the cache snooping?
> 
> You might want to indeed, now that coherency is guaranteed by the sync 
> functions originally introduced by Lucas. The only issue I could see is 
> that they always invalidate the full buffer whereas bus snooping only 
> affects pages that are actually touched. Someone would need to try this 
> on a desktop machine and see how it affects performance.
> 
> I'd be all for it though, since it would also allow us to get rid of 
> this ungraceful nv_device_is_cpu_coherent() function and result in 
> simplifying nouveau_bo.c a bit.

This will need some testing to get hard numbers, but I suspect that
invalidating the whole buffer isn't to bad as the prefetch machinery
works very well with the access patterns we see in graphics drivers.

Flushing out the whole buffer should be even less problematic, as it
will only flush out dirty lines that would need to be flushed on GPU
read snooping anyways.

In the long run we might want a separate cpu prepare/finish ioctl where
we can indicate the area of interest. This might help to avoid some of
the invalidate overhead especially for userspace suballocated buffers.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [Nouveau] [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices
@ 2014-07-11  9:53                       ` Lucas Stach
  0 siblings, 0 replies; 34+ messages in thread
From: Lucas Stach @ 2014-07-11  9:53 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Ben Skeggs, Alexandre Courbot, nouveau, linux-kernel, dri-devel,
	Ben Skeggs, linux-tegra

Am Freitag, den 11.07.2014, 11:57 +0900 schrieb Alexandre Courbot:
[...]
> >> Yeah, I am not familiar with i915 but it seems like we are on a similar boat
> >> here (excepted ARM is more constrained as to its memory mappings). The
> >> strategy in this series is, map buffers used by user-space cached and
> >> explicitly synchronize them (since the ownership transition from user to GPU
> >> is always clearly performed by syscalls), and use coherent mappings for
> >> buffers used by the kernel which are accessed more randomly. This has solved
> >> all our coherency issues and resulted in the best performance so far.
> > I wonder if we might want to use unsnooped cached mappings of pages on
> > non-ARM platforms also, to avoid the overhead of the cache snooping?
> 
> You might want to indeed, now that coherency is guaranteed by the sync 
> functions originally introduced by Lucas. The only issue I could see is 
> that they always invalidate the full buffer whereas bus snooping only 
> affects pages that are actually touched. Someone would need to try this 
> on a desktop machine and see how it affects performance.
> 
> I'd be all for it though, since it would also allow us to get rid of 
> this ungraceful nv_device_is_cpu_coherent() function and result in 
> simplifying nouveau_bo.c a bit.

This will need some testing to get hard numbers, but I suspect that
invalidating the whole buffer isn't to bad as the prefetch machinery
works very well with the access patterns we see in graphics drivers.

Flushing out the whole buffer should be even less problematic, as it
will only flush out dirty lines that would need to be flushed on GPU
read snooping anyways.

In the long run we might want a separate cpu prepare/finish ioctl where
we can indicate the area of interest. This might help to avoid some of
the invalidate overhead especially for userspace suballocated buffers.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |


^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2014-07-11  9:55 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-07-08  8:25 [PATCH v4 0/6] drm: nouveau: memory coherency on ARM Alexandre Courbot
2014-07-08  8:25 ` Alexandre Courbot
     [not found] ` <1404807961-30530-1-git-send-email-acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-07-08  8:25   ` [PATCH v4 1/6] drm/ttm: expose CPU address of DMA-allocated pages Alexandre Courbot
2014-07-08  8:25     ` Alexandre Courbot
2014-07-08  8:25   ` [PATCH v4 2/6] drm/nouveau: map pages using DMA API on platform devices Alexandre Courbot
2014-07-08  8:25     ` Alexandre Courbot
     [not found]     ` <1404807961-30530-3-git-send-email-acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-07-10 12:58       ` Daniel Vetter
2014-07-10 12:58         ` [Nouveau] " Daniel Vetter
     [not found]         ` <20140710125849.GF17271-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
2014-07-11  2:35           ` Alexandre Courbot
2014-07-11  2:35             ` Alexandre Courbot
     [not found]             ` <53BF4D6B.70904-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-07-11  2:50               ` Ben Skeggs
2014-07-11  2:50                 ` [Nouveau] " Ben Skeggs
     [not found]                 ` <CACAvsv7eER4VmbR81Ym=YE7fQZ9cNuJsb5372SAuSX+PQfYyrQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-07-11  2:57                   ` Alexandre Courbot
2014-07-11  2:57                     ` Alexandre Courbot
2014-07-11  9:53                     ` Lucas Stach
2014-07-11  9:53                       ` Lucas Stach
2014-07-11  7:38             ` Daniel Vetter
2014-07-11  7:38               ` Daniel Vetter
2014-07-08  8:25   ` [PATCH v4 3/6] drm/nouveau: introduce nv_device_is_cpu_coherent() Alexandre Courbot
2014-07-08  8:25     ` Alexandre Courbot
2014-07-08  8:25   ` [PATCH v4 4/6] drm/nouveau: synchronize BOs when required Alexandre Courbot
2014-07-08  8:25     ` Alexandre Courbot
2014-07-10 13:04     ` [Nouveau] " Daniel Vetter
2014-07-10 13:04       ` Daniel Vetter
     [not found]       ` <20140710130449.GG17271-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
2014-07-11  2:40         ` Alexandre Courbot
2014-07-11  2:40           ` Alexandre Courbot
     [not found]           ` <53BF4E9B.7090606-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-07-11  7:41             ` Daniel Vetter
2014-07-11  7:41               ` [Nouveau] " Daniel Vetter
     [not found]               ` <20140711074138.GW17271-dv86pmgwkMBes7Z6vYuT8azUEOm+Xw19@public.gmane.org>
2014-07-11  9:35                 ` Alexandre Courbot
2014-07-11  9:35                   ` [Nouveau] " Alexandre Courbot
2014-07-08  8:26   ` [PATCH v4 5/6] drm/nouveau: implement explicitly coherent BOs Alexandre Courbot
2014-07-08  8:26     ` Alexandre Courbot
2014-07-08  8:26 ` [PATCH v4 6/6] drm/nouveau: allocate GPFIFOs and fences coherently Alexandre Courbot
2014-07-08  8:26   ` Alexandre Courbot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.