[PATCH 0/4] drm/ttm: nouveau: memory coherency fixes for ARM

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/4] drm/ttm: nouveau: memory coherency fixes for ARM
@ 2014-05-19  7:10 ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

This small series introduces TTM helper functions as well as Nouveau hooks that
are needed to ensure buffer coherency on ARM. Most of this series is a
forward-port of some patches Lucas Stach sent last year and that are also
needed for Nouveau GK20A support:

http://lists.freedesktop.org/archives/nouveau/2013-August/014026.html

Another patch takes care of flushing the CPU write-buffer when writing BOs
through a non-BAR path.

Alexandre Courbot (1):
  drm/nouveau: introduce CPU cache flushing macro

Lucas Stach (3):
  drm/ttm: recognize ARM arch in ioprot handler
  drm/ttm: introduce dma cache sync helpers
  drm/nouveau: hook up cache sync functions

 drivers/gpu/drm/nouveau/core/os.h     | 17 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 40 +++++++++++++++++++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_gem.c |  8 ++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c          | 25 ++++++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h       | 28 ++++++++++++++++++++++++
 7 files changed, 136 insertions(+), 4 deletions(-)

-- 
1.9.2

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 0/4] drm/ttm: nouveau: memory coherency fixes for ARM
@ 2014-05-19  7:10 ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

This small series introduces TTM helper functions as well as Nouveau hooks that
are needed to ensure buffer coherency on ARM. Most of this series is a
forward-port of some patches Lucas Stach sent last year and that are also
needed for Nouveau GK20A support:

http://lists.freedesktop.org/archives/nouveau/2013-August/014026.html

Another patch takes care of flushing the CPU write-buffer when writing BOs
through a non-BAR path.

Alexandre Courbot (1):
  drm/nouveau: introduce CPU cache flushing macro

Lucas Stach (3):
  drm/ttm: recognize ARM arch in ioprot handler
  drm/ttm: introduce dma cache sync helpers
  drm/nouveau: hook up cache sync functions

 drivers/gpu/drm/nouveau/core/os.h     | 17 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 40 +++++++++++++++++++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_gem.c |  8 ++++++-
 drivers/gpu/drm/ttm/ttm_bo_util.c     |  2 +-
 drivers/gpu/drm/ttm/ttm_tt.c          | 25 ++++++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h       | 28 ++++++++++++++++++++++++
 7 files changed, 136 insertions(+), 4 deletions(-)

-- 
1.9.2


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 1/4] drm/ttm: recognize ARM arch in ioprot handler
  2014-05-19  7:10 ` Alexandre Courbot
@ 2014-05-19  7:10   ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: gnurou, nouveau, linux-kernel, dri-devel, linux-tegra

From: Lucas Stach <dev@lynxeye.de>

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 1df856f78568..30e5d90cb7bc 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -500,7 +500,7 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
 			pgprot_val(tmp) |= _PAGE_GUARDED;
 	}
 #endif
-#if defined(__ia64__)
+#if defined(__ia64__) || defined(__arm__)
 	if (caching_flags & TTM_PL_FLAG_WC)
 		tmp = pgprot_writecombine(tmp);
 	else
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 1/4] drm/ttm: recognize ARM arch in ioprot handler
@ 2014-05-19  7:10   ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

From: Lucas Stach <dev@lynxeye.de>

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_bo_util.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo_util.c b/drivers/gpu/drm/ttm/ttm_bo_util.c
index 1df856f78568..30e5d90cb7bc 100644
--- a/drivers/gpu/drm/ttm/ttm_bo_util.c
+++ b/drivers/gpu/drm/ttm/ttm_bo_util.c
@@ -500,7 +500,7 @@ pgprot_t ttm_io_prot(uint32_t caching_flags, pgprot_t tmp)
 			pgprot_val(tmp) |= _PAGE_GUARDED;
 	}
 #endif
-#if defined(__ia64__)
+#if defined(__ia64__) || defined(__arm__)
 	if (caching_flags & TTM_PL_FLAG_WC)
 		tmp = pgprot_writecombine(tmp);
 	else
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
  2014-05-19  7:10 ` Alexandre Courbot
@ 2014-05-19  7:10     ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>

On arches with non-coherent PCI, we need to flush caches ourselfes at
the appropriate places. Introduce two small helpers to make things easy
for TTM based drivers.

Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..05a316b71ad1 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -38,6 +38,7 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/dma-mapping.h>
 #include <drm/drm_cache.h>
 #include <drm/drm_mem_util.h>
 #include <drm/ttm/ttm_module.h>
@@ -248,6 +249,30 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)
 }
 EXPORT_SYMBOL(ttm_dma_tt_fini);
 
+void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+				      struct device *dev)
+{
+	int i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_device(dev, ttm_dma->dma_address[i],
+					   PAGE_SIZE, DMA_TO_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_device);
+
+void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+				   struct device *dev)
+{
+	int i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_cpu(dev, ttm_dma->dma_address[i],
+					PAGE_SIZE, DMA_FROM_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_cpu);
+
 void ttm_tt_unbind(struct ttm_tt *ttm)
 {
 	int ret;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..52fb709568fc 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/spinlock.h>
 #include <linux/reservation.h>
+#include <linux/device.h>
 
 struct ttm_backend_func {
 	/**
@@ -690,6 +691,33 @@ extern int ttm_tt_swapout(struct ttm_tt *ttm,
  */
 extern void ttm_tt_unpopulate(struct ttm_tt *ttm);
 
+/**
+ * ttm_dma_tt_cache_sync_for_device:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device to which to sync.
+ *
+ * This function will flush the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an (almost)
+ * noop. This makes sure that data written by the CPU is visible to the device.
+ */
+extern void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+					     struct device *dev);
+
+/**
+ * ttm_dma_tt_cache_sync_for_cpu:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device from which to sync.
+ *
+ * This function will invalidate the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an (almost)
+ * noop. This makes sure that the CPU does not read any stale cached or
+ * prefetched data.
+ */
+extern void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+					  struct device *dev);
+
 /*
  * ttm_bo.c
  */
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
@ 2014-05-19  7:10     ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

From: Lucas Stach <dev@lynxeye.de>

On arches with non-coherent PCI, we need to flush caches ourselfes at
the appropriate places. Introduce two small helpers to make things easy
for TTM based drivers.

Signed-off-by: Lucas Stach <dev@lynxeye.de>
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
 include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
 2 files changed, 53 insertions(+)

diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
index 75f319090043..05a316b71ad1 100644
--- a/drivers/gpu/drm/ttm/ttm_tt.c
+++ b/drivers/gpu/drm/ttm/ttm_tt.c
@@ -38,6 +38,7 @@
 #include <linux/swap.h>
 #include <linux/slab.h>
 #include <linux/export.h>
+#include <linux/dma-mapping.h>
 #include <drm/drm_cache.h>
 #include <drm/drm_mem_util.h>
 #include <drm/ttm/ttm_module.h>
@@ -248,6 +249,30 @@ void ttm_dma_tt_fini(struct ttm_dma_tt *ttm_dma)
 }
 EXPORT_SYMBOL(ttm_dma_tt_fini);
 
+void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+				      struct device *dev)
+{
+	int i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_device(dev, ttm_dma->dma_address[i],
+					   PAGE_SIZE, DMA_TO_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_device);
+
+void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+				   struct device *dev)
+{
+	int i;
+
+	for (i = 0; i < ttm_dma->ttm.num_pages; i++) {
+		dma_sync_single_for_cpu(dev, ttm_dma->dma_address[i],
+					PAGE_SIZE, DMA_FROM_DEVICE);
+	}
+}
+EXPORT_SYMBOL(ttm_dma_tt_cache_sync_for_cpu);
+
 void ttm_tt_unbind(struct ttm_tt *ttm)
 {
 	int ret;
diff --git a/include/drm/ttm/ttm_bo_driver.h b/include/drm/ttm/ttm_bo_driver.h
index a5183da3ef92..52fb709568fc 100644
--- a/include/drm/ttm/ttm_bo_driver.h
+++ b/include/drm/ttm/ttm_bo_driver.h
@@ -41,6 +41,7 @@
 #include <linux/fs.h>
 #include <linux/spinlock.h>
 #include <linux/reservation.h>
+#include <linux/device.h>
 
 struct ttm_backend_func {
 	/**
@@ -690,6 +691,33 @@ extern int ttm_tt_swapout(struct ttm_tt *ttm,
  */
 extern void ttm_tt_unpopulate(struct ttm_tt *ttm);
 
+/**
+ * ttm_dma_tt_cache_sync_for_device:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device to which to sync.
+ *
+ * This function will flush the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an (almost)
+ * noop. This makes sure that data written by the CPU is visible to the device.
+ */
+extern void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
+					     struct device *dev);
+
+/**
+ * ttm_dma_tt_cache_sync_for_cpu:
+ *
+ * @ttm A struct ttm_tt of the type returned by ttm_dma_tt_init.
+ * @dev A struct device representing the device from which to sync.
+ *
+ * This function will invalidate the CPU caches on arches where snooping in the
+ * TT is not available. On fully coherent arches this will turn into an (almost)
+ * noop. This makes sure that the CPU does not read any stale cached or
+ * prefetched data.
+ */
+extern void ttm_dma_tt_cache_sync_for_cpu(struct ttm_dma_tt *ttm_dma,
+					  struct device *dev);
+
 /*
  * ttm_bo.c
  */
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  7:10 ` Alexandre Courbot
@ 2014-05-19  7:10   ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: gnurou, nouveau, linux-kernel, dri-devel, linux-tegra

From: Lucas Stach <dev@lynxeye.de>

Signed-off-by: Lucas Stach <dev@lynxeye.de>
[acourbot@nvidia.com: make conditional and platform-friendly]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index b6dc85c614be..0886f47e5244 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 {
 	int ret;
 
+	nouveau_bo_sync_for_device(nvbo);
+
 	ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
 			      interruptible, no_wait_gpu);
 	if (ret)
@@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
 	return 0;
 }
 
+#ifdef NOUVEAU_NEED_CACHE_SYNC
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+	struct nouveau_device *device;
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
+		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
+					      nv_device_base(device));
+}
+
+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	if (ttm && ttm->caching_state == tt_cached) {
+		struct nouveau_device *device;
+
+		device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+		ttm_dma_tt_cache_sync_for_device((struct ttm_dma_tt *)ttm,
+						 nv_device_base(device));
+	}
+}
+#endif
+
 static int
 nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
 			 struct ttm_mem_type_manager *man)
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..ead214931223 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -89,6 +89,26 @@ int  nouveau_bo_vma_add(struct nouveau_bo *, struct nouveau_vm *,
 			struct nouveau_vma *);
 void nouveau_bo_vma_del(struct nouveau_bo *, struct nouveau_vma *);
 
+#if IS_ENABLED(CONFIG_ARCH_TEGRA)
+#define NOUVEAU_NEED_CACHE_SYNC
+#endif
+
+#ifdef NOUVEAU_NEED_CACHE_SYNC
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
+void nouveau_bo_sync_for_device(struct nouveau_bo *);
+#else
+static inline void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *)
+{
+}
+
+static inline void
+nouveau_bo_sync_for_device(struct nouveau_bo *)
+{
+}
+#endif
+
+
 /* TODO: submit equivalent to TTM generic API upstream? */
 static inline void __iomem *
 nvbo_kmap_obj_iovirtual(struct nouveau_bo *nvbo)
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..b7e42fdc9634 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
 	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
 	spin_unlock(&nvbo->bo.bdev->fence_lock);
 	drm_gem_object_unreference_unlocked(gem);
-	return ret;
+
+	if (ret)
+		return ret;
+
+	nouveau_bo_sync_for_cpu(nvbo);
+
+	return 0;
 }
 
 int
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-19  7:10   ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

From: Lucas Stach <dev@lynxeye.de>

Signed-off-by: Lucas Stach <dev@lynxeye.de>
[acourbot@nvidia.com: make conditional and platform-friendly]
Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
 3 files changed, 59 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index b6dc85c614be..0886f47e5244 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
 {
 	int ret;
 
+	nouveau_bo_sync_for_device(nvbo);
+
 	ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
 			      interruptible, no_wait_gpu);
 	if (ret)
@@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
 	return 0;
 }
 
+#ifdef NOUVEAU_NEED_CACHE_SYNC
+void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
+{
+	struct nouveau_device *device;
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
+		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
+					      nv_device_base(device));
+}
+
+void
+nouveau_bo_sync_for_device(struct nouveau_bo *nvbo)
+{
+	struct ttm_tt *ttm = nvbo->bo.ttm;
+
+	if (ttm && ttm->caching_state == tt_cached) {
+		struct nouveau_device *device;
+
+		device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
+
+		ttm_dma_tt_cache_sync_for_device((struct ttm_dma_tt *)ttm,
+						 nv_device_base(device));
+	}
+}
+#endif
+
 static int
 nouveau_bo_init_mem_type(struct ttm_bo_device *bdev, uint32_t type,
 			 struct ttm_mem_type_manager *man)
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index ff17c1f432fc..ead214931223 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -89,6 +89,26 @@ int  nouveau_bo_vma_add(struct nouveau_bo *, struct nouveau_vm *,
 			struct nouveau_vma *);
 void nouveau_bo_vma_del(struct nouveau_bo *, struct nouveau_vma *);
 
+#if IS_ENABLED(CONFIG_ARCH_TEGRA)
+#define NOUVEAU_NEED_CACHE_SYNC
+#endif
+
+#ifdef NOUVEAU_NEED_CACHE_SYNC
+void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
+void nouveau_bo_sync_for_device(struct nouveau_bo *);
+#else
+static inline void
+nouveau_bo_sync_for_cpu(struct nouveau_bo *)
+{
+}
+
+static inline void
+nouveau_bo_sync_for_device(struct nouveau_bo *)
+{
+}
+#endif
+
+
 /* TODO: submit equivalent to TTM generic API upstream? */
 static inline void __iomem *
 nvbo_kmap_obj_iovirtual(struct nouveau_bo *nvbo)
diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
index c90c0dc0afe8..b7e42fdc9634 100644
--- a/drivers/gpu/drm/nouveau/nouveau_gem.c
+++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
@@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
 	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
 	spin_unlock(&nvbo->bo.bdev->fence_lock);
 	drm_gem_object_unreference_unlocked(gem);
-	return ret;
+
+	if (ret)
+		return ret;
+
+	nouveau_bo_sync_for_cpu(nvbo);
+
+	return 0;
 }
 
 int
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19  7:10 ` Alexandre Courbot
@ 2014-05-19  7:10     ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Some architectures (e.g. ARM) need the CPU buffers to be explicitely
flushed for a memory write to take effect. Not doing so results in
synchronization issues, especially after writing to BOs.

This patch introduces a macro that flushes the caches on ARM and
translates to a no-op on other architectures, and uses it when
writing to in-memory BOs. It will also be useful for implementations of
instmem that access shared memory directly instead of going through
PRAMIN.

Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
---
 drivers/gpu/drm/nouveau/core/os.h    | 17 +++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.c |  8 ++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
index d0ced94ca54c..274b4460bb03 100644
--- a/drivers/gpu/drm/nouveau/core/os.h
+++ b/drivers/gpu/drm/nouveau/core/os.h
@@ -38,4 +38,21 @@
 #endif /* def __BIG_ENDIAN else */
 #endif /* !ioread32_native */
 
+#if defined(__arm__)
+
+#define nv_cpu_cache_flush_area(va, size)	\
+do {						\
+	phys_addr_t pa = virt_to_phys(va);	\
+	__cpuc_flush_dcache_area(va, size);	\
+	outer_flush_range(pa, pa + size);	\
+} while (0)
+
+#else
+
+#define nv_cpu_cache_flush_area(va, size)	\
+do {						\
+} while (0)
+
+#endif /* defined(__arm__) */
+
 #endif
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 0886f47e5244..b9c9729c5733 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
 	mem = &mem[index];
 	if (is_iomem)
 		iowrite16_native(val, (void __force __iomem *)mem);
-	else
+	else {
 		*mem = val;
+		nv_cpu_cache_flush_area(mem, 2);
+	}
 }
 
 u32
@@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
 	mem = &mem[index];
 	if (is_iomem)
 		iowrite32_native(val, (void __force __iomem *)mem);
-	else
+	else {
 		*mem = val;
+		nv_cpu_cache_flush_area(mem, 4);
+	}
 }
 
 static struct ttm_tt *
-- 
1.9.2

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-19  7:10     ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-19  7:10 UTC (permalink / raw)
  To: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding
  Cc: nouveau, dri-devel, linux-tegra, linux-kernel, gnurou, Alexandre Courbot

Some architectures (e.g. ARM) need the CPU buffers to be explicitely
flushed for a memory write to take effect. Not doing so results in
synchronization issues, especially after writing to BOs.

This patch introduces a macro that flushes the caches on ARM and
translates to a no-op on other architectures, and uses it when
writing to in-memory BOs. It will also be useful for implementations of
instmem that access shared memory directly instead of going through
PRAMIN.

Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
---
 drivers/gpu/drm/nouveau/core/os.h    | 17 +++++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_bo.c |  8 ++++++--
 2 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
index d0ced94ca54c..274b4460bb03 100644
--- a/drivers/gpu/drm/nouveau/core/os.h
+++ b/drivers/gpu/drm/nouveau/core/os.h
@@ -38,4 +38,21 @@
 #endif /* def __BIG_ENDIAN else */
 #endif /* !ioread32_native */
 
+#if defined(__arm__)
+
+#define nv_cpu_cache_flush_area(va, size)	\
+do {						\
+	phys_addr_t pa = virt_to_phys(va);	\
+	__cpuc_flush_dcache_area(va, size);	\
+	outer_flush_range(pa, pa + size);	\
+} while (0)
+
+#else
+
+#define nv_cpu_cache_flush_area(va, size)	\
+do {						\
+} while (0)
+
+#endif /* defined(__arm__) */
+
 #endif
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 0886f47e5244..b9c9729c5733 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
 	mem = &mem[index];
 	if (is_iomem)
 		iowrite16_native(val, (void __force __iomem *)mem);
-	else
+	else {
 		*mem = val;
+		nv_cpu_cache_flush_area(mem, 2);
+	}
 }
 
 u32
@@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
 	mem = &mem[index];
 	if (is_iomem)
 		iowrite32_native(val, (void __force __iomem *)mem);
-	else
+	else {
 		*mem = val;
+		nv_cpu_cache_flush_area(mem, 4);
+	}
 }
 
 static struct ttm_tt *
-- 
1.9.2


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
  2014-05-19  7:10     ` Alexandre Courbot
@ 2014-05-19  8:33       ` Thierry Reding
  -1 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  8:33 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: gnurou, nouveau, linux-kernel, dri-devel, Ben Skeggs, linux-tegra


[-- Attachment #1.1: Type: text/plain, Size: 1109 bytes --]

On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
> From: Lucas Stach <dev@lynxeye.de>
> 
> On arches with non-coherent PCI,

I guess since this applies to gk20a

> we need to flush caches ourselfes at

"ourselves". Or perhaps even reword to something like: "..., caches need
to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
invalidate rather than flush.

> the appropriate places. Introduce two small helpers to make things easy
> for TTM based drivers.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
>  include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
[...]
> +void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
> +				      struct device *dev)
> +{
> +	int i;

This should probably be unsigned long to match the type of
ttm_dma->ttm.num_pages.

Thierry

[-- Attachment #1.2: Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
@ 2014-05-19  8:33       ` Thierry Reding
  0 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  8:33 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, Ben Skeggs, Lucas Stach, nouveau, dri-devel,
	linux-tegra, linux-kernel, gnurou

[-- Attachment #1: Type: text/plain, Size: 1109 bytes --]

On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
> From: Lucas Stach <dev@lynxeye.de>
> 
> On arches with non-coherent PCI,

I guess since this applies to gk20a

> we need to flush caches ourselfes at

"ourselves". Or perhaps even reword to something like: "..., caches need
to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
invalidate rather than flush.

> the appropriate places. Introduce two small helpers to make things easy
> for TTM based drivers.
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
>  include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
>  2 files changed, 53 insertions(+)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
[...]
> +void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
> +				      struct device *dev)
> +{
> +	int i;

This should probably be unsigned long to match the type of
ttm_dma->ttm.num_pages.

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  7:10   ` Alexandre Courbot
@ 2014-05-19  8:46       ` Thierry Reding
  -1 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  8:46 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 2960 bytes --]

On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
> From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> 
> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> [acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org: make conditional and platform-friendly]
> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>

Perhaps having a propery commit message here would be good.

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
[...]
> +#ifdef NOUVEAU_NEED_CACHE_SYNC
> +void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_device *device;
> +	struct ttm_tt *ttm = nvbo->bo.ttm;
> +
> +	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
> +
> +	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
> +		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
> +					      nv_device_base(device));

Can we be certain at this point that the struct ttm_tt is in fact a
struct ttm_dma_tt?

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
[...]
> +#if IS_ENABLED(CONFIG_ARCH_TEGRA)
> +#define NOUVEAU_NEED_CACHE_SYNC
> +#endif

I know I gave this as an example myself when we discussed this offline,
but I'm now thinking that this might actually be better off in Kconfig.

> +#ifdef NOUVEAU_NEED_CACHE_SYNC
> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
> +void nouveau_bo_sync_for_device(struct nouveau_bo *);
> +#else
> +static inline void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *)
> +{
> +}
> +
> +static inline void
> +nouveau_bo_sync_for_device(struct nouveau_bo *)
> +{
> +}
> +#endif
> +
> +

There's a gratuituous blank line here.

> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index c90c0dc0afe8..b7e42fdc9634 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>  	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>  	spin_unlock(&nvbo->bo.bdev->fence_lock);
>  	drm_gem_object_unreference_unlocked(gem);
> -	return ret;
> +
> +	if (ret)
> +		return ret;
> +
> +	nouveau_bo_sync_for_cpu(nvbo);
> +
> +	return 0;
>  }

This could be rewritten as:

	if (!ret)
		nouveau_bo_sync_for_cpu(nvbo);

	return ret;

Which would be slightly shorter.

On second thought, perhaps part of nouveau_gem_ioctl_cpu_prep() could be
refactored into a separate function to make this more symmetric. If we
put that in nouveau_bo.c and name it nouveau_bo_wait() for example, the
dummies can go away and both nouveau_bo_sync_for_{cpu,device}() can be
made static. I also think that's cleaner because it has both variants of
the nouveau_bo_sync_for_*() calls in the same file.

Thierry

[-- Attachment #1.2: Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 181 bytes --]

_______________________________________________
Nouveau mailing list
Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-19  8:46       ` Thierry Reding
  0 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  8:46 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, Ben Skeggs, Lucas Stach, nouveau, dri-devel,
	linux-tegra, linux-kernel, gnurou

[-- Attachment #1: Type: text/plain, Size: 2844 bytes --]

On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
> From: Lucas Stach <dev@lynxeye.de>
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> [acourbot@nvidia.com: make conditional and platform-friendly]
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>

Perhaps having a propery commit message here would be good.

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
[...]
> +#ifdef NOUVEAU_NEED_CACHE_SYNC
> +void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> +{
> +	struct nouveau_device *device;
> +	struct ttm_tt *ttm = nvbo->bo.ttm;
> +
> +	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
> +
> +	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
> +		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
> +					      nv_device_base(device));

Can we be certain at this point that the struct ttm_tt is in fact a
struct ttm_dma_tt?

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
[...]
> +#if IS_ENABLED(CONFIG_ARCH_TEGRA)
> +#define NOUVEAU_NEED_CACHE_SYNC
> +#endif

I know I gave this as an example myself when we discussed this offline,
but I'm now thinking that this might actually be better off in Kconfig.

> +#ifdef NOUVEAU_NEED_CACHE_SYNC
> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
> +void nouveau_bo_sync_for_device(struct nouveau_bo *);
> +#else
> +static inline void
> +nouveau_bo_sync_for_cpu(struct nouveau_bo *)
> +{
> +}
> +
> +static inline void
> +nouveau_bo_sync_for_device(struct nouveau_bo *)
> +{
> +}
> +#endif
> +
> +

There's a gratuituous blank line here.

> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
> index c90c0dc0afe8..b7e42fdc9634 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
> @@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>  	ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>  	spin_unlock(&nvbo->bo.bdev->fence_lock);
>  	drm_gem_object_unreference_unlocked(gem);
> -	return ret;
> +
> +	if (ret)
> +		return ret;
> +
> +	nouveau_bo_sync_for_cpu(nvbo);
> +
> +	return 0;
>  }

This could be rewritten as:

	if (!ret)
		nouveau_bo_sync_for_cpu(nvbo);

	return ret;

Which would be slightly shorter.

On second thought, perhaps part of nouveau_gem_ioctl_cpu_prep() could be
refactored into a separate function to make this more symmetric. If we
put that in nouveau_bo.c and name it nouveau_bo_wait() for example, the
dummies can go away and both nouveau_bo_sync_for_{cpu,device}() can be
made static. I also think that's cleaner because it has both variants of
the nouveau_bo_sync_for_*() calls in the same file.

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19  7:10     ` Alexandre Courbot
@ 2014-05-19  9:02         ` Thierry Reding
  -1 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  9:02 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA


[-- Attachment #1.1: Type: text/plain, Size: 2755 bytes --]

On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> flushed for a memory write to take effect. Not doing so results in
> synchronization issues, especially after writing to BOs.

It seems to me that the above is generally true for all architectures,
not just ARM.

Also: s/explicitely/explicitly/

> This patch introduces a macro that flushes the caches on ARM and
> translates to a no-op on other architectures, and uses it when
> writing to in-memory BOs. It will also be useful for implementations of
> instmem that access shared memory directly instead of going through
> PRAMIN.

Presumably instmem can access shared memory on all architectures, so
this doesn't seem like a property of the architecture but rather of the
memory pool backing the instmem.

In that case I wonder if this shouldn't be moved into an operation that
is implemented by the backing memory pool and be a noop where the cache
doesn't need explicit flushing.

> diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> index d0ced94ca54c..274b4460bb03 100644
> --- a/drivers/gpu/drm/nouveau/core/os.h
> +++ b/drivers/gpu/drm/nouveau/core/os.h
> @@ -38,4 +38,21 @@
>  #endif /* def __BIG_ENDIAN else */
>  #endif /* !ioread32_native */
>  
> +#if defined(__arm__)
> +
> +#define nv_cpu_cache_flush_area(va, size)	\
> +do {						\
> +	phys_addr_t pa = virt_to_phys(va);	\
> +	__cpuc_flush_dcache_area(va, size);	\
> +	outer_flush_range(pa, pa + size);	\
> +} while (0)

Couldn't this be a static inline function?

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
[...]
> index 0886f47e5244..b9c9729c5733 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite16_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 2);
> +	}
>  }
>  
>  u32
> @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite32_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 4);
> +	}

This looks rather like a sledgehammer to me. Effectively this turns nvbo
into an uncached buffer. With additional overhead of constantly flushing
caches. Wouldn't it make more sense to locate the places where these are
called and flush the cache after all the writes have completed?

Thierry

[-- Attachment #1.2: Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 181 bytes --]

_______________________________________________
Nouveau mailing list
Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
http://lists.freedesktop.org/mailman/listinfo/nouveau

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-19  9:02         ` Thierry Reding
  0 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19  9:02 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, Ben Skeggs, Lucas Stach, nouveau, dri-devel,
	linux-tegra, linux-kernel, gnurou

[-- Attachment #1: Type: text/plain, Size: 2755 bytes --]

On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> flushed for a memory write to take effect. Not doing so results in
> synchronization issues, especially after writing to BOs.

It seems to me that the above is generally true for all architectures,
not just ARM.

Also: s/explicitely/explicitly/

> This patch introduces a macro that flushes the caches on ARM and
> translates to a no-op on other architectures, and uses it when
> writing to in-memory BOs. It will also be useful for implementations of
> instmem that access shared memory directly instead of going through
> PRAMIN.

Presumably instmem can access shared memory on all architectures, so
this doesn't seem like a property of the architecture but rather of the
memory pool backing the instmem.

In that case I wonder if this shouldn't be moved into an operation that
is implemented by the backing memory pool and be a noop where the cache
doesn't need explicit flushing.

> diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> index d0ced94ca54c..274b4460bb03 100644
> --- a/drivers/gpu/drm/nouveau/core/os.h
> +++ b/drivers/gpu/drm/nouveau/core/os.h
> @@ -38,4 +38,21 @@
>  #endif /* def __BIG_ENDIAN else */
>  #endif /* !ioread32_native */
>  
> +#if defined(__arm__)
> +
> +#define nv_cpu_cache_flush_area(va, size)	\
> +do {						\
> +	phys_addr_t pa = virt_to_phys(va);	\
> +	__cpuc_flush_dcache_area(va, size);	\
> +	outer_flush_range(pa, pa + size);	\
> +} while (0)

Couldn't this be a static inline function?

> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
[...]
> index 0886f47e5244..b9c9729c5733 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite16_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 2);
> +	}
>  }
>  
>  u32
> @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>  	mem = &mem[index];
>  	if (is_iomem)
>  		iowrite32_native(val, (void __force __iomem *)mem);
> -	else
> +	else {
>  		*mem = val;
> +		nv_cpu_cache_flush_area(mem, 4);
> +	}

This looks rather like a sledgehammer to me. Effectively this turns nvbo
into an uncached buffer. With additional overhead of constantly flushing
caches. Wouldn't it make more sense to locate the places where these are
called and flush the cache after all the writes have completed?

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19  9:02         ` Thierry Reding
@ 2014-05-19  9:22           ` Lucas Stach
  -1 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:22 UTC (permalink / raw)
  To: Thierry Reding
  Cc: gnurou, nouveau, linux-kernel, dri-devel, Ben Skeggs, linux-tegra

Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > flushed for a memory write to take effect. Not doing so results in
> > synchronization issues, especially after writing to BOs.
> 
> It seems to me that the above is generally true for all architectures,
> not just ARM.
> 
No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
snoop the CPU caches and therefore an explicit cache flush is not
required.

> Also: s/explicitely/explicitly/
> 
> > This patch introduces a macro that flushes the caches on ARM and
> > translates to a no-op on other architectures, and uses it when
> > writing to in-memory BOs. It will also be useful for implementations of
> > instmem that access shared memory directly instead of going through
> > PRAMIN.
> 
> Presumably instmem can access shared memory on all architectures, so
> this doesn't seem like a property of the architecture but rather of the
> memory pool backing the instmem.
> 
> In that case I wonder if this shouldn't be moved into an operation that
> is implemented by the backing memory pool and be a noop where the cache
> doesn't need explicit flushing.
> 
> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> > index d0ced94ca54c..274b4460bb03 100644
> > --- a/drivers/gpu/drm/nouveau/core/os.h
> > +++ b/drivers/gpu/drm/nouveau/core/os.h
> > @@ -38,4 +38,21 @@
> >  #endif /* def __BIG_ENDIAN else */
> >  #endif /* !ioread32_native */
> >  
> > +#if defined(__arm__)
> > +
> > +#define nv_cpu_cache_flush_area(va, size)	\
> > +do {						\
> > +	phys_addr_t pa = virt_to_phys(va);	\
> > +	__cpuc_flush_dcache_area(va, size);	\
> > +	outer_flush_range(pa, pa + size);	\
> > +} while (0)
> 
> Couldn't this be a static inline function?
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
> > index 0886f47e5244..b9c9729c5733 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite16_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 2);
> > +	}
> >  }
> >  
> >  u32
> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite32_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 4);
> > +	}
> 
> This looks rather like a sledgehammer to me. Effectively this turns nvbo
> into an uncached buffer. With additional overhead of constantly flushing
> caches. Wouldn't it make more sense to locate the places where these are
> called and flush the cache after all the writes have completed?
> 
I don't think the explicit flushing for those things makes sense. I
think it is a lot more effective to just map the BOs write-combined on
PCI non-coherent arches. This way any writes will be buffered. Reads
will be slow, but I don't think nouveau is reading back a lot from those
buffers.
Using the write-combining buffer doesn't need any additional
synchronization as it will get flushed on pushbuf kickoff anyways.

Regards,
Lucas
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-19  9:22           ` Lucas Stach
  0 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:22 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Alexandre Courbot, gnurou, nouveau, linux-kernel, dri-devel,
	Ben Skeggs, linux-tegra

Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > flushed for a memory write to take effect. Not doing so results in
> > synchronization issues, especially after writing to BOs.
> 
> It seems to me that the above is generally true for all architectures,
> not just ARM.
> 
No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
snoop the CPU caches and therefore an explicit cache flush is not
required.

> Also: s/explicitely/explicitly/
> 
> > This patch introduces a macro that flushes the caches on ARM and
> > translates to a no-op on other architectures, and uses it when
> > writing to in-memory BOs. It will also be useful for implementations of
> > instmem that access shared memory directly instead of going through
> > PRAMIN.
> 
> Presumably instmem can access shared memory on all architectures, so
> this doesn't seem like a property of the architecture but rather of the
> memory pool backing the instmem.
> 
> In that case I wonder if this shouldn't be moved into an operation that
> is implemented by the backing memory pool and be a noop where the cache
> doesn't need explicit flushing.
> 
> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
> > index d0ced94ca54c..274b4460bb03 100644
> > --- a/drivers/gpu/drm/nouveau/core/os.h
> > +++ b/drivers/gpu/drm/nouveau/core/os.h
> > @@ -38,4 +38,21 @@
> >  #endif /* def __BIG_ENDIAN else */
> >  #endif /* !ioread32_native */
> >  
> > +#if defined(__arm__)
> > +
> > +#define nv_cpu_cache_flush_area(va, size)	\
> > +do {						\
> > +	phys_addr_t pa = virt_to_phys(va);	\
> > +	__cpuc_flush_dcache_area(va, size);	\
> > +	outer_flush_range(pa, pa + size);	\
> > +} while (0)
> 
> Couldn't this be a static inline function?
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
> > index 0886f47e5244..b9c9729c5733 100644
> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite16_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 2);
> > +	}
> >  }
> >  
> >  u32
> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> >  	mem = &mem[index];
> >  	if (is_iomem)
> >  		iowrite32_native(val, (void __force __iomem *)mem);
> > -	else
> > +	else {
> >  		*mem = val;
> > +		nv_cpu_cache_flush_area(mem, 4);
> > +	}
> 
> This looks rather like a sledgehammer to me. Effectively this turns nvbo
> into an uncached buffer. With additional overhead of constantly flushing
> caches. Wouldn't it make more sense to locate the places where these are
> called and flush the cache after all the writes have completed?
> 
I don't think the explicit flushing for those things makes sense. I
think it is a lot more effective to just map the BOs write-combined on
PCI non-coherent arches. This way any writes will be buffered. Reads
will be slow, but I don't think nouveau is reading back a lot from those
buffers.
Using the write-combining buffer doesn't need any additional
synchronization as it will get flushed on pushbuf kickoff anyways.

Regards,
Lucas
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  7:10   ` Alexandre Courbot
@ 2014-05-19  9:31       ` Lucas Stach
  -1 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:31 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Am Montag, den 19.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
> From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> 
> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> [acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org: make conditional and platform-friendly]
> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
>  3 files changed, 59 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index b6dc85c614be..0886f47e5244 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  {
>  	int ret;
>  
> +	nouveau_bo_sync_for_device(nvbo);
> +
>  	ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
>  			      interruptible, no_wait_gpu);
>  	if (ret)
> @@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
>  	return 0;
>  }
>  
> +#ifdef NOUVEAU_NEED_CACHE_SYNC

I don't like this ifdef at all. I know calling this functions will add a
little overhead to x86 where it isn't strictly required, but I think
it's negligible.

When I looked at them the dma_sync_single_for_[device|cpu] functions
which are called here map out to just a drain of the PCI store buffer on
x86, which should be fast enough to be done unconditionally. They won't
so any time-consuming cache synchronization on PCI coherent arches.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-19  9:31       ` Lucas Stach
  0 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:31 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: David Airlie, Ben Skeggs, Lucas Stach, Thierry Reding, gnurou,
	nouveau, linux-kernel, dri-devel, linux-tegra

Am Montag, den 19.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
> From: Lucas Stach <dev@lynxeye.de>
> 
> Signed-off-by: Lucas Stach <dev@lynxeye.de>
> [acourbot@nvidia.com: make conditional and platform-friendly]
> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> ---
>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
>  drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
>  3 files changed, 59 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> index b6dc85c614be..0886f47e5244 100644
> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> @@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>  {
>  	int ret;
>  
> +	nouveau_bo_sync_for_device(nvbo);
> +
>  	ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
>  			      interruptible, no_wait_gpu);
>  	if (ret)
> @@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
>  	return 0;
>  }
>  
> +#ifdef NOUVEAU_NEED_CACHE_SYNC

I don't like this ifdef at all. I know calling this functions will add a
little overhead to x86 where it isn't strictly required, but I think
it's negligible.

When I looked at them the dma_sync_single_for_[device|cpu] functions
which are called here map out to just a drain of the PCI store buffer on
x86, which should be fast enough to be done unconditionally. They won't
so any time-consuming cache synchronization on PCI coherent arches.

Regards,
Lucas

-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  8:46       ` Thierry Reding
@ 2014-05-19  9:44         ` Lucas Stach
  -1 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:44 UTC (permalink / raw)
  To: Thierry Reding
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

Am Montag, den 19.05.2014, 10:46 +0200 schrieb Thierry Reding:
> On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
> > From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> > 
> > Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
> > [acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org: make conditional and platform-friendly]
> > Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
> 
> Perhaps having a propery commit message here would be good.
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
> > +#ifdef NOUVEAU_NEED_CACHE_SYNC
> > +void
> > +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> > +{
> > +	struct nouveau_device *device;
> > +	struct ttm_tt *ttm = nvbo->bo.ttm;
> > +
> > +	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
> > +
> > +	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
> > +		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
> > +					      nv_device_base(device));
> 
> Can we be certain at this point that the struct ttm_tt is in fact a
> struct ttm_dma_tt?

Yes, for all cases except AGP, where things are mapped WC anyways (so
caching_state is always != cached) nouveau_bo_driver uses
nouveau_ttm_tt_create() as its ttm_tt_create callback. This in turn
calls nouveau_sgdma_create_ttm() which uses ttm_dma_tt_init()
unconditionally.

Regards,
Lucas
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-19  9:44         ` Lucas Stach
  0 siblings, 0 replies; 41+ messages in thread
From: Lucas Stach @ 2014-05-19  9:44 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Alexandre Courbot, gnurou, nouveau, linux-kernel, dri-devel,
	Ben Skeggs, linux-tegra

Am Montag, den 19.05.2014, 10:46 +0200 schrieb Thierry Reding:
> On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
> > From: Lucas Stach <dev@lynxeye.de>
> > 
> > Signed-off-by: Lucas Stach <dev@lynxeye.de>
> > [acourbot@nvidia.com: make conditional and platform-friendly]
> > Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
> 
> Perhaps having a propery commit message here would be good.
> 
> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
> > +#ifdef NOUVEAU_NEED_CACHE_SYNC
> > +void
> > +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
> > +{
> > +	struct nouveau_device *device;
> > +	struct ttm_tt *ttm = nvbo->bo.ttm;
> > +
> > +	device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
> > +
> > +	if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
> > +		ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
> > +					      nv_device_base(device));
> 
> Can we be certain at this point that the struct ttm_tt is in fact a
> struct ttm_dma_tt?

Yes, for all cases except AGP, where things are mapped WC anyways (so
caching_state is always != cached) nouveau_bo_driver uses
nouveau_ttm_tt_create() as its ttm_tt_create callback. This in turn
calls nouveau_sgdma_create_ttm() which uses ttm_dma_tt_init()
unconditionally.

Regards,
Lucas
-- 
Pengutronix e.K.             | Lucas Stach                 |
Industrial Linux Solutions   | http://www.pengutronix.de/  |


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19  9:22           ` Lucas Stach
@ 2014-05-19 10:03               ` Thierry Reding
  -1 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19 10:03 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Alexandre Courbot, gnurou-Re5JQEeQqe8AvxtiuMwx3w,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 2486 bytes --]

On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > flushed for a memory write to take effect. Not doing so results in
> > > synchronization issues, especially after writing to BOs.
> > 
> > It seems to me that the above is generally true for all architectures,
> > not just ARM.
> > 
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.

I was criticizing the wording in the commit message. Perhaps it could be
enhanced with what you just said.

> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > [...]
> > > index 0886f47e5244..b9c9729c5733 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite16_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 2);
> > > +	}
> > >  }
> > >  
> > >  u32
> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite32_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 4);
> > > +	}
> > 
> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
> > into an uncached buffer. With additional overhead of constantly flushing
> > caches. Wouldn't it make more sense to locate the places where these are
> > called and flush the cache after all the writes have completed?
> > 
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

Sounds good to me.

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-19 10:03               ` Thierry Reding
  0 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-19 10:03 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Alexandre Courbot, gnurou, nouveau, linux-kernel, dri-devel,
	Ben Skeggs, linux-tegra

[-- Attachment #1: Type: text/plain, Size: 2486 bytes --]

On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > flushed for a memory write to take effect. Not doing so results in
> > > synchronization issues, especially after writing to BOs.
> > 
> > It seems to me that the above is generally true for all architectures,
> > not just ARM.
> > 
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.

I was criticizing the wording in the commit message. Perhaps it could be
enhanced with what you just said.

> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > [...]
> > > index 0886f47e5244..b9c9729c5733 100644
> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite16_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 2);
> > > +	}
> > >  }
> > >  
> > >  u32
> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
> > >  	mem = &mem[index];
> > >  	if (is_iomem)
> > >  		iowrite32_native(val, (void __force __iomem *)mem);
> > > -	else
> > > +	else {
> > >  		*mem = val;
> > > +		nv_cpu_cache_flush_area(mem, 4);
> > > +	}
> > 
> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
> > into an uncached buffer. With additional overhead of constantly flushing
> > caches. Wouldn't it make more sense to locate the places where these are
> > called and flush the cache after all the writes have completed?
> > 
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

Sounds good to me.

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19 10:03               ` Thierry Reding
@ 2014-05-19 10:27                 ` Daniel Vetter
  -1 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2014-05-19 10:27 UTC (permalink / raw)
  To: Thierry Reding
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA, Lucas Stach

On Mon, May 19, 2014 at 12:03:17PM +0200, Thierry Reding wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> > Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > > flushed for a memory write to take effect. Not doing so results in
> > > > synchronization issues, especially after writing to BOs.
> > > 
> > > It seems to me that the above is generally true for all architectures,
> > > not just ARM.
> > > 
> > No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> > snoop the CPU caches and therefore an explicit cache flush is not
> > required.
> 
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.

Shouldn't this be done in the dma mapping layer? I know that i915 does all
the cpu cache flushing itself, but that's because the x86 dma layer
refuses to believe that there are non-coherent platforms on x86. But on
arm it can cope.

This is somewhat important for dma-buf buffer sharing since if the cpu
cache control is done in drivers you must do double-flushing on shared
buffers. Atm you have to do that anyway, but at least this would make it
easier. The other problem is that ttm reinvents half of the dma mapping
functions.

Just my 2 cents.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [Nouveau] [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-19 10:27                 ` Daniel Vetter
  0 siblings, 0 replies; 41+ messages in thread
From: Daniel Vetter @ 2014-05-19 10:27 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Lucas Stach, nouveau, linux-kernel, dri-devel, Ben Skeggs, linux-tegra

On Mon, May 19, 2014 at 12:03:17PM +0200, Thierry Reding wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
> > Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
> > > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
> > > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
> > > > flushed for a memory write to take effect. Not doing so results in
> > > > synchronization issues, especially after writing to BOs.
> > > 
> > > It seems to me that the above is generally true for all architectures,
> > > not just ARM.
> > > 
> > No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> > snoop the CPU caches and therefore an explicit cache flush is not
> > required.
> 
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.

Shouldn't this be done in the dma mapping layer? I know that i915 does all
the cpu cache flushing itself, but that's because the x86 dma layer
refuses to believe that there are non-coherent platforms on x86. But on
arm it can cope.

This is somewhat important for dma-buf buffer sharing since if the cpu
cache control is done in drivers you must do double-flushing on shared
buffers. Atm you have to do that anyway, but at least this would make it
easier. The other problem is that ttm reinvents half of the dma mapping
functions.

Just my 2 cents.
-Daniel
-- 
Daniel Vetter
Software Engineer, Intel Corporation
+41 (0) 79 365 57 48 - http://blog.ffwll.ch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
  2014-05-19  8:33       ` Thierry Reding
@ 2014-05-23  5:49         ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  5:49 UTC (permalink / raw)
  To: Thierry Reding
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, May 19, 2014 at 5:33 PM, Thierry Reding
<thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
>> From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>>
>> On arches with non-coherent PCI,
>
> I guess since this applies to gk20a
>
>> we need to flush caches ourselfes at
>
> "ourselves". Or perhaps even reword to something like: "..., caches need
> to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
> invalidate rather than flush.

Rephrased as "On arches for which access to GPU memory is non-coherent, caches
need to be flushed and invalidated explicitly at the appropriate places."

>
>> the appropriate places. Introduce two small helpers to make things easy
>> for TTM based drivers.
>>
>> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> ---
>>  drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
>>  include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
>>  2 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> [...]
>> +void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
>> +                                   struct device *dev)
>> +{
>> +     int i;
>
> This should probably be unsigned long to match the type of
> ttm_dma->ttm.num_pages.

Fixed.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
@ 2014-05-23  5:49         ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  5:49 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Alexandre Courbot, David Airlie, Ben Skeggs, Lucas Stach,
	nouveau, dri-devel, linux-tegra, Linux Kernel Mailing List

On Mon, May 19, 2014 at 5:33 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
>> From: Lucas Stach <dev@lynxeye.de>
>>
>> On arches with non-coherent PCI,
>
> I guess since this applies to gk20a
>
>> we need to flush caches ourselfes at
>
> "ourselves". Or perhaps even reword to something like: "..., caches need
> to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
> invalidate rather than flush.

Rephrased as "On arches for which access to GPU memory is non-coherent, caches
need to be flushed and invalidated explicitly at the appropriate places."

>
>> the appropriate places. Introduce two small helpers to make things easy
>> for TTM based drivers.
>>
>> Signed-off-by: Lucas Stach <dev@lynxeye.de>
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>> ---
>>  drivers/gpu/drm/ttm/ttm_tt.c    | 25 +++++++++++++++++++++++++
>>  include/drm/ttm/ttm_bo_driver.h | 28 ++++++++++++++++++++++++++++
>>  2 files changed, 53 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/ttm/ttm_tt.c b/drivers/gpu/drm/ttm/ttm_tt.c
> [...]
>> +void ttm_dma_tt_cache_sync_for_device(struct ttm_dma_tt *ttm_dma,
>> +                                   struct device *dev)
>> +{
>> +     int i;
>
> This should probably be unsigned long to match the type of
> ttm_dma->ttm.num_pages.

Fixed.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  8:46       ` Thierry Reding
@ 2014-05-23  6:00         ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:00 UTC (permalink / raw)
  To: Thierry Reding
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, May 19, 2014 at 5:46 PM, Thierry Reding
<thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
>> From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>>
>> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>> [acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org: make conditional and platform-friendly]
>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>
> Perhaps having a propery commit message here would be good.
>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>> +void
>> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
>> +{
>> +     struct nouveau_device *device;
>> +     struct ttm_tt *ttm = nvbo->bo.ttm;
>> +
>> +     device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
>> +
>> +     if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
>> +             ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
>> +                                           nv_device_base(device));
>
> Can we be certain at this point that the struct ttm_tt is in fact a
> struct ttm_dma_tt?
>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
> [...]
>> +#if IS_ENABLED(CONFIG_ARCH_TEGRA)
>> +#define NOUVEAU_NEED_CACHE_SYNC
>> +#endif
>
> I know I gave this as an example myself when we discussed this offline,
> but I'm now thinking that this might actually be better off in Kconfig.
>
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
>> +void nouveau_bo_sync_for_device(struct nouveau_bo *);
>> +#else
>> +static inline void
>> +nouveau_bo_sync_for_cpu(struct nouveau_bo *)
>> +{
>> +}
>> +
>> +static inline void
>> +nouveau_bo_sync_for_device(struct nouveau_bo *)
>> +{
>> +}
>> +#endif
>> +
>> +
>
> There's a gratuituous blank line here.

Fixed.

>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> index c90c0dc0afe8..b7e42fdc9634 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> @@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>>       ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>>       spin_unlock(&nvbo->bo.bdev->fence_lock);
>>       drm_gem_object_unreference_unlocked(gem);
>> -     return ret;
>> +
>> +     if (ret)
>> +             return ret;
>> +
>> +     nouveau_bo_sync_for_cpu(nvbo);
>> +
>> +     return 0;
>>  }
>
> This could be rewritten as:
>
>         if (!ret)
>                 nouveau_bo_sync_for_cpu(nvbo);
>
>         return ret;
>
> Which would be slightly shorter.

I prefer to have a clear, easy to read code flow here by keeping
error-handling within conditions (and not the other way round). This
kind of optimization is very well done by the compiler.

>
> On second thought, perhaps part of nouveau_gem_ioctl_cpu_prep() could be
> refactored into a separate function to make this more symmetric. If we
> put that in nouveau_bo.c and name it nouveau_bo_wait() for example, the
> dummies can go away and both nouveau_bo_sync_for_{cpu,device}() can be
> made static. I also think that's cleaner because it has both variants of
> the nouveau_bo_sync_for_*() calls in the same file.

Yep, agreed. I will give it a try in the next version of the series.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-23  6:00         ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:00 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Alexandre Courbot, David Airlie, Ben Skeggs, Lucas Stach,
	nouveau, dri-devel, linux-tegra, Linux Kernel Mailing List

On Mon, May 19, 2014 at 5:46 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Mon, May 19, 2014 at 04:10:57PM +0900, Alexandre Courbot wrote:
>> From: Lucas Stach <dev@lynxeye.de>
>>
>> Signed-off-by: Lucas Stach <dev@lynxeye.de>
>> [acourbot@nvidia.com: make conditional and platform-friendly]
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>
> Perhaps having a propery commit message here would be good.
>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
> [...]
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>> +void
>> +nouveau_bo_sync_for_cpu(struct nouveau_bo *nvbo)
>> +{
>> +     struct nouveau_device *device;
>> +     struct ttm_tt *ttm = nvbo->bo.ttm;
>> +
>> +     device = nouveau_dev(nouveau_bdev(ttm->bdev)->dev);
>> +
>> +     if (nvbo->bo.ttm && nvbo->bo.ttm->caching_state == tt_cached)
>> +             ttm_dma_tt_cache_sync_for_cpu((struct ttm_dma_tt *)nvbo->bo.ttm,
>> +                                           nv_device_base(device));
>
> Can we be certain at this point that the struct ttm_tt is in fact a
> struct ttm_dma_tt?
>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
> [...]
>> +#if IS_ENABLED(CONFIG_ARCH_TEGRA)
>> +#define NOUVEAU_NEED_CACHE_SYNC
>> +#endif
>
> I know I gave this as an example myself when we discussed this offline,
> but I'm now thinking that this might actually be better off in Kconfig.
>
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>> +void nouveau_bo_sync_for_cpu(struct nouveau_bo *);
>> +void nouveau_bo_sync_for_device(struct nouveau_bo *);
>> +#else
>> +static inline void
>> +nouveau_bo_sync_for_cpu(struct nouveau_bo *)
>> +{
>> +}
>> +
>> +static inline void
>> +nouveau_bo_sync_for_device(struct nouveau_bo *)
>> +{
>> +}
>> +#endif
>> +
>> +
>
> There's a gratuituous blank line here.

Fixed.

>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_gem.c b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> index c90c0dc0afe8..b7e42fdc9634 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_gem.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_gem.c
>> @@ -897,7 +897,13 @@ nouveau_gem_ioctl_cpu_prep(struct drm_device *dev, void *data,
>>       ret = ttm_bo_wait(&nvbo->bo, true, true, no_wait);
>>       spin_unlock(&nvbo->bo.bdev->fence_lock);
>>       drm_gem_object_unreference_unlocked(gem);
>> -     return ret;
>> +
>> +     if (ret)
>> +             return ret;
>> +
>> +     nouveau_bo_sync_for_cpu(nvbo);
>> +
>> +     return 0;
>>  }
>
> This could be rewritten as:
>
>         if (!ret)
>                 nouveau_bo_sync_for_cpu(nvbo);
>
>         return ret;
>
> Which would be slightly shorter.

I prefer to have a clear, easy to read code flow here by keeping
error-handling within conditions (and not the other way round). This
kind of optimization is very well done by the compiler.

>
> On second thought, perhaps part of nouveau_gem_ioctl_cpu_prep() could be
> refactored into a separate function to make this more symmetric. If we
> put that in nouveau_bo.c and name it nouveau_bo_wait() for example, the
> dummies can go away and both nouveau_bo_sync_for_{cpu,device}() can be
> made static. I also think that's cleaner because it has both variants of
> the nouveau_bo_sync_for_*() calls in the same file.

Yep, agreed. I will give it a try in the next version of the series.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
  2014-05-19  9:31       ` Lucas Stach
@ 2014-05-23  6:01           ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:01 UTC (permalink / raw)
  To: Lucas Stach
  Cc: David Airlie, nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, May 19, 2014 at 6:31 PM, Lucas Stach <l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> wrote:
> Am Montag, den 19.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
>> From: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>>
>> Signed-off-by: Lucas Stach <dev-8ppwABl0HbeELgA04lAiVw@public.gmane.org>
>> [acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org: make conditional and platform-friendly]
>> Signed-off-by: Alexandre Courbot <acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
>> ---
>>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
>>  drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
>>  drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
>>  3 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index b6dc85c614be..0886f47e5244 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>>  {
>>       int ret;
>>
>> +     nouveau_bo_sync_for_device(nvbo);
>> +
>>       ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
>>                             interruptible, no_wait_gpu);
>>       if (ret)
>> @@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
>>       return 0;
>>  }
>>
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>
> I don't like this ifdef at all. I know calling this functions will add a
> little overhead to x86 where it isn't strictly required, but I think
> it's negligible.
>
> When I looked at them the dma_sync_single_for_[device|cpu] functions
> which are called here map out to just a drain of the PCI store buffer on
> x86, which should be fast enough to be done unconditionally. They won't
> so any time-consuming cache synchronization on PCI coherent arches.

If Ben agrees with it I am also perfectly fine with getting rid of this #ifdef.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 3/4] drm/nouveau: hook up cache sync functions
@ 2014-05-23  6:01           ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:01 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Alexandre Courbot, David Airlie, Ben Skeggs, Lucas Stach,
	Thierry Reding, nouveau, Linux Kernel Mailing List, dri-devel,
	linux-tegra

On Mon, May 19, 2014 at 6:31 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> Am Montag, den 19.05.2014, 16:10 +0900 schrieb Alexandre Courbot:
>> From: Lucas Stach <dev@lynxeye.de>
>>
>> Signed-off-by: Lucas Stach <dev@lynxeye.de>
>> [acourbot@nvidia.com: make conditional and platform-friendly]
>> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com>
>> ---
>>  drivers/gpu/drm/nouveau/nouveau_bo.c  | 32 ++++++++++++++++++++++++++++++++
>>  drivers/gpu/drm/nouveau/nouveau_bo.h  | 20 ++++++++++++++++++++
>>  drivers/gpu/drm/nouveau/nouveau_gem.c |  8 +++++++-
>>  3 files changed, 59 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> index b6dc85c614be..0886f47e5244 100644
>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> @@ -407,6 +407,8 @@ nouveau_bo_validate(struct nouveau_bo *nvbo, bool interruptible,
>>  {
>>       int ret;
>>
>> +     nouveau_bo_sync_for_device(nvbo);
>> +
>>       ret = ttm_bo_validate(&nvbo->bo, &nvbo->placement,
>>                             interruptible, no_wait_gpu);
>>       if (ret)
>> @@ -487,6 +489,36 @@ nouveau_bo_invalidate_caches(struct ttm_bo_device *bdev, uint32_t flags)
>>       return 0;
>>  }
>>
>> +#ifdef NOUVEAU_NEED_CACHE_SYNC
>
> I don't like this ifdef at all. I know calling this functions will add a
> little overhead to x86 where it isn't strictly required, but I think
> it's negligible.
>
> When I looked at them the dma_sync_single_for_[device|cpu] functions
> which are called here map out to just a drain of the PCI store buffer on
> x86, which should be fast enough to be done unconditionally. They won't
> so any time-consuming cache synchronization on PCI coherent arches.

If Ben agrees with it I am also perfectly fine with getting rid of this #ifdef.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19 10:03               ` Thierry Reding
@ 2014-05-23  6:58                 ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:58 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Lucas Stach, Alexandre Courbot,
	nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, May 19, 2014 at 7:03 PM, Thierry Reding
<thierry.reding-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > > flushed for a memory write to take effect. Not doing so results in
>> > > synchronization issues, especially after writing to BOs.
>> >
>> > It seems to me that the above is generally true for all architectures,
>> > not just ARM.
>> >
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.
>
>> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > [...]
>> > > index 0886f47e5244..b9c9729c5733 100644
>> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite16_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 2);
>> > > + }
>> > >  }
>> > >
>> > >  u32
>> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite32_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 4);
>> > > + }
>> >
>> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> > into an uncached buffer. With additional overhead of constantly flushing
>> > caches. Wouldn't it make more sense to locate the places where these are
>> > called and flush the cache after all the writes have completed?
>> >
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> Sounds good to me.

I will need to wrap my head around TTM some more to understand how to
do this the right way, but it is true that brute-forcing in-memory BO
mappings to be WC make the addition of nv_cpu_cache_flush_area()
unneeded. Is that the direction we want to take with this?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-05-23  6:58                 ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-05-23  6:58 UTC (permalink / raw)
  To: Thierry Reding
  Cc: Lucas Stach, Alexandre Courbot, nouveau,
	Linux Kernel Mailing List, dri-devel, Ben Skeggs, linux-tegra

On Mon, May 19, 2014 at 7:03 PM, Thierry Reding
<thierry.reding@gmail.com> wrote:
> On Mon, May 19, 2014 at 11:22:11AM +0200, Lucas Stach wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> > On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > > flushed for a memory write to take effect. Not doing so results in
>> > > synchronization issues, especially after writing to BOs.
>> >
>> > It seems to me that the above is generally true for all architectures,
>> > not just ARM.
>> >
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>
> I was criticizing the wording in the commit message. Perhaps it could be
> enhanced with what you just said.
>
>> > > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > [...]
>> > > index 0886f47e5244..b9c9729c5733 100644
>> > > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite16_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 2);
>> > > + }
>> > >  }
>> > >
>> > >  u32
>> > > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> > >   mem = &mem[index];
>> > >   if (is_iomem)
>> > >           iowrite32_native(val, (void __force __iomem *)mem);
>> > > - else
>> > > + else {
>> > >           *mem = val;
>> > > +         nv_cpu_cache_flush_area(mem, 4);
>> > > + }
>> >
>> > This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> > into an uncached buffer. With additional overhead of constantly flushing
>> > caches. Wouldn't it make more sense to locate the places where these are
>> > called and flush the cache after all the writes have completed?
>> >
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> Sounds good to me.

I will need to wrap my head around TTM some more to understand how to
do this the right way, but it is true that brute-forcing in-memory BO
mappings to be WC make the addition of nv_cpu_cache_flush_area()
unneeded. Is that the direction we want to take with this?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
  2014-05-23  5:49         ` Alexandre Courbot
@ 2014-05-23  7:31           ` Thierry Reding
  -1 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-23  7:31 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: nouveau, Linux Kernel Mailing List, dri-devel, Ben Skeggs, linux-tegra


[-- Attachment #1.1: Type: text/plain, Size: 810 bytes --]

On Fri, May 23, 2014 at 02:49:40PM +0900, Alexandre Courbot wrote:
> On Mon, May 19, 2014 at 5:33 PM, Thierry Reding
> <thierry.reding@gmail.com> wrote:
> > On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
> >> From: Lucas Stach <dev@lynxeye.de>
> >>
> >> On arches with non-coherent PCI,
> >
> > I guess since this applies to gk20a
> >
> >> we need to flush caches ourselfes at
> >
> > "ourselves". Or perhaps even reword to something like: "..., caches need
> > to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
> > invalidate rather than flush.
> 
> Rephrased as "On arches for which access to GPU memory is non-coherent, caches
> need to be flushed and invalidated explicitly at the appropriate places."

Nit: s/arches/architectures/

Thierry

[-- Attachment #1.2: Type: application/pgp-signature, Size: 836 bytes --]

[-- Attachment #2: Type: text/plain, Size: 159 bytes --]

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 2/4] drm/ttm: introduce dma cache sync helpers
@ 2014-05-23  7:31           ` Thierry Reding
  0 siblings, 0 replies; 41+ messages in thread
From: Thierry Reding @ 2014-05-23  7:31 UTC (permalink / raw)
  To: Alexandre Courbot
  Cc: Alexandre Courbot, David Airlie, Ben Skeggs, Lucas Stach,
	nouveau, dri-devel, linux-tegra, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 810 bytes --]

On Fri, May 23, 2014 at 02:49:40PM +0900, Alexandre Courbot wrote:
> On Mon, May 19, 2014 at 5:33 PM, Thierry Reding
> <thierry.reding@gmail.com> wrote:
> > On Mon, May 19, 2014 at 04:10:56PM +0900, Alexandre Courbot wrote:
> >> From: Lucas Stach <dev@lynxeye.de>
> >>
> >> On arches with non-coherent PCI,
> >
> > I guess since this applies to gk20a
> >
> >> we need to flush caches ourselfes at
> >
> > "ourselves". Or perhaps even reword to something like: "..., caches need
> > to be flushed and invalidated explicitly", since dma_sync_for_cpu() does
> > invalidate rather than flush.
> 
> Rephrased as "On arches for which access to GPU memory is non-coherent, caches
> need to be flushed and invalidated explicitly at the appropriate places."

Nit: s/arches/architectures/

Thierry

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-05-19  9:22           ` Lucas Stach
@ 2014-06-09 10:41               ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-06-09 10:41 UTC (permalink / raw)
  To: Lucas Stach
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > flushed for a memory write to take effect. Not doing so results in
>> > synchronization issues, especially after writing to BOs.
>>
>> It seems to me that the above is generally true for all architectures,
>> not just ARM.
>>
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.
>
>> Also: s/explicitely/explicitly/
>>
>> > This patch introduces a macro that flushes the caches on ARM and
>> > translates to a no-op on other architectures, and uses it when
>> > writing to in-memory BOs. It will also be useful for implementations of
>> > instmem that access shared memory directly instead of going through
>> > PRAMIN.
>>
>> Presumably instmem can access shared memory on all architectures, so
>> this doesn't seem like a property of the architecture but rather of the
>> memory pool backing the instmem.
>>
>> In that case I wonder if this shouldn't be moved into an operation that
>> is implemented by the backing memory pool and be a noop where the cache
>> doesn't need explicit flushing.
>>
>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>> > index d0ced94ca54c..274b4460bb03 100644
>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>> > @@ -38,4 +38,21 @@
>> >  #endif /* def __BIG_ENDIAN else */
>> >  #endif /* !ioread32_native */
>> >
>> > +#if defined(__arm__)
>> > +
>> > +#define nv_cpu_cache_flush_area(va, size)  \
>> > +do {                                               \
>> > +   phys_addr_t pa = virt_to_phys(va);      \
>> > +   __cpuc_flush_dcache_area(va, size);     \
>> > +   outer_flush_range(pa, pa + size);       \
>> > +} while (0)
>>
>> Couldn't this be a static inline function?
>>
>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> [...]
>> > index 0886f47e5244..b9c9729c5733 100644
>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite16_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 2);
>> > +   }
>> >  }
>> >
>> >  u32
>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite32_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 4);
>> > +   }
>>
>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> into an uncached buffer. With additional overhead of constantly flushing
>> caches. Wouldn't it make more sense to locate the places where these are
>> called and flush the cache after all the writes have completed?
>>
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

I tried to go that way, and something interesting happened.

What I did: remove this patch and instead set the following caching
parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():

    man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
    man->default_caching = TTM_PL_FLAG_WC;

What happened: no runtime errors as what happened when caching is
enabled. However, many of the vertex and texture buffers seem to be
partially corrupted. In glmark2 the 3d models had many vertices (but
not all) at the wrong position. Note that not all the scenes ended up
being corrupted - in particular, when two consecutive scenes used the
same model, the second instance would be uncorrupted.

Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
What is interesting is that while data like vertices and textures got
corrupted, pushbuffers and shader programs seem to be just fine, as I
could not see any runtime error.

I don't really understand what kind of caching behavior could lead to
that. If anyone has any idea, I'd love to hear.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-06-09 10:41               ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-06-09 10:41 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Thierry Reding, Alexandre Courbot, nouveau,
	Linux Kernel Mailing List, dri-devel, Ben Skeggs, linux-tegra

On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>> > flushed for a memory write to take effect. Not doing so results in
>> > synchronization issues, especially after writing to BOs.
>>
>> It seems to me that the above is generally true for all architectures,
>> not just ARM.
>>
> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
> snoop the CPU caches and therefore an explicit cache flush is not
> required.
>
>> Also: s/explicitely/explicitly/
>>
>> > This patch introduces a macro that flushes the caches on ARM and
>> > translates to a no-op on other architectures, and uses it when
>> > writing to in-memory BOs. It will also be useful for implementations of
>> > instmem that access shared memory directly instead of going through
>> > PRAMIN.
>>
>> Presumably instmem can access shared memory on all architectures, so
>> this doesn't seem like a property of the architecture but rather of the
>> memory pool backing the instmem.
>>
>> In that case I wonder if this shouldn't be moved into an operation that
>> is implemented by the backing memory pool and be a noop where the cache
>> doesn't need explicit flushing.
>>
>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>> > index d0ced94ca54c..274b4460bb03 100644
>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>> > @@ -38,4 +38,21 @@
>> >  #endif /* def __BIG_ENDIAN else */
>> >  #endif /* !ioread32_native */
>> >
>> > +#if defined(__arm__)
>> > +
>> > +#define nv_cpu_cache_flush_area(va, size)  \
>> > +do {                                               \
>> > +   phys_addr_t pa = virt_to_phys(va);      \
>> > +   __cpuc_flush_dcache_area(va, size);     \
>> > +   outer_flush_range(pa, pa + size);       \
>> > +} while (0)
>>
>> Couldn't this be a static inline function?
>>
>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> [...]
>> > index 0886f47e5244..b9c9729c5733 100644
>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite16_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 2);
>> > +   }
>> >  }
>> >
>> >  u32
>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>> >     mem = &mem[index];
>> >     if (is_iomem)
>> >             iowrite32_native(val, (void __force __iomem *)mem);
>> > -   else
>> > +   else {
>> >             *mem = val;
>> > +           nv_cpu_cache_flush_area(mem, 4);
>> > +   }
>>
>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>> into an uncached buffer. With additional overhead of constantly flushing
>> caches. Wouldn't it make more sense to locate the places where these are
>> called and flush the cache after all the writes have completed?
>>
> I don't think the explicit flushing for those things makes sense. I
> think it is a lot more effective to just map the BOs write-combined on
> PCI non-coherent arches. This way any writes will be buffered. Reads
> will be slow, but I don't think nouveau is reading back a lot from those
> buffers.
> Using the write-combining buffer doesn't need any additional
> synchronization as it will get flushed on pushbuf kickoff anyways.

I tried to go that way, and something interesting happened.

What I did: remove this patch and instead set the following caching
parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():

    man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
    man->default_caching = TTM_PL_FLAG_WC;

What happened: no runtime errors as what happened when caching is
enabled. However, many of the vertex and texture buffers seem to be
partially corrupted. In glmark2 the 3d models had many vertices (but
not all) at the wrong position. Note that not all the scenes ended up
being corrupted - in particular, when two consecutive scenes used the
same model, the second instance would be uncorrupted.

Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
What is interesting is that while data like vertices and textures got
corrupted, pushbuffers and shader programs seem to be just fine, as I
could not see any runtime error.

I don't really understand what kind of caching behavior could lead to
that. If anyone has any idea, I'd love to hear.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
  2014-06-09 10:41               ` Alexandre Courbot
@ 2014-06-12 13:50                   ` Alexandre Courbot
  -1 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-06-12 13:50 UTC (permalink / raw)
  To: Lucas Stach
  Cc: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW,
	Linux Kernel Mailing List,
	dri-devel-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Ben Skeggs,
	linux-tegra-u79uwXL29TY76Z2rM5mHXA

On Mon, Jun 9, 2014 at 7:41 PM, Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>>> > flushed for a memory write to take effect. Not doing so results in
>>> > synchronization issues, especially after writing to BOs.
>>>
>>> It seems to me that the above is generally true for all architectures,
>>> not just ARM.
>>>
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>>
>>> Also: s/explicitely/explicitly/
>>>
>>> > This patch introduces a macro that flushes the caches on ARM and
>>> > translates to a no-op on other architectures, and uses it when
>>> > writing to in-memory BOs. It will also be useful for implementations of
>>> > instmem that access shared memory directly instead of going through
>>> > PRAMIN.
>>>
>>> Presumably instmem can access shared memory on all architectures, so
>>> this doesn't seem like a property of the architecture but rather of the
>>> memory pool backing the instmem.
>>>
>>> In that case I wonder if this shouldn't be moved into an operation that
>>> is implemented by the backing memory pool and be a noop where the cache
>>> doesn't need explicit flushing.
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>>> > index d0ced94ca54c..274b4460bb03 100644
>>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>>> > @@ -38,4 +38,21 @@
>>> >  #endif /* def __BIG_ENDIAN else */
>>> >  #endif /* !ioread32_native */
>>> >
>>> > +#if defined(__arm__)
>>> > +
>>> > +#define nv_cpu_cache_flush_area(va, size)  \
>>> > +do {                                               \
>>> > +   phys_addr_t pa = virt_to_phys(va);      \
>>> > +   __cpuc_flush_dcache_area(va, size);     \
>>> > +   outer_flush_range(pa, pa + size);       \
>>> > +} while (0)
>>>
>>> Couldn't this be a static inline function?
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> [...]
>>> > index 0886f47e5244..b9c9729c5733 100644
>>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite16_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 2);
>>> > +   }
>>> >  }
>>> >
>>> >  u32
>>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite32_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 4);
>>> > +   }
>>>
>>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>>> into an uncached buffer. With additional overhead of constantly flushing
>>> caches. Wouldn't it make more sense to locate the places where these are
>>> called and flush the cache after all the writes have completed?
>>>
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> I tried to go that way, and something interesting happened.
>
> What I did: remove this patch and instead set the following caching
> parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():
>
>     man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
>     man->default_caching = TTM_PL_FLAG_WC;
>
> What happened: no runtime errors as what happened when caching is
> enabled. However, many of the vertex and texture buffers seem to be
> partially corrupted. In glmark2 the 3d models had many vertices (but
> not all) at the wrong position. Note that not all the scenes ended up
> being corrupted - in particular, when two consecutive scenes used the
> same model, the second instance would be uncorrupted.
>
> Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
> What is interesting is that while data like vertices and textures got
> corrupted, pushbuffers and shader programs seem to be just fine, as I
> could not see any runtime error.

An interesting fact: if I change ttm_bo_kmap_ttm() such as kernel
mappings of BOs are always performed write-combined, and leave the
TTM_PL_TT default caching to TTM_PL_FLAG_CACHED so user-space mappings
remain cached, the corruptions just vanish. It seems to be the fact of
setting user-space mappings to anything non-cached that leads to this
puzzling behavior. Certainly some subtlety of ARM mappings are getting
over my head here.

If we need to implement different policies for kernel and user-space
mappings, this might complicate things a bit, especially since support
needs to be in TTM and not only Nouveau. I will submit a RFC tomorrow
if I don't hear better ideas by then.

Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
@ 2014-06-12 13:50                   ` Alexandre Courbot
  0 siblings, 0 replies; 41+ messages in thread
From: Alexandre Courbot @ 2014-06-12 13:50 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Thierry Reding, Alexandre Courbot, nouveau,
	Linux Kernel Mailing List, dri-devel, Ben Skeggs, linux-tegra

On Mon, Jun 9, 2014 at 7:41 PM, Alexandre Courbot <gnurou@gmail.com> wrote:
> On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach@pengutronix.de> wrote:
>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>>> > Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>>> > flushed for a memory write to take effect. Not doing so results in
>>> > synchronization issues, especially after writing to BOs.
>>>
>>> It seems to me that the above is generally true for all architectures,
>>> not just ARM.
>>>
>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>> snoop the CPU caches and therefore an explicit cache flush is not
>> required.
>>
>>> Also: s/explicitely/explicitly/
>>>
>>> > This patch introduces a macro that flushes the caches on ARM and
>>> > translates to a no-op on other architectures, and uses it when
>>> > writing to in-memory BOs. It will also be useful for implementations of
>>> > instmem that access shared memory directly instead of going through
>>> > PRAMIN.
>>>
>>> Presumably instmem can access shared memory on all architectures, so
>>> this doesn't seem like a property of the architecture but rather of the
>>> memory pool backing the instmem.
>>>
>>> In that case I wonder if this shouldn't be moved into an operation that
>>> is implemented by the backing memory pool and be a noop where the cache
>>> doesn't need explicit flushing.
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>>> > index d0ced94ca54c..274b4460bb03 100644
>>> > --- a/drivers/gpu/drm/nouveau/core/os.h
>>> > +++ b/drivers/gpu/drm/nouveau/core/os.h
>>> > @@ -38,4 +38,21 @@
>>> >  #endif /* def __BIG_ENDIAN else */
>>> >  #endif /* !ioread32_native */
>>> >
>>> > +#if defined(__arm__)
>>> > +
>>> > +#define nv_cpu_cache_flush_area(va, size)  \
>>> > +do {                                               \
>>> > +   phys_addr_t pa = virt_to_phys(va);      \
>>> > +   __cpuc_flush_dcache_area(va, size);     \
>>> > +   outer_flush_range(pa, pa + size);       \
>>> > +} while (0)
>>>
>>> Couldn't this be a static inline function?
>>>
>>> > diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> [...]
>>> > index 0886f47e5244..b9c9729c5733 100644
>>> > --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>> > @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite16_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 2);
>>> > +   }
>>> >  }
>>> >
>>> >  u32
>>> > @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>>> >     mem = &mem[index];
>>> >     if (is_iomem)
>>> >             iowrite32_native(val, (void __force __iomem *)mem);
>>> > -   else
>>> > +   else {
>>> >             *mem = val;
>>> > +           nv_cpu_cache_flush_area(mem, 4);
>>> > +   }
>>>
>>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>>> into an uncached buffer. With additional overhead of constantly flushing
>>> caches. Wouldn't it make more sense to locate the places where these are
>>> called and flush the cache after all the writes have completed?
>>>
>> I don't think the explicit flushing for those things makes sense. I
>> think it is a lot more effective to just map the BOs write-combined on
>> PCI non-coherent arches. This way any writes will be buffered. Reads
>> will be slow, but I don't think nouveau is reading back a lot from those
>> buffers.
>> Using the write-combining buffer doesn't need any additional
>> synchronization as it will get flushed on pushbuf kickoff anyways.
>
> I tried to go that way, and something interesting happened.
>
> What I did: remove this patch and instead set the following caching
> parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():
>
>     man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
>     man->default_caching = TTM_PL_FLAG_WC;
>
> What happened: no runtime errors as what happened when caching is
> enabled. However, many of the vertex and texture buffers seem to be
> partially corrupted. In glmark2 the 3d models had many vertices (but
> not all) at the wrong position. Note that not all the scenes ended up
> being corrupted - in particular, when two consecutive scenes used the
> same model, the second instance would be uncorrupted.
>
> Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
> What is interesting is that while data like vertices and textures got
> corrupted, pushbuffers and shader programs seem to be just fine, as I
> could not see any runtime error.

An interesting fact: if I change ttm_bo_kmap_ttm() such as kernel
mappings of BOs are always performed write-combined, and leave the
TTM_PL_TT default caching to TTM_PL_FLAG_CACHED so user-space mappings
remain cached, the corruptions just vanish. It seems to be the fact of
setting user-space mappings to anything non-cached that leads to this
puzzling behavior. Certainly some subtlety of ARM mappings are getting
over my head here.

If we need to implement different policies for kernel and user-space
mappings, this might complicate things a bit, especially since support
needs to be in TTM and not only Nouveau. I will submit a RFC tomorrow
if I don't hear better ideas by then.

Alex.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro
       [not found]                   ` <CAAVeFuJYe5wVH_gTok80hT=4GbwhYq4C9c7S5No_V11qjs3brQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2014-06-12 18:15                     ` Roy Spliet
  0 siblings, 0 replies; 41+ messages in thread
From: Roy Spliet @ 2014-06-12 18:15 UTC (permalink / raw)
  To: nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Perhaps a bit late, but here's some random pointers I recall from my 
previous experience: on ARM, caching policy of different mappings of the 
same physical memory object must always correspond perfectly. If the 
userspace mapping is uncached, so should the kernel mapping be. If the 
memory is allocated using get_(zeroed_)page(), by default the caching 
policy for the pointer returned is WRITE_ALLOC on SMP systems.
One useful tool for debugging cache problems is the pagewalk api's 
(mm/pagewalk.c), which will help you find the raw page table entry (and 
thus it's caching properties) for a virtual memory address.

Roy

op 12-06-14 15:50, Alexandre Courbot schreef:
> On Mon, Jun 9, 2014 at 7:41 PM, Alexandre Courbot <gnurou-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
>> On Mon, May 19, 2014 at 6:22 PM, Lucas Stach <l.stach-bIcnvbaLZ9MEGnE8C9+IrQ@public.gmane.org> wrote:
>>> Am Montag, den 19.05.2014, 11:02 +0200 schrieb Thierry Reding:
>>>> On Mon, May 19, 2014 at 04:10:58PM +0900, Alexandre Courbot wrote:
>>>>> Some architectures (e.g. ARM) need the CPU buffers to be explicitely
>>>>> flushed for a memory write to take effect. Not doing so results in
>>>>> synchronization issues, especially after writing to BOs.
>>>> It seems to me that the above is generally true for all architectures,
>>>> not just ARM.
>>>>
>>> No, on PCI coherent arches, like x86 and some PowerPCs, the GPU will
>>> snoop the CPU caches and therefore an explicit cache flush is not
>>> required.
>>>
>>>> Also: s/explicitely/explicitly/
>>>>
>>>>> This patch introduces a macro that flushes the caches on ARM and
>>>>> translates to a no-op on other architectures, and uses it when
>>>>> writing to in-memory BOs. It will also be useful for implementations of
>>>>> instmem that access shared memory directly instead of going through
>>>>> PRAMIN.
>>>> Presumably instmem can access shared memory on all architectures, so
>>>> this doesn't seem like a property of the architecture but rather of the
>>>> memory pool backing the instmem.
>>>>
>>>> In that case I wonder if this shouldn't be moved into an operation that
>>>> is implemented by the backing memory pool and be a noop where the cache
>>>> doesn't need explicit flushing.
>>>>
>>>>> diff --git a/drivers/gpu/drm/nouveau/core/os.h b/drivers/gpu/drm/nouveau/core/os.h
>>>>> index d0ced94ca54c..274b4460bb03 100644
>>>>> --- a/drivers/gpu/drm/nouveau/core/os.h
>>>>> +++ b/drivers/gpu/drm/nouveau/core/os.h
>>>>> @@ -38,4 +38,21 @@
>>>>>   #endif /* def __BIG_ENDIAN else */
>>>>>   #endif /* !ioread32_native */
>>>>>
>>>>> +#if defined(__arm__)
>>>>> +
>>>>> +#define nv_cpu_cache_flush_area(va, size)  \
>>>>> +do {                                               \
>>>>> +   phys_addr_t pa = virt_to_phys(va);      \
>>>>> +   __cpuc_flush_dcache_area(va, size);     \
>>>>> +   outer_flush_range(pa, pa + size);       \
>>>>> +} while (0)
>>>> Couldn't this be a static inline function?
>>>>
>>>>> diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>> [...]
>>>>> index 0886f47e5244..b9c9729c5733 100644
>>>>> --- a/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>>> +++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
>>>>> @@ -437,8 +437,10 @@ nouveau_bo_wr16(struct nouveau_bo *nvbo, unsigned index, u16 val)
>>>>>      mem = &mem[index];
>>>>>      if (is_iomem)
>>>>>              iowrite16_native(val, (void __force __iomem *)mem);
>>>>> -   else
>>>>> +   else {
>>>>>              *mem = val;
>>>>> +           nv_cpu_cache_flush_area(mem, 2);
>>>>> +   }
>>>>>   }
>>>>>
>>>>>   u32
>>>>> @@ -461,8 +463,10 @@ nouveau_bo_wr32(struct nouveau_bo *nvbo, unsigned index, u32 val)
>>>>>      mem = &mem[index];
>>>>>      if (is_iomem)
>>>>>              iowrite32_native(val, (void __force __iomem *)mem);
>>>>> -   else
>>>>> +   else {
>>>>>              *mem = val;
>>>>> +           nv_cpu_cache_flush_area(mem, 4);
>>>>> +   }
>>>> This looks rather like a sledgehammer to me. Effectively this turns nvbo
>>>> into an uncached buffer. With additional overhead of constantly flushing
>>>> caches. Wouldn't it make more sense to locate the places where these are
>>>> called and flush the cache after all the writes have completed?
>>>>
>>> I don't think the explicit flushing for those things makes sense. I
>>> think it is a lot more effective to just map the BOs write-combined on
>>> PCI non-coherent arches. This way any writes will be buffered. Reads
>>> will be slow, but I don't think nouveau is reading back a lot from those
>>> buffers.
>>> Using the write-combining buffer doesn't need any additional
>>> synchronization as it will get flushed on pushbuf kickoff anyways.
>> I tried to go that way, and something interesting happened.
>>
>> What I did: remove this patch and instead set the following caching
>> parameters for the TTM_PL_TT case in nouveau_bo_init_mem_type():
>>
>>      man->available_caching = TTM_PL_FLAG_UNCACHED | TTM_PL_FLAG_WC;
>>      man->default_caching = TTM_PL_FLAG_WC;
>>
>> What happened: no runtime errors as what happened when caching is
>> enabled. However, many of the vertex and texture buffers seem to be
>> partially corrupted. In glmark2 the 3d models had many vertices (but
>> not all) at the wrong position. Note that not all the scenes ended up
>> being corrupted - in particular, when two consecutive scenes used the
>> same model, the second instance would be uncorrupted.
>>
>> Forcing the caching to TTM_PL_FLAG_UNCACHED led to the same result.
>> What is interesting is that while data like vertices and textures got
>> corrupted, pushbuffers and shader programs seem to be just fine, as I
>> could not see any runtime error.
> An interesting fact: if I change ttm_bo_kmap_ttm() such as kernel
> mappings of BOs are always performed write-combined, and leave the
> TTM_PL_TT default caching to TTM_PL_FLAG_CACHED so user-space mappings
> remain cached, the corruptions just vanish. It seems to be the fact of
> setting user-space mappings to anything non-cached that leads to this
> puzzling behavior. Certainly some subtlety of ARM mappings are getting
> over my head here.
>
> If we need to implement different policies for kernel and user-space
> mappings, this might complicate things a bit, especially since support
> needs to be in TTM and not only Nouveau. I will submit a RFC tomorrow
> if I don't hear better ideas by then.
>
> Alex.
> _______________________________________________
> Nouveau mailing list
> Nouveau-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW@public.gmane.org
> http://lists.freedesktop.org/mailman/listinfo/nouveau
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2014-06-12 18:15 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-19  7:10 [PATCH 0/4] drm/ttm: nouveau: memory coherency fixes for ARM Alexandre Courbot
2014-05-19  7:10 ` Alexandre Courbot
2014-05-19  7:10 ` [PATCH 1/4] drm/ttm: recognize ARM arch in ioprot handler Alexandre Courbot
2014-05-19  7:10   ` Alexandre Courbot
2014-05-19  7:10 ` [PATCH 3/4] drm/nouveau: hook up cache sync functions Alexandre Courbot
2014-05-19  7:10   ` Alexandre Courbot
     [not found]   ` <1400483458-9648-4-git-send-email-acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-05-19  8:46     ` Thierry Reding
2014-05-19  8:46       ` Thierry Reding
2014-05-19  9:44       ` Lucas Stach
2014-05-19  9:44         ` Lucas Stach
2014-05-23  6:00       ` Alexandre Courbot
2014-05-23  6:00         ` Alexandre Courbot
2014-05-19  9:31     ` Lucas Stach
2014-05-19  9:31       ` Lucas Stach
     [not found]       ` <1400491887.8467.15.camel-WzVe3FnzCwFR6QfukMTsflXZhhPuCNm+@public.gmane.org>
2014-05-23  6:01         ` Alexandre Courbot
2014-05-23  6:01           ` Alexandre Courbot
     [not found] ` <1400483458-9648-1-git-send-email-acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-05-19  7:10   ` [PATCH 2/4] drm/ttm: introduce dma cache sync helpers Alexandre Courbot
2014-05-19  7:10     ` Alexandre Courbot
2014-05-19  8:33     ` Thierry Reding
2014-05-19  8:33       ` Thierry Reding
2014-05-23  5:49       ` Alexandre Courbot
2014-05-23  5:49         ` Alexandre Courbot
2014-05-23  7:31         ` Thierry Reding
2014-05-23  7:31           ` Thierry Reding
2014-05-19  7:10   ` [PATCH 4/4] drm/nouveau: introduce CPU cache flushing macro Alexandre Courbot
2014-05-19  7:10     ` Alexandre Courbot
     [not found]     ` <1400483458-9648-5-git-send-email-acourbot-DDmLM1+adcrQT0dZR+AlfA@public.gmane.org>
2014-05-19  9:02       ` Thierry Reding
2014-05-19  9:02         ` Thierry Reding
2014-05-19  9:22         ` Lucas Stach
2014-05-19  9:22           ` Lucas Stach
     [not found]           ` <1400491331.8467.8.camel-WzVe3FnzCwFR6QfukMTsflXZhhPuCNm+@public.gmane.org>
2014-05-19 10:03             ` Thierry Reding
2014-05-19 10:03               ` Thierry Reding
2014-05-19 10:27               ` Daniel Vetter
2014-05-19 10:27                 ` [Nouveau] " Daniel Vetter
2014-05-23  6:58               ` Alexandre Courbot
2014-05-23  6:58                 ` Alexandre Courbot
2014-06-09 10:41             ` Alexandre Courbot
2014-06-09 10:41               ` Alexandre Courbot
     [not found]               ` <CAAVeFu+KZ9AqB5ji5-AA+qzEFDWd7y0=J1eSEPqQ-OyhmXufig-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-12 13:50                 ` Alexandre Courbot
2014-06-12 13:50                   ` Alexandre Courbot
     [not found]                   ` <CAAVeFuJYe5wVH_gTok80hT=4GbwhYq4C9c7S5No_V11qjs3brQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2014-06-12 18:15                     ` Roy Spliet

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.