All of lore.kernel.org
 help / color / mirror / Atom feed
* "Fixes" for page flipping under PRIME on AMD & nouveau
@ 2016-08-17 16:12 Mario Kleiner
  2016-08-17 16:12 ` [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's Mario Kleiner
                   ` (3 more replies)
  0 siblings, 4 replies; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 16:12 UTC (permalink / raw)
  To: dri-devel; +Cc: michel.daenzer, jglisse, bskeggs, alexander.deucher, airlied

Hi,

i spent some time playing with DRI3/Present + PRIME for testing
how well it works for Optimus/Enduro style setups wrt. page flipping
on the current kernel/mesa/xorg. I want page flipping, because
neuroscience/medical applications need the reliable timing/timestamping
and tear free presentation we currently only can get via page
flipping, but not the copyswap path.

Intel as display gpu + nouveau for render offload worked nicely
on intel-ddx with page flipping, proper timing, dmabuf fence sync
and all.

AMD uses copy swaps because radeon/amdgpu kms can't switch the
scanout mode from tiled to linear on the fly during flips. That's
a todo in itself. For the moment i used the ati-ddx with Option
"ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
HD-5770's into linear mode so page flipping can be used for
prime. The current modesetting-ddx will use page flipping in
any case as it doesn't detect the tiling format mismatch.

nouveau uses page flips.

Turns out that prime + page flipping currently doesn't work
on nouveau and amd. The first offload rendered images from
the imported dmabufs show up properly, but then the display
is stuck alternating between the first two or three rendered
frames.

The problem is that during the pageflip ioctl we pin the
dmabuf into VRAM in preparation for scanout, then unpin it
when we are done with it at next flip, but the buffer stays
in the VRAM memory domain. Next time we flip to the buffer
again, the driver skips the DMA copy from GTT to VRAM during
pinning, because the buffers content apparently already resides
in VRAM. Therefore it doesn't update the VRAM copy with the updated
dmabuf content in system RAM, so freshly rendered frames from the
prime export/render offload gpu never reach the display gpu and one
only sees stale images.

The attached patches for nouveau and radeon kms seem to work
pretty ok, page flipping works, display updates, tear-free,
dmabuf fence sync works, onset timing/timestamping is correct.
They simply pin the buffer back into GTT, then unpin, to force
a move of the buffer into the GTT domain, and thereby force the
following pin to do a new copy from GTT -> VRAM. The code tries
to avoid a useless copy from VRAM -> GTT during the pin op.

However, the approach feels very much like a hack, so i assume
this is not the proper way of doing it? I looked what ttm has
to offer, but couldn't find anything elegant and obvious. Maybe
there is a way to evict a bo without actually copying data back
to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
missed something, as i'm not very familiar with ttm.

Thoughts or suggestions?

Another insight with my hacks is so far that nouveau seems to
be fast as prime exporter/renderoffload, but rather slow as
display gpu/prime importer, as tested on a 2008 or 2009
MacBookPro dual-Nvidia laptop.

AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
importer/display gpu, but very slow as prime exporter/render offload,
e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
that Mesa's blitImage function is the slow bit here. On r600 it seems
to draw a textured triangle strip to detile the gpu renderbuffer and
copy it into GTT. As drawing a textured fullscreen quad is normally
much faster, something special seems to be going on there wrt. DMA?
However, i don't have a realistic real Enduro test setup with AMD
iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
so this could be wrong.

thanks,
-mario

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's.
  2016-08-17 16:12 "Fixes" for page flipping under PRIME on AMD & nouveau Mario Kleiner
@ 2016-08-17 16:12 ` Mario Kleiner
  2016-08-17 16:12 ` [PATCH 2/2] drm/radeon: " Mario Kleiner
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 16:12 UTC (permalink / raw)
  To: dri-devel; +Cc: michel.daenzer, jglisse, bskeggs, alexander.deucher, airlied

Scanout bo's which are dmabuf backed in RAM and
imported via prime will not update their content
with new rendering from the renderoffload gpu
once they've been flipped onto the scanout once.
The reason is that at preparation of first flip
they get pinned into VRAM, then unpinned at some
later point, but they stay in the VRAM memory domain,
so updates to the system RAM dmabuf object by the
exporting render offload gpu don't lead to updates
of the content in VRAM - it becomes stale.

For prime imported dmabufs we solve this by first
pinning the bo into GTT, which will reset the bos
domain back to GTT, then unpinning again, so the
followup pinning into VRAM will actually upload an up
to date display buffer from dmabuf GTT backing store.

During the pinning into GTT, we skip the actual data move
from VRAM to GTT to avoid a needless bo copy of stale
image data.

Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
---
 drivers/gpu/drm/nouveau/nouveau_bo.c      | 35 +++++++++++++++++++++++++++++--
 drivers/gpu/drm/nouveau/nouveau_bo.h      |  1 +
 drivers/gpu/drm/nouveau/nouveau_display.c | 17 +++++++++++++++
 drivers/gpu/drm/nouveau/nouveau_prime.c   |  1 +
 4 files changed, 52 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.c b/drivers/gpu/drm/nouveau/nouveau_bo.c
index 6190035..87052e4 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.c
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.c
@@ -38,6 +38,18 @@
 #include "nouveau_ttm.h"
 #include "nouveau_gem.h"
 
+static inline bool nouveau_dmabuf_skip_op(struct ttm_buffer_object *bo,
+					  struct ttm_mem_reg *new_mem)
+{
+	struct nouveau_bo *nvbo = nouveau_bo(bo);
+
+	/*
+	 * Return true if a expensive operation as part of a dmabuf
+	 * bo copy from VRAM to GTT can be skipped on this bo.
+	 */
+	return nvbo->prime_imported && new_mem && new_mem->mem_type == TTM_PL_TT;
+}
+
 /*
  * NV10-NV40 tiling helpers
  */
@@ -1026,13 +1038,15 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict, bool intr,
 	struct nouveau_channel *chan = drm->ttm.chan;
 	struct nouveau_cli *cli = (void *)chan->user.client;
 	struct nouveau_fence *fence;
+	bool skip_prime = !evict && nouveau_dmabuf_skip_op(bo, new_mem);
 	int ret;
 
 	/* create temporary vmas for the transfer and attach them to the
 	 * old nvkm_mem node, these will get cleaned up after ttm has
 	 * destroyed the ttm_mem_reg
 	 */
-	if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA) {
+	if (drm->device.info.family >= NV_DEVICE_INFO_V0_TESLA &&
+	    !skip_prime) {
 		ret = nouveau_bo_move_prep(drm, bo, new_mem);
 		if (ret)
 			return ret;
@@ -1041,7 +1055,21 @@ nouveau_bo_move_m2mf(struct ttm_buffer_object *bo, int evict, bool intr,
 	mutex_lock_nested(&cli->mutex, SINGLE_DEPTH_NESTING);
 	ret = nouveau_fence_sync(nouveau_bo(bo), chan, true, intr);
 	if (ret == 0) {
-		ret = drm->ttm.move(chan, bo, &bo->mem, new_mem);
+		/*
+		 * For prime-imported dmabufs which are page-flipped to the
+		 * display as scanout bo's and thereby pinned into VRAM, we
+		 * need to do a pseudo-move back into GTT memory domain once
+		 * they are replaced by a new scanout bo. This to enforce an
+		 * update to the new content from dmabuf storage at next flip,
+		 * otherwise we'd display a stale image. The move back into
+		 * GTT goes through most "administrative moves" of a real
+		 * bo move, but we skip the actual copy of the now stale old
+		 * image data from VRAM back to GTT dmabuf backing to save a
+		 * useless copy.
+		 */
+		if (!skip_prime)
+			ret = drm->ttm.move(chan, bo, &bo->mem, new_mem);
+
 		if (ret == 0) {
 			ret = nouveau_fence_new(chan, false, &fence);
 			if (ret == 0) {
@@ -1202,6 +1230,9 @@ nouveau_bo_move_ntfy(struct ttm_buffer_object *bo, struct ttm_mem_reg *new_mem)
 	if (bo->destroy != nouveau_bo_del_ttm)
 		return;
 
+	if (nouveau_dmabuf_skip_op(bo, new_mem))
+		return;
+
 	list_for_each_entry(vma, &nvbo->vma_list, head) {
 		if (new_mem && new_mem->mem_type != TTM_PL_SYSTEM &&
 			      (new_mem->mem_type == TTM_PL_VRAM ||
diff --git a/drivers/gpu/drm/nouveau/nouveau_bo.h b/drivers/gpu/drm/nouveau/nouveau_bo.h
index e423609..4e415e0 100644
--- a/drivers/gpu/drm/nouveau/nouveau_bo.h
+++ b/drivers/gpu/drm/nouveau/nouveau_bo.h
@@ -39,6 +39,7 @@ struct nouveau_bo {
 	int pin_refcnt;
 
 	struct ttm_bo_kmap_obj dma_buf_vmap;
+	bool prime_imported;
 };
 
 static inline struct nouveau_bo *
diff --git a/drivers/gpu/drm/nouveau/nouveau_display.c b/drivers/gpu/drm/nouveau/nouveau_display.c
index afbf557..bb49159 100644
--- a/drivers/gpu/drm/nouveau/nouveau_display.c
+++ b/drivers/gpu/drm/nouveau/nouveau_display.c
@@ -736,6 +736,22 @@ nouveau_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb,
 		return -ENOMEM;
 
 	if (new_bo != old_bo) {
+		/* Is this a scanout buffer from an imported prime dmabuf? */
+		if (new_bo->prime_imported && !new_bo->pin_refcnt) {
+			/*
+			 * Pretend it "moved out" of VRAM, so a fresh copy of
+			 * new dmabuf content from export gpu gets reuploaded
+			 * from GTT backing store when pinning into VRAM.
+			 */
+			DRM_DEBUG_PRIME("Flip to prime imported dmabuf %p\n",
+					new_bo);
+			if (nouveau_bo_pin(new_bo, TTM_PL_FLAG_TT, false))
+				DRM_ERROR("Fail gtt pin imported buf %p\n",
+					  new_bo);
+			else
+				nouveau_bo_unpin(new_bo);
+		}
+
 		ret = nouveau_bo_pin(new_bo, TTM_PL_FLAG_VRAM, true);
 		if (ret)
 			goto fail_free;
@@ -808,6 +824,7 @@ nouveau_crtc_page_flip(struct drm_crtc *crtc, struct drm_framebuffer *fb,
 	ttm_bo_unreserve(&old_bo->bo);
 	if (old_bo != new_bo)
 		nouveau_bo_unpin(old_bo);
+
 	nouveau_fence_unref(&fence);
 	return 0;
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_prime.c b/drivers/gpu/drm/nouveau/nouveau_prime.c
index a0a9704..2bd76f6 100644
--- a/drivers/gpu/drm/nouveau/nouveau_prime.c
+++ b/drivers/gpu/drm/nouveau/nouveau_prime.c
@@ -75,6 +75,7 @@ struct drm_gem_object *nouveau_gem_prime_import_sg_table(struct drm_device *dev,
 		return ERR_PTR(ret);
 
 	nvbo->valid_domains = NOUVEAU_GEM_DOMAIN_GART;
+	nvbo->prime_imported = true;
 
 	/* Initialize the embedded gem-object. We return a single gem-reference
 	 * to the caller, instead of a normal nouveau_bo ttm reference. */
-- 
2.7.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH 2/2] drm/radeon: Fix pageflipping of PRIME imported scanout bo's.
  2016-08-17 16:12 "Fixes" for page flipping under PRIME on AMD & nouveau Mario Kleiner
  2016-08-17 16:12 ` [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's Mario Kleiner
@ 2016-08-17 16:12 ` Mario Kleiner
  2016-08-17 16:27 ` "Fixes" for page flipping under PRIME on AMD & nouveau Christian König
  2016-08-18  2:23 ` Michel Dänzer
  3 siblings, 0 replies; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 16:12 UTC (permalink / raw)
  To: dri-devel; +Cc: michel.daenzer, jglisse, bskeggs, alexander.deucher, airlied

Scanout bo's which are dmabuf backed in RAM and
imported via prime will not update their content
with new rendering from the renderoffload gpu
once they've been flipped onto the scanout once.
The reason is that at preparation of first flip
they get pinned into VRAM, then unpinned at some
later point, but they stay in the VRAM memory domain,
so updates to the system RAM dmabuf object by the
exporting render offload gpu don't lead to updates
of the content in VRAM - it becomes stale.

For prime imported dmabufs we solve this by first
pinning the bo into GTT, which will reset the bos
domain back to GTT, then unpinning again, so the
followup pinning into VRAM will actually upload an up
to date display buffer from dmabuf GTT backing store.

During the pinning into GTT, we skip the actual data move
from VRAM to GTT to avoid a needless bo copy of stale
image data.

Signed-off-by: Mario Kleiner <mario.kleiner.de@gmail.com>
---
 drivers/gpu/drm/radeon/radeon.h         |  1 +
 drivers/gpu/drm/radeon/radeon_display.c | 28 ++++++++++++++++++++++++++++
 drivers/gpu/drm/radeon/radeon_prime.c   |  1 +
 drivers/gpu/drm/radeon/radeon_ttm.c     | 14 ++++++++++++++
 4 files changed, 44 insertions(+)

diff --git a/drivers/gpu/drm/radeon/radeon.h b/drivers/gpu/drm/radeon/radeon.h
index 5633ee3..c200e8a 100644
--- a/drivers/gpu/drm/radeon/radeon.h
+++ b/drivers/gpu/drm/radeon/radeon.h
@@ -508,6 +508,7 @@ struct radeon_bo {
 	struct drm_gem_object		gem_base;
 
 	struct ttm_bo_kmap_obj		dma_buf_vmap;
+	bool				prime_imported;
 	pid_t				pid;
 
 	struct radeon_mn		*mn;
diff --git a/drivers/gpu/drm/radeon/radeon_display.c b/drivers/gpu/drm/radeon/radeon_display.c
index c3206fb..1082267 100644
--- a/drivers/gpu/drm/radeon/radeon_display.c
+++ b/drivers/gpu/drm/radeon/radeon_display.c
@@ -550,6 +550,34 @@ static int radeon_crtc_page_flip(struct drm_crtc *crtc,
 		DRM_ERROR("failed to reserve new rbo buffer before flip\n");
 		goto cleanup;
 	}
+
+	/*
+	 * Repin into GTT in case of imported prime dmabuf,
+	 * then unpin again. Restores source dmabuf location
+	 * to GTT, where the actual dmabuf backing store gets
+	 * updated by the exporting render offload gpu at swap.
+	 */
+	if (new_rbo->prime_imported) {
+		DRM_DEBUG_PRIME("Flip to prime imported dmabuf %p\n", new_rbo);
+
+		r = radeon_bo_pin(new_rbo, RADEON_GEM_DOMAIN_GTT, NULL);
+		if (unlikely(r != 0)) {
+			DRM_ERROR("failed to gtt pin buffer %p before flip\n",
+				  new_rbo);
+		}
+		else {
+			r = radeon_bo_unpin(new_rbo);
+		}
+
+		if (unlikely(r != 0)) {
+			radeon_bo_unreserve(new_rbo);
+			r = -EINVAL;
+			DRM_ERROR("failed to gtt unpin buffer %p before flip\n",
+				  new_rbo);
+			goto cleanup;
+		}
+	}
+
 	/* Only 27 bit offset for legacy CRTC */
 	r = radeon_bo_pin_restricted(new_rbo, RADEON_GEM_DOMAIN_VRAM,
 				     ASIC_IS_AVIVO(rdev) ? 0 : 1 << 27, &base);
diff --git a/drivers/gpu/drm/radeon/radeon_prime.c b/drivers/gpu/drm/radeon/radeon_prime.c
index f3609c9..693c362 100644
--- a/drivers/gpu/drm/radeon/radeon_prime.c
+++ b/drivers/gpu/drm/radeon/radeon_prime.c
@@ -69,6 +69,7 @@ struct drm_gem_object *radeon_gem_prime_import_sg_table(struct drm_device *dev,
 	ww_mutex_lock(&resv->lock, NULL);
 	ret = radeon_bo_create(rdev, attach->dmabuf->size, PAGE_SIZE, false,
 			       RADEON_GEM_DOMAIN_GTT, 0, sg, resv, &bo);
+	bo->prime_imported = true;
 	ww_mutex_unlock(&resv->lock);
 	if (ret)
 		return ERR_PTR(ret);
diff --git a/drivers/gpu/drm/radeon/radeon_ttm.c b/drivers/gpu/drm/radeon/radeon_ttm.c
index 0c00e19..87b3f59 100644
--- a/drivers/gpu/drm/radeon/radeon_ttm.c
+++ b/drivers/gpu/drm/radeon/radeon_ttm.c
@@ -256,6 +256,7 @@ static int radeon_move_blit(struct ttm_buffer_object *bo,
 			struct ttm_mem_reg *old_mem)
 {
 	struct radeon_device *rdev;
+	struct radeon_bo *rbo;
 	uint64_t old_start, new_start;
 	struct radeon_fence *fence;
 	unsigned num_pages;
@@ -296,6 +297,19 @@ static int radeon_move_blit(struct ttm_buffer_object *bo,
 	BUILD_BUG_ON((PAGE_SIZE % RADEON_GPU_PAGE_SIZE) != 0);
 
 	num_pages = new_mem->num_pages * (PAGE_SIZE / RADEON_GPU_PAGE_SIZE);
+
+	/*
+	 * Prime imported dmabuf, previously used as scanout buffer in a page
+	 * flip? If so, skip actual data move back from VRAM into GTT, as this
+	 * would only copy back stale image data.
+	 */
+	rbo = container_of(bo, struct radeon_bo, tbo);
+	if (rbo->prime_imported && old_mem->mem_type == TTM_PL_VRAM &&
+	    new_mem->mem_type == TTM_PL_TT) {
+		DRM_DEBUG_PRIME("Skip for dmabuf back-move %p.\n", rbo);
+		num_pages = 0;
+	}
+
 	fence = radeon_copy(rdev, old_start, new_start, num_pages, bo->resv);
 	if (IS_ERR(fence))
 		return PTR_ERR(fence);
-- 
2.7.0

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 16:12 "Fixes" for page flipping under PRIME on AMD & nouveau Mario Kleiner
  2016-08-17 16:12 ` [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's Mario Kleiner
  2016-08-17 16:12 ` [PATCH 2/2] drm/radeon: " Mario Kleiner
@ 2016-08-17 16:27 ` Christian König
  2016-08-17 16:35   ` Mario Kleiner
  2016-08-18  2:23 ` Michel Dänzer
  3 siblings, 1 reply; 24+ messages in thread
From: Christian König @ 2016-08-17 16:27 UTC (permalink / raw)
  To: Mario Kleiner, dri-devel
  Cc: alexander.deucher, airlied, jglisse, michel.daenzer, bskeggs

> AMD uses copy swaps because radeon/amdgpu kms can't switch the
> scanout mode from tiled to linear on the fly during flips.
Well I'm not an expert on this, but as far as I know the bigger problem 
is that the dedicated AMD hardware generations you are targeting usually 
can't reliable scanout from system memory without a rather complicated 
setup.

So that is a complete NAK to the radeon changes.

Regards,
Christian.

Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
> Hi,
>
> i spent some time playing with DRI3/Present + PRIME for testing
> how well it works for Optimus/Enduro style setups wrt. page flipping
> on the current kernel/mesa/xorg. I want page flipping, because
> neuroscience/medical applications need the reliable timing/timestamping
> and tear free presentation we currently only can get via page
> flipping, but not the copyswap path.
>
> Intel as display gpu + nouveau for render offload worked nicely
> on intel-ddx with page flipping, proper timing, dmabuf fence sync
> and all.
>
> AMD uses copy swaps because radeon/amdgpu kms can't switch the
> scanout mode from tiled to linear on the fly during flips. That's
> a todo in itself. For the moment i used the ati-ddx with Option
> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
> HD-5770's into linear mode so page flipping can be used for
> prime. The current modesetting-ddx will use page flipping in
> any case as it doesn't detect the tiling format mismatch.
>
> nouveau uses page flips.
>
> Turns out that prime + page flipping currently doesn't work
> on nouveau and amd. The first offload rendered images from
> the imported dmabufs show up properly, but then the display
> is stuck alternating between the first two or three rendered
> frames.
>
> The problem is that during the pageflip ioctl we pin the
> dmabuf into VRAM in preparation for scanout, then unpin it
> when we are done with it at next flip, but the buffer stays
> in the VRAM memory domain. Next time we flip to the buffer
> again, the driver skips the DMA copy from GTT to VRAM during
> pinning, because the buffers content apparently already resides
> in VRAM. Therefore it doesn't update the VRAM copy with the updated
> dmabuf content in system RAM, so freshly rendered frames from the
> prime export/render offload gpu never reach the display gpu and one
> only sees stale images.
>
> The attached patches for nouveau and radeon kms seem to work
> pretty ok, page flipping works, display updates, tear-free,
> dmabuf fence sync works, onset timing/timestamping is correct.
> They simply pin the buffer back into GTT, then unpin, to force
> a move of the buffer into the GTT domain, and thereby force the
> following pin to do a new copy from GTT -> VRAM. The code tries
> to avoid a useless copy from VRAM -> GTT during the pin op.
>
> However, the approach feels very much like a hack, so i assume
> this is not the proper way of doing it? I looked what ttm has
> to offer, but couldn't find anything elegant and obvious. Maybe
> there is a way to evict a bo without actually copying data back
> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
> missed something, as i'm not very familiar with ttm.
>
> Thoughts or suggestions?
>
> Another insight with my hacks is so far that nouveau seems to
> be fast as prime exporter/renderoffload, but rather slow as
> display gpu/prime importer, as tested on a 2008 or 2009
> MacBookPro dual-Nvidia laptop.
>
> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
> importer/display gpu, but very slow as prime exporter/render offload,
> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
> that Mesa's blitImage function is the slow bit here. On r600 it seems
> to draw a textured triangle strip to detile the gpu renderbuffer and
> copy it into GTT. As drawing a textured fullscreen quad is normally
> much faster, something special seems to be going on there wrt. DMA?
> However, i don't have a realistic real Enduro test setup with AMD
> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
> so this could be wrong.
>
> thanks,
> -mario
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 16:27 ` "Fixes" for page flipping under PRIME on AMD & nouveau Christian König
@ 2016-08-17 16:35   ` Mario Kleiner
  2016-08-17 17:02     ` Christian König
  2016-08-17 17:43     ` Alex Deucher
  0 siblings, 2 replies; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 16:35 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: michel.daenzer, jglisse, bskeggs, alexander.deucher, airlied

On 08/17/2016 06:27 PM, Christian König wrote:
>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>> scanout mode from tiled to linear on the fly during flips.
> Well I'm not an expert on this, but as far as I know the bigger problem
> is that the dedicated AMD hardware generations you are targeting usually
> can't reliable scanout from system memory without a rather complicated
> setup.
>
> So that is a complete NAK to the radeon changes.

Hi Christian,

thanks for the feedback, but i think that's a misunderstanding. The 
patches don't make them scanout from system memory, they just enforce a 
fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. I 
just assume there is a more elegant/clean way than this "fake" pin/unpin 
to GTT to essentially tell the driver that its current VRAM content is 
stale and needs a refresh from the up to date dmabuf in system RAM.

Btw. i'll be offline for the next few hours, just wanted to get this out 
now.

thanks,
-mario

>
> Regards,
> Christian.
>
> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>> Hi,
>>
>> i spent some time playing with DRI3/Present + PRIME for testing
>> how well it works for Optimus/Enduro style setups wrt. page flipping
>> on the current kernel/mesa/xorg. I want page flipping, because
>> neuroscience/medical applications need the reliable timing/timestamping
>> and tear free presentation we currently only can get via page
>> flipping, but not the copyswap path.
>>
>> Intel as display gpu + nouveau for render offload worked nicely
>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>> and all.
>>
>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>> scanout mode from tiled to linear on the fly during flips. That's
>> a todo in itself. For the moment i used the ati-ddx with Option
>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>> HD-5770's into linear mode so page flipping can be used for
>> prime. The current modesetting-ddx will use page flipping in
>> any case as it doesn't detect the tiling format mismatch.
>>
>> nouveau uses page flips.
>>
>> Turns out that prime + page flipping currently doesn't work
>> on nouveau and amd. The first offload rendered images from
>> the imported dmabufs show up properly, but then the display
>> is stuck alternating between the first two or three rendered
>> frames.
>>
>> The problem is that during the pageflip ioctl we pin the
>> dmabuf into VRAM in preparation for scanout, then unpin it
>> when we are done with it at next flip, but the buffer stays
>> in the VRAM memory domain. Next time we flip to the buffer
>> again, the driver skips the DMA copy from GTT to VRAM during
>> pinning, because the buffers content apparently already resides
>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>> dmabuf content in system RAM, so freshly rendered frames from the
>> prime export/render offload gpu never reach the display gpu and one
>> only sees stale images.
>>
>> The attached patches for nouveau and radeon kms seem to work
>> pretty ok, page flipping works, display updates, tear-free,
>> dmabuf fence sync works, onset timing/timestamping is correct.
>> They simply pin the buffer back into GTT, then unpin, to force
>> a move of the buffer into the GTT domain, and thereby force the
>> following pin to do a new copy from GTT -> VRAM. The code tries
>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>
>> However, the approach feels very much like a hack, so i assume
>> this is not the proper way of doing it? I looked what ttm has
>> to offer, but couldn't find anything elegant and obvious. Maybe
>> there is a way to evict a bo without actually copying data back
>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>> missed something, as i'm not very familiar with ttm.
>>
>> Thoughts or suggestions?
>>
>> Another insight with my hacks is so far that nouveau seems to
>> be fast as prime exporter/renderoffload, but rather slow as
>> display gpu/prime importer, as tested on a 2008 or 2009
>> MacBookPro dual-Nvidia laptop.
>>
>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>> importer/display gpu, but very slow as prime exporter/render offload,
>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>> to draw a textured triangle strip to detile the gpu renderbuffer and
>> copy it into GTT. As drawing a textured fullscreen quad is normally
>> much faster, something special seems to be going on there wrt. DMA?
>> However, i don't have a realistic real Enduro test setup with AMD
>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>> so this could be wrong.
>>
>> thanks,
>> -mario
>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 16:35   ` Mario Kleiner
@ 2016-08-17 17:02     ` Christian König
  2016-08-17 23:29       ` Mario Kleiner
  2016-08-17 17:43     ` Alex Deucher
  1 sibling, 1 reply; 24+ messages in thread
From: Christian König @ 2016-08-17 17:02 UTC (permalink / raw)
  To: Mario Kleiner, dri-devel
  Cc: alexander.deucher, airlied, jglisse, michel.daenzer, bskeggs

Am 17.08.2016 um 18:35 schrieb Mario Kleiner:
> On 08/17/2016 06:27 PM, Christian König wrote:
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips.
>> Well I'm not an expert on this, but as far as I know the bigger problem
>> is that the dedicated AMD hardware generations you are targeting usually
>> can't reliable scanout from system memory without a rather complicated
>> setup.
>>
>> So that is a complete NAK to the radeon changes.
>
> Hi Christian,
>
> thanks for the feedback, but i think that's a misunderstanding. The 
> patches don't make them scanout from system memory, they just enforce 
> a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again. 
> I just assume there is a more elegant/clean way than this "fake" 
> pin/unpin to GTT to essentially tell the driver that its current VRAM 
> content is stale and needs a refresh from the up to date dmabuf in 
> system RAM.

I was already wondering how the heck you got that working.

What do you mean with a fresh copy from GTT to VRAM? A buffer exported 
by DMA-buf should never move as long as it is exported, same for a 
buffer pinned to VRAM.

So using a DMA-buf for scanout is impossible and actually not valuable 
cause is shouldn't matter if we copy from GTT to VRAM because of a 
buffer migration or because of a copy triggered by the DDX.

What are you actually trying to do here?

Regards,
Christian.

>
> Btw. i'll be offline for the next few hours, just wanted to get this 
> out now.
>
> thanks,
> -mario
>
>>
>> Regards,
>> Christian.
>>
>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>> Hi,
>>>
>>> i spent some time playing with DRI3/Present + PRIME for testing
>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>> on the current kernel/mesa/xorg. I want page flipping, because
>>> neuroscience/medical applications need the reliable timing/timestamping
>>> and tear free presentation we currently only can get via page
>>> flipping, but not the copyswap path.
>>>
>>> Intel as display gpu + nouveau for render offload worked nicely
>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>> and all.
>>>
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips. That's
>>> a todo in itself. For the moment i used the ati-ddx with Option
>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>> HD-5770's into linear mode so page flipping can be used for
>>> prime. The current modesetting-ddx will use page flipping in
>>> any case as it doesn't detect the tiling format mismatch.
>>>
>>> nouveau uses page flips.
>>>
>>> Turns out that prime + page flipping currently doesn't work
>>> on nouveau and amd. The first offload rendered images from
>>> the imported dmabufs show up properly, but then the display
>>> is stuck alternating between the first two or three rendered
>>> frames.
>>>
>>> The problem is that during the pageflip ioctl we pin the
>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>> when we are done with it at next flip, but the buffer stays
>>> in the VRAM memory domain. Next time we flip to the buffer
>>> again, the driver skips the DMA copy from GTT to VRAM during
>>> pinning, because the buffers content apparently already resides
>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>> dmabuf content in system RAM, so freshly rendered frames from the
>>> prime export/render offload gpu never reach the display gpu and one
>>> only sees stale images.
>>>
>>> The attached patches for nouveau and radeon kms seem to work
>>> pretty ok, page flipping works, display updates, tear-free,
>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>> They simply pin the buffer back into GTT, then unpin, to force
>>> a move of the buffer into the GTT domain, and thereby force the
>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>
>>> However, the approach feels very much like a hack, so i assume
>>> this is not the proper way of doing it? I looked what ttm has
>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>> there is a way to evict a bo without actually copying data back
>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>> missed something, as i'm not very familiar with ttm.
>>>
>>> Thoughts or suggestions?
>>>
>>> Another insight with my hacks is so far that nouveau seems to
>>> be fast as prime exporter/renderoffload, but rather slow as
>>> display gpu/prime importer, as tested on a 2008 or 2009
>>> MacBookPro dual-Nvidia laptop.
>>>
>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>> importer/display gpu, but very slow as prime exporter/render offload,
>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>> much faster, something special seems to be going on there wrt. DMA?
>>> However, i don't have a realistic real Enduro test setup with AMD
>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>> so this could be wrong.
>>>
>>> thanks,
>>> -mario
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 16:35   ` Mario Kleiner
  2016-08-17 17:02     ` Christian König
@ 2016-08-17 17:43     ` Alex Deucher
  2016-08-17 23:51       ` Mario Kleiner
  1 sibling, 1 reply; 24+ messages in thread
From: Alex Deucher @ 2016-08-17 17:43 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Daenzer, Michel, Jerome Glisse, Maling list - DRI developers,
	Ben Skeggs, Deucher, Alexander, Dave Airlie

On Wed, Aug 17, 2016 at 12:35 PM, Mario Kleiner
<mario.kleiner.de@gmail.com> wrote:
> On 08/17/2016 06:27 PM, Christian König wrote:
>>>
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips.
>>
>> Well I'm not an expert on this, but as far as I know the bigger problem
>> is that the dedicated AMD hardware generations you are targeting usually
>> can't reliable scanout from system memory without a rather complicated
>> setup.
>>
>> So that is a complete NAK to the radeon changes.
>
>
> Hi Christian,
>
> thanks for the feedback, but i think that's a misunderstanding. The patches
> don't make them scanout from system memory, they just enforce a fresh copy
> from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there
> is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially
> tell the driver that its current VRAM content is stale and needs a refresh
> from the up to date dmabuf in system RAM.
>

I think the ddx should handle the copy rather than the kernel.  That
also takes care of the tiling.  I.e., copy from the linear shared
buffer in system memory to the tiled scanout buffer in vram.  The ddx
should also be able to take damage into account and only copy the
delta.  From a bandwidth perspective, I'm not sure how much sense
pageflipping makes since there are so many copies already.

Alex

> Btw. i'll be offline for the next few hours, just wanted to get this out
> now.
>
> thanks,
> -mario
>
>
>>
>> Regards,
>> Christian.
>>
>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>>
>>> Hi,
>>>
>>> i spent some time playing with DRI3/Present + PRIME for testing
>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>> on the current kernel/mesa/xorg. I want page flipping, because
>>> neuroscience/medical applications need the reliable timing/timestamping
>>> and tear free presentation we currently only can get via page
>>> flipping, but not the copyswap path.
>>>
>>> Intel as display gpu + nouveau for render offload worked nicely
>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>> and all.
>>>
>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>> scanout mode from tiled to linear on the fly during flips. That's
>>> a todo in itself. For the moment i used the ati-ddx with Option
>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>> HD-5770's into linear mode so page flipping can be used for
>>> prime. The current modesetting-ddx will use page flipping in
>>> any case as it doesn't detect the tiling format mismatch.
>>>
>>> nouveau uses page flips.
>>>
>>> Turns out that prime + page flipping currently doesn't work
>>> on nouveau and amd. The first offload rendered images from
>>> the imported dmabufs show up properly, but then the display
>>> is stuck alternating between the first two or three rendered
>>> frames.
>>>
>>> The problem is that during the pageflip ioctl we pin the
>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>> when we are done with it at next flip, but the buffer stays
>>> in the VRAM memory domain. Next time we flip to the buffer
>>> again, the driver skips the DMA copy from GTT to VRAM during
>>> pinning, because the buffers content apparently already resides
>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>> dmabuf content in system RAM, so freshly rendered frames from the
>>> prime export/render offload gpu never reach the display gpu and one
>>> only sees stale images.
>>>
>>> The attached patches for nouveau and radeon kms seem to work
>>> pretty ok, page flipping works, display updates, tear-free,
>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>> They simply pin the buffer back into GTT, then unpin, to force
>>> a move of the buffer into the GTT domain, and thereby force the
>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>
>>> However, the approach feels very much like a hack, so i assume
>>> this is not the proper way of doing it? I looked what ttm has
>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>> there is a way to evict a bo without actually copying data back
>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>> missed something, as i'm not very familiar with ttm.
>>>
>>> Thoughts or suggestions?
>>>
>>> Another insight with my hacks is so far that nouveau seems to
>>> be fast as prime exporter/renderoffload, but rather slow as
>>> display gpu/prime importer, as tested on a 2008 or 2009
>>> MacBookPro dual-Nvidia laptop.
>>>
>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>> importer/display gpu, but very slow as prime exporter/render offload,
>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>> much faster, something special seems to be going on there wrt. DMA?
>>> However, i don't have a realistic real Enduro test setup with AMD
>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>> so this could be wrong.
>>>
>>> thanks,
>>> -mario
>>>
>>> _______________________________________________
>>> dri-devel mailing list
>>> dri-devel@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>
>>
>>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 17:02     ` Christian König
@ 2016-08-17 23:29       ` Mario Kleiner
  2016-08-18  7:41         ` Christian König
  0 siblings, 1 reply; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 23:29 UTC (permalink / raw)
  To: Christian König, dri-devel
  Cc: alexander.deucher, airlied, jglisse, michel.daenzer, bskeggs

On 08/17/2016 07:02 PM, Christian König wrote:
> Am 17.08.2016 um 18:35 schrieb Mario Kleiner:
>> On 08/17/2016 06:27 PM, Christian König wrote:
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips.
>>> Well I'm not an expert on this, but as far as I know the bigger problem
>>> is that the dedicated AMD hardware generations you are targeting usually
>>> can't reliable scanout from system memory without a rather complicated
>>> setup.
>>>
>>> So that is a complete NAK to the radeon changes.
>>
>> Hi Christian,
>>
>> thanks for the feedback, but i think that's a misunderstanding. The
>> patches don't make them scanout from system memory, they just enforce
>> a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again.
>> I just assume there is a more elegant/clean way than this "fake"
>> pin/unpin to GTT to essentially tell the driver that its current VRAM
>> content is stale and needs a refresh from the up to date dmabuf in
>> system RAM.
>
> I was already wondering how the heck you got that working.
>
> What do you mean with a fresh copy from GTT to VRAM? A buffer exported
> by DMA-buf should never move as long as it is exported, same for a
> buffer pinned to VRAM.
>

Under DRI3/Present, the way it is currently implemented in the X-Server 
and Mesa, the display gpu (= normally integrated one) is importing the 
dma-buf that was exported by the render offload gpu. So the actual 
dmabuf doesn't move, but just stays where it is in system RAM.

Afaiu the prime importing display gpu generates its own gem buffer 
handle (prime_fd_to_handle) from that dmabuf, importing scather-gather 
tables to access the dmabuf in system ram. As far as page flipping is 
concerned, so far those gem buffers / radeon_bo's aren't treated any 
different than native ones. During pageflip setup they get pinned into 
VRAM, which moves (=copies) their content from the RAM dmabuf backing 
store into VRAM. Then they get flipped and scanned out as usual. The 
disconnect happens when such a buffer gets flipped off the scanout (and 
unpinned) and later on page-flipped to the scanout again. Now the driver 
just reuses the bo that still likely resides in VRAM (although not 
pinned anymore) and forgets that it was associated with some dmabuf 
backing in RAM which may have updated visual content. So the exporting 
renderoffload gpu happily renders new frames into the dmabuf in ram, 
while radeon kms happily displays stale frames from its own copy in VRAM.

> So using a DMA-buf for scanout is impossible and actually not valuable
> cause is shouldn't matter if we copy from GTT to VRAM because of a
> buffer migration or because of a copy triggered by the DDX.
>
> What are you actually trying to do here?
>

Make a typical Enduro laptop with an AMD iGPU + AMD dGPU work under 
DRI3/Present, without tearing and other ugliness, e.g.,

DRI_PRIME=1 glxgears -fullscreen

-> discrete gpu renders, integrated gpu displays the rendered frames.

Currently the drivers use copies for handling the PresentPixmap 
requests, which sort of works in showing the right pictures, but gives 
bad tearing and undefined timing. With copies we are too slow to keep 
ahead of the scanout and Present doesn't even guarantee that the copy 
starts vsync'ed. So at all levels, from delays in the x-server, mesa's 
way of doing things, commmand submission and the hw itself we end up 
blitting in the middle of scanout. And the presentation timing isn't 
ever trustworthy for timing sensitive applications unless we present via 
page flipping.

The hack in my patch tricks the driver into migrating the bo back to GTT 
(skipping the actual pointless data copy though) and then back into VRAM 
to force a copy of fresh content from the imported dmabuf into VRAM, so 
page flipping flips up to date content into the scanout.

-mario

> Regards,
> Christian.
>
>>
>> Btw. i'll be offline for the next few hours, just wanted to get this
>> out now.
>>
>> thanks,
>> -mario
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>>> Hi,
>>>>
>>>> i spent some time playing with DRI3/Present + PRIME for testing
>>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>>> on the current kernel/mesa/xorg. I want page flipping, because
>>>> neuroscience/medical applications need the reliable timing/timestamping
>>>> and tear free presentation we currently only can get via page
>>>> flipping, but not the copyswap path.
>>>>
>>>> Intel as display gpu + nouveau for render offload worked nicely
>>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>>> and all.
>>>>
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips. That's
>>>> a todo in itself. For the moment i used the ati-ddx with Option
>>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>>> HD-5770's into linear mode so page flipping can be used for
>>>> prime. The current modesetting-ddx will use page flipping in
>>>> any case as it doesn't detect the tiling format mismatch.
>>>>
>>>> nouveau uses page flips.
>>>>
>>>> Turns out that prime + page flipping currently doesn't work
>>>> on nouveau and amd. The first offload rendered images from
>>>> the imported dmabufs show up properly, but then the display
>>>> is stuck alternating between the first two or three rendered
>>>> frames.
>>>>
>>>> The problem is that during the pageflip ioctl we pin the
>>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>>> when we are done with it at next flip, but the buffer stays
>>>> in the VRAM memory domain. Next time we flip to the buffer
>>>> again, the driver skips the DMA copy from GTT to VRAM during
>>>> pinning, because the buffers content apparently already resides
>>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>>> dmabuf content in system RAM, so freshly rendered frames from the
>>>> prime export/render offload gpu never reach the display gpu and one
>>>> only sees stale images.
>>>>
>>>> The attached patches for nouveau and radeon kms seem to work
>>>> pretty ok, page flipping works, display updates, tear-free,
>>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>>> They simply pin the buffer back into GTT, then unpin, to force
>>>> a move of the buffer into the GTT domain, and thereby force the
>>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>>
>>>> However, the approach feels very much like a hack, so i assume
>>>> this is not the proper way of doing it? I looked what ttm has
>>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>>> there is a way to evict a bo without actually copying data back
>>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>>> missed something, as i'm not very familiar with ttm.
>>>>
>>>> Thoughts or suggestions?
>>>>
>>>> Another insight with my hacks is so far that nouveau seems to
>>>> be fast as prime exporter/renderoffload, but rather slow as
>>>> display gpu/prime importer, as tested on a 2008 or 2009
>>>> MacBookPro dual-Nvidia laptop.
>>>>
>>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>>> importer/display gpu, but very slow as prime exporter/render offload,
>>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>>> much faster, something special seems to be going on there wrt. DMA?
>>>> However, i don't have a realistic real Enduro test setup with AMD
>>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>>> so this could be wrong.
>>>>
>>>> thanks,
>>>> -mario
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 17:43     ` Alex Deucher
@ 2016-08-17 23:51       ` Mario Kleiner
  2016-08-18  2:32         ` Michel Dänzer
  0 siblings, 1 reply; 24+ messages in thread
From: Mario Kleiner @ 2016-08-17 23:51 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Daenzer, Michel, Jerome Glisse, Maling list - DRI developers,
	Ben Skeggs, Deucher, Alexander, Dave Airlie

On 08/17/2016 07:43 PM, Alex Deucher wrote:
> On Wed, Aug 17, 2016 at 12:35 PM, Mario Kleiner
> <mario.kleiner.de@gmail.com> wrote:
>> On 08/17/2016 06:27 PM, Christian König wrote:
>>>>
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips.
>>>
>>> Well I'm not an expert on this, but as far as I know the bigger problem
>>> is that the dedicated AMD hardware generations you are targeting usually
>>> can't reliable scanout from system memory without a rather complicated
>>> setup.
>>>
>>> So that is a complete NAK to the radeon changes.
>>
>>
>> Hi Christian,
>>
>> thanks for the feedback, but i think that's a misunderstanding. The patches
>> don't make them scanout from system memory, they just enforce a fresh copy
>> from RAM/GTT -> VRAM before scanning out a buffer again. I just assume there
>> is a more elegant/clean way than this "fake" pin/unpin to GTT to essentially
>> tell the driver that its current VRAM content is stale and needs a refresh
>> from the up to date dmabuf in system RAM.
>>
>
> I think the ddx should handle the copy rather than the kernel.  That
> also takes care of the tiling.  I.e., copy from the linear shared
> buffer in system memory to the tiled scanout buffer in vram.  The ddx
> should also be able to take damage into account and only copy the
> delta.  From a bandwidth perspective, I'm not sure how much sense
> pageflipping makes since there are so many copies already.
>
> Alex

That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the 
mismatch in tiling flags and uses the DRI3/Present copy path instead of 
the pageflip path. The problem is that the servers Present 
implementation doesn't request a vsync'ed start of the copy operation 
and the whole procedure is too slow to keep ahead of the scanout, so it 
tears pretty badly for many animations. Also no page flipping = no 
reliable timestamps. And the modesetting ddx doesn't handle it at all, 
as it doesn't know about the tiling mismatch.

You are right, going through page flipping doesn't save any bandwith, 
may even use more without damage handling, but it prevents tearing and 
undefined presentation timing.

So it sounds as if the bug is not that page flipping doesn't quite work 
without my hack, but that i even managed to get this far?

There is this other approach from NVidia's Alex Goins for their 
proprietary driver, whose patches landed in the X-Server 1.19 master 
branch a couple of weeks ago. I haven't read his patches in detail yet, 
and i so far couldn't successfully test them with the reference 
implementation in modesetting ddx 1.19. Afaik there the display gpu 
exports a pair of scanout friendly, page flipping compatible dmabufs (i 
assume linear, contiguous, accessible by the display engines), and the 
offload gpu imports those and renders into them. That saves one extra 
copy, so should be somewhat more efficient.

Setting it up seems to be more involved and less flexible though. So far 
i couldn't make it work here for testing. Maybe bugs, maybe mistakes on 
my side, maybe i just have the wrong hardware for it. Need to read the 
patches first in detail to understand how it is supposed to work.

-mario

>
>> Btw. i'll be offline for the next few hours, just wanted to get this out
>> now.
>>
>> thanks,
>> -mario
>>
>>
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>>>
>>>> Hi,
>>>>
>>>> i spent some time playing with DRI3/Present + PRIME for testing
>>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>>> on the current kernel/mesa/xorg. I want page flipping, because
>>>> neuroscience/medical applications need the reliable timing/timestamping
>>>> and tear free presentation we currently only can get via page
>>>> flipping, but not the copyswap path.
>>>>
>>>> Intel as display gpu + nouveau for render offload worked nicely
>>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>>> and all.
>>>>
>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>> scanout mode from tiled to linear on the fly during flips. That's
>>>> a todo in itself. For the moment i used the ati-ddx with Option
>>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>>> HD-5770's into linear mode so page flipping can be used for
>>>> prime. The current modesetting-ddx will use page flipping in
>>>> any case as it doesn't detect the tiling format mismatch.
>>>>
>>>> nouveau uses page flips.
>>>>
>>>> Turns out that prime + page flipping currently doesn't work
>>>> on nouveau and amd. The first offload rendered images from
>>>> the imported dmabufs show up properly, but then the display
>>>> is stuck alternating between the first two or three rendered
>>>> frames.
>>>>
>>>> The problem is that during the pageflip ioctl we pin the
>>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>>> when we are done with it at next flip, but the buffer stays
>>>> in the VRAM memory domain. Next time we flip to the buffer
>>>> again, the driver skips the DMA copy from GTT to VRAM during
>>>> pinning, because the buffers content apparently already resides
>>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>>> dmabuf content in system RAM, so freshly rendered frames from the
>>>> prime export/render offload gpu never reach the display gpu and one
>>>> only sees stale images.
>>>>
>>>> The attached patches for nouveau and radeon kms seem to work
>>>> pretty ok, page flipping works, display updates, tear-free,
>>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>>> They simply pin the buffer back into GTT, then unpin, to force
>>>> a move of the buffer into the GTT domain, and thereby force the
>>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>>
>>>> However, the approach feels very much like a hack, so i assume
>>>> this is not the proper way of doing it? I looked what ttm has
>>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>>> there is a way to evict a bo without actually copying data back
>>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>>> missed something, as i'm not very familiar with ttm.
>>>>
>>>> Thoughts or suggestions?
>>>>
>>>> Another insight with my hacks is so far that nouveau seems to
>>>> be fast as prime exporter/renderoffload, but rather slow as
>>>> display gpu/prime importer, as tested on a 2008 or 2009
>>>> MacBookPro dual-Nvidia laptop.
>>>>
>>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>>> importer/display gpu, but very slow as prime exporter/render offload,
>>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>>> much faster, something special seems to be going on there wrt. DMA?
>>>> However, i don't have a realistic real Enduro test setup with AMD
>>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>>> so this could be wrong.
>>>>
>>>> thanks,
>>>> -mario
>>>>
>>>> _______________________________________________
>>>> dri-devel mailing list
>>>> dri-devel@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>
>>>
>>>
>> _______________________________________________
>> dri-devel mailing list
>> dri-devel@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 16:12 "Fixes" for page flipping under PRIME on AMD & nouveau Mario Kleiner
                   ` (2 preceding siblings ...)
  2016-08-17 16:27 ` "Fixes" for page flipping under PRIME on AMD & nouveau Christian König
@ 2016-08-18  2:23 ` Michel Dänzer
  2016-08-18 19:21   ` Marek Olšák
  2016-08-26 19:57   ` Mario Kleiner
  3 siblings, 2 replies; 24+ messages in thread
From: Michel Dänzer @ 2016-08-18  2:23 UTC (permalink / raw)
  To: Mario Kleiner; +Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

On 18/08/16 01:12 AM, Mario Kleiner wrote:
> 
> Intel as display gpu + nouveau for render offload worked nicely
> on intel-ddx with page flipping, proper timing, dmabuf fence sync
> and all.

How about with AMD instead of nouveau in this case?


> Turns out that prime + page flipping currently doesn't work
> on nouveau and amd. The first offload rendered images from
> the imported dmabufs show up properly, but then the display
> is stuck alternating between the first two or three rendered
> frames.
> 
> The problem is that during the pageflip ioctl we pin the
> dmabuf into VRAM in preparation for scanout, then unpin it
> when we are done with it at next flip, but the buffer stays
> in the VRAM memory domain.

Sounds like you found a bug here: BOs which are being shared between
different GPUs should always be pinned to GTT, moving them to VRAM (and
consequently the page flip) should fail.

The latest versions of DCE support scanning out from GTT, so that might
be a good solution at least for Carrizo and newer APUs, not sure it
makes sense for dGPUs though.


> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
> importer/display gpu, but very slow as prime exporter/render offload,
> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
> that Mesa's blitImage function is the slow bit here. On r600 it seems
> to draw a textured triangle strip to detile the gpu renderbuffer and
> copy it into GTT. As drawing a textured fullscreen quad is normally
> much faster, something special seems to be going on there wrt. DMA?

Maybe the rasterization as two triangles results in bad PCIe bandwidth
utilization. Using the asynchronous DMA engine for these transfers would
probably be ideal, but having the 3D engine rasterize a single rectangle
(either using the rectangle primitive or a large triangle with scissor)
might already help.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 23:51       ` Mario Kleiner
@ 2016-08-18  2:32         ` Michel Dänzer
  2016-08-18  7:49           ` Christian König
  2016-08-26 20:07           ` Mario Kleiner
  0 siblings, 2 replies; 24+ messages in thread
From: Michel Dänzer @ 2016-08-18  2:32 UTC (permalink / raw)
  To: Mario Kleiner, Alex Deucher
  Cc: Deucher, Alexander, Dave Airlie, Jerome Glisse, Ben Skeggs,
	Maling list - DRI developers

On 18/08/16 08:51 AM, Mario Kleiner wrote:
>
> That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the
> mismatch in tiling flags and uses the DRI3/Present copy path instead of
> the pageflip path. The problem is that the servers Present
> implementation doesn't request a vsync'ed start of the copy operation [...]

It waits for vblank before starting the copy.


> There is this other approach from NVidia's Alex Goins for their
> proprietary driver, whose patches landed in the X-Server 1.19 master
> branch a couple of weeks ago. I haven't read his patches in detail yet,
> and i so far couldn't successfully test them with the reference
> implementation in modesetting ddx 1.19. Afaik there the display gpu
> exports a pair of scanout friendly, page flipping compatible dmabufs (i
> assume linear, contiguous, accessible by the display engines),

FWIW, that wouldn't be possible with our "older" GPUs which can't scan
out from GTT: A BO can be either shared with another GPU or scanout
friendly, not both at the same time.


> and the offload gpu imports those and renders into them. That saves
> one extra copy, so should be somewhat more efficient.

Using two shared buffers actually isn't as efficient as possible wrt
inter-GPU bandwidth.


> Setting it up seems to be more involved and less flexible though. So far
> i couldn't make it work here for testing. Maybe bugs, maybe mistakes on
> my side, maybe i just have the wrong hardware for it.

Yeah, my impression has been it's a rather complicated solution geared
towards the Intel iGPU + proprietary nVidia use case.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-17 23:29       ` Mario Kleiner
@ 2016-08-18  7:41         ` Christian König
  2016-08-18  7:52           ` Michel Dänzer
  0 siblings, 1 reply; 24+ messages in thread
From: Christian König @ 2016-08-18  7:41 UTC (permalink / raw)
  To: Mario Kleiner, dri-devel
  Cc: alexander.deucher, airlied, jglisse, michel.daenzer, bskeggs

> Afaiu the prime importing display gpu generates its own gem buffer 
> handle (prime_fd_to_handle) from that dmabuf, importing scather-gather 
> tables to access the dmabuf in system ram. As far as page flipping is 
> concerned, so far those gem buffers / radeon_bo's aren't treated any 
> different than native ones. During pageflip setup they get pinned into 
> VRAM, which moves (=copies) their content from the RAM dmabuf backing 
> store into VRAM. 

Your understanding isn't correct. Buffers imported using prime always 
stay in GTT, they can't be moved to VRAM.

It's the DDX which copies the buffer content from the imported prime 
handle into a native on which is enabled to scan out.

Regards,
Christian.

Am 18.08.2016 um 01:29 schrieb Mario Kleiner:
> On 08/17/2016 07:02 PM, Christian König wrote:
>> Am 17.08.2016 um 18:35 schrieb Mario Kleiner:
>>> On 08/17/2016 06:27 PM, Christian König wrote:
>>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>>> scanout mode from tiled to linear on the fly during flips.
>>>> Well I'm not an expert on this, but as far as I know the bigger 
>>>> problem
>>>> is that the dedicated AMD hardware generations you are targeting 
>>>> usually
>>>> can't reliable scanout from system memory without a rather complicated
>>>> setup.
>>>>
>>>> So that is a complete NAK to the radeon changes.
>>>
>>> Hi Christian,
>>>
>>> thanks for the feedback, but i think that's a misunderstanding. The
>>> patches don't make them scanout from system memory, they just enforce
>>> a fresh copy from RAM/GTT -> VRAM before scanning out a buffer again.
>>> I just assume there is a more elegant/clean way than this "fake"
>>> pin/unpin to GTT to essentially tell the driver that its current VRAM
>>> content is stale and needs a refresh from the up to date dmabuf in
>>> system RAM.
>>
>> I was already wondering how the heck you got that working.
>>
>> What do you mean with a fresh copy from GTT to VRAM? A buffer exported
>> by DMA-buf should never move as long as it is exported, same for a
>> buffer pinned to VRAM.
>>
>
> Under DRI3/Present, the way it is currently implemented in the 
> X-Server and Mesa, the display gpu (= normally integrated one) is 
> importing the dma-buf that was exported by the render offload gpu. So 
> the actual dmabuf doesn't move, but just stays where it is in system RAM.
>
> Afaiu the prime importing display gpu generates its own gem buffer 
> handle (prime_fd_to_handle) from that dmabuf, importing scather-gather 
> tables to access the dmabuf in system ram. As far as page flipping is 
> concerned, so far those gem buffers / radeon_bo's aren't treated any 
> different than native ones. During pageflip setup they get pinned into 
> VRAM, which moves (=copies) their content from the RAM dmabuf backing 
> store into VRAM. Then they get flipped and scanned out as usual. The 
> disconnect happens when such a buffer gets flipped off the scanout 
> (and unpinned) and later on page-flipped to the scanout again. Now the 
> driver just reuses the bo that still likely resides in VRAM (although 
> not pinned anymore) and forgets that it was associated with some 
> dmabuf backing in RAM which may have updated visual content. So the 
> exporting renderoffload gpu happily renders new frames into the dmabuf 
> in ram, while radeon kms happily displays stale frames from its own 
> copy in VRAM.
>
>> So using a DMA-buf for scanout is impossible and actually not valuable
>> cause is shouldn't matter if we copy from GTT to VRAM because of a
>> buffer migration or because of a copy triggered by the DDX.
>>
>> What are you actually trying to do here?
>>
>
> Make a typical Enduro laptop with an AMD iGPU + AMD dGPU work under 
> DRI3/Present, without tearing and other ugliness, e.g.,
>
> DRI_PRIME=1 glxgears -fullscreen
>
> -> discrete gpu renders, integrated gpu displays the rendered frames.
>
> Currently the drivers use copies for handling the PresentPixmap 
> requests, which sort of works in showing the right pictures, but gives 
> bad tearing and undefined timing. With copies we are too slow to keep 
> ahead of the scanout and Present doesn't even guarantee that the copy 
> starts vsync'ed. So at all levels, from delays in the x-server, mesa's 
> way of doing things, commmand submission and the hw itself we end up 
> blitting in the middle of scanout. And the presentation timing isn't 
> ever trustworthy for timing sensitive applications unless we present 
> via page flipping.
>
> The hack in my patch tricks the driver into migrating the bo back to 
> GTT (skipping the actual pointless data copy though) and then back 
> into VRAM to force a copy of fresh content from the imported dmabuf 
> into VRAM, so page flipping flips up to date content into the scanout.
>
> -mario
>
>> Regards,
>> Christian.
>>
>>>
>>> Btw. i'll be offline for the next few hours, just wanted to get this
>>> out now.
>>>
>>> thanks,
>>> -mario
>>>
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 17.08.2016 um 18:12 schrieb Mario Kleiner:
>>>>> Hi,
>>>>>
>>>>> i spent some time playing with DRI3/Present + PRIME for testing
>>>>> how well it works for Optimus/Enduro style setups wrt. page flipping
>>>>> on the current kernel/mesa/xorg. I want page flipping, because
>>>>> neuroscience/medical applications need the reliable 
>>>>> timing/timestamping
>>>>> and tear free presentation we currently only can get via page
>>>>> flipping, but not the copyswap path.
>>>>>
>>>>> Intel as display gpu + nouveau for render offload worked nicely
>>>>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>>>>> and all.
>>>>>
>>>>> AMD uses copy swaps because radeon/amdgpu kms can't switch the
>>>>> scanout mode from tiled to linear on the fly during flips. That's
>>>>> a todo in itself. For the moment i used the ati-ddx with Option
>>>>> "ColorTiling/ColorTiling2D" "off" to force my pair of old Radeon
>>>>> HD-5770's into linear mode so page flipping can be used for
>>>>> prime. The current modesetting-ddx will use page flipping in
>>>>> any case as it doesn't detect the tiling format mismatch.
>>>>>
>>>>> nouveau uses page flips.
>>>>>
>>>>> Turns out that prime + page flipping currently doesn't work
>>>>> on nouveau and amd. The first offload rendered images from
>>>>> the imported dmabufs show up properly, but then the display
>>>>> is stuck alternating between the first two or three rendered
>>>>> frames.
>>>>>
>>>>> The problem is that during the pageflip ioctl we pin the
>>>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>>>> when we are done with it at next flip, but the buffer stays
>>>>> in the VRAM memory domain. Next time we flip to the buffer
>>>>> again, the driver skips the DMA copy from GTT to VRAM during
>>>>> pinning, because the buffers content apparently already resides
>>>>> in VRAM. Therefore it doesn't update the VRAM copy with the updated
>>>>> dmabuf content in system RAM, so freshly rendered frames from the
>>>>> prime export/render offload gpu never reach the display gpu and one
>>>>> only sees stale images.
>>>>>
>>>>> The attached patches for nouveau and radeon kms seem to work
>>>>> pretty ok, page flipping works, display updates, tear-free,
>>>>> dmabuf fence sync works, onset timing/timestamping is correct.
>>>>> They simply pin the buffer back into GTT, then unpin, to force
>>>>> a move of the buffer into the GTT domain, and thereby force the
>>>>> following pin to do a new copy from GTT -> VRAM. The code tries
>>>>> to avoid a useless copy from VRAM -> GTT during the pin op.
>>>>>
>>>>> However, the approach feels very much like a hack, so i assume
>>>>> this is not the proper way of doing it? I looked what ttm has
>>>>> to offer, but couldn't find anything elegant and obvious. Maybe
>>>>> there is a way to evict a bo without actually copying data back
>>>>> to RAM? Or to invalidate the VRAM copy as stale? Maybe i just
>>>>> missed something, as i'm not very familiar with ttm.
>>>>>
>>>>> Thoughts or suggestions?
>>>>>
>>>>> Another insight with my hacks is so far that nouveau seems to
>>>>> be fast as prime exporter/renderoffload, but rather slow as
>>>>> display gpu/prime importer, as tested on a 2008 or 2009
>>>>> MacBookPro dual-Nvidia laptop.
>>>>>
>>>>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>>>>> importer/display gpu, but very slow as prime exporter/render offload,
>>>>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>>>>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>>>>> to draw a textured triangle strip to detile the gpu renderbuffer and
>>>>> copy it into GTT. As drawing a textured fullscreen quad is normally
>>>>> much faster, something special seems to be going on there wrt. DMA?
>>>>> However, i don't have a realistic real Enduro test setup with AMD
>>>>> iGPU + dGPU, only this cobbled together dual HD-5770's in a MacPro,
>>>>> so this could be wrong.
>>>>>
>>>>> thanks,
>>>>> -mario
>>>>>
>>>>> _______________________________________________
>>>>> dri-devel mailing list
>>>>> dri-devel@lists.freedesktop.org
>>>>> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>>>>
>>>>
>>

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  2:32         ` Michel Dänzer
@ 2016-08-18  7:49           ` Christian König
  2016-08-26 20:07           ` Mario Kleiner
  1 sibling, 0 replies; 24+ messages in thread
From: Christian König @ 2016-08-18  7:49 UTC (permalink / raw)
  To: Michel Dänzer, Mario Kleiner, Alex Deucher
  Cc: Deucher, Alexander, Dave Airlie, Jerome Glisse, Ben Skeggs,
	Maling list - DRI developers

Am 18.08.2016 um 04:32 schrieb Michel Dänzer:
> On 18/08/16 08:51 AM, Mario Kleiner wrote:
>> There is this other approach from NVidia's Alex Goins for their
>> proprietary driver, whose patches landed in the X-Server 1.19 master
>> branch a couple of weeks ago. I haven't read his patches in detail yet,
>> and i so far couldn't successfully test them with the reference
>> implementation in modesetting ddx 1.19. Afaik there the display gpu
>> exports a pair of scanout friendly, page flipping compatible dmabufs (i
>> assume linear, contiguous, accessible by the display engines),
> FWIW, that wouldn't be possible with our "older" GPUs which can't scan
> out from GTT: A BO can be either shared with another GPU or scanout
> friendly, not both at the same time.

And even for newer GPUs it is quite complicated to setup.

As far as I understood it you need to make sure that at least:
1. A whole line buffered is continuous. E.g. if you want to scan out 
1920x1080 32bpp without tilling you need  1920*4=7680 bytes of linear 
memory. The result is that you need to special allocate your GTT buffer.
2. You can't use multiple layer page tables for the system domain (we 
already do this).
3. The MC needs to guarantee enough PCIe bandwith for the CRTC. This 
means you need to reprogram some priorities in the MC differently which 
can only be done when the whole GPU is idle and we haven't released 
documentation for at all.

But keep in mind that this is only *AFAIK* and from a document on how 
the DCE works I read quite a while ago.

Regards,
Christian.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  7:41         ` Christian König
@ 2016-08-18  7:52           ` Michel Dänzer
  2016-08-18  8:20             ` Christian König
  0 siblings, 1 reply; 24+ messages in thread
From: Michel Dänzer @ 2016-08-18  7:52 UTC (permalink / raw)
  To: Christian König, Mario Kleiner
  Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

On 18/08/16 04:41 PM, Christian König wrote:
>> Afaiu the prime importing display gpu generates its own gem buffer
>> handle (prime_fd_to_handle) from that dmabuf, importing scather-gather
>> tables to access the dmabuf in system ram. As far as page flipping is
>> concerned, so far those gem buffers / radeon_bo's aren't treated any
>> different than native ones. During pageflip setup they get pinned into
>> VRAM, which moves (=copies) their content from the RAM dmabuf backing
>> store into VRAM. 
> 
> Your understanding isn't correct. Buffers imported using prime always
> stay in GTT, they can't be moved to VRAM.

That's the theory, but based on Mario's description it's clear that
there is at least one bug which either actually allows a shared buffer
to be moved to VRAM, or at least doesn't propagate the error correctly,
so the page flip operation "succeeds".


> It's the DDX which copies the buffer content from the imported prime
> handle into a native on which is enabled to scan out.

There is no such code which could explain what Mario is seeing.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  7:52           ` Michel Dänzer
@ 2016-08-18  8:20             ` Christian König
  2016-08-18  8:26               ` Michel Dänzer
  0 siblings, 1 reply; 24+ messages in thread
From: Christian König @ 2016-08-18  8:20 UTC (permalink / raw)
  To: Michel Dänzer, Mario Kleiner
  Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

Am 18.08.2016 um 09:52 schrieb Michel Dänzer:
> On 18/08/16 04:41 PM, Christian König wrote:
>>> Afaiu the prime importing display gpu generates its own gem buffer
>>> handle (prime_fd_to_handle) from that dmabuf, importing scather-gather
>>> tables to access the dmabuf in system ram. As far as page flipping is
>>> concerned, so far those gem buffers / radeon_bo's aren't treated any
>>> different than native ones. During pageflip setup they get pinned into
>>> VRAM, which moves (=copies) their content from the RAM dmabuf backing
>>> store into VRAM.
>> Your understanding isn't correct. Buffers imported using prime always
>> stay in GTT, they can't be moved to VRAM.
> That's the theory, but based on Mario's description it's clear that
> there is at least one bug which either actually allows a shared buffer
> to be moved to VRAM, or at least doesn't propagate the error correctly,
> so the page flip operation "succeeds".
>
>
>> It's the DDX which copies the buffer content from the imported prime
>> handle into a native on which is enabled to scan out.
> There is no such code which could explain what Mario is seeing.

How should this work then otherwise?

I agree that I don't understand fully either what is happening here, but 
I find it quite unlikely that we actually scan out from system memory 
without the proper hardware setup.

On the other hand that we accidentally move a prime imported buffer to 
VRAM could be possible, but this would clearly be a rather severe bug we 
hopefully have noticed already.

Any other idea what actually happens here?

Regards,
Christian.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  8:20             ` Christian König
@ 2016-08-18  8:26               ` Michel Dänzer
  0 siblings, 0 replies; 24+ messages in thread
From: Michel Dänzer @ 2016-08-18  8:26 UTC (permalink / raw)
  To: Christian König, Mario Kleiner
  Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

On 18/08/16 05:20 PM, Christian König wrote:
> Am 18.08.2016 um 09:52 schrieb Michel Dänzer:
>> On 18/08/16 04:41 PM, Christian König wrote:
>>>> Afaiu the prime importing display gpu generates its own gem buffer
>>>> handle (prime_fd_to_handle) from that dmabuf, importing scather-gather
>>>> tables to access the dmabuf in system ram. As far as page flipping is
>>>> concerned, so far those gem buffers / radeon_bo's aren't treated any
>>>> different than native ones. During pageflip setup they get pinned into
>>>> VRAM, which moves (=copies) their content from the RAM dmabuf backing
>>>> store into VRAM.
>>> Your understanding isn't correct. Buffers imported using prime always
>>> stay in GTT, they can't be moved to VRAM.
>> That's the theory, but based on Mario's description it's clear that
>> there is at least one bug which either actually allows a shared buffer
>> to be moved to VRAM, or at least doesn't propagate the error correctly,
>> so the page flip operation "succeeds".
>>
>>
>>> It's the DDX which copies the buffer content from the imported prime
>>> handle into a native on which is enabled to scan out.
>> There is no such code which could explain what Mario is seeing.
> 
> How should this work then otherwise?

[...]

> On the other hand that we accidentally move a prime imported buffer to
> VRAM could be possible, but this would clearly be a rather severe bug we
> hopefully have noticed already.

That's what seems to be happening, based on Mario's description and patches.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  2:23 ` Michel Dänzer
@ 2016-08-18 19:21   ` Marek Olšák
  2016-08-26 20:10     ` Mario Kleiner
  2016-08-26 19:57   ` Mario Kleiner
  1 sibling, 1 reply; 24+ messages in thread
From: Marek Olšák @ 2016-08-18 19:21 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: dri-devel, Jerome Glisse, Ben Skeggs, Deucher, Alexander, Dave Airlie

On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer <michel@daenzer.net> wrote:
> Maybe the rasterization as two triangles results in bad PCIe bandwidth
> utilization. Using the asynchronous DMA engine for these transfers would
> probably be ideal, but having the 3D engine rasterize a single rectangle
> (either using the rectangle primitive or a large triangle with scissor)
> might already help.

There is only one thing that's bad for PCIe when the surface is
linear: the 3D engine. Disabling all but the first shader engine and
all but the first 2 RBs should improve performance for blits from VRAM
to GTT. The closed driver does that, but I don't remember if the
destination must be linear, must be in GTT, or both. In any case, SDMA
should still be the best for VRAM->GTT blits.

Marek
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  2:23 ` Michel Dänzer
  2016-08-18 19:21   ` Marek Olšák
@ 2016-08-26 19:57   ` Mario Kleiner
  2016-08-29  3:16     ` Michel Dänzer
  1 sibling, 1 reply; 24+ messages in thread
From: Mario Kleiner @ 2016-08-26 19:57 UTC (permalink / raw)
  To: Michel Dänzer
  Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

To pick this up again after a week of manic testing :)

On 08/18/2016 04:23 AM, Michel Dänzer wrote:
> On 18/08/16 01:12 AM, Mario Kleiner wrote:
>>
>> Intel as display gpu + nouveau for render offload worked nicely
>> on intel-ddx with page flipping, proper timing, dmabuf fence sync
>> and all.
>
> How about with AMD instead of nouveau in this case?
>

I don't have any real AMD Enduro laptop with either Intel + AMD or AMD + 
AMD atm., so i tested with my hacked up setups, but there things look 
very good:

a) A standard PC with Intel Haswell + AMD Tonga Pro R9 380. Seems to 
work correctly, page-flipping used, no visual artifacts or other 
problems, my measurement equipment also shows perfect timing and no 
glitches. Performance is very good, even without Marek's recent SDMA + 
PRIME patch series. Seems though with his patches some of the many 
criterions for using it doesn't get satisfied so it uses a fallback path 
on my machine.

One thing that confuses me so far is that visual results and measurment 
suggest it works nicely, properly serializing the rendering/detiling 
blit and the pageflip. But when i ftrace the Intel drivers 
reservation_object_wait_timeout_rcu() call where it normally waits for 
the dmabuf fence to complete then i never see it blocking for more than 
a few dozen microseconds, and i couldn't find any other place where it 
blocks on detiling blit completion yet. Iow. it seems to work correctly 
in practice, but i don't know where it actually blocks. Could also be 
that the flip work func in intels driver just executes after the 
detiling blit has already completed.

b) A MacPro with dual Radeon HD-5770 and NVidia GeForce, and my pageflip 
hacks applied. I ported Marek's Mesa SDMA patch to r600, and with that i 
get very good performance for AMD Evergreen as renderoffload gpu both 
for the NVidia + AMD and AMD + AMD combo. So this solved the performance 
problems on the older gpus. I assume Intel + old radeon-kms would just 
behave equally well. So thanks Marek, that was perfect!

I guess that means we are really good now wrt. renderoffload whenever an 
Intel iGPU is used for display, regardless if nouveau or AMD is used as 
dGPU :)

>
>> Turns out that prime + page flipping currently doesn't work
>> on nouveau and amd. The first offload rendered images from
>> the imported dmabufs show up properly, but then the display
>> is stuck alternating between the first two or three rendered
>> frames.
>>
>> The problem is that during the pageflip ioctl we pin the
>> dmabuf into VRAM in preparation for scanout, then unpin it
>> when we are done with it at next flip, but the buffer stays
>> in the VRAM memory domain.
>
> Sounds like you found a bug here: BOs which are being shared between
> different GPUs should always be pinned to GTT, moving them to VRAM (and
> consequently the page flip) should fail.
>

Seems so, although i hoped i was fixing a bug, not exploiting a 
loophole. In practice i haven't observed trouble with the hack so far. I 
havent't looked deeply enough into how the dma api below dmabuf 
operates, so this is just guesswork, but i suspect the reason that this 
doesn't blow up in an obvious way is that if the render offload gpu 
exports the dmabuf then the pages get pinned/locked into system RAM, so 
the pages can't move around or get paged out to swap, as long as the 
dmabuf stays exported. When the dmabuf importing AMD or nouveau display 
gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my 
hack) all that changes is some pin refcount for the RAM pages, but the 
refcount always stays non-zero and system RAM isn't freed or moved 
around during the session. I just wonder if this bug couldn't somehow be 
turned into a proper feature?

I'm tempted to keep my patches as a temporary stop gap measure in some 
kernel on GitHub, so my users could use them to get NVidia+NVidia or at 
least old AMD+AMD setups with radeon-kms + ati-ddx working well enough 
for their research work until some proper solution comes around. But if 
you think there is some major way how this could blow up, corrupt data, 
hang/crash during normal use then better not. I don't know how many of 
my users have such systems, as my advice to them so far was to "stay the 
hell away from anything with hybrid graphics/Optimus/Enduro in its name 
if they value their work". Now i could change my purchase advice to 
"anything hybrid with a Intel iGPU is probably ok in terms of 
correctness/timing/performance for not too demanding performance needs".

> The latest versions of DCE support scanning out from GTT, so that might
> be a good solution at least for Carrizo and newer APUs, not sure it
> makes sense for dGPUs though.

That would be good to have. But that means DCE-11 or later only? What is 
the constraint on older parts, does it need contiguous memory? I 
personally don't care about the dGPU case, i only use these dGPUs for 
testing because i don't have access to any real Enduro laptops with APUs.

-mario

>
>
>> AMD, as tested with dual Radeon HD-5770 seems to be fast as prime
>> importer/display gpu, but very slow as prime exporter/render offload,
>> e.g., taking 16 msecs to get a 1920x1080 framebuffer into RAM. Seems
>> that Mesa's blitImage function is the slow bit here. On r600 it seems
>> to draw a textured triangle strip to detile the gpu renderbuffer and
>> copy it into GTT. As drawing a textured fullscreen quad is normally
>> much faster, something special seems to be going on there wrt. DMA?
>
> Maybe the rasterization as two triangles results in bad PCIe bandwidth
> utilization. Using the asynchronous DMA engine for these transfers would
> probably be ideal, but having the 3D engine rasterize a single rectangle
> (either using the rectangle primitive or a large triangle with scissor)
> might already help.
>
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18  2:32         ` Michel Dänzer
  2016-08-18  7:49           ` Christian König
@ 2016-08-26 20:07           ` Mario Kleiner
  2016-08-29  3:06             ` Michel Dänzer
  1 sibling, 1 reply; 24+ messages in thread
From: Mario Kleiner @ 2016-08-26 20:07 UTC (permalink / raw)
  To: Michel Dänzer, Alex Deucher
  Cc: Deucher, Alexander, Dave Airlie, Jerome Glisse, Ben Skeggs,
	Maling list - DRI developers

On 08/18/2016 04:32 AM, Michel Dänzer wrote:
> On 18/08/16 08:51 AM, Mario Kleiner wrote:
>>
>> That's what the ati-ddx/amdgpu-ddx does at the moment, as it detects the
>> mismatch in tiling flags and uses the DRI3/Present copy path instead of
>> the pageflip path. The problem is that the servers Present
>> implementation doesn't request a vsync'ed start of the copy operation [...]
>
> It waits for vblank before starting the copy.
>

Yes, a vblank event triggers the present_execute in the server. But all 
the latency from vblank event dispatch to the copy command packet 
hitting the gpu is still way too bad to avoid tearing. I tried again and 
couldn't find a single intel/amd/nvidia gpu here that doesn't tear more 
or less badly depending on load with DRI3/Present Copyswaps. Even 
tearfree wouldn't be good enough for my kind of applications as crucial 
timing/timestamps could still be off frequently by at least 1 frame.

>
>> There is this other approach from NVidia's Alex Goins for their
>> proprietary driver, whose patches landed in the X-Server 1.19 master
>> branch a couple of weeks ago. I haven't read his patches in detail yet,
>> and i so far couldn't successfully test them with the reference
>> implementation in modesetting ddx 1.19. Afaik there the display gpu
>> exports a pair of scanout friendly, page flipping compatible dmabufs (i
>> assume linear, contiguous, accessible by the display engines),
>
> FWIW, that wouldn't be possible with our "older" GPUs which can't scan
> out from GTT: A BO can be either shared with another GPU or scanout
> friendly, not both at the same time.
>

Ok, good to know.

>
>> and the offload gpu imports those and renders into them. That saves
>> one extra copy, so should be somewhat more efficient.
>
> Using two shared buffers actually isn't as efficient as possible wrt
> inter-GPU bandwidth.
>

Out of interest, why? You'd have only one detiling copy VRAM -> RAM? Or 
is it about switching some kind of GTT mappings with two buffers that is 
inefficient?

>
>> Setting it up seems to be more involved and less flexible though. So far
>> i couldn't make it work here for testing. Maybe bugs, maybe mistakes on
>> my side, maybe i just have the wrong hardware for it.
>
> Yeah, my impression has been it's a rather complicated solution geared
> towards the Intel iGPU + proprietary nVidia use case.
>
>

Setting up output source/output sink is not fun, as i learned now, 
rather clumsy and complex compared to render offload. I hope the real 
thing will come with some fool-proof one-click setup GUI, otherwise i 
don't have great hopes, given the technical skill level of my users. I 
still didn't manage to get it working, not even with the new Nvidia 
proprietary beta drivers on a real Optimus laptop.

-mario
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-18 19:21   ` Marek Olšák
@ 2016-08-26 20:10     ` Mario Kleiner
  2016-08-26 20:33       ` Alex Deucher
  0 siblings, 1 reply; 24+ messages in thread
From: Mario Kleiner @ 2016-08-26 20:10 UTC (permalink / raw)
  To: Marek Olšák, Michel Dänzer
  Cc: Deucher, Alexander, Dave Airlie, Jerome Glisse, Ben Skeggs, dri-devel

On 08/18/2016 09:21 PM, Marek Olšák wrote:
> On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer <michel@daenzer.net> wrote:
>> Maybe the rasterization as two triangles results in bad PCIe bandwidth
>> utilization. Using the asynchronous DMA engine for these transfers would
>> probably be ideal, but having the 3D engine rasterize a single rectangle
>> (either using the rectangle primitive or a large triangle with scissor)
>> might already help.
>
> There is only one thing that's bad for PCIe when the surface is
> linear: the 3D engine. Disabling all but the first shader engine and
> all but the first 2 RBs should improve performance for blits from VRAM
> to GTT. The closed driver does that, but I don't remember if the
> destination must be linear, must be in GTT, or both. In any case, SDMA
> should still be the best for VRAM->GTT blits.
>
> Marek
>

Friday evening education question:

So if you have multiple render backends active they compete for PCIe bus 
access and some kind of "trashing" happens in the arbitration, 
drastically reducing the bandwidth?

thanks,
-mario
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-26 20:10     ` Mario Kleiner
@ 2016-08-26 20:33       ` Alex Deucher
  0 siblings, 0 replies; 24+ messages in thread
From: Alex Deucher @ 2016-08-26 20:33 UTC (permalink / raw)
  To: Mario Kleiner
  Cc: Michel Dänzer, dri-devel, Jerome Glisse, Ben Skeggs,
	Deucher, Alexander, Dave Airlie

On Fri, Aug 26, 2016 at 4:10 PM, Mario Kleiner
<mario.kleiner.de@gmail.com> wrote:
> On 08/18/2016 09:21 PM, Marek Olšák wrote:
>>
>> On Thu, Aug 18, 2016 at 4:23 AM, Michel Dänzer <michel@daenzer.net> wrote:
>>>
>>> Maybe the rasterization as two triangles results in bad PCIe bandwidth
>>> utilization. Using the asynchronous DMA engine for these transfers would
>>> probably be ideal, but having the 3D engine rasterize a single rectangle
>>> (either using the rectangle primitive or a large triangle with scissor)
>>> might already help.
>>
>>
>> There is only one thing that's bad for PCIe when the surface is
>> linear: the 3D engine. Disabling all but the first shader engine and
>> all but the first 2 RBs should improve performance for blits from VRAM
>> to GTT. The closed driver does that, but I don't remember if the
>> destination must be linear, must be in GTT, or both. In any case, SDMA
>> should still be the best for VRAM->GTT blits.
>>
>> Marek
>>
>
> Friday evening education question:
>
> So if you have multiple render backends active they compete for PCIe bus
> access and some kind of "trashing" happens in the arbitration, drastically
> reducing the bandwidth?

I think it has more to do with the access patterns.  The requests
can't be scheduled as efficiently compared to contiguous linear
accesses.

Alex
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-26 20:07           ` Mario Kleiner
@ 2016-08-29  3:06             ` Michel Dänzer
  0 siblings, 0 replies; 24+ messages in thread
From: Michel Dänzer @ 2016-08-29  3:06 UTC (permalink / raw)
  To: Mario Kleiner, Alex Deucher
  Cc: Deucher, Alexander, Dave Airlie, Jerome Glisse, Ben Skeggs,
	Maling list - DRI developers

On 27/08/16 05:07 AM, Mario Kleiner wrote:
> On 08/18/2016 04:32 AM, Michel Dänzer wrote:
>> On 18/08/16 08:51 AM, Mario Kleiner wrote:
>>>
>>> and the offload gpu imports those and renders into them. That saves
>>> one extra copy, so should be somewhat more efficient.
>>
>> Using two shared buffers actually isn't as efficient as possible wrt
>> inter-GPU bandwidth.
> 
> Out of interest, why? You'd have only one detiling copy VRAM -> RAM?

Yeah, that's basically it. With a single shared buffer, only the parts
which have changed since last time need to be copied between the GPUs;
the slave GPU can copy the other changed parts from its other local
scanout pixmap (with TearFree enabled; note that this isn't quite
implemented yet in our drivers for slave output, but I'm planning to do
it soon). With two shared pixmaps, some changed parts have to be copied
between GPUs several times.


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-26 19:57   ` Mario Kleiner
@ 2016-08-29  3:16     ` Michel Dänzer
  2016-08-29 13:20       ` Deucher, Alexander
  0 siblings, 1 reply; 24+ messages in thread
From: Michel Dänzer @ 2016-08-29  3:16 UTC (permalink / raw)
  To: Mario Kleiner; +Cc: alexander.deucher, airlied, jglisse, bskeggs, dri-devel

On 27/08/16 04:57 AM, Mario Kleiner wrote:
> On 08/18/2016 04:23 AM, Michel Dänzer wrote:
>> On 18/08/16 01:12 AM, Mario Kleiner wrote:
> 
> One thing that confuses me so far is that visual results and measurment
> suggest it works nicely, properly serializing the rendering/detiling
> blit and the pageflip. But when i ftrace the Intel drivers
> reservation_object_wait_timeout_rcu() call where it normally waits for
> the dmabuf fence to complete then i never see it blocking for more than
> a few dozen microseconds, and i couldn't find any other place where it
> blocks on detiling blit completion yet. Iow. it seems to work correctly
> in practice, but i don't know where it actually blocks.

It actually doesn't work correctly in all cases yet:
https://bugs.freedesktop.org/show_bug.cgi?id=95472


>>> Turns out that prime + page flipping currently doesn't work
>>> on nouveau and amd. The first offload rendered images from
>>> the imported dmabufs show up properly, but then the display
>>> is stuck alternating between the first two or three rendered
>>> frames.
>>>
>>> The problem is that during the pageflip ioctl we pin the
>>> dmabuf into VRAM in preparation for scanout, then unpin it
>>> when we are done with it at next flip, but the buffer stays
>>> in the VRAM memory domain.
>>
>> Sounds like you found a bug here: BOs which are being shared between
>> different GPUs should always be pinned to GTT, moving them to VRAM (and
>> consequently the page flip) should fail.
> 
> Seems so, although i hoped i was fixing a bug, not exploiting a
> loophole. In practice i haven't observed trouble with the hack so far. I
> havent't looked deeply enough into how the dma api below dmabuf
> operates, so this is just guesswork, but i suspect the reason that this
> doesn't blow up in an obvious way is that if the render offload gpu
> exports the dmabuf then the pages get pinned/locked into system RAM, so
> the pages can't move around or get paged out to swap, as long as the
> dmabuf stays exported. When the dmabuf importing AMD or nouveau display
> gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with my
> hack) all that changes is some pin refcount for the RAM pages, but the
> refcount always stays non-zero and system RAM isn't freed or moved
> around during the session. I just wonder if this bug couldn't somehow be
> turned into a proper feature?

I'm afraid not; BOs which are being shared between devices are supposed
to be pinned to GTT, and pinned BOs aren't supposed to move.

However, something similar to your patches could be done in the DDX
drivers, using the dedicated scanout pixmap mechanism.


>> The latest versions of DCE support scanning out from GTT, so that might
>> be a good solution at least for Carrizo and newer APUs, not sure it
>> makes sense for dGPUs though.
> 
> That would be good to have. But that means DCE-11 or later only? What is
> the constraint on older parts, does it need contiguous memory?

Presumably. Anyway, from Christian's description it sounds like it'll be
tricky to get this working even with current APUs. :(


-- 
Earthling Michel Dänzer               |               http://www.amd.com
Libre software enthusiast             |             Mesa and X developer
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: "Fixes" for page flipping under PRIME on AMD & nouveau
  2016-08-29  3:16     ` Michel Dänzer
@ 2016-08-29 13:20       ` Deucher, Alexander
  0 siblings, 0 replies; 24+ messages in thread
From: Deucher, Alexander @ 2016-08-29 13:20 UTC (permalink / raw)
  To: 'Michel Dänzer', Mario Kleiner
  Cc: airlied, jglisse, bskeggs, dri-devel

> -----Original Message-----
> From: Michel Dänzer [mailto:michel@daenzer.net]
> Sent: Sunday, August 28, 2016 11:17 PM
> To: Mario Kleiner
> Cc: dri-devel@lists.freedesktop.org; jglisse@redhat.com;
> bskeggs@redhat.com; Deucher, Alexander; airlied@redhat.com
> Subject: Re: "Fixes" for page flipping under PRIME on AMD & nouveau
> 
> On 27/08/16 04:57 AM, Mario Kleiner wrote:
> > On 08/18/2016 04:23 AM, Michel Dänzer wrote:
> >> On 18/08/16 01:12 AM, Mario Kleiner wrote:
> >
> > One thing that confuses me so far is that visual results and measurment
> > suggest it works nicely, properly serializing the rendering/detiling
> > blit and the pageflip. But when i ftrace the Intel drivers
> > reservation_object_wait_timeout_rcu() call where it normally waits for
> > the dmabuf fence to complete then i never see it blocking for more than
> > a few dozen microseconds, and i couldn't find any other place where it
> > blocks on detiling blit completion yet. Iow. it seems to work correctly
> > in practice, but i don't know where it actually blocks.
> 
> It actually doesn't work correctly in all cases yet:
> https://bugs.freedesktop.org/show_bug.cgi?id=95472
> 
> 
> >>> Turns out that prime + page flipping currently doesn't work
> >>> on nouveau and amd. The first offload rendered images from
> >>> the imported dmabufs show up properly, but then the display
> >>> is stuck alternating between the first two or three rendered
> >>> frames.
> >>>
> >>> The problem is that during the pageflip ioctl we pin the
> >>> dmabuf into VRAM in preparation for scanout, then unpin it
> >>> when we are done with it at next flip, but the buffer stays
> >>> in the VRAM memory domain.
> >>
> >> Sounds like you found a bug here: BOs which are being shared between
> >> different GPUs should always be pinned to GTT, moving them to VRAM
> (and
> >> consequently the page flip) should fail.
> >
> > Seems so, although i hoped i was fixing a bug, not exploiting a
> > loophole. In practice i haven't observed trouble with the hack so far. I
> > havent't looked deeply enough into how the dma api below dmabuf
> > operates, so this is just guesswork, but i suspect the reason that this
> > doesn't blow up in an obvious way is that if the render offload gpu
> > exports the dmabuf then the pages get pinned/locked into system RAM,
> so
> > the pages can't move around or get paged out to swap, as long as the
> > dmabuf stays exported. When the dmabuf importing AMD or nouveau
> display
> > gpu then moves the bo from GTT to VRAM (or pseudo-moves it back with
> my
> > hack) all that changes is some pin refcount for the RAM pages, but the
> > refcount always stays non-zero and system RAM isn't freed or moved
> > around during the session. I just wonder if this bug couldn't somehow be
> > turned into a proper feature?
> 
> I'm afraid not; BOs which are being shared between devices are supposed
> to be pinned to GTT, and pinned BOs aren't supposed to move.
> 
> However, something similar to your patches could be done in the DDX
> drivers, using the dedicated scanout pixmap mechanism.
> 
> 
> >> The latest versions of DCE support scanning out from GTT, so that might
> >> be a good solution at least for Carrizo and newer APUs, not sure it
> >> makes sense for dGPUs though.
> >
> > That would be good to have. But that means DCE-11 or later only? What is
> > the constraint on older parts, does it need contiguous memory?
> 
> Presumably. Anyway, from Christian's description it sounds like it'll be
> tricky to get this working even with current APUs. :(

It only works for DCE11 APUs (not dGPUs) using single level page tables for gart and has fairly strict alignment requirements.  The watermark setup and bandwidth management also have much stricter requirements.  I think DAL has most of what is needed in place on the display side assuming the rest of the stack provides a buffer with the right alignment.

Alex

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2016-08-29 13:20 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-17 16:12 "Fixes" for page flipping under PRIME on AMD & nouveau Mario Kleiner
2016-08-17 16:12 ` [PATCH 1/2] drm/nouveau: Fix pageflipping of PRIME imported scanout bo's Mario Kleiner
2016-08-17 16:12 ` [PATCH 2/2] drm/radeon: " Mario Kleiner
2016-08-17 16:27 ` "Fixes" for page flipping under PRIME on AMD & nouveau Christian König
2016-08-17 16:35   ` Mario Kleiner
2016-08-17 17:02     ` Christian König
2016-08-17 23:29       ` Mario Kleiner
2016-08-18  7:41         ` Christian König
2016-08-18  7:52           ` Michel Dänzer
2016-08-18  8:20             ` Christian König
2016-08-18  8:26               ` Michel Dänzer
2016-08-17 17:43     ` Alex Deucher
2016-08-17 23:51       ` Mario Kleiner
2016-08-18  2:32         ` Michel Dänzer
2016-08-18  7:49           ` Christian König
2016-08-26 20:07           ` Mario Kleiner
2016-08-29  3:06             ` Michel Dänzer
2016-08-18  2:23 ` Michel Dänzer
2016-08-18 19:21   ` Marek Olšák
2016-08-26 20:10     ` Mario Kleiner
2016-08-26 20:33       ` Alex Deucher
2016-08-26 19:57   ` Mario Kleiner
2016-08-29  3:16     ` Michel Dänzer
2016-08-29 13:20       ` Deucher, Alexander

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.