[PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
@ 2018-02-28  7:21 Monk Liu
       [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Monk Liu @ 2018-02-28  7:21 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

found recover_vram_from_shadow sometimes get executed
in paralle with SDMA scheduler, should stop all
schedulers before doing gpu reset/recover

Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +++++++++++-------------------
 1 file changed, 15 insertions(+), 25 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 75d1733..e9d81a8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	/* block TTM */
 	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
+
 	/* store modesetting */
 	if (amdgpu_device_has_dc_support(adev))
 		state = drm_atomic_helper_suspend(adev->ddev);
 
-	/* block scheduler */
+	/* block all schedulers and reset given job's ring */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
 
 		if (!ring || !ring->sched.thread)
 			continue;
 
-		/* only focus on the ring hit timeout if &job not NULL */
+		kthread_park(ring->sched.thread);
+
 		if (job && job->ring->idx != i)
 			continue;
 
-		kthread_park(ring->sched.thread);
 		drm_sched_hw_job_reset(&ring->sched, &job->base);
 
 		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
@@ -2707,33 +2708,22 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 			}
 			dma_fence_put(fence);
 		}
+	}
 
-		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-			struct amdgpu_ring *ring = adev->rings[i];
-
-			if (!ring || !ring->sched.thread)
-				continue;
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
 
-			/* only focus on the ring hit timeout if &job not NULL */
-			if (job && job->ring->idx != i)
-				continue;
+		if (!ring || !ring->sched.thread)
+			continue;
 
+		/* only need recovery sched of the given job's ring
+		 * or all rings (in the case @job is NULL)
+		 * after above amdgpu_reset accomplished
+		 */
+		if ((!job || job->ring->idx == i) && !r)
 			drm_sched_job_recovery(&ring->sched);
-			kthread_unpark(ring->sched.thread);
-		}
-	} else {
-		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-			struct amdgpu_ring *ring = adev->rings[i];
 
-			if (!ring || !ring->sched.thread)
-				continue;
-
-			/* only focus on the ring hit timeout if &job not NULL */
-			if (job && job->ring->idx != i)
-				continue;
-
-			kthread_unpark(adev->rings[i]->sched.thread);
-		}
+		kthread_unpark(ring->sched.thread);
 	}
 
 	if (amdgpu_device_has_dc_support(adev)) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28  7:21   ` Monk Liu
       [not found]     ` <1519802463-9090-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28  7:21   ` [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer Monk Liu
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 32+ messages in thread
From: Monk Liu @ 2018-02-28  7:21 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1)create a routine "handle_vram_lost" to do the vram
recovery, and put it into amdgpu_device_reset/reset_sriov,
this way no need of the extra paramter to hold the
VRAM LOST information and the related macros can be removed.

3)show vram_recover failure if time out, and set TMO equal to
lockup_timeout if vram_recover is under SRIOV runtime mode.

4)report error if any ip reset failed for SR-IOV

Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
 2 files changed, 72 insertions(+), 68 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 5bddfc1..abbc3f1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
 #define CIK_CURSOR_WIDTH 128
 #define CIK_CURSOR_HEIGHT 128
 
-/* GPU RESET flags */
-#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
-#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
-
 struct amdgpu_device;
 struct amdgpu_ib;
 struct amdgpu_cs_parser;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index e9d81a8..39ece7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -1591,6 +1591,8 @@ static int amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
 
 			r = block->version->funcs->hw_init(adev);
 			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
+			if (r)
+				return r;
 		}
 	}
 
@@ -1624,6 +1626,8 @@ static int amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
 
 			r = block->version->funcs->hw_init(adev);
 			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
+			if (r)
+				return r;
 		}
 	}
 
@@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
 	return r;
 }
 
+static int amdgpu_handle_vram_lost(struct amdgpu_device *adev)
+{
+	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
+	struct amdgpu_bo *bo, *tmp;
+	struct dma_fence *fence = NULL, *next = NULL;
+	long r = 1;
+	int i = 0;
+	long tmo;
+
+	if (amdgpu_sriov_runtime(adev))
+		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
+	else
+		tmo = msecs_to_jiffies(100);
+
+	DRM_INFO("recover vram bo from shadow start\n");
+	mutex_lock(&adev->shadow_list_lock);
+	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
+		next = NULL;
+		amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
+		if (fence) {
+			r = dma_fence_wait_timeout(fence, false, tmo);
+			if (r == 0)
+				pr_err("wait fence %p[%d] timeout\n", fence, i);
+			else if (r < 0)
+				pr_err("wait fence %p[%d] interrupted\n", fence, i);
+			if (r < 1) {
+				dma_fence_put(fence);
+				fence = next;
+				break;
+			}
+			i++;
+		}
+
+		dma_fence_put(fence);
+		fence = next;
+	}
+	mutex_unlock(&adev->shadow_list_lock);
+
+	if (fence) {
+		r = dma_fence_wait_timeout(fence, false, tmo);
+		if (r == 0)
+			pr_err("wait fence %p[%d] timeout\n", fence, i);
+		else if (r < 0)
+			pr_err("wait fence %p[%d] interrupted\n", fence, i);
+
+	}
+	dma_fence_put(fence);
+
+	if (r > 0)
+		DRM_INFO("recover vram bo from shadow done\n");
+	else
+		DRM_ERROR("recover vram bo from shadow failed\n");
+
+	return (r > 0?0:1);
+}
+
 /*
  * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
  *
  * @adev: amdgpu device pointer
- * @reset_flags: output param tells caller the reset result
  *
  * attempt to do soft-reset or full-reset and reinitialize Asic
  * return 0 means successed otherwise failed
 */
-static int amdgpu_device_reset(struct amdgpu_device *adev,
-			       uint64_t* reset_flags)
+static int amdgpu_device_reset(struct amdgpu_device *adev)
 {
 	bool need_full_reset, vram_lost = 0;
 	int r;
@@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
 			DRM_INFO("soft reset failed, will fallback to full reset!\n");
 			need_full_reset = true;
 		}
-
 	}
 
 	if (need_full_reset) {
@@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
 		}
 	}
 
-	if (reset_flags) {
-		if (vram_lost)
-			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
-
-		if (need_full_reset)
-			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
-	}
+	if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
+		r = amdgpu_handle_vram_lost(adev);
 
 	return r;
 }
@@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
  * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
  *
  * @adev: amdgpu device pointer
- * @reset_flags: output param tells caller the reset result
  *
  * do VF FLR and reinitialize Asic
  * return 0 means successed otherwise failed
 */
-static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
-				     uint64_t *reset_flags,
-				     bool from_hypervisor)
+static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool from_hypervisor)
 {
 	int r;
 
@@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
 
 	/* now we are okay to resume SMC/CP/SDMA */
 	r = amdgpu_device_ip_reinit_late_sriov(adev);
+	amdgpu_virt_release_full_gpu(adev, true);
 	if (r)
 		goto error;
 
 	amdgpu_irq_gpu_reset_resume_helper(adev);
 	r = amdgpu_ib_ring_tests(adev);
-	if (r)
-		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
 
-error:
-	/* release full control of GPU after ib test */
-	amdgpu_virt_release_full_gpu(adev, true);
-
-	if (reset_flags) {
-		if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
-			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
-			atomic_inc(&adev->vram_lost_counter);
-		}
-
-		/* VF FLR or hotlink reset is always full-reset */
-		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
+	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
+		atomic_inc(&adev->vram_lost_counter);
+		r = amdgpu_handle_vram_lost(adev);
 	}
 
+error:
+
 	return r;
 }
 
@@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	}
 
 	if (amdgpu_sriov_vf(adev))
-		r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
+		r = amdgpu_device_reset_sriov(adev, job ? false : true);
 	else
-		r = amdgpu_device_reset(adev, &reset_flags);
-
-	if (!r) {
-		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
-			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
-			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
-			struct amdgpu_bo *bo, *tmp;
-			struct dma_fence *fence = NULL, *next = NULL;
-
-			DRM_INFO("recover vram bo from shadow\n");
-			mutex_lock(&adev->shadow_list_lock);
-			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
-				next = NULL;
-				amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
-				if (fence) {
-					r = dma_fence_wait(fence, false);
-					if (r) {
-						WARN(r, "recovery from shadow isn't completed\n");
-						break;
-					}
-				}
-
-				dma_fence_put(fence);
-				fence = next;
-			}
-			mutex_unlock(&adev->shadow_list_lock);
-			if (fence) {
-				r = dma_fence_wait(fence, false);
-				if (r)
-					WARN(r, "recovery from shadow isn't completed\n");
-			}
-			dma_fence_put(fence);
-		}
-	}
+		r = amdgpu_device_reset(adev);
 
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28  7:21   ` [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling Monk Liu
@ 2018-02-28  7:21   ` Monk Liu
       [not found]     ` <1519802463-9090-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28  7:21   ` [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ Monk Liu
  2018-02-28 12:20   ` [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover Christian König
  3 siblings, 1 reply; 32+ messages in thread
From: Monk Liu @ 2018-02-28  7:21 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

because this time SDMA may under GPU RESET so its ring->ready
may not true, keep going and GPU scheduler will reschedule
this job if it failed.

give a warning on copy_buffer when go through direct_submit
while ring->ready is false

Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
index e38e6db..7b75ac9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
@@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
 	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
 	WARN_ON(job->ibs[0].length_dw > num_dw);
 	if (direct_submit) {
+		WARN_ON(!ring->ready);
 		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
 				       NULL, fence);
 		job->fence = dma_fence_get(*fence);
@@ -1692,11 +1693,6 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
 	struct amdgpu_job *job;
 	int r;
 
-	if (!ring->ready) {
-		DRM_ERROR("Trying to clear memory with ring turned off.\n");
-		return -EINVAL;
-	}
-
 	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
 		r = amdgpu_ttm_alloc_gart(&bo->tbo);
 		if (r)
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ
       [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28  7:21   ` [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling Monk Liu
  2018-02-28  7:21   ` [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer Monk Liu
@ 2018-02-28  7:21   ` Monk Liu
       [not found]     ` <1519802463-9090-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28 12:20   ` [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover Christian König
  3 siblings, 1 reply; 32+ messages in thread
From: Monk Liu @ 2018-02-28  7:21 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

sometimes GPU is switched to other VFs and won't swich
back soon, so the kiq reg access will not signal within
a short period, instead of returning TMO error we in fact
should sleep 1ms and try again later (when not in interrupt
context).

And since there is retry scheme provided so the MAX_KIQ_REG_WAIT
tmo shouldn't set to a long time, set it to 10ms is enough.

if gpu already in reset state, don't retry the KIQ
reg access otherwise it would always hang because KIQ
was already die.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index b832651..7081eb6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -22,7 +22,7 @@
  */
 
 #include "amdgpu.h"
-#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
+#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)
 {
@@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_read:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			schedule();
+			goto retry_read;
+		}
 		return ~0;
 	}
 	val = adev->wb.wb[adev->virt.reg_val_offs];
@@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_write:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
-	if (r < 1)
+	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			schedule();
+			goto retry_write;
+		}
+	}
 }
 
 /**
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ
       [not found]     ` <1519802463-9090-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28  7:41       ` Liu, Monk
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Monk @ 2018-02-28  7:41 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Drop it, will send V2 patch

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Monk Liu
Sent: 2018年2月28日 15:21
To: amd-gfx@lists.freedesktop.org
Cc: Liu, Monk <Monk.Liu@amd.com>
Subject: [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ

sometimes GPU is switched to other VFs and won't swich back soon, so the kiq reg access will not signal within a short period, instead of returning TMO error we in fact should sleep 1ms and try again later (when not in interrupt context).

And since there is retry scheme provided so the MAX_KIQ_REG_WAIT tmo shouldn't set to a long time, set it to 10ms is enough.

if gpu already in reset state, don't retry the KIQ reg access otherwise it would always hang because KIQ was already die.

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index b832651..7081eb6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -22,7 +22,7 @@
  */
 
 #include "amdgpu.h"
-#define MAX_KIQ_REG_WAIT	100000000 /* in usecs */
+#define MAX_KIQ_REG_WAIT	10000 /* in usecs, 10ms */
 
 uint64_t amdgpu_csa_vaddr(struct amdgpu_device *adev)  { @@ -152,9 +152,14 @@ uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_read:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
 	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			schedule();
+			goto retry_read;
+		}
 		return ~0;
 	}
 	val = adev->wb.wb[adev->virt.reg_val_offs];
@@ -179,9 +184,15 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v)
 	amdgpu_ring_commit(ring);
 	spin_unlock_irqrestore(&kiq->ring_lock, flags);
 
+retry_write:
 	r = amdgpu_fence_wait_polling(ring, seq, MAX_KIQ_REG_WAIT);
-	if (r < 1)
+	if (r < 1) {
 		DRM_ERROR("wait for kiq fence error: %ld\n", r);
+		if (!in_interrupt() && !adev->in_gpu_reset) {
+			schedule();
+			goto retry_write;
+		}
+	}
 }
 
 /**
--
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2018-02-28  7:21   ` [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ Monk Liu
@ 2018-02-28 12:20   ` Christian König
       [not found]     ` <9e575c74-c6ce-76ef-a09c-1dec5a4807a3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  3 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-02-28 12:20 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW, Grodzovsky, Andrey

Andrey please give this set a good testing as well.

Am 28.02.2018 um 08:21 schrieb Monk Liu:
> found recover_vram_from_shadow sometimes get executed
> in paralle with SDMA scheduler, should stop all
> schedulers before doing gpu reset/recover
>
> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

For now this patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 +++++++++++-------------------
>   1 file changed, 15 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 75d1733..e9d81a8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   	/* block TTM */
>   	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> +
>   	/* store modesetting */
>   	if (amdgpu_device_has_dc_support(adev))
>   		state = drm_atomic_helper_suspend(adev->ddev);
>   
> -	/* block scheduler */
> +	/* block all schedulers and reset given job's ring */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];
>   
>   		if (!ring || !ring->sched.thread)
>   			continue;
>   
> -		/* only focus on the ring hit timeout if &job not NULL */
> +		kthread_park(ring->sched.thread);
> +
>   		if (job && job->ring->idx != i)
>   			continue;
>   
> -		kthread_park(ring->sched.thread);
>   		drm_sched_hw_job_reset(&ring->sched, &job->base);
>   
>   		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> @@ -2707,33 +2708,22 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   			}
>   			dma_fence_put(fence);
>   		}
> +	}
>   
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			struct amdgpu_ring *ring = adev->rings[i];
> -
> -			if (!ring || !ring->sched.thread)
> -				continue;
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
>   
> -			/* only focus on the ring hit timeout if &job not NULL */
> -			if (job && job->ring->idx != i)
> -				continue;
> +		if (!ring || !ring->sched.thread)
> +			continue;
>   
> +		/* only need recovery sched of the given job's ring
> +		 * or all rings (in the case @job is NULL)
> +		 * after above amdgpu_reset accomplished
> +		 */
> +		if ((!job || job->ring->idx == i) && !r)
>   			drm_sched_job_recovery(&ring->sched);
> -			kthread_unpark(ring->sched.thread);
> -		}
> -	} else {
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			struct amdgpu_ring *ring = adev->rings[i];
>   
> -			if (!ring || !ring->sched.thread)
> -				continue;
> -
> -			/* only focus on the ring hit timeout if &job not NULL */
> -			if (job && job->ring->idx != i)
> -				continue;
> -
> -			kthread_unpark(adev->rings[i]->sched.thread);
> -		}
> +		kthread_unpark(ring->sched.thread);
>   	}
>   
>   	if (amdgpu_device_has_dc_support(adev)) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found]     ` <1519802463-9090-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 12:22       ` Christian König
  2018-02-28 13:29       ` Andrey Grodzovsky
  2018-02-28 14:25       ` Alex Deucher
  2 siblings, 0 replies; 32+ messages in thread
From: Christian König @ 2018-02-28 12:22 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 28.02.2018 um 08:21 schrieb Monk Liu:
> 1)create a routine "handle_vram_lost" to do the vram
> recovery, and put it into amdgpu_device_reset/reset_sriov,
> this way no need of the extra paramter to hold the
> VRAM LOST information and the related macros can be removed.
>
> 3)show vram_recover failure if time out, and set TMO equal to
> lockup_timeout if vram_recover is under SRIOV runtime mode.
>
> 4)report error if any ip reset failed for SR-IOV
>
> Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Looks good to me, but not 100% sure if all of this is technically correct.

Patch is Acked-by: Christian König <christian.koenig@amd.com> for now.

Andrey and/or David should probably take a look as well.

Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
>   2 files changed, 72 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5bddfc1..abbc3f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
>   #define CIK_CURSOR_WIDTH 128
>   #define CIK_CURSOR_HEIGHT 128
>   
> -/* GPU RESET flags */
> -#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
> -#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
> -
>   struct amdgpu_device;
>   struct amdgpu_ib;
>   struct amdgpu_cs_parser;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e9d81a8..39ece7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1591,6 +1591,8 @@ static int amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -1624,6 +1626,8 @@ static int amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static int amdgpu_handle_vram_lost(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> +	struct amdgpu_bo *bo, *tmp;
> +	struct dma_fence *fence = NULL, *next = NULL;
> +	long r = 1;
> +	int i = 0;
> +	long tmo;
> +
> +	if (amdgpu_sriov_runtime(adev))
> +		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> +	else
> +		tmo = msecs_to_jiffies(100);
> +
> +	DRM_INFO("recover vram bo from shadow start\n");
> +	mutex_lock(&adev->shadow_list_lock);
> +	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> +		next = NULL;
> +		amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> +		if (fence) {
> +			r = dma_fence_wait_timeout(fence, false, tmo);
> +			if (r == 0)
> +				pr_err("wait fence %p[%d] timeout\n", fence, i);
> +			else if (r < 0)
> +				pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +			if (r < 1) {
> +				dma_fence_put(fence);
> +				fence = next;
> +				break;
> +			}
> +			i++;
> +		}
> +
> +		dma_fence_put(fence);
> +		fence = next;
> +	}
> +	mutex_unlock(&adev->shadow_list_lock);
> +
> +	if (fence) {
> +		r = dma_fence_wait_timeout(fence, false, tmo);
> +		if (r == 0)
> +			pr_err("wait fence %p[%d] timeout\n", fence, i);
> +		else if (r < 0)
> +			pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +
> +	}
> +	dma_fence_put(fence);
> +
> +	if (r > 0)
> +		DRM_INFO("recover vram bo from shadow done\n");
> +	else
> +		DRM_ERROR("recover vram bo from shadow failed\n");
> +
> +	return (r > 0?0:1);
> +}
> +
>   /*
>    * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * attempt to do soft-reset or full-reset and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset(struct amdgpu_device *adev,
> -			       uint64_t* reset_flags)
> +static int amdgpu_device_reset(struct amdgpu_device *adev)
>   {
>   	bool need_full_reset, vram_lost = 0;
>   	int r;
> @@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   			DRM_INFO("soft reset failed, will fallback to full reset!\n");
>   			need_full_reset = true;
>   		}
> -
>   	}
>   
>   	if (need_full_reset) {
> @@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   		}
>   	}
>   
> -	if (reset_flags) {
> -		if (vram_lost)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -
> -		if (need_full_reset)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> -	}
> +	if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
> +		r = amdgpu_handle_vram_lost(adev);
>   
>   	return r;
>   }
> @@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>    * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * do VF FLR and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> -				     uint64_t *reset_flags,
> -				     bool from_hypervisor)
> +static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool from_hypervisor)
>   {
>   	int r;
>   
> @@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   
>   	/* now we are okay to resume SMC/CP/SDMA */
>   	r = amdgpu_device_ip_reinit_late_sriov(adev);
> +	amdgpu_virt_release_full_gpu(adev, true);
>   	if (r)
>   		goto error;
>   
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   	r = amdgpu_ib_ring_tests(adev);
> -	if (r)
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
>   
> -error:
> -	/* release full control of GPU after ib test */
> -	amdgpu_virt_release_full_gpu(adev, true);
> -
> -	if (reset_flags) {
> -		if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -			atomic_inc(&adev->vram_lost_counter);
> -		}
> -
> -		/* VF FLR or hotlink reset is always full-reset */
> -		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +		atomic_inc(&adev->vram_lost_counter);
> +		r = amdgpu_handle_vram_lost(adev);
>   	}
>   
> +error:
> +
>   	return r;
>   }
>   
> @@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	}
>   
>   	if (amdgpu_sriov_vf(adev))
> -		r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
> +		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>   	else
> -		r = amdgpu_device_reset(adev, &reset_flags);
> -
> -	if (!r) {
> -		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
> -			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -			struct amdgpu_bo *bo, *tmp;
> -			struct dma_fence *fence = NULL, *next = NULL;
> -
> -			DRM_INFO("recover vram bo from shadow\n");
> -			mutex_lock(&adev->shadow_list_lock);
> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -				next = NULL;
> -				amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> -				if (fence) {
> -					r = dma_fence_wait(fence, false);
> -					if (r) {
> -						WARN(r, "recovery from shadow isn't completed\n");
> -						break;
> -					}
> -				}
> -
> -				dma_fence_put(fence);
> -				fence = next;
> -			}
> -			mutex_unlock(&adev->shadow_list_lock);
> -			if (fence) {
> -				r = dma_fence_wait(fence, false);
> -				if (r)
> -					WARN(r, "recovery from shadow isn't completed\n");
> -			}
> -			dma_fence_put(fence);
> -		}
> -	}
> +		r = amdgpu_device_reset(adev);
>   
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]     ` <1519802463-9090-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 12:23       ` Christian König
       [not found]         ` <01879d0d-edea-680e-c9f2-1005d94f1dfd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-02-28 12:23 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 28.02.2018 um 08:21 schrieb Monk Liu:
> because this time SDMA may under GPU RESET so its ring->ready
> may not true, keep going and GPU scheduler will reschedule
> this job if it failed.
>
> give a warning on copy_buffer when go through direct_submit
> while ring->ready is false

NAK, that test has already saved us quite a bunch of trouble with the fb 
layer.

Why exactly are you running into issues with that?

Christian.

>
> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index e38e6db..7b75ac9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>   	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>   	WARN_ON(job->ibs[0].length_dw > num_dw);
>   	if (direct_submit) {
> +		WARN_ON(!ring->ready);
>   		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>   				       NULL, fence);
>   		job->fence = dma_fence_get(*fence);
> @@ -1692,11 +1693,6 @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>   	struct amdgpu_job *job;
>   	int r;
>   
> -	if (!ring->ready) {
> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
> -		return -EINVAL;
> -	}
> -
>   	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>   		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>   		if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]         ` <01879d0d-edea-680e-c9f2-1005d94f1dfd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-28 12:34           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449BEDBC1FCE72C0021C88184C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-02-28 12:34 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Because when SDMA was hang by like process A, and meanwhile another process B is already running into the code of fill_buffer()
So just let process B continue, don't block it otherwise process B would fail by software reason .

Let it run and finally process B's job would fail and GPU recover will repeat it again (since it is a kernel job) 

Without this solution other process will be greatly harmed by one black sheep that triggering GPU recover 

/Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年2月28日 20:24
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Am 28.02.2018 um 08:21 schrieb Monk Liu:
> because this time SDMA may under GPU RESET so its ring->ready may not 
> true, keep going and GPU scheduler will reschedule this job if it 
> failed.
>
> give a warning on copy_buffer when go through direct_submit while 
> ring->ready is false

NAK, that test has already saved us quite a bunch of trouble with the fb layer.

Why exactly are you running into issues with that?

Christian.

>
> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index e38e6db..7b75ac9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>   	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>   	WARN_ON(job->ibs[0].length_dw > num_dw);
>   	if (direct_submit) {
> +		WARN_ON(!ring->ready);
>   		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>   				       NULL, fence);
>   		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int 
> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>   	struct amdgpu_job *job;
>   	int r;
>   
> -	if (!ring->ready) {
> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
> -		return -EINVAL;
> -	}
> -
>   	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>   		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>   		if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]             ` <BLUPR12MB0449BEDBC1FCE72C0021C88184C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-28 12:40               ` Liu, Monk
  2018-02-28 12:45               ` Christian König
  1 sibling, 0 replies; 32+ messages in thread
From: Liu, Monk @ 2018-02-28 12:40 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

What about this for fill_buffer():
 if (ring->ready != true && !adev->in_gpu_reset) {
	return -EINVAL;
 }


/Monk


-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2018年2月28日 20:35
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Because when SDMA was hang by like process A, and meanwhile another process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .

Let it run and finally process B's job would fail and GPU recover will repeat it again (since it is a kernel job) 

Without this solution other process will be greatly harmed by one black sheep that triggering GPU recover 

/Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
Sent: 2018年2月28日 20:24
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Am 28.02.2018 um 08:21 schrieb Monk Liu:
> because this time SDMA may under GPU RESET so its ring->ready may not 
> true, keep going and GPU scheduler will reschedule this job if it 
> failed.
>
> give a warning on copy_buffer when go through direct_submit while
> ring->ready is false

NAK, that test has already saved us quite a bunch of trouble with the fb layer.

Why exactly are you running into issues with that?

Christian.

>
> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>   1 file changed, 1 insertion(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> index e38e6db..7b75ac9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>   	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>   	WARN_ON(job->ibs[0].length_dw > num_dw);
>   	if (direct_submit) {
> +		WARN_ON(!ring->ready);
>   		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>   				       NULL, fence);
>   		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int 
> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>   	struct amdgpu_job *job;
>   	int r;
>   
> -	if (!ring->ready) {
> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
> -		return -EINVAL;
> -	}
> -
>   	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>   		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>   		if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]             ` <BLUPR12MB0449BEDBC1FCE72C0021C88184C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-02-28 12:40               ` Liu, Monk
@ 2018-02-28 12:45               ` Christian König
       [not found]                 ` <3a8945f9-848e-dd19-373d-5dddc69f76cb-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Christian König @ 2018-02-28 12:45 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Good point, but in this case we need some other handling.

Please change the test to use ring->ring_obj instead, this way we still 
bail out if somebody tries to submit commands before the ring is even 
allocated.

And you also need to fix a couple of other places in amdgpu_ttm.c.

Regards,
Christian.

Am 28.02.2018 um 13:34 schrieb Liu, Monk:
> Because when SDMA was hang by like process A, and meanwhile another process B is already running into the code of fill_buffer()
> So just let process B continue, don't block it otherwise process B would fail by software reason .
>
> Let it run and finally process B's job would fail and GPU recover will repeat it again (since it is a kernel job)
>
> Without this solution other process will be greatly harmed by one black sheep that triggering GPU recover
>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月28日 20:24
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>> because this time SDMA may under GPU RESET so its ring->ready may not
>> true, keep going and GPU scheduler will reschedule this job if it
>> failed.
>>
>> give a warning on copy_buffer when go through direct_submit while
>> ring->ready is false
> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>
> Why exactly are you running into issues with that?
>
> Christian.
>
>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index e38e6db..7b75ac9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>    	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>    	WARN_ON(job->ibs[0].length_dw > num_dw);
>>    	if (direct_submit) {
>> +		WARN_ON(!ring->ready);
>>    		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>    				       NULL, fence);
>>    		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int
>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>    	struct amdgpu_job *job;
>>    	int r;
>>    
>> -	if (!ring->ready) {
>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>> -		return -EINVAL;
>> -	}
>> -
>>    	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>    		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>    		if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found]     ` <1519802463-9090-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28 12:22       ` Christian König
@ 2018-02-28 13:29       ` Andrey Grodzovsky
       [not found]         ` <13337cd9-78e9-df36-f2ab-749cf182177b-5C7GfCeVMHo@public.gmane.org>
  2018-02-28 14:25       ` Alex Deucher
  2 siblings, 1 reply; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 13:29 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW



On 02/28/2018 02:21 AM, Monk Liu wrote:
> 1)create a routine "handle_vram_lost" to do the vram
> recovery, and put it into amdgpu_device_reset/reset_sriov,
> this way no need of the extra paramter to hold the
> VRAM LOST information and the related macros can be removed.
>
> 3)show vram_recover failure if time out, and set TMO equal to
> lockup_timeout if vram_recover is under SRIOV runtime mode.
>
> 4)report error if any ip reset failed for SR-IOV
>
> Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
>   2 files changed, 72 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5bddfc1..abbc3f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
>   #define CIK_CURSOR_WIDTH 128
>   #define CIK_CURSOR_HEIGHT 128
>   
> -/* GPU RESET flags */
> -#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
> -#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
> -
>   struct amdgpu_device;
>   struct amdgpu_ib;
>   struct amdgpu_cs_parser;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e9d81a8..39ece7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1591,6 +1591,8 @@ static int amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -1624,6 +1626,8 @@ static int amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static int amdgpu_handle_vram_lost(struct amdgpu_device *adev)
> +{
> +	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> +	struct amdgpu_bo *bo, *tmp;
> +	struct dma_fence *fence = NULL, *next = NULL;
> +	long r = 1;
> +	int i = 0;
> +	long tmo;
> +
> +	if (amdgpu_sriov_runtime(adev))
> +		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> +	else
> +		tmo = msecs_to_jiffies(100);
> +
> +	DRM_INFO("recover vram bo from shadow start\n");
> +	mutex_lock(&adev->shadow_list_lock);
> +	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> +		next = NULL;
> +		amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> +		if (fence) {
> +			r = dma_fence_wait_timeout(fence, false, tmo);
> +			if (r == 0)
> +				pr_err("wait fence %p[%d] timeout\n", fence, i);
> +			else if (r < 0)
> +				pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +			if (r < 1) {
> +				dma_fence_put(fence);
> +				fence = next;
> +				break;
> +			}
> +			i++;
> +		}
> +
> +		dma_fence_put(fence);
> +		fence = next;
> +	}
> +	mutex_unlock(&adev->shadow_list_lock);
> +
> +	if (fence) {
> +		r = dma_fence_wait_timeout(fence, false, tmo);
> +		if (r == 0)
> +			pr_err("wait fence %p[%d] timeout\n", fence, i);
> +		else if (r < 0)
> +			pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +
> +	}
> +	dma_fence_put(fence);
> +
> +	if (r > 0)
> +		DRM_INFO("recover vram bo from shadow done\n");
> +	else
> +		DRM_ERROR("recover vram bo from shadow failed\n");
> +
> +	return (r > 0?0:1);
> +}
> +
>   /*
>    * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * attempt to do soft-reset or full-reset and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset(struct amdgpu_device *adev,
> -			       uint64_t* reset_flags)
> +static int amdgpu_device_reset(struct amdgpu_device *adev)
>   {
>   	bool need_full_reset, vram_lost = 0;
>   	int r;
> @@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   			DRM_INFO("soft reset failed, will fallback to full reset!\n");
>   			need_full_reset = true;
>   		}
> -
>   	}
>   
>   	if (need_full_reset) {
> @@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   		}
>   	}
>   
> -	if (reset_flags) {
> -		if (vram_lost)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -
> -		if (need_full_reset)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> -	}
> +	if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
> +		r = amdgpu_handle_vram_lost(adev);
>   
>   	return r;
>   }
> @@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>    * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * do VF FLR and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> -				     uint64_t *reset_flags,
> -				     bool from_hypervisor)
> +static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool from_hypervisor)
>   {
>   	int r;
>   
> @@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>   
>   	/* now we are okay to resume SMC/CP/SDMA */
>   	r = amdgpu_device_ip_reinit_late_sriov(adev);
> +	amdgpu_virt_release_full_gpu(adev, true);
>   	if (r)
>   		goto error;
>   
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   	r = amdgpu_ib_ring_tests(adev);
> -	if (r)
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);

This 2 lines deletion seems unintentional

>   
> -error:
> -	/* release full control of GPU after ib test */
> -	amdgpu_virt_release_full_gpu(adev, true);
> -
> -	if (reset_flags) {
> -		if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -			atomic_inc(&adev->vram_lost_counter);
> -		}
> -
> -		/* VF FLR or hotlink reset is always full-reset */
> -		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +		atomic_inc(&adev->vram_lost_counter);
> +		r = amdgpu_handle_vram_lost(adev);
>   	}
>   
> +error:
> +
>   	return r;

Not doing incrementation of VRAM lost counter in case of error anymore 
is intentional ?

Andrey
>   }
>   
> @@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	}
>   
>   	if (amdgpu_sriov_vf(adev))
> -		r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
> +		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>   	else
> -		r = amdgpu_device_reset(adev, &reset_flags);
> -
> -	if (!r) {
> -		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
> -			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -			struct amdgpu_bo *bo, *tmp;
> -			struct dma_fence *fence = NULL, *next = NULL;
> -
> -			DRM_INFO("recover vram bo from shadow\n");
> -			mutex_lock(&adev->shadow_list_lock);
> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -				next = NULL;
> -				amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> -				if (fence) {
> -					r = dma_fence_wait(fence, false);
> -					if (r) {
> -						WARN(r, "recovery from shadow isn't completed\n");
> -						break;
> -					}
> -				}
> -
> -				dma_fence_put(fence);
> -				fence = next;
> -			}
> -			mutex_unlock(&adev->shadow_list_lock);
> -			if (fence) {
> -				r = dma_fence_wait(fence, false);
> -				if (r)
> -					WARN(r, "recovery from shadow isn't completed\n");
> -			}
> -			dma_fence_put(fence);
> -		}
> -	}
> +		r = amdgpu_device_reset(adev);
>   
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found]     ` <9e575c74-c6ce-76ef-a09c-1dec5a4807a3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-02-28 13:30       ` Andrey Grodzovsky
       [not found]         ` <e4756f8b-d8f9-9849-aad4-a23193e367f6-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 13:30 UTC (permalink / raw)
  To: christian.koenig-5C7GfCeVMHo, Monk Liu,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq 
access if not in IRQ

Andrey


On 02/28/2018 07:20 AM, Christian König wrote:
> Andrey please give this set a good testing as well.
>
> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>> found recover_vram_from_shadow sometimes get executed
>> in paralle with SDMA scheduler, should stop all
>> schedulers before doing gpu reset/recover
>>
>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>
> For now this patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>.
>
> Regards,
> Christian.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40 
>> +++++++++++-------------------
>>   1 file changed, 15 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 75d1733..e9d81a8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>         /* block TTM */
>>       resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>> +
>>       /* store modesetting */
>>       if (amdgpu_device_has_dc_support(adev))
>>           state = drm_atomic_helper_suspend(adev->ddev);
>>   -    /* block scheduler */
>> +    /* block all schedulers and reset given job's ring */
>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>           struct amdgpu_ring *ring = adev->rings[i];
>>             if (!ring || !ring->sched.thread)
>>               continue;
>>   -        /* only focus on the ring hit timeout if &job not NULL */
>> +        kthread_park(ring->sched.thread);
>> +
>>           if (job && job->ring->idx != i)
>>               continue;
>>   -        kthread_park(ring->sched.thread);
>>           drm_sched_hw_job_reset(&ring->sched, &job->base);
>>             /* after all hw jobs are reset, hw fence is meaningless, 
>> so force_completion */
>> @@ -2707,33 +2708,22 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>               }
>>               dma_fence_put(fence);
>>           }
>> +    }
>>   -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -            struct amdgpu_ring *ring = adev->rings[i];
>> -
>> -            if (!ring || !ring->sched.thread)
>> -                continue;
>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +        struct amdgpu_ring *ring = adev->rings[i];
>>   -            /* only focus on the ring hit timeout if &job not NULL */
>> -            if (job && job->ring->idx != i)
>> -                continue;
>> +        if (!ring || !ring->sched.thread)
>> +            continue;
>>   +        /* only need recovery sched of the given job's ring
>> +         * or all rings (in the case @job is NULL)
>> +         * after above amdgpu_reset accomplished
>> +         */
>> +        if ((!job || job->ring->idx == i) && !r)
>>               drm_sched_job_recovery(&ring->sched);
>> -            kthread_unpark(ring->sched.thread);
>> -        }
>> -    } else {
>> -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -            struct amdgpu_ring *ring = adev->rings[i];
>>   -            if (!ring || !ring->sched.thread)
>> -                continue;
>> -
>> -            /* only focus on the ring hit timeout if &job not NULL */
>> -            if (job && job->ring->idx != i)
>> -                continue;
>> -
>> -            kthread_unpark(adev->rings[i]->sched.thread);
>> -        }
>> +        kthread_unpark(ring->sched.thread);
>>       }
>>         if (amdgpu_device_has_dc_support(adev)) {
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found]         ` <e4756f8b-d8f9-9849-aad4-a23193e367f6-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 13:31           ` Liu, Monk
       [not found]             ` <BLUPR12MB04491C15749E9DCAB3DEF1A684C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-02-28 13:31 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian,
	amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Already sent 

-----Original Message-----
From: Grodzovsky, Andrey 
Sent: 2018年2月28日 21:31
To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover

Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ

Andrey


On 02/28/2018 07:20 AM, Christian König wrote:
> Andrey please give this set a good testing as well.
>
> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>> found recover_vram_from_shadow sometimes get executed in paralle with 
>> SDMA scheduler, should stop all schedulers before doing gpu 
>> reset/recover
>>
>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>
> For now this patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>.
>
> Regards,
> Christian.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
>> +++++++++++-------------------
>>   1 file changed, 15 insertions(+), 25 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 75d1733..e9d81a8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct 
>> amdgpu_device *adev,
>>         /* block TTM */
>>       resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>> +
>>       /* store modesetting */
>>       if (amdgpu_device_has_dc_support(adev))
>>           state = drm_atomic_helper_suspend(adev->ddev);
>>   -    /* block scheduler */
>> +    /* block all schedulers and reset given job's ring */
>>       for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>           struct amdgpu_ring *ring = adev->rings[i];
>>             if (!ring || !ring->sched.thread)
>>               continue;
>>   -        /* only focus on the ring hit timeout if &job not NULL */
>> +        kthread_park(ring->sched.thread);
>> +
>>           if (job && job->ring->idx != i)
>>               continue;
>>   -        kthread_park(ring->sched.thread);
>>           drm_sched_hw_job_reset(&ring->sched, &job->base);
>>             /* after all hw jobs are reset, hw fence is meaningless, 
>> so force_completion */ @@ -2707,33 +2708,22 @@ int 
>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>               }
>>               dma_fence_put(fence);
>>           }
>> +    }
>>   -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -            struct amdgpu_ring *ring = adev->rings[i];
>> -
>> -            if (!ring || !ring->sched.thread)
>> -                continue;
>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> +        struct amdgpu_ring *ring = adev->rings[i];
>>   -            /* only focus on the ring hit timeout if &job not NULL 
>> */
>> -            if (job && job->ring->idx != i)
>> -                continue;
>> +        if (!ring || !ring->sched.thread)
>> +            continue;
>>   +        /* only need recovery sched of the given job's ring
>> +         * or all rings (in the case @job is NULL)
>> +         * after above amdgpu_reset accomplished
>> +         */
>> +        if ((!job || job->ring->idx == i) && !r)
>>               drm_sched_job_recovery(&ring->sched);
>> -            kthread_unpark(ring->sched.thread);
>> -        }
>> -    } else {
>> -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -            struct amdgpu_ring *ring = adev->rings[i];
>>   -            if (!ring || !ring->sched.thread)
>> -                continue;
>> -
>> -            /* only focus on the ring hit timeout if &job not NULL 
>> */
>> -            if (job && job->ring->idx != i)
>> -                continue;
>> -
>> -            kthread_unpark(adev->rings[i]->sched.thread);
>> -        }
>> +        kthread_unpark(ring->sched.thread);
>>       }
>>         if (amdgpu_device_has_dc_support(adev)) {
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found]         ` <13337cd9-78e9-df36-f2ab-749cf182177b-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 13:36           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449D9513F4798028569ED0884C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-02-28 13:36 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

>   		goto error;
>   
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   	r = amdgpu_ib_ring_tests(adev);
> -	if (r)
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);

This 2 lines deletion seems unintentional

No, intentional, because if IB TEST failed the error will be reported by the fault IP 


> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +		atomic_inc(&adev->vram_lost_counter);
> +		r = amdgpu_handle_vram_lost(adev);
>   	}
>   
> +error:
> +
>   	return r;

Not doing incrementation of VRAM lost counter in case of error anymore is intentional ?


[ML] yeah, no need to increase the counter because GPU reset is failed so everything is meaningless, driver cannot used anymore



-----Original Message-----
From: Grodzovsky, Andrey 
Sent: 2018年2月28日 21:29
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling



On 02/28/2018 02:21 AM, Monk Liu wrote:
> 1)create a routine "handle_vram_lost" to do the vram recovery, and put 
> it into amdgpu_device_reset/reset_sriov, this way no need of the extra 
> paramter to hold the VRAM LOST information and the related macros can 
> be removed.
>
> 3)show vram_recover failure if time out, and set TMO equal to 
> lockup_timeout if vram_recover is under SRIOV runtime mode.
>
> 4)report error if any ip reset failed for SR-IOV
>
> Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
>   2 files changed, 72 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5bddfc1..abbc3f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
>   #define CIK_CURSOR_WIDTH 128
>   #define CIK_CURSOR_HEIGHT 128
>   
> -/* GPU RESET flags */
> -#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0) -#define 
> AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
> -
>   struct amdgpu_device;
>   struct amdgpu_ib;
>   struct amdgpu_cs_parser;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e9d81a8..39ece7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1591,6 +1591,8 @@ static int 
> amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, 
> r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -1624,6 +1626,8 @@ static int 
> amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
>   
>   			r = block->version->funcs->hw_init(adev);
>   			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, 
> r?"failed":"successed");
> +			if (r)
> +				return r;
>   		}
>   	}
>   
> @@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> +static int amdgpu_handle_vram_lost(struct amdgpu_device *adev) {
> +	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> +	struct amdgpu_bo *bo, *tmp;
> +	struct dma_fence *fence = NULL, *next = NULL;
> +	long r = 1;
> +	int i = 0;
> +	long tmo;
> +
> +	if (amdgpu_sriov_runtime(adev))
> +		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> +	else
> +		tmo = msecs_to_jiffies(100);
> +
> +	DRM_INFO("recover vram bo from shadow start\n");
> +	mutex_lock(&adev->shadow_list_lock);
> +	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> +		next = NULL;
> +		amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> +		if (fence) {
> +			r = dma_fence_wait_timeout(fence, false, tmo);
> +			if (r == 0)
> +				pr_err("wait fence %p[%d] timeout\n", fence, i);
> +			else if (r < 0)
> +				pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +			if (r < 1) {
> +				dma_fence_put(fence);
> +				fence = next;
> +				break;
> +			}
> +			i++;
> +		}
> +
> +		dma_fence_put(fence);
> +		fence = next;
> +	}
> +	mutex_unlock(&adev->shadow_list_lock);
> +
> +	if (fence) {
> +		r = dma_fence_wait_timeout(fence, false, tmo);
> +		if (r == 0)
> +			pr_err("wait fence %p[%d] timeout\n", fence, i);
> +		else if (r < 0)
> +			pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +
> +	}
> +	dma_fence_put(fence);
> +
> +	if (r > 0)
> +		DRM_INFO("recover vram bo from shadow done\n");
> +	else
> +		DRM_ERROR("recover vram bo from shadow failed\n");
> +
> +	return (r > 0?0:1);
> +}
> +
>   /*
>    * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * attempt to do soft-reset or full-reset and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset(struct amdgpu_device *adev,
> -			       uint64_t* reset_flags)
> +static int amdgpu_device_reset(struct amdgpu_device *adev)
>   {
>   	bool need_full_reset, vram_lost = 0;
>   	int r;
> @@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   			DRM_INFO("soft reset failed, will fallback to full reset!\n");
>   			need_full_reset = true;
>   		}
> -
>   	}
>   
>   	if (need_full_reset) {
> @@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   		}
>   	}
>   
> -	if (reset_flags) {
> -		if (vram_lost)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -
> -		if (need_full_reset)
> -			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> -	}
> +	if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
> +		r = amdgpu_handle_vram_lost(adev);
>   
>   	return r;
>   }
> @@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>    * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>    *
>    * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>    *
>    * do VF FLR and reinitialize Asic
>    * return 0 means successed otherwise failed
>   */
> -static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> -				     uint64_t *reset_flags,
> -				     bool from_hypervisor)
> +static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool 
> +from_hypervisor)
>   {
>   	int r;
>   
> @@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct 
> amdgpu_device *adev,
>   
>   	/* now we are okay to resume SMC/CP/SDMA */
>   	r = amdgpu_device_ip_reinit_late_sriov(adev);
> +	amdgpu_virt_release_full_gpu(adev, true);
>   	if (r)
>   		goto error;
>   
>   	amdgpu_irq_gpu_reset_resume_helper(adev);
>   	r = amdgpu_ib_ring_tests(adev);
> -	if (r)
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);

This 2 lines deletion seems unintentional

>   
> -error:
> -	/* release full control of GPU after ib test */
> -	amdgpu_virt_release_full_gpu(adev, true);
> -
> -	if (reset_flags) {
> -		if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -			atomic_inc(&adev->vram_lost_counter);
> -		}
> -
> -		/* VF FLR or hotlink reset is always full-reset */
> -		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +		atomic_inc(&adev->vram_lost_counter);
> +		r = amdgpu_handle_vram_lost(adev);
>   	}
>   
> +error:
> +
>   	return r;

Not doing incrementation of VRAM lost counter in case of error anymore is intentional ?

Andrey
>   }
>   
> @@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	}
>   
>   	if (amdgpu_sriov_vf(adev))
> -		r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
> +		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>   	else
> -		r = amdgpu_device_reset(adev, &reset_flags);
> -
> -	if (!r) {
> -		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
> -			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -			struct amdgpu_bo *bo, *tmp;
> -			struct dma_fence *fence = NULL, *next = NULL;
> -
> -			DRM_INFO("recover vram bo from shadow\n");
> -			mutex_lock(&adev->shadow_list_lock);
> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -				next = NULL;
> -				amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> -				if (fence) {
> -					r = dma_fence_wait(fence, false);
> -					if (r) {
> -						WARN(r, "recovery from shadow isn't completed\n");
> -						break;
> -					}
> -				}
> -
> -				dma_fence_put(fence);
> -				fence = next;
> -			}
> -			mutex_unlock(&adev->shadow_list_lock);
> -			if (fence) {
> -				r = dma_fence_wait(fence, false);
> -				if (r)
> -					WARN(r, "recovery from shadow isn't completed\n");
> -			}
> -			dma_fence_put(fence);
> -		}
> -	}
> +		r = amdgpu_device_reset(adev);
>   
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i];

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found]             ` <BLUPR12MB0449D9513F4798028569ED0884C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-28 13:41               ` Andrey Grodzovsky
  0 siblings, 0 replies; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 13:41 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Patch look good to me then.

Acked-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>


Andrey


On 02/28/2018 08:36 AM, Liu, Monk wrote:
>>    		goto error;
>>    
>>    	amdgpu_irq_gpu_reset_resume_helper(adev);
>>    	r = amdgpu_ib_ring_tests(adev);
>> -	if (r)
>> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> This 2 lines deletion seems unintentional
>
> No, intentional, because if IB TEST failed the error will be reported by the fault IP
>
>
>> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>> +		atomic_inc(&adev->vram_lost_counter);
>> +		r = amdgpu_handle_vram_lost(adev);
>>    	}
>>    
>> +error:
>> +
>>    	return r;
> Not doing incrementation of VRAM lost counter in case of error anymore is intentional ?
>
>
> [ML] yeah, no need to increase the counter because GPU reset is failed so everything is meaningless, driver cannot used anymore
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey
> Sent: 2018年2月28日 21:29
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
>
>
>
> On 02/28/2018 02:21 AM, Monk Liu wrote:
>> 1)create a routine "handle_vram_lost" to do the vram recovery, and put
>> it into amdgpu_device_reset/reset_sriov, this way no need of the extra
>> paramter to hold the VRAM LOST information and the related macros can
>> be removed.
>>
>> 3)show vram_recover failure if time out, and set TMO equal to
>> lockup_timeout if vram_recover is under SRIOV runtime mode.
>>
>> 4)report error if any ip reset failed for SR-IOV
>>
>> Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
>>    2 files changed, 72 insertions(+), 68 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 5bddfc1..abbc3f1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
>>    #define CIK_CURSOR_WIDTH 128
>>    #define CIK_CURSOR_HEIGHT 128
>>    
>> -/* GPU RESET flags */
>> -#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0) -#define
>> AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
>> -
>>    struct amdgpu_device;
>>    struct amdgpu_ib;
>>    struct amdgpu_cs_parser;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index e9d81a8..39ece7f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -1591,6 +1591,8 @@ static int
>> amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
>>    
>>    			r = block->version->funcs->hw_init(adev);
>>    			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name,
>> r?"failed":"successed");
>> +			if (r)
>> +				return r;
>>    		}
>>    	}
>>    
>> @@ -1624,6 +1626,8 @@ static int
>> amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
>>    
>>    			r = block->version->funcs->hw_init(adev);
>>    			DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name,
>> r?"failed":"successed");
>> +			if (r)
>> +				return r;
>>    		}
>>    	}
>>    
>> @@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
>>    	return r;
>>    }
>>    
>> +static int amdgpu_handle_vram_lost(struct amdgpu_device *adev) {
>> +	struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
>> +	struct amdgpu_bo *bo, *tmp;
>> +	struct dma_fence *fence = NULL, *next = NULL;
>> +	long r = 1;
>> +	int i = 0;
>> +	long tmo;
>> +
>> +	if (amdgpu_sriov_runtime(adev))
>> +		tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
>> +	else
>> +		tmo = msecs_to_jiffies(100);
>> +
>> +	DRM_INFO("recover vram bo from shadow start\n");
>> +	mutex_lock(&adev->shadow_list_lock);
>> +	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
>> +		next = NULL;
>> +		amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
>> +		if (fence) {
>> +			r = dma_fence_wait_timeout(fence, false, tmo);
>> +			if (r == 0)
>> +				pr_err("wait fence %p[%d] timeout\n", fence, i);
>> +			else if (r < 0)
>> +				pr_err("wait fence %p[%d] interrupted\n", fence, i);
>> +			if (r < 1) {
>> +				dma_fence_put(fence);
>> +				fence = next;
>> +				break;
>> +			}
>> +			i++;
>> +		}
>> +
>> +		dma_fence_put(fence);
>> +		fence = next;
>> +	}
>> +	mutex_unlock(&adev->shadow_list_lock);
>> +
>> +	if (fence) {
>> +		r = dma_fence_wait_timeout(fence, false, tmo);
>> +		if (r == 0)
>> +			pr_err("wait fence %p[%d] timeout\n", fence, i);
>> +		else if (r < 0)
>> +			pr_err("wait fence %p[%d] interrupted\n", fence, i);
>> +
>> +	}
>> +	dma_fence_put(fence);
>> +
>> +	if (r > 0)
>> +		DRM_INFO("recover vram bo from shadow done\n");
>> +	else
>> +		DRM_ERROR("recover vram bo from shadow failed\n");
>> +
>> +	return (r > 0?0:1);
>> +}
>> +
>>    /*
>>     * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>>     *
>>     * @adev: amdgpu device pointer
>> - * @reset_flags: output param tells caller the reset result
>>     *
>>     * attempt to do soft-reset or full-reset and reinitialize Asic
>>     * return 0 means successed otherwise failed
>>    */
>> -static int amdgpu_device_reset(struct amdgpu_device *adev,
>> -			       uint64_t* reset_flags)
>> +static int amdgpu_device_reset(struct amdgpu_device *adev)
>>    {
>>    	bool need_full_reset, vram_lost = 0;
>>    	int r;
>> @@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>>    			DRM_INFO("soft reset failed, will fallback to full reset!\n");
>>    			need_full_reset = true;
>>    		}
>> -
>>    	}
>>    
>>    	if (need_full_reset) {
>> @@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>>    		}
>>    	}
>>    
>> -	if (reset_flags) {
>> -		if (vram_lost)
>> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
>> -
>> -		if (need_full_reset)
>> -			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
>> -	}
>> +	if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
>> +		r = amdgpu_handle_vram_lost(adev);
>>    
>>    	return r;
>>    }
>> @@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>>     * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>>     *
>>     * @adev: amdgpu device pointer
>> - * @reset_flags: output param tells caller the reset result
>>     *
>>     * do VF FLR and reinitialize Asic
>>     * return 0 means successed otherwise failed
>>    */
>> -static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>> -				     uint64_t *reset_flags,
>> -				     bool from_hypervisor)
>> +static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool
>> +from_hypervisor)
>>    {
>>    	int r;
>>    
>> @@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct
>> amdgpu_device *adev,
>>    
>>    	/* now we are okay to resume SMC/CP/SDMA */
>>    	r = amdgpu_device_ip_reinit_late_sriov(adev);
>> +	amdgpu_virt_release_full_gpu(adev, true);
>>    	if (r)
>>    		goto error;
>>    
>>    	amdgpu_irq_gpu_reset_resume_helper(adev);
>>    	r = amdgpu_ib_ring_tests(adev);
>> -	if (r)
>> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> This 2 lines deletion seems unintentional
>
>>    
>> -error:
>> -	/* release full control of GPU after ib test */
>> -	amdgpu_virt_release_full_gpu(adev, true);
>> -
>> -	if (reset_flags) {
>> -		if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>> -			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
>> -			atomic_inc(&adev->vram_lost_counter);
>> -		}
>> -
>> -		/* VF FLR or hotlink reset is always full-reset */
>> -		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
>> +	if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
>> +		atomic_inc(&adev->vram_lost_counter);
>> +		r = amdgpu_handle_vram_lost(adev);
>>    	}
>>    
>> +error:
>> +
>>    	return r;
> Not doing incrementation of VRAM lost counter in case of error anymore is intentional ?
>
> Andrey
>>    }
>>    
>> @@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	}
>>    
>>    	if (amdgpu_sriov_vf(adev))
>> -		r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
>> +		r = amdgpu_device_reset_sriov(adev, job ? false : true);
>>    	else
>> -		r = amdgpu_device_reset(adev, &reset_flags);
>> -
>> -	if (!r) {
>> -		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
>> -			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
>> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
>> -			struct amdgpu_bo *bo, *tmp;
>> -			struct dma_fence *fence = NULL, *next = NULL;
>> -
>> -			DRM_INFO("recover vram bo from shadow\n");
>> -			mutex_lock(&adev->shadow_list_lock);
>> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
>> -				next = NULL;
>> -				amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
>> -				if (fence) {
>> -					r = dma_fence_wait(fence, false);
>> -					if (r) {
>> -						WARN(r, "recovery from shadow isn't completed\n");
>> -						break;
>> -					}
>> -				}
>> -
>> -				dma_fence_put(fence);
>> -				fence = next;
>> -			}
>> -			mutex_unlock(&adev->shadow_list_lock);
>> -			if (fence) {
>> -				r = dma_fence_wait(fence, false);
>> -				if (r)
>> -					WARN(r, "recovery from shadow isn't completed\n");
>> -			}
>> -			dma_fence_put(fence);
>> -		}
>> -	}
>> +		r = amdgpu_device_reset(adev);
>>    
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i];

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found]             ` <BLUPR12MB04491C15749E9DCAB3DEF1A684C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-02-28 13:42               ` Andrey Grodzovsky
  2018-02-28 16:40               ` Andrey Grodzovsky
  1 sibling, 0 replies; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 13:42 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Found it.

Thanks.


On 02/28/2018 08:31 AM, Liu, Monk wrote:
> Already sent
>
> -----Original Message-----
> From: Grodzovsky, Andrey
> Sent: 2018年2月28日 21:31
> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
>
> Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ
>
> Andrey
>
>
> On 02/28/2018 07:20 AM, Christian König wrote:
>> Andrey please give this set a good testing as well.
>>
>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>> found recover_vram_from_shadow sometimes get executed in paralle with
>>> SDMA scheduler, should stop all schedulers before doing gpu
>>> reset/recover
>>>
>>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> For now this patch is Reviewed-by: Christian König
>> <christian.koenig@amd.com>.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
>>> +++++++++++-------------------
>>>    1 file changed, 15 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 75d1733..e9d81a8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>          /* block TTM */
>>>        resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>> +
>>>        /* store modesetting */
>>>        if (amdgpu_device_has_dc_support(adev))
>>>            state = drm_atomic_helper_suspend(adev->ddev);
>>>    -    /* block scheduler */
>>> +    /* block all schedulers and reset given job's ring */
>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>            struct amdgpu_ring *ring = adev->rings[i];
>>>              if (!ring || !ring->sched.thread)
>>>                continue;
>>>    -        /* only focus on the ring hit timeout if &job not NULL */
>>> +        kthread_park(ring->sched.thread);
>>> +
>>>            if (job && job->ring->idx != i)
>>>                continue;
>>>    -        kthread_park(ring->sched.thread);
>>>            drm_sched_hw_job_reset(&ring->sched, &job->base);
>>>              /* after all hw jobs are reset, hw fence is meaningless,
>>> so force_completion */ @@ -2707,33 +2708,22 @@ int
>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                }
>>>                dma_fence_put(fence);
>>>            }
>>> +    }
>>>    -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>> -
>>> -            if (!ring || !ring->sched.thread)
>>> -                continue;
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>    -            /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -            if (job && job->ring->idx != i)
>>> -                continue;
>>> +        if (!ring || !ring->sched.thread)
>>> +            continue;
>>>    +        /* only need recovery sched of the given job's ring
>>> +         * or all rings (in the case @job is NULL)
>>> +         * after above amdgpu_reset accomplished
>>> +         */
>>> +        if ((!job || job->ring->idx == i) && !r)
>>>                drm_sched_job_recovery(&ring->sched);
>>> -            kthread_unpark(ring->sched.thread);
>>> -        }
>>> -    } else {
>>> -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>>    -            if (!ring || !ring->sched.thread)
>>> -                continue;
>>> -
>>> -            /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -            if (job && job->ring->idx != i)
>>> -                continue;
>>> -
>>> -            kthread_unpark(adev->rings[i]->sched.thread);
>>> -        }
>>> +        kthread_unpark(ring->sched.thread);
>>>        }
>>>          if (amdgpu_device_has_dc_support(adev)) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling
       [not found]     ` <1519802463-9090-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2018-02-28 12:22       ` Christian König
  2018-02-28 13:29       ` Andrey Grodzovsky
@ 2018-02-28 14:25       ` Alex Deucher
  2 siblings, 0 replies; 32+ messages in thread
From: Alex Deucher @ 2018-02-28 14:25 UTC (permalink / raw)
  To: Monk Liu; +Cc: amd-gfx list

On Wed, Feb 28, 2018 at 2:21 AM, Monk Liu <Monk.Liu@amd.com> wrote:
> 1)create a routine "handle_vram_lost" to do the vram
> recovery, and put it into amdgpu_device_reset/reset_sriov,
> this way no need of the extra paramter to hold the
> VRAM LOST information and the related macros can be removed.
>
> 3)show vram_recover failure if time out, and set TMO equal to
> lockup_timeout if vram_recover is under SRIOV runtime mode.
>
> 4)report error if any ip reset failed for SR-IOV
>
> Change-Id: I686e2b6133844c14948c206a2315c064a78c1d9c
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 -
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 136 +++++++++++++++--------------
>  2 files changed, 72 insertions(+), 68 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 5bddfc1..abbc3f1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -181,10 +181,6 @@ extern int amdgpu_cik_support;
>  #define CIK_CURSOR_WIDTH 128
>  #define CIK_CURSOR_HEIGHT 128
>
> -/* GPU RESET flags */
> -#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
> -#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
> -
>  struct amdgpu_device;
>  struct amdgpu_ib;
>  struct amdgpu_cs_parser;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index e9d81a8..39ece7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -1591,6 +1591,8 @@ static int amdgpu_device_ip_reinit_early_sriov(struct amdgpu_device *adev)
>
>                         r = block->version->funcs->hw_init(adev);
>                         DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +                       if (r)
> +                               return r;
>                 }
>         }
>
> @@ -1624,6 +1626,8 @@ static int amdgpu_device_ip_reinit_late_sriov(struct amdgpu_device *adev)
>
>                         r = block->version->funcs->hw_init(adev);
>                         DRM_INFO("RE-INIT: %s %s\n", block->version->funcs->name, r?"failed":"successed");
> +                       if (r)
> +                               return r;
>                 }
>         }
>
> @@ -2471,17 +2475,71 @@ static int amdgpu_device_recover_vram_from_shadow(struct amdgpu_device *adev,
>         return r;
>  }
>
> +static int amdgpu_handle_vram_lost(struct amdgpu_device *adev)

Please rename this function to amdgpu_device_handle_vram_lost() for
consistency.with other functions in the file.

Alex


> +{
> +       struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> +       struct amdgpu_bo *bo, *tmp;
> +       struct dma_fence *fence = NULL, *next = NULL;
> +       long r = 1;
> +       int i = 0;
> +       long tmo;
> +
> +       if (amdgpu_sriov_runtime(adev))
> +               tmo = msecs_to_jiffies(amdgpu_lockup_timeout);
> +       else
> +               tmo = msecs_to_jiffies(100);
> +
> +       DRM_INFO("recover vram bo from shadow start\n");
> +       mutex_lock(&adev->shadow_list_lock);
> +       list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> +               next = NULL;
> +               amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> +               if (fence) {
> +                       r = dma_fence_wait_timeout(fence, false, tmo);
> +                       if (r == 0)
> +                               pr_err("wait fence %p[%d] timeout\n", fence, i);
> +                       else if (r < 0)
> +                               pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +                       if (r < 1) {
> +                               dma_fence_put(fence);
> +                               fence = next;
> +                               break;
> +                       }
> +                       i++;
> +               }
> +
> +               dma_fence_put(fence);
> +               fence = next;
> +       }
> +       mutex_unlock(&adev->shadow_list_lock);
> +
> +       if (fence) {
> +               r = dma_fence_wait_timeout(fence, false, tmo);
> +               if (r == 0)
> +                       pr_err("wait fence %p[%d] timeout\n", fence, i);
> +               else if (r < 0)
> +                       pr_err("wait fence %p[%d] interrupted\n", fence, i);
> +
> +       }
> +       dma_fence_put(fence);
> +
> +       if (r > 0)
> +               DRM_INFO("recover vram bo from shadow done\n");
> +       else
> +               DRM_ERROR("recover vram bo from shadow failed\n");
> +
> +       return (r > 0?0:1);
> +}
> +
>  /*
>   * amdgpu_device_reset - reset ASIC/GPU for bare-metal or passthrough
>   *
>   * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>   *
>   * attempt to do soft-reset or full-reset and reinitialize Asic
>   * return 0 means successed otherwise failed
>  */
> -static int amdgpu_device_reset(struct amdgpu_device *adev,
> -                              uint64_t* reset_flags)
> +static int amdgpu_device_reset(struct amdgpu_device *adev)
>  {
>         bool need_full_reset, vram_lost = 0;
>         int r;
> @@ -2496,7 +2554,6 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>                         DRM_INFO("soft reset failed, will fallback to full reset!\n");
>                         need_full_reset = true;
>                 }
> -
>         }
>
>         if (need_full_reset) {
> @@ -2545,13 +2602,8 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>                 }
>         }
>
> -       if (reset_flags) {
> -               if (vram_lost)
> -                       (*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -
> -               if (need_full_reset)
> -                       (*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> -       }
> +       if (!r && ((need_full_reset && !(adev->flags & AMD_IS_APU)) || vram_lost))
> +               r = amdgpu_handle_vram_lost(adev);
>
>         return r;
>  }
> @@ -2560,14 +2612,11 @@ static int amdgpu_device_reset(struct amdgpu_device *adev,
>   * amdgpu_device_reset_sriov - reset ASIC for SR-IOV vf
>   *
>   * @adev: amdgpu device pointer
> - * @reset_flags: output param tells caller the reset result
>   *
>   * do VF FLR and reinitialize Asic
>   * return 0 means successed otherwise failed
>  */
> -static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
> -                                    uint64_t *reset_flags,
> -                                    bool from_hypervisor)
> +static int amdgpu_device_reset_sriov(struct amdgpu_device *adev, bool from_hypervisor)
>  {
>         int r;
>
> @@ -2588,28 +2637,20 @@ static int amdgpu_device_reset_sriov(struct amdgpu_device *adev,
>
>         /* now we are okay to resume SMC/CP/SDMA */
>         r = amdgpu_device_ip_reinit_late_sriov(adev);
> +       amdgpu_virt_release_full_gpu(adev, true);
>         if (r)
>                 goto error;
>
>         amdgpu_irq_gpu_reset_resume_helper(adev);
>         r = amdgpu_ib_ring_tests(adev);
> -       if (r)
> -               dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
>
> -error:
> -       /* release full control of GPU after ib test */
> -       amdgpu_virt_release_full_gpu(adev, true);
> -
> -       if (reset_flags) {
> -               if (adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> -                       (*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> -                       atomic_inc(&adev->vram_lost_counter);
> -               }
> -
> -               /* VF FLR or hotlink reset is always full-reset */
> -               (*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +       if (!r && adev->virt.gim_feature & AMDGIM_FEATURE_GIM_FLR_VRAMLOST) {
> +               atomic_inc(&adev->vram_lost_counter);
> +               r = amdgpu_handle_vram_lost(adev);
>         }
>
> +error:
> +
>         return r;
>  }
>
> @@ -2673,42 +2714,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         }
>
>         if (amdgpu_sriov_vf(adev))
> -               r = amdgpu_device_reset_sriov(adev, &reset_flags, job ? false : true);
> +               r = amdgpu_device_reset_sriov(adev, job ? false : true);
>         else
> -               r = amdgpu_device_reset(adev, &reset_flags);
> -
> -       if (!r) {
> -               if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
> -                       (reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
> -                       struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -                       struct amdgpu_bo *bo, *tmp;
> -                       struct dma_fence *fence = NULL, *next = NULL;
> -
> -                       DRM_INFO("recover vram bo from shadow\n");
> -                       mutex_lock(&adev->shadow_list_lock);
> -                       list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -                               next = NULL;
> -                               amdgpu_device_recover_vram_from_shadow(adev, ring, bo, &next);
> -                               if (fence) {
> -                                       r = dma_fence_wait(fence, false);
> -                                       if (r) {
> -                                               WARN(r, "recovery from shadow isn't completed\n");
> -                                               break;
> -                                       }
> -                               }
> -
> -                               dma_fence_put(fence);
> -                               fence = next;
> -                       }
> -                       mutex_unlock(&adev->shadow_list_lock);
> -                       if (fence) {
> -                               r = dma_fence_wait(fence, false);
> -                               if (r)
> -                                       WARN(r, "recovery from shadow isn't completed\n");
> -                       }
> -                       dma_fence_put(fence);
> -               }
> -       }
> +               r = amdgpu_device_reset(adev);
>
>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                 struct amdgpu_ring *ring = adev->rings[i];
> --
> 2.7.4
>
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found]             ` <BLUPR12MB04491C15749E9DCAB3DEF1A684C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-02-28 13:42               ` Andrey Grodzovsky
@ 2018-02-28 16:40               ` Andrey Grodzovsky
       [not found]                 ` <a54b5a4f-a370-87ec-7bac-33e6036107f9-5C7GfCeVMHo@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 16:40 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

No new issues found with those patches, testing GPU reset using libdrm 
deadlock detection test on Ellsmire.

The patches are Tested-By: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

P.S

Noticed existing issues (before Monk's patches)

Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize 
parser !

And occasional unlock imbalance forn amdgpu_cs_ioctl

DEBUG_LOCKS_WARN_ON(depth <= 0)
[   93.069011 <    0.000017>] WARNING: CPU: 3 PID: 2215 at 
kernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360

On CZ full reset hangs the system.

Gonna take a look at those issues.

Thanks,

Andrey


 From
On 02/28/2018 08:31 AM, Liu, Monk wrote:
> Already sent
>
> -----Original Message-----
> From: Grodzovsky, Andrey
> Sent: 2018年2月28日 21:31
> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
>
> Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ
>
> Andrey
>
>
> On 02/28/2018 07:20 AM, Christian König wrote:
>> Andrey please give this set a good testing as well.
>>
>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>> found recover_vram_from_shadow sometimes get executed in paralle with
>>> SDMA scheduler, should stop all schedulers before doing gpu
>>> reset/recover
>>>
>>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> For now this patch is Reviewed-by: Christian König
>> <christian.koenig@amd.com>.
>>
>> Regards,
>> Christian.
>>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
>>> +++++++++++-------------------
>>>    1 file changed, 15 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 75d1733..e9d81a8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
>>> amdgpu_device *adev,
>>>          /* block TTM */
>>>        resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>> +
>>>        /* store modesetting */
>>>        if (amdgpu_device_has_dc_support(adev))
>>>            state = drm_atomic_helper_suspend(adev->ddev);
>>>    -    /* block scheduler */
>>> +    /* block all schedulers and reset given job's ring */
>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>            struct amdgpu_ring *ring = adev->rings[i];
>>>              if (!ring || !ring->sched.thread)
>>>                continue;
>>>    -        /* only focus on the ring hit timeout if &job not NULL */
>>> +        kthread_park(ring->sched.thread);
>>> +
>>>            if (job && job->ring->idx != i)
>>>                continue;
>>>    -        kthread_park(ring->sched.thread);
>>>            drm_sched_hw_job_reset(&ring->sched, &job->base);
>>>              /* after all hw jobs are reset, hw fence is meaningless,
>>> so force_completion */ @@ -2707,33 +2708,22 @@ int
>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                }
>>>                dma_fence_put(fence);
>>>            }
>>> +    }
>>>    -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>> -
>>> -            if (!ring || !ring->sched.thread)
>>> -                continue;
>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>    -            /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -            if (job && job->ring->idx != i)
>>> -                continue;
>>> +        if (!ring || !ring->sched.thread)
>>> +            continue;
>>>    +        /* only need recovery sched of the given job's ring
>>> +         * or all rings (in the case @job is NULL)
>>> +         * after above amdgpu_reset accomplished
>>> +         */
>>> +        if ((!job || job->ring->idx == i) && !r)
>>>                drm_sched_job_recovery(&ring->sched);
>>> -            kthread_unpark(ring->sched.thread);
>>> -        }
>>> -    } else {
>>> -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>>    -            if (!ring || !ring->sched.thread)
>>> -                continue;
>>> -
>>> -            /* only focus on the ring hit timeout if &job not NULL
>>> */
>>> -            if (job && job->ring->idx != i)
>>> -                continue;
>>> -
>>> -            kthread_unpark(adev->rings[i]->sched.thread);
>>> -        }
>>> +        kthread_unpark(ring->sched.thread);
>>>        }
>>>          if (amdgpu_device_has_dc_support(adev)) {

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover
       [not found]                 ` <a54b5a4f-a370-87ec-7bac-33e6036107f9-5C7GfCeVMHo@public.gmane.org>
@ 2018-02-28 17:42                   ` Andrey Grodzovsky
  0 siblings, 0 replies; 32+ messages in thread
From: Andrey Grodzovsky @ 2018-02-28 17:42 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

After looking more

"Failed to initialize parser !" is expected since the reason is jobs 
from guilty context (context causing the GPU hang) are canceled.

Locking imbalance is actually gone with Monk's patches.

Thanks,

Andrey


On 02/28/2018 11:40 AM, Andrey Grodzovsky wrote:
> No new issues found with those patches, testing GPU reset using libdrm 
> deadlock detection test on Ellsmire.
>
> The patches are Tested-By: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
> P.S
>
> Noticed existing issues (before Monk's patches)
>
> Multiple [drm:amdgpu_cs_ioctl [amdgpu]] *ERROR* Failed to initialize 
> parser !
>
> And occasional unlock imbalance forn amdgpu_cs_ioctl
>
> DEBUG_LOCKS_WARN_ON(depth <= 0)
> [   93.069011 <    0.000017>] WARNING: CPU: 3 PID: 2215 at 
> kernel/locking/lockdep.c:3682 lock_release+0x2e8/0x360
>
> On CZ full reset hangs the system.
>
> Gonna take a look at those issues.
>
> Thanks,
>
> Andrey
>
>
> From
> On 02/28/2018 08:31 AM, Liu, Monk wrote:
>> Already sent
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey
>> Sent: 2018年2月28日 21:31
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Liu, Monk 
>> <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu 
>> recover
>>
>> Will do once Monk sends V2 for  [PATCH 4/4] drm/amdgpu: try again kiq 
>> access if not in IRQ
>>
>> Andrey
>>
>>
>> On 02/28/2018 07:20 AM, Christian König wrote:
>>> Andrey please give this set a good testing as well.
>>>
>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>> found recover_vram_from_shadow sometimes get executed in paralle with
>>>> SDMA scheduler, should stop all schedulers before doing gpu
>>>> reset/recover
>>>>
>>>> Change-Id: Ibaef3e3c015f3cf88f84b2eaf95cda95ae1a64e3
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> For now this patch is Reviewed-by: Christian König
>>> <christian.koenig@amd.com>.
>>>
>>> Regards,
>>> Christian.
>>>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 40
>>>> +++++++++++-------------------
>>>>    1 file changed, 15 insertions(+), 25 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 75d1733..e9d81a8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -2649,22 +2649,23 @@ int amdgpu_device_gpu_recover(struct
>>>> amdgpu_device *adev,
>>>>          /* block TTM */
>>>>        resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>>>> +
>>>>        /* store modesetting */
>>>>        if (amdgpu_device_has_dc_support(adev))
>>>>            state = drm_atomic_helper_suspend(adev->ddev);
>>>>    -    /* block scheduler */
>>>> +    /* block all schedulers and reset given job's ring */
>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>            struct amdgpu_ring *ring = adev->rings[i];
>>>>              if (!ring || !ring->sched.thread)
>>>>                continue;
>>>>    -        /* only focus on the ring hit timeout if &job not NULL */
>>>> +        kthread_park(ring->sched.thread);
>>>> +
>>>>            if (job && job->ring->idx != i)
>>>>                continue;
>>>>    -        kthread_park(ring->sched.thread);
>>>>            drm_sched_hw_job_reset(&ring->sched, &job->base);
>>>>              /* after all hw jobs are reset, hw fence is meaningless,
>>>> so force_completion */ @@ -2707,33 +2708,22 @@ int
>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>                }
>>>>                dma_fence_put(fence);
>>>>            }
>>>> +    }
>>>>    -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>>> -
>>>> -            if (!ring || !ring->sched.thread)
>>>> -                continue;
>>>> +    for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> +        struct amdgpu_ring *ring = adev->rings[i];
>>>>    -            /* only focus on the ring hit timeout if &job not NULL
>>>> */
>>>> -            if (job && job->ring->idx != i)
>>>> -                continue;
>>>> +        if (!ring || !ring->sched.thread)
>>>> +            continue;
>>>>    +        /* only need recovery sched of the given job's ring
>>>> +         * or all rings (in the case @job is NULL)
>>>> +         * after above amdgpu_reset accomplished
>>>> +         */
>>>> +        if ((!job || job->ring->idx == i) && !r)
>>>>                drm_sched_job_recovery(&ring->sched);
>>>> -            kthread_unpark(ring->sched.thread);
>>>> -        }
>>>> -    } else {
>>>> -        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>> -            struct amdgpu_ring *ring = adev->rings[i];
>>>>    -            if (!ring || !ring->sched.thread)
>>>> -                continue;
>>>> -
>>>> -            /* only focus on the ring hit timeout if &job not NULL
>>>> */
>>>> -            if (job && job->ring->idx != i)
>>>> -                continue;
>>>> -
>>>> - kthread_unpark(adev->rings[i]->sched.thread);
>>>> -        }
>>>> +        kthread_unpark(ring->sched.thread);
>>>>        }
>>>>          if (amdgpu_device_has_dc_support(adev)) {
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                 ` <3a8945f9-848e-dd19-373d-5dddc69f76cb-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-01  6:01                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB04498FBA36C747652758D56E84C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  6:01 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.

I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ? 


/Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年2月28日 20:46
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Good point, but in this case we need some other handling.

Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.

And you also need to fix a couple of other places in amdgpu_ttm.c.

Regards,
Christian.

Am 28.02.2018 um 13:34 schrieb Liu, Monk:
> Because when SDMA was hang by like process A, and meanwhile another 
> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>
> Let it run and finally process B's job would fail and GPU recover will 
> repeat it again (since it is a kernel job)
>
> Without this solution other process will be greatly harmed by one 
> black sheep that triggering GPU recover
>
> /Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年2月28日 20:24
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>> because this time SDMA may under GPU RESET so its ring->ready may not 
>> true, keep going and GPU scheduler will reschedule this job if it 
>> failed.
>>
>> give a warning on copy_buffer when go through direct_submit while
>> ring->ready is false
> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>
> Why exactly are you running into issues with that?
>
> Christian.
>
>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>    1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> index e38e6db..7b75ac9 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>    	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>    	WARN_ON(job->ibs[0].length_dw > num_dw);
>>    	if (direct_submit) {
>> +		WARN_ON(!ring->ready);
>>    		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>    				       NULL, fence);
>>    		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int 
>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>    	struct amdgpu_job *job;
>>    	int r;
>>    
>> -	if (!ring->ready) {
>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>> -		return -EINVAL;
>> -	}
>> -
>>    	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>    		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>    		if (r)

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                     ` <BLUPR12MB04498FBA36C747652758D56E84C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  8:16                       ` Christian König
       [not found]                         ` <7af9b6fb-28e8-4e25-5d4a-5b566a00cbea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-03-01  8:16 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Ah, crap yeah that won't work since we don't free the ring.

Key point is we need to distinct between the ring doesn't work temporary 
because we are in a GPU reset and it doesn't work at all because we are 
missing firmware or stuff like that.

And no, checking the gpu_reset flag is totally racy and can't be done 
either. How about checking accel_working instead?

Christian.

Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>
>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月28日 20:46
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
> Good point, but in this case we need some other handling.
>
> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>
> And you also need to fix a couple of other places in amdgpu_ttm.c.
>
> Regards,
> Christian.
>
> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>> Because when SDMA was hang by like process A, and meanwhile another
>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>
>> Let it run and finally process B's job would fail and GPU recover will
>> repeat it again (since it is a kernel job)
>>
>> Without this solution other process will be greatly harmed by one
>> black sheep that triggering GPU recover
>>
>> /Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月28日 20:24
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>> because this time SDMA may under GPU RESET so its ring->ready may not
>>> true, keep going and GPU scheduler will reschedule this job if it
>>> failed.
>>>
>>> give a warning on copy_buffer when go through direct_submit while
>>> ring->ready is false
>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>
>> Why exactly are you running into issues with that?
>>
>> Christian.
>>
>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>     1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index e38e6db..7b75ac9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>     	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>     	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>     	if (direct_submit) {
>>> +		WARN_ON(!ring->ready);
>>>     		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>     				       NULL, fence);
>>>     		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int
>>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>     	struct amdgpu_job *job;
>>>     	int r;
>>>     
>>> -	if (!ring->ready) {
>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>> -		return -EINVAL;
>>> -	}
>>> -
>>>     	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>     		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>     		if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                         ` <7af9b6fb-28e8-4e25-5d4a-5b566a00cbea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-01  8:23                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB04490BF8390F6270B265969984C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  8:23 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....

If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?

In fact we did stress TDR test with this patch and it can really fix couple over kill issues 


/Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月1日 16:16
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Ah, crap yeah that won't work since we don't free the ring.

Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.

And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?

Christian.

Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>
>
> /Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年2月28日 20:46
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Good point, but in this case we need some other handling.
>
> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>
> And you also need to fix a couple of other places in amdgpu_ttm.c.
>
> Regards,
> Christian.
>
> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>> Because when SDMA was hang by like process A, and meanwhile another 
>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>
>> Let it run and finally process B's job would fail and GPU recover 
>> will repeat it again (since it is a kernel job)
>>
>> Without this solution other process will be greatly harmed by one 
>> black sheep that triggering GPU recover
>>
>> /Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年2月28日 20:24
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>> not true, keep going and GPU scheduler will reschedule this job if 
>>> it failed.
>>>
>>> give a warning on copy_buffer when go through direct_submit while
>>> ring->ready is false
>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>
>> Why exactly are you running into issues with that?
>>
>> Christian.
>>
>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>     1 file changed, 1 insertion(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> index e38e6db..7b75ac9 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>     	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>     	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>     	if (direct_submit) {
>>> +		WARN_ON(!ring->ready);
>>>     		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>     				       NULL, fence);
>>>     		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int 
>>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>     	struct amdgpu_job *job;
>>>     	int r;
>>>     
>>> -	if (!ring->ready) {
>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>> -		return -EINVAL;
>>> -	}
>>> -
>>>     	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>     		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>     		if (r)
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                             ` <BLUPR12MB04490BF8390F6270B265969984C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  8:41                               ` Christian König
       [not found]                                 ` <6dedfc7f-cf69-676f-463d-be52cda1b1bb-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-03-01  8:41 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring 
is available during suspend/resume.

E.g. suspend works as this:
1. Evict all normal BOs from VRAM using the SDMA.
2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, 
firmware etc...
4. Evict all remaining BOs from VRAM using CPU copies.

Resume works the other way around:
1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
2. Pin those BOs in VRAM and enable the hardware.
3. Use the SDMA to move back all other BOs into VRAM.

To figure out if we should use the CPU copy or the SDMA the ring->ready 
flag is used. So removing this check is really a no-go because it would 
break suspend/resume.

The only other alternative I can see is to add a separate flag in 
amdgpu_mman if buffer_funcs should be used or not. See all the callers 
of amdgpu_ttm_set_active_vram_size() as well.

Something like renaming amdgpu_ttm_set_active_vram_size() into 
amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset 
is set.

BTW: Changing the active VRAM size in GPU reset is quite a bug you 
stumbled over here, so this really needs fixing anyway.

Regards,
Christian.

Am 01.03.2018 um 09:23 schrieb Liu, Monk:
> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>
> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>
> In fact we did stress TDR test with this patch and it can really fix couple over kill issues
>
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 16:16
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
> Ah, crap yeah that won't work since we don't free the ring.
>
> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>
> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>
> Christian.
>
> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月28日 20:46
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>> Good point, but in this case we need some other handling.
>>
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>
>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>
>> Regards,
>> Christian.
>>
>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>> Because when SDMA was hang by like process A, and meanwhile another
>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>
>>> Let it run and finally process B's job would fail and GPU recover
>>> will repeat it again (since it is a kernel job)
>>>
>>> Without this solution other process will be greatly harmed by one
>>> black sheep that triggering GPU recover
>>>
>>> /Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月28日 20:24
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>>> for fill_buffer
>>>
>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>> because this time SDMA may under GPU RESET so its ring->ready may
>>>> not true, keep going and GPU scheduler will reschedule this job if
>>>> it failed.
>>>>
>>>> give a warning on copy_buffer when go through direct_submit while
>>>> ring->ready is false
>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>
>>> Why exactly are you running into issues with that?
>>>
>>> Christian.
>>>
>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>      1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index e38e6db..7b75ac9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>      	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>      	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>      	if (direct_submit) {
>>>> +		WARN_ON(!ring->ready);
>>>>      		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>      				       NULL, fence);
>>>>      		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ int
>>>> amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>      	struct amdgpu_job *job;
>>>>      	int r;
>>>>      
>>>> -	if (!ring->ready) {
>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>> -		return -EINVAL;
>>>> -	}
>>>> -
>>>>      	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>      		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>      		if (r)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                 ` <6dedfc7f-cf69-676f-463d-be52cda1b1bb-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-01  8:51                                   ` Liu, Monk
       [not found]                                     ` <BLUPR12MB04491577FB4B6D6A5BCC3B8784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  8:51 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job
What's your concern ?

/Monk


-----Original Message-----
From: Koenig, Christian 
Sent: 2018年3月1日 16:41
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

E.g. suspend works as this:
1. Evict all normal BOs from VRAM using the SDMA.
2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
4. Evict all remaining BOs from VRAM using CPU copies.

Resume works the other way around:
1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
2. Pin those BOs in VRAM and enable the hardware.
3. Use the SDMA to move back all other BOs into VRAM.

To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.

The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.

Something like renaming amdgpu_ttm_set_active_vram_size() into
amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.

BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.

Regards,
Christian.

Am 01.03.2018 um 09:23 schrieb Liu, Monk:
> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>
> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>
> In fact we did stress TDR test with this patch and it can really fix 
> couple over kill issues
>
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 16:16
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Ah, crap yeah that won't work since we don't free the ring.
>
> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>
> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>
> Christian.
>
> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月28日 20:46
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Good point, but in this case we need some other handling.
>>
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>
>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>
>> Regards,
>> Christian.
>>
>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>> Because when SDMA was hang by like process A, and meanwhile another 
>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>
>>> Let it run and finally process B's job would fail and GPU recover 
>>> will repeat it again (since it is a kernel job)
>>>
>>> Without this solution other process will be greatly harmed by one 
>>> black sheep that triggering GPU recover
>>>
>>> /Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月28日 20:24
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>> not true, keep going and GPU scheduler will reschedule this job if 
>>>> it failed.
>>>>
>>>> give a warning on copy_buffer when go through direct_submit while
>>>> ring->ready is false
>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>
>>> Why exactly are you running into issues with that?
>>>
>>> Christian.
>>>
>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>      1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index e38e6db..7b75ac9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>      	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>      	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>      	if (direct_submit) {
>>>> +		WARN_ON(!ring->ready);
>>>>      		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>      				       NULL, fence);
>>>>      		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ 
>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>      	struct amdgpu_job *job;
>>>>      	int r;
>>>>      
>>>> -	if (!ring->ready) {
>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>> -		return -EINVAL;
>>>> -	}
>>>> -
>>>>      	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>      		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>      		if (r)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                     ` <BLUPR12MB04491577FB4B6D6A5BCC3B8784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  9:25                                       ` Liu, Monk
       [not found]                                         ` <BLUPR12MB044933DC340A46EADC1F7A2784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-03-01  9:37                                       ` Liu, Monk
  1 sibling, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  9:25 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Okay I missed the part of suspend/resume/s3

For SRIOV we don't run into that logic, looks I need to keep it in private branch till 
A better solution raised 

/Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2018年3月1日 16:52
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?

/Monk


-----Original Message-----
From: Koenig, Christian
Sent: 2018年3月1日 16:41
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

E.g. suspend works as this:
1. Evict all normal BOs from VRAM using the SDMA.
2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
4. Evict all remaining BOs from VRAM using CPU copies.

Resume works the other way around:
1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
2. Pin those BOs in VRAM and enable the hardware.
3. Use the SDMA to move back all other BOs into VRAM.

To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.

The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.

Something like renaming amdgpu_ttm_set_active_vram_size() into
amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.

BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.

Regards,
Christian.

Am 01.03.2018 um 09:23 schrieb Liu, Monk:
> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>
> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>
> In fact we did stress TDR test with this patch and it can really fix 
> couple over kill issues
>
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 16:16
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Ah, crap yeah that won't work since we don't free the ring.
>
> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>
> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>
> Christian.
>
> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月28日 20:46
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Good point, but in this case we need some other handling.
>>
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>
>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>
>> Regards,
>> Christian.
>>
>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>> Because when SDMA was hang by like process A, and meanwhile another 
>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>
>>> Let it run and finally process B's job would fail and GPU recover 
>>> will repeat it again (since it is a kernel job)
>>>
>>> Without this solution other process will be greatly harmed by one 
>>> black sheep that triggering GPU recover
>>>
>>> /Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月28日 20:24
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>> not true, keep going and GPU scheduler will reschedule this job if 
>>>> it failed.
>>>>
>>>> give a warning on copy_buffer when go through direct_submit while
>>>> ring->ready is false
>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>
>>> Why exactly are you running into issues with that?
>>>
>>> Christian.
>>>
>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>      1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index e38e6db..7b75ac9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>      	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>      	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>      	if (direct_submit) {
>>>> +		WARN_ON(!ring->ready);
>>>>      		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>      				       NULL, fence);
>>>>      		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ 
>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>      	struct amdgpu_job *job;
>>>>      	int r;
>>>>      
>>>> -	if (!ring->ready) {
>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>> -		return -EINVAL;
>>>> -	}
>>>> -
>>>>      	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>      		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>      		if (r)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                         ` <BLUPR12MB044933DC340A46EADC1F7A2784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  9:30                                           ` Christian König
  0 siblings, 0 replies; 32+ messages in thread
From: Christian König @ 2018-03-01  9:30 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Even for bare-metal changing the VRAM size in the memory manager because 
we disable/enable the SDMA during GPU reset is quite a bug.

Going to work on this, a correct fix should actually be really trivial.

Regards,
Christian.

Am 01.03.2018 um 10:25 schrieb Liu, Monk:
> Okay I missed the part of suspend/resume/s3
>
> For SRIOV we don't run into that logic, looks I need to keep it in private branch till
> A better solution raised
>
> /Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
> Sent: 2018年3月1日 16:52
> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
> As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?
>
> /Monk
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月1日 16:41
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>
> E.g. suspend works as this:
> 1. Evict all normal BOs from VRAM using the SDMA.
> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
> 4. Evict all remaining BOs from VRAM using CPU copies.
>
> Resume works the other way around:
> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
> 2. Pin those BOs in VRAM and enable the hardware.
> 3. Use the SDMA to move back all other BOs into VRAM.
>
> To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.
>
> The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.
>
> Something like renaming amdgpu_ttm_set_active_vram_size() into
> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.
>
> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>>
>> In fact we did stress TDR test with this patch and it can really fix
>> couple over kill issues
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月1日 16:16
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>> Ah, crap yeah that won't work since we don't free the ring.
>>
>> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>>
>> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>>
>> Christian.
>>
>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年2月28日 20:46
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>>> for fill_buffer
>>>
>>> Good point, but in this case we need some other handling.
>>>
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>
>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>> Because when SDMA was hang by like process A, and meanwhile another
>>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>>
>>>> Let it run and finally process B's job would fail and GPU recover
>>>> will repeat it again (since it is a kernel job)
>>>>
>>>> Without this solution other process will be greatly harmed by one
>>>> black sheep that triggering GPU recover
>>>>
>>>> /Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年2月28日 20:24
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not
>>>> ready for fill_buffer
>>>>
>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>> because this time SDMA may under GPU RESET so its ring->ready may
>>>>> not true, keep going and GPU scheduler will reschedule this job if
>>>>> it failed.
>>>>>
>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>> ring->ready is false
>>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>>
>>>> Why exactly are you running into issues with that?
>>>>
>>>> Christian.
>>>>
>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>       1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index e38e6db..7b75ac9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>>       	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>       	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>       	if (direct_submit) {
>>>>> +		WARN_ON(!ring->ready);
>>>>>       		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>       				       NULL, fence);
>>>>>       		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@
>>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>       	struct amdgpu_job *job;
>>>>>       	int r;
>>>>>       
>>>>> -	if (!ring->ready) {
>>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>> -		return -EINVAL;
>>>>> -	}
>>>>> -
>>>>>       	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>       		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>       		if (r)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                     ` <BLUPR12MB04491577FB4B6D6A5BCC3B8784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  2018-03-01  9:25                                       ` Liu, Monk
@ 2018-03-01  9:37                                       ` Liu, Monk
       [not found]                                         ` <BLUPR12MB044905DB87968358166BDEF484C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  1 sibling, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  9:37 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Christian

During suspend/resume, which stage will set sdma's ring->ready to false ?? I can only find it in amdgpu_ring_fini()
But I saw that it is only invoked in unload_kms(), looks device_suspnd() will not set ring->ready to false

/Monk

-----Original Message-----
From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
Sent: 2018年3月1日 16:52
To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?

/Monk


-----Original Message-----
From: Koenig, Christian
Sent: 2018年3月1日 16:41
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.

E.g. suspend works as this:
1. Evict all normal BOs from VRAM using the SDMA.
2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
4. Evict all remaining BOs from VRAM using CPU copies.

Resume works the other way around:
1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
2. Pin those BOs in VRAM and enable the hardware.
3. Use the SDMA to move back all other BOs into VRAM.

To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.

The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.

Something like renaming amdgpu_ttm_set_active_vram_size() into
amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.

BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.

Regards,
Christian.

Am 01.03.2018 um 09:23 schrieb Liu, Monk:
> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>
> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>
> In fact we did stress TDR test with this patch and it can really fix 
> couple over kill issues
>
>
> /Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 16:16
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Ah, crap yeah that won't work since we don't free the ring.
>
> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>
> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>
> Christian.
>
> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年2月28日 20:46
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Good point, but in this case we need some other handling.
>>
>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>
>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>
>> Regards,
>> Christian.
>>
>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>> Because when SDMA was hang by like process A, and meanwhile another 
>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>
>>> Let it run and finally process B's job would fail and GPU recover 
>>> will repeat it again (since it is a kernel job)
>>>
>>> Without this solution other process will be greatly harmed by one 
>>> black sheep that triggering GPU recover
>>>
>>> /Monk
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年2月28日 20:24
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>> not true, keep going and GPU scheduler will reschedule this job if 
>>>> it failed.
>>>>
>>>> give a warning on copy_buffer when go through direct_submit while
>>>> ring->ready is false
>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>
>>> Why exactly are you running into issues with that?
>>>
>>> Christian.
>>>
>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>      1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> index e38e6db..7b75ac9 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>      	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>      	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>      	if (direct_submit) {
>>>> +		WARN_ON(!ring->ready);
>>>>      		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>      				       NULL, fence);
>>>>      		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ 
>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>      	struct amdgpu_job *job;
>>>>      	int r;
>>>>      
>>>> -	if (!ring->ready) {
>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>> -		return -EINVAL;
>>>> -	}
>>>> -
>>>>      	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>      		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>      		if (r)
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                         ` <BLUPR12MB044905DB87968358166BDEF484C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  9:39                                           ` Christian König
       [not found]                                             ` <a6812ba2-1c34-c6a4-d65a-09f924ea0940-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-03-01  9:39 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Hi Monk,

look at si_dma_stop() and/or other SDMA generation code.

Regards,
Christian.

Am 01.03.2018 um 10:37 schrieb Liu, Monk:
> Hi Christian
>
> During suspend/resume, which stage will set sdma's ring->ready to false ?? I can only find it in amdgpu_ring_fini()
> But I saw that it is only invoked in unload_kms(), looks device_suspnd() will not set ring->ready to false
>
> /Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf Of Liu, Monk
> Sent: 2018年3月1日 16:52
> To: Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
> As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?
>
> /Monk
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月1日 16:41
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>
> E.g. suspend works as this:
> 1. Evict all normal BOs from VRAM using the SDMA.
> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
> 4. Evict all remaining BOs from VRAM using CPU copies.
>
> Resume works the other way around:
> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
> 2. Pin those BOs in VRAM and enable the hardware.
> 3. Use the SDMA to move back all other BOs into VRAM.
>
> To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.
>
> The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.
>
> Something like renaming amdgpu_ttm_set_active_vram_size() into
> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.
>
> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>>
>> In fact we did stress TDR test with this patch and it can really fix
>> couple over kill issues
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月1日 16:16
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>> Ah, crap yeah that won't work since we don't free the ring.
>>
>> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>>
>> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>>
>> Christian.
>>
>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年2月28日 20:46
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>>> for fill_buffer
>>>
>>> Good point, but in this case we need some other handling.
>>>
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>
>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>> Because when SDMA was hang by like process A, and meanwhile another
>>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>>
>>>> Let it run and finally process B's job would fail and GPU recover
>>>> will repeat it again (since it is a kernel job)
>>>>
>>>> Without this solution other process will be greatly harmed by one
>>>> black sheep that triggering GPU recover
>>>>
>>>> /Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年2月28日 20:24
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not
>>>> ready for fill_buffer
>>>>
>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>> because this time SDMA may under GPU RESET so its ring->ready may
>>>>> not true, keep going and GPU scheduler will reschedule this job if
>>>>> it failed.
>>>>>
>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>> ring->ready is false
>>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>>
>>>> Why exactly are you running into issues with that?
>>>>
>>>> Christian.
>>>>
>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>       1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index e38e6db..7b75ac9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>>       	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>       	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>       	if (direct_submit) {
>>>>> +		WARN_ON(!ring->ready);
>>>>>       		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>       				       NULL, fence);
>>>>>       		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@
>>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>       	struct amdgpu_job *job;
>>>>>       	int r;
>>>>>       
>>>>> -	if (!ring->ready) {
>>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>> -		return -EINVAL;
>>>>> -	}
>>>>> -
>>>>>       	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>       		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>       		if (r)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                             ` <a6812ba2-1c34-c6a4-d65a-09f924ea0940-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2018-03-01  9:41                                               ` Liu, Monk
       [not found]                                                 ` <BLUPR12MB044923819BB9561C57EACA8584C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Liu, Monk @ 2018-03-01  9:41 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Oops, looks only si will call this dma_stop() during hw_fini, I checked for sdma4.0/3.0/2.4, no such logic



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2018年3月1日 17:39
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Hi Monk,

look at si_dma_stop() and/or other SDMA generation code.

Regards,
Christian.

Am 01.03.2018 um 10:37 schrieb Liu, Monk:
> Hi Christian
>
> During suspend/resume, which stage will set sdma's ring->ready to 
> false ?? I can only find it in amdgpu_ring_fini() But I saw that it is 
> only invoked in unload_kms(), looks device_suspnd() will not set 
> ring->ready to false
>
> /Monk
>
> -----Original Message-----
> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf 
> Of Liu, Monk
> Sent: 2018年3月1日 16:52
> To: Koenig, Christian <Christian.Koenig@amd.com>; 
> amd-gfx@lists.freedesktop.org
> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
> As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?
>
> /Monk
>
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2018年3月1日 16:41
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>
> E.g. suspend works as this:
> 1. Evict all normal BOs from VRAM using the SDMA.
> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
> 4. Evict all remaining BOs from VRAM using CPU copies.
>
> Resume works the other way around:
> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
> 2. Pin those BOs in VRAM and enable the hardware.
> 3. Use the SDMA to move back all other BOs into VRAM.
>
> To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.
>
> The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.
>
> Something like renaming amdgpu_ttm_set_active_vram_size() into
> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.
>
> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>>
>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>>
>> In fact we did stress TDR test with this patch and it can really fix 
>> couple over kill issues
>>
>>
>> /Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2018年3月1日 16:16
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>> Ah, crap yeah that won't work since we don't free the ring.
>>
>> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>>
>> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>>
>> Christian.
>>
>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2018年2月28日 20:46
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Good point, but in this case we need some other handling.
>>>
>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>
>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>> Because when SDMA was hang by like process A, and meanwhile another 
>>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>>
>>>> Let it run and finally process B's job would fail and GPU recover 
>>>> will repeat it again (since it is a kernel job)
>>>>
>>>> Without this solution other process will be greatly harmed by one 
>>>> black sheep that triggering GPU recover
>>>>
>>>> /Monk
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2018年2月28日 20:24
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>> ready for fill_buffer
>>>>
>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>>> not true, keep going and GPU scheduler will reschedule this job if 
>>>>> it failed.
>>>>>
>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>> ring->ready is false
>>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>>
>>>> Why exactly are you running into issues with that?
>>>>
>>>> Christian.
>>>>
>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>       1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> index e38e6db..7b75ac9 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>>       	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>       	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>       	if (direct_submit) {
>>>>> +		WARN_ON(!ring->ready);
>>>>>       		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>       				       NULL, fence);
>>>>>       		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@ 
>>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>       	struct amdgpu_job *job;
>>>>>       	int r;
>>>>>       
>>>>> -	if (!ring->ready) {
>>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>> -		return -EINVAL;
>>>>> -	}
>>>>> -
>>>>>       	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>       		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>       		if (r)
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                                 ` <BLUPR12MB044923819BB9561C57EACA8584C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2018-03-01  9:51                                                   ` Christian König
       [not found]                                                     ` <26eb2e4c-33dd-2160-5b83-a7ff9e3a2558-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 32+ messages in thread
From: Christian König @ 2018-03-01  9:51 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Look for amdgpu_ttm_set_active_vram_size(), setting ring->ready to false 
is directly beneath it.

Christian.

Am 01.03.2018 um 10:41 schrieb Liu, Monk:
> Oops, looks only si will call this dma_stop() during hw_fini, I checked for sdma4.0/3.0/2.4, no such logic
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 17:39
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
>
> Hi Monk,
>
> look at si_dma_stop() and/or other SDMA generation code.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 10:37 schrieb Liu, Monk:
>> Hi Christian
>>
>> During suspend/resume, which stage will set sdma's ring->ready to
>> false ?? I can only find it in amdgpu_ring_fini() But I saw that it is
>> only invoked in unload_kms(), looks device_suspnd() will not set
>> ring->ready to false
>>
>> /Monk
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On Behalf
>> Of Liu, Monk
>> Sent: 2018年3月1日 16:52
>> To: Koenig, Christian <Christian.Koenig@amd.com>;
>> amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>> As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?
>>
>> /Monk
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月1日 16:41
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>> for fill_buffer
>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>>
>> E.g. suspend works as this:
>> 1. Evict all normal BOs from VRAM using the SDMA.
>> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
>> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
>> 4. Evict all remaining BOs from VRAM using CPU copies.
>>
>> Resume works the other way around:
>> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
>> 2. Pin those BOs in VRAM and enable the hardware.
>> 3. Use the SDMA to move back all other BOs into VRAM.
>>
>> To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.
>>
>> The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.
>>
>> Something like renaming amdgpu_ttm_set_active_vram_size() into
>> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.
>>
>> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.
>>
>> Regards,
>> Christian.
>>
>> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>>> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>>>
>>> In fact we did stress TDR test with this patch and it can really fix
>>> couple over kill issues
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年3月1日 16:16
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready
>>> for fill_buffer
>>>
>>> Ah, crap yeah that won't work since we don't free the ring.
>>>
>>> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>>>
>>> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>>>
>>> Christian.
>>>
>>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>>>
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian
>>>> Sent: 2018年2月28日 20:46
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not
>>>> ready for fill_buffer
>>>>
>>>> Good point, but in this case we need some other handling.
>>>>
>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>>
>>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>>> Because when SDMA was hang by like process A, and meanwhile another
>>>>> process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>>>
>>>>> Let it run and finally process B's job would fail and GPU recover
>>>>> will repeat it again (since it is a kernel job)
>>>>>
>>>>> Without this solution other process will be greatly harmed by one
>>>>> black sheep that triggering GPU recover
>>>>>
>>>>> /Monk
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>>> Sent: 2018年2月28日 20:24
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not
>>>>> ready for fill_buffer
>>>>>
>>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>>> because this time SDMA may under GPU RESET so its ring->ready may
>>>>>> not true, keep going and GPU scheduler will reschedule this job if
>>>>>> it failed.
>>>>>>
>>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>>> ring->ready is false
>>>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>>>
>>>>> Why exactly are you running into issues with that?
>>>>>
>>>>> Christian.
>>>>>
>>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>>        1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> index e38e6db..7b75ac9 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>>>        	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>>        	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>>        	if (direct_submit) {
>>>>>> +		WARN_ON(!ring->ready);
>>>>>>        		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>>        				       NULL, fence);
>>>>>>        		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 @@
>>>>>> int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>>        	struct amdgpu_job *job;
>>>>>>        	int r;
>>>>>>        
>>>>>> -	if (!ring->ready) {
>>>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>>> -		return -EINVAL;
>>>>>> -	}
>>>>>> -
>>>>>>        	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>>        		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>>        		if (r)
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

* RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer
       [not found]                                                     ` <26eb2e4c-33dd-2160-5b83-a7ff9e3a2558-5C7GfCeVMHo@public.gmane.org>
@ 2018-03-01 10:11                                                       ` Liu, Monk
  0 siblings, 0 replies; 32+ messages in thread
From: Liu, Monk @ 2018-03-01 10:11 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Oh yeah, I grep "ring->ready = false" and lost others ...

-----Original Message-----
From: Koenig, Christian 
Sent: 2018年3月1日 17:51
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer

Look for amdgpu_ttm_set_active_vram_size(), setting ring->ready to false is directly beneath it.

Christian.

Am 01.03.2018 um 10:41 schrieb Liu, Monk:
> Oops, looks only si will call this dma_stop() during hw_fini, I 
> checked for sdma4.0/3.0/2.4, no such logic
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2018年3月1日 17:39
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
> for fill_buffer
>
> Hi Monk,
>
> look at si_dma_stop() and/or other SDMA generation code.
>
> Regards,
> Christian.
>
> Am 01.03.2018 um 10:37 schrieb Liu, Monk:
>> Hi Christian
>>
>> During suspend/resume, which stage will set sdma's ring->ready to 
>> false ?? I can only find it in amdgpu_ring_fini() But I saw that it 
>> is only invoked in unload_kms(), looks device_suspnd() will not set
>> ring->ready to false
>>
>> /Monk
>>
>> -----Original Message-----
>> From: amd-gfx [mailto:amd-gfx-bounces@lists.freedesktop.org] On 
>> Behalf Of Liu, Monk
>> Sent: 2018年3月1日 16:52
>> To: Koenig, Christian <Christian.Koenig@amd.com>; 
>> amd-gfx@lists.freedesktop.org
>> Subject: RE: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>> As long as we are in gpu reset, we don't return error when ring not ready, cuz eventually it either success or failed and rescheduled by scheduler since it is kernel job What's your concern ?
>>
>> /Monk
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2018年3月1日 16:41
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not ready 
>> for fill_buffer
>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>> amdgpu_evict_flags() and amdgpu_bo_move() needs to know if the SDMA ring is available during suspend/resume.
>>
>> E.g. suspend works as this:
>> 1. Evict all normal BOs from VRAM using the SDMA.
>> 2. Disable SDMA, GART, PSP etc... e.g. disable the hardware.
>> 3. Unpin all BOs in VRAM used by hardware engines, e.g. the GART table, firmware etc...
>> 4. Evict all remaining BOs from VRAM using CPU copies.
>>
>> Resume works the other way around:
>> 1. Move all BOs for GART table and firmware back into VRAM using CPU copies.
>> 2. Pin those BOs in VRAM and enable the hardware.
>> 3. Use the SDMA to move back all other BOs into VRAM.
>>
>> To figure out if we should use the CPU copy or the SDMA the ring->ready flag is used. So removing this check is really a no-go because it would break suspend/resume.
>>
>> The only other alternative I can see is to add a separate flag in amdgpu_mman if buffer_funcs should be used or not. See all the callers of amdgpu_ttm_set_active_vram_size() as well.
>>
>> Something like renaming amdgpu_ttm_set_active_vram_size() into
>> amdgpu_ttm_set_buffer_funcs_read() and ignoring any change if gpu_reset is set.
>>
>> BTW: Changing the active VRAM size in GPU reset is quite a bug you stumbled over here, so this really needs fixing anyway.
>>
>> Regards,
>> Christian.
>>
>> Am 01.03.2018 um 09:23 schrieb Liu, Monk:
>>> Accel_working is only for GFX ring, I don't think test it is appropriate for SDMA jobs ....
>>>
>>> If we are in gpu rest stage, we can let SDMA job live with the case of ring not ready,  can you point out where is the problem ?
>>>
>>> In fact we did stress TDR test with this patch and it can really fix 
>>> couple over kill issues
>>>
>>>
>>> /Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2018年3月1日 16:16
>>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
>>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>> ready for fill_buffer
>>>
>>> Ah, crap yeah that won't work since we don't free the ring.
>>>
>>> Key point is we need to distinct between the ring doesn't work temporary because we are in a GPU reset and it doesn't work at all because we are missing firmware or stuff like that.
>>>
>>> And no, checking the gpu_reset flag is totally racy and can't be done either. How about checking accel_working instead?
>>>
>>> Christian.
>>>
>>> Am 01.03.2018 um 07:01 schrieb Liu, Monk:
>>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>> I don't understand how could fill_buffer() get run under the case that ring->ring_obj is not even allocated ... where is such case ?
>>>>
>>>>
>>>> /Monk
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian
>>>> Sent: 2018年2月28日 20:46
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>> ready for fill_buffer
>>>>
>>>> Good point, but in this case we need some other handling.
>>>>
>>>> Please change the test to use ring->ring_obj instead, this way we still bail out if somebody tries to submit commands before the ring is even allocated.
>>>>
>>>> And you also need to fix a couple of other places in amdgpu_ttm.c.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 28.02.2018 um 13:34 schrieb Liu, Monk:
>>>>> Because when SDMA was hang by like process A, and meanwhile 
>>>>> another process B is already running into the code of fill_buffer() So just let process B continue, don't block it otherwise process B would fail by software reason .
>>>>>
>>>>> Let it run and finally process B's job would fail and GPU recover 
>>>>> will repeat it again (since it is a kernel job)
>>>>>
>>>>> Without this solution other process will be greatly harmed by one 
>>>>> black sheep that triggering GPU recover
>>>>>
>>>>> /Monk
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>>> Sent: 2018年2月28日 20:24
>>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>>> Subject: Re: [PATCH 3/4] drm/amdgpu: don't return when ring not 
>>>>> ready for fill_buffer
>>>>>
>>>>> Am 28.02.2018 um 08:21 schrieb Monk Liu:
>>>>>> because this time SDMA may under GPU RESET so its ring->ready may 
>>>>>> not true, keep going and GPU scheduler will reschedule this job 
>>>>>> if it failed.
>>>>>>
>>>>>> give a warning on copy_buffer when go through direct_submit while
>>>>>> ring->ready is false
>>>>> NAK, that test has already saved us quite a bunch of trouble with the fb layer.
>>>>>
>>>>> Why exactly are you running into issues with that?
>>>>>
>>>>> Christian.
>>>>>
>>>>>> Change-Id: Ife6cd55e0e843d99900e5bed5418499e88633685
>>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>>> ---
>>>>>>        drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c | 6 +-----
>>>>>>        1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> index e38e6db..7b75ac9 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ttm.c
>>>>>> @@ -1656,6 +1656,7 @@ int amdgpu_copy_buffer(struct amdgpu_ring *ring, uint64_t src_offset,
>>>>>>        	amdgpu_ring_pad_ib(ring, &job->ibs[0]);
>>>>>>        	WARN_ON(job->ibs[0].length_dw > num_dw);
>>>>>>        	if (direct_submit) {
>>>>>> +		WARN_ON(!ring->ready);
>>>>>>        		r = amdgpu_ib_schedule(ring, job->num_ibs, job->ibs,
>>>>>>        				       NULL, fence);
>>>>>>        		job->fence = dma_fence_get(*fence); @@ -1692,11 +1693,6 
>>>>>> @@ int amdgpu_fill_buffer(struct amdgpu_bo *bo,
>>>>>>        	struct amdgpu_job *job;
>>>>>>        	int r;
>>>>>>        
>>>>>> -	if (!ring->ready) {
>>>>>> -		DRM_ERROR("Trying to clear memory with ring turned off.\n");
>>>>>> -		return -EINVAL;
>>>>>> -	}
>>>>>> -
>>>>>>        	if (bo->tbo.mem.mem_type == TTM_PL_TT) {
>>>>>>        		r = amdgpu_ttm_alloc_gart(&bo->tbo);
>>>>>>        		if (r)
>>>> _______________________________________________
>>>> amd-gfx mailing list
>>>> amd-gfx@lists.freedesktop.org
>>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2018-03-01 10:11 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-02-28  7:21 [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover Monk Liu
     [not found] ` <1519802463-9090-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-28  7:21   ` [PATCH 2/4] drm/amdgpu: cleanups for vram lost handling Monk Liu
     [not found]     ` <1519802463-9090-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-28 12:22       ` Christian König
2018-02-28 13:29       ` Andrey Grodzovsky
     [not found]         ` <13337cd9-78e9-df36-f2ab-749cf182177b-5C7GfCeVMHo@public.gmane.org>
2018-02-28 13:36           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449D9513F4798028569ED0884C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-28 13:41               ` Andrey Grodzovsky
2018-02-28 14:25       ` Alex Deucher
2018-02-28  7:21   ` [PATCH 3/4] drm/amdgpu: don't return when ring not ready for fill_buffer Monk Liu
     [not found]     ` <1519802463-9090-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-28 12:23       ` Christian König
     [not found]         ` <01879d0d-edea-680e-c9f2-1005d94f1dfd-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-28 12:34           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449BEDBC1FCE72C0021C88184C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-28 12:40               ` Liu, Monk
2018-02-28 12:45               ` Christian König
     [not found]                 ` <3a8945f9-848e-dd19-373d-5dddc69f76cb-5C7GfCeVMHo@public.gmane.org>
2018-03-01  6:01                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB04498FBA36C747652758D56E84C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  8:16                       ` Christian König
     [not found]                         ` <7af9b6fb-28e8-4e25-5d4a-5b566a00cbea-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-01  8:23                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB04490BF8390F6270B265969984C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  8:41                               ` Christian König
     [not found]                                 ` <6dedfc7f-cf69-676f-463d-be52cda1b1bb-5C7GfCeVMHo@public.gmane.org>
2018-03-01  8:51                                   ` Liu, Monk
     [not found]                                     ` <BLUPR12MB04491577FB4B6D6A5BCC3B8784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  9:25                                       ` Liu, Monk
     [not found]                                         ` <BLUPR12MB044933DC340A46EADC1F7A2784C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  9:30                                           ` Christian König
2018-03-01  9:37                                       ` Liu, Monk
     [not found]                                         ` <BLUPR12MB044905DB87968358166BDEF484C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  9:39                                           ` Christian König
     [not found]                                             ` <a6812ba2-1c34-c6a4-d65a-09f924ea0940-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-03-01  9:41                                               ` Liu, Monk
     [not found]                                                 ` <BLUPR12MB044923819BB9561C57EACA8584C60-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-03-01  9:51                                                   ` Christian König
     [not found]                                                     ` <26eb2e4c-33dd-2160-5b83-a7ff9e3a2558-5C7GfCeVMHo@public.gmane.org>
2018-03-01 10:11                                                       ` Liu, Monk
2018-02-28  7:21   ` [PATCH 4/4] drm/amdgpu: try again kiq access if not in IRQ Monk Liu
     [not found]     ` <1519802463-9090-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2018-02-28  7:41       ` Liu, Monk
2018-02-28 12:20   ` [PATCH 1/4] drm/amdgpu: stop all rings before doing gpu recover Christian König
     [not found]     ` <9e575c74-c6ce-76ef-a09c-1dec5a4807a3-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2018-02-28 13:30       ` Andrey Grodzovsky
     [not found]         ` <e4756f8b-d8f9-9849-aad4-a23193e367f6-5C7GfCeVMHo@public.gmane.org>
2018-02-28 13:31           ` Liu, Monk
     [not found]             ` <BLUPR12MB04491C15749E9DCAB3DEF1A684C70-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2018-02-28 13:42               ` Andrey Grodzovsky
2018-02-28 16:40               ` Andrey Grodzovsky
     [not found]                 ` <a54b5a4f-a370-87ec-7bac-33e6036107f9-5C7GfCeVMHo@public.gmane.org>
2018-02-28 17:42                   ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.