[PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated
@ 2019-10-28 11:31 ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Le Ma

The err_event_athub error will mess up the buffer and cause UVD resume hang.

Change-Id: If17a2161fb9b1b52eac08de00d2e935191bdbf99
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index b2c364b..b4dd89a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -39,6 +39,8 @@
 #include "cikd.h"
 #include "uvd/uvd_4_2_d.h"
 
+#include "amdgpu_ras.h"
+
 /* 1 second timeout */
 #define UVD_IDLE_TIMEOUT	msecs_to_jiffies(1000)
 
@@ -372,7 +374,13 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		if (!adev->uvd.inst[j].saved_bo)
 			return -ENOMEM;
 
-		memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		/* re-write 0 since err_event_athub will corrupt VCPU buffer */
+		if (amdgpu_ras_intr_triggered()) {
+			DRM_WARN("UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT\n");
+			memset(adev->uvd.inst[j].saved_bo, 0, size);
+		} else {
+			memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		}
 	}
 	return 0;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated
@ 2019-10-28 11:31 ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx; +Cc: Le Ma

The err_event_athub error will mess up the buffer and cause UVD resume hang.

Change-Id: If17a2161fb9b1b52eac08de00d2e935191bdbf99
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index b2c364b..b4dd89a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -39,6 +39,8 @@
 #include "cikd.h"
 #include "uvd/uvd_4_2_d.h"
 
+#include "amdgpu_ras.h"
+
 /* 1 second timeout */
 #define UVD_IDLE_TIMEOUT	msecs_to_jiffies(1000)
 
@@ -372,7 +374,13 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		if (!adev->uvd.inst[j].saved_bo)
 			return -ENOMEM;
 
-		memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		/* re-write 0 since err_event_athub will corrupt VCPU buffer */
+		if (amdgpu_ras_intr_triggered()) {
+			DRM_WARN("UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT\n");
+			memset(adev->uvd.inst[j].saved_bo, 0, size);
+		} else {
+			memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		}
 	}
 	return 0;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Le Ma

Otherwise next err_event_athub error cannot call gpu reset.

Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 676cad1..51d74bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		}
 	}
 
+	if (!r && in_ras_intr)
+		atomic_set(&amdgpu_ras_in_intr, 0);
+
 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 		/*unlock kfd: SRIOV would do it separately */
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx; +Cc: Le Ma

Otherwise next err_event_athub error cannot call gpu reset.

Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 676cad1..51d74bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		}
 	}
 
+	if (!r && in_ras_intr)
+		atomic_set(&amdgpu_ras_in_intr, 0);
+
 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 		/*unlock kfd: SRIOV would do it separately */
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] drm/amdgpu: bypass some cleanup work after err_event_athub
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Le Ma

PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

Change-Id: If54a3735edd6ccbb58d40a5f8833392981f8ce37
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 20 +++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      |  6 ++++--
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 51d74bb..72d9892 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2274,6 +2274,12 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev)
 		/* displays are handled in phase1 */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE)
 			continue;
+		/* PSP lost connection when err_event_athub occurs */
+		if (amdgpu_ras_intr_triggered() &&
+		    adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP) {
+			adev->ip_blocks[i].status.hw = false;
+			continue;
+		}
 		/* XXX handle errors */
 		r = adev->ip_blocks[i].version->funcs->suspend(adev);
 		/* XXX handle errors */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index fd7a73f..fce206f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -167,6 +167,13 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	while (*((unsigned int *)psp->fence_buf) != index) {
 		if (--timeout == 0)
 			break;
+		/*
+		 * Shouldn't wait for timeout when err_event_athub occurs,
+		 * because gpu reset thread triggered and lock resource should
+		 * be released for psp resume sequence.
+		 */
+		if (amdgpu_ras_intr_triggered())
+			break;
 		msleep(1);
 		amdgpu_asic_invalidate_hdp(psp->adev, NULL);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 796326b..dab90c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -558,15 +558,17 @@ int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
 	if (!(!!enable ^ !!amdgpu_ras_is_feature_enabled(adev, head)))
 		return 0;
 
-	ret = psp_ras_enable_features(&adev->psp, &info, enable);
-	if (ret) {
-		DRM_ERROR("RAS ERROR: %s %s feature failed ret %d\n",
-				enable ? "enable":"disable",
-				ras_block_str(head->block),
-				ret);
-		if (ret == TA_RAS_STATUS__RESET_NEEDED)
-			return -EAGAIN;
-		return -EINVAL;
+	if (!amdgpu_ras_intr_triggered()) {
+		ret = psp_ras_enable_features(&adev->psp, &info, enable);
+		if (ret) {
+			DRM_ERROR("RAS ERROR: %s %s feature failed ret %d\n",
+					enable ? "enable":"disable",
+					ras_block_str(head->block),
+					ret);
+			if (ret == TA_RAS_STATUS__RESET_NEEDED)
+				return -EAGAIN;
+			return -EINVAL;
+		}
 	}
 
 	/* setup the obj */
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 9fe95e7..9c2dba62 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3736,8 +3736,10 @@ static int gfx_v9_0_hw_fini(void *handle)
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
 
-	/* disable KCQ to avoid CPC touch memory not valid anymore */
-	gfx_v9_0_kcq_disable(adev);
+	/* DF freeze and kcq disable will fail */
+	if (!amdgpu_ras_intr_triggered())
+		/* disable KCQ to avoid CPC touch memory not valid anymore */
+		gfx_v9_0_kcq_disable(adev);
 
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v9_0_cp_gfx_enable(adev, false);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/4] drm/amdgpu: bypass some cleanup work after err_event_athub
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx; +Cc: Le Ma

PSP lost connection when err_event_athub occurs. These cleanup work can be
skipped in BACO reset.

Change-Id: If54a3735edd6ccbb58d40a5f8833392981f8ce37
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c |  6 ++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    |  7 +++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 20 +++++++++++---------
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      |  6 ++++--
 4 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 51d74bb..72d9892 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2274,6 +2274,12 @@ static int amdgpu_device_ip_suspend_phase2(struct amdgpu_device *adev)
 		/* displays are handled in phase1 */
 		if (adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_DCE)
 			continue;
+		/* PSP lost connection when err_event_athub occurs */
+		if (amdgpu_ras_intr_triggered() &&
+		    adev->ip_blocks[i].version->type == AMD_IP_BLOCK_TYPE_PSP) {
+			adev->ip_blocks[i].status.hw = false;
+			continue;
+		}
 		/* XXX handle errors */
 		r = adev->ip_blocks[i].version->funcs->suspend(adev);
 		/* XXX handle errors */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index fd7a73f..fce206f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -167,6 +167,13 @@ psp_cmd_submit_buf(struct psp_context *psp,
 	while (*((unsigned int *)psp->fence_buf) != index) {
 		if (--timeout == 0)
 			break;
+		/*
+		 * Shouldn't wait for timeout when err_event_athub occurs,
+		 * because gpu reset thread triggered and lock resource should
+		 * be released for psp resume sequence.
+		 */
+		if (amdgpu_ras_intr_triggered())
+			break;
 		msleep(1);
 		amdgpu_asic_invalidate_hdp(psp->adev, NULL);
 	}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 796326b..dab90c2 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -558,15 +558,17 @@ int amdgpu_ras_feature_enable(struct amdgpu_device *adev,
 	if (!(!!enable ^ !!amdgpu_ras_is_feature_enabled(adev, head)))
 		return 0;
 
-	ret = psp_ras_enable_features(&adev->psp, &info, enable);
-	if (ret) {
-		DRM_ERROR("RAS ERROR: %s %s feature failed ret %d\n",
-				enable ? "enable":"disable",
-				ras_block_str(head->block),
-				ret);
-		if (ret == TA_RAS_STATUS__RESET_NEEDED)
-			return -EAGAIN;
-		return -EINVAL;
+	if (!amdgpu_ras_intr_triggered()) {
+		ret = psp_ras_enable_features(&adev->psp, &info, enable);
+		if (ret) {
+			DRM_ERROR("RAS ERROR: %s %s feature failed ret %d\n",
+					enable ? "enable":"disable",
+					ras_block_str(head->block),
+					ret);
+			if (ret == TA_RAS_STATUS__RESET_NEEDED)
+				return -EAGAIN;
+			return -EINVAL;
+		}
 	}
 
 	/* setup the obj */
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 9fe95e7..9c2dba62 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -3736,8 +3736,10 @@ static int gfx_v9_0_hw_fini(void *handle)
 	amdgpu_irq_put(adev, &adev->gfx.priv_reg_irq, 0);
 	amdgpu_irq_put(adev, &adev->gfx.priv_inst_irq, 0);
 
-	/* disable KCQ to avoid CPC touch memory not valid anymore */
-	gfx_v9_0_kcq_disable(adev);
+	/* DF freeze and kcq disable will fail */
+	if (!amdgpu_ras_intr_triggered())
+		/* disable KCQ to avoid CPC touch memory not valid anymore */
+		gfx_v9_0_kcq_disable(adev);
 
 	if (amdgpu_sriov_vf(adev)) {
 		gfx_v9_0_cp_gfx_enable(adev, false);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Le Ma

From: Le Ma <Le.Ma@amd.com>

Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2
Signed-off-by: Le Ma <Le.Ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 0db458f..876690a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
 						RAS_CNTLR_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
 
-		amdgpu_ras_global_ras_isr(adev);
+		/*
+		 * ras_controller_int is dedicated for nbif ras error,
+		 * not the global interrupt for sync flood
+		 */
+		amdgpu_ras_reset_gpu(adev, true);
 	}
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-28 11:31     ` Le Ma
  0 siblings, 0 replies; 18+ messages in thread
From: Le Ma @ 2019-10-28 11:31 UTC (permalink / raw)
  To: amd-gfx; +Cc: Le Ma

From: Le Ma <Le.Ma@amd.com>

Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2
Signed-off-by: Le Ma <Le.Ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 0db458f..876690a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
 						RAS_CNTLR_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
 
-		amdgpu_ras_global_ras_isr(adev);
+		/*
+		 * ras_controller_int is dedicated for nbif ras error,
+		 * not the global interrupt for sync flood
+		 */
+		amdgpu_ras_reset_gpu(adev, true);
 	}
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated
@ 2019-10-28 11:53     ` Zhang, Hawking
  0 siblings, 0 replies; 18+ messages in thread
From: Zhang, Hawking @ 2019-10-28 11:53 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Ma, Le

We should hold on patch #2 and patch #4 until we have baco based RAS recovery works since current ras recovery policy is changed by these two patches. 

Other than that, the Series is
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Regards,
Hawking
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: 2019年10月28日 19:31
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated

The err_event_athub error will mess up the buffer and cause UVD resume hang.

Change-Id: If17a2161fb9b1b52eac08de00d2e935191bdbf99
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index b2c364b..b4dd89a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -39,6 +39,8 @@
 #include "cikd.h"
 #include "uvd/uvd_4_2_d.h"
 
+#include "amdgpu_ras.h"
+
 /* 1 second timeout */
 #define UVD_IDLE_TIMEOUT	msecs_to_jiffies(1000)
 
@@ -372,7 +374,13 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		if (!adev->uvd.inst[j].saved_bo)
 			return -ENOMEM;
 
-		memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		/* re-write 0 since err_event_athub will corrupt VCPU buffer */
+		if (amdgpu_ras_intr_triggered()) {
+			DRM_WARN("UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT\n");
+			memset(adev->uvd.inst[j].saved_bo, 0, size);
+		} else {
+			memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		}
 	}
 	return 0;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated
@ 2019-10-28 11:53     ` Zhang, Hawking
  0 siblings, 0 replies; 18+ messages in thread
From: Zhang, Hawking @ 2019-10-28 11:53 UTC (permalink / raw)
  To: Ma, Le, amd-gfx; +Cc: Ma, Le

We should hold on patch #2 and patch #4 until we have baco based RAS recovery works since current ras recovery policy is changed by these two patches. 

Other than that, the Series is
Reviewed-by: Hawking Zhang <Hawking.Zhang@amd.com>

Regards,
Hawking
-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: 2019年10月28日 19:31
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated

The err_event_athub error will mess up the buffer and cause UVD resume hang.

Change-Id: If17a2161fb9b1b52eac08de00d2e935191bdbf99
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
index b2c364b..b4dd89a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_uvd.c
@@ -39,6 +39,8 @@
 #include "cikd.h"
 #include "uvd/uvd_4_2_d.h"
 
+#include "amdgpu_ras.h"
+
 /* 1 second timeout */
 #define UVD_IDLE_TIMEOUT	msecs_to_jiffies(1000)
 
@@ -372,7 +374,13 @@ int amdgpu_uvd_suspend(struct amdgpu_device *adev)
 		if (!adev->uvd.inst[j].saved_bo)
 			return -ENOMEM;
 
-		memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		/* re-write 0 since err_event_athub will corrupt VCPU buffer */
+		if (amdgpu_ras_intr_triggered()) {
+			DRM_WARN("UVD VCPU state may lost due to RAS ERREVENT_ATHUB_INTERRUPT\n");
+			memset(adev->uvd.inst[j].saved_bo, 0, size);
+		} else {
+			memcpy_fromio(adev->uvd.inst[j].saved_bo, ptr, size);
+		}
 	}
 	return 0;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-29  1:27         ` Chen, Guchun
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Guchun @ 2019-10-29  1:27 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Ma, Le



Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: Monday, October 28, 2019 7:31 PM
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded

Otherwise next err_event_athub error cannot call gpu reset.

Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 676cad1..51d74bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		}
 	}
 
+	if (!r && in_ras_intr)
+		atomic_set(&amdgpu_ras_in_intr, 0);
[Guchun]To access this atomic variable, maybe it's better we create a new function like reset or clear in amdgpu_ras.h or .c first, then we can call that function here, like we we do to amdgpu_ras_intr_triggered in this same function. This will do assist to modularity of ras driver.

 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 		/*unlock kfd: SRIOV would do it separately */
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-29  1:27         ` Chen, Guchun
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Guchun @ 2019-10-29  1:27 UTC (permalink / raw)
  To: Ma, Le, amd-gfx; +Cc: Ma, Le



Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: Monday, October 28, 2019 7:31 PM
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded

Otherwise next err_event_athub error cannot call gpu reset.

Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd
Signed-off-by: Le Ma <le.ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 676cad1..51d74bb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		}
 	}
 
+	if (!r && in_ras_intr)
+		atomic_set(&amdgpu_ras_in_intr, 0);
[Guchun]To access this atomic variable, maybe it's better we create a new function like reset or clear in amdgpu_ras.h or .c first, then we can call that function here, like we we do to amdgpu_ras_intr_triggered in this same function. This will do assist to modularity of ras driver.

 skip_sched_resume:
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 		/*unlock kfd: SRIOV would do it separately */
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-29  1:36         ` Chen, Guchun
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Guchun @ 2019-10-29  1:36 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Ma, Le




Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: Monday, October 28, 2019 7:31 PM
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler

From: Le Ma <Le.Ma@amd.com>

Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2
Signed-off-by: Le Ma <Le.Ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 0db458f..876690a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
 						RAS_CNTLR_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
 
-		amdgpu_ras_global_ras_isr(adev);
+		/*
+		 * ras_controller_int is dedicated for nbif ras error,
+		 * not the global interrupt for sync flood
+		 */
+		amdgpu_ras_reset_gpu(adev, true);
[Guchun]We need to add one printing here to tell audience, who and why resets gpu? And moreover, in the removed global ras isr handler amdgpu_ras_global_ras_isr, we call amdgpu_ras_reset_gpu with is_baco parameter "false", but now we use "true" here?
 	}
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-29  1:36         ` Chen, Guchun
  0 siblings, 0 replies; 18+ messages in thread
From: Chen, Guchun @ 2019-10-29  1:36 UTC (permalink / raw)
  To: Ma, Le, amd-gfx; +Cc: Ma, Le




Regards,
Guchun

-----Original Message-----
From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org> On Behalf Of Le Ma
Sent: Monday, October 28, 2019 7:31 PM
To: amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler

From: Le Ma <Le.Ma@amd.com>

Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2
Signed-off-by: Le Ma <Le.Ma@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 0db458f..876690a 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
 						RAS_CNTLR_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
 
-		amdgpu_ras_global_ras_isr(adev);
+		/*
+		 * ras_controller_int is dedicated for nbif ras error,
+		 * not the global interrupt for sync flood
+		 */
+		amdgpu_ras_reset_gpu(adev, true);
[Guchun]We need to add one printing here to tell audience, who and why resets gpu? And moreover, in the removed global ras isr handler amdgpu_ras_global_ras_isr, we call amdgpu_ras_reset_gpu with is_baco parameter "false", but now we use "true" here?
 	}
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-29  7:27             ` Ma, Le
  0 siblings, 0 replies; 18+ messages in thread
From: Ma, Le @ 2019-10-29  7:27 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2395 bytes --]





> -----Original Message-----

> From: Chen, Guchun <Guchun.Chen@amd.com>

> Sent: Tuesday, October 29, 2019 9:28 AM

> To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org

> Cc: Ma, Le <Le.Ma@amd.com>

> Subject: RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu

> recovery succeeded

>

>

>

> Regards,

> Guchun

>

> -----Original Message-----

> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Le Ma

> Sent: Monday, October 28, 2019 7:31 PM

> To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> Cc: Ma, Le <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

> Subject: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery

> succeeded

>

> Otherwise next err_event_athub error cannot call gpu reset.

>

> Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd

> Signed-off-by: Le Ma <le.ma@amd.com<mailto:le.ma@amd.com>>

> ---

>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++

>  1 file changed, 3 insertions(+)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index 676cad1..51d74bb 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> @@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct

> amdgpu_device *adev,

>                      }

>          }

>

> +      if (!r && in_ras_intr)

> +                  atomic_set(&amdgpu_ras_in_intr, 0);

> [Guchun]To access this atomic variable, maybe it's better we create a new

> function like reset or clear in amdgpu_ras.h or .c first, then we can call that

> function here, like we we do to amdgpu_ras_intr_triggered in this same

> function. This will do assist to modularity of ras driver.

> [Le] Agree with you. We could make it paired with amdgpu_ras_intr_triggered.



>  skip_sched_resume:

>          list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {

>                      /*unlock kfd: SRIOV would do it separately */

> --

> 2.7.4

>

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 7163 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded
@ 2019-10-29  7:27             ` Ma, Le
  0 siblings, 0 replies; 18+ messages in thread
From: Ma, Le @ 2019-10-29  7:27 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 2395 bytes --]





> -----Original Message-----

> From: Chen, Guchun <Guchun.Chen@amd.com>

> Sent: Tuesday, October 29, 2019 9:28 AM

> To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org

> Cc: Ma, Le <Le.Ma@amd.com>

> Subject: RE: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu

> recovery succeeded

>

>

>

> Regards,

> Guchun

>

> -----Original Message-----

> From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Le Ma

> Sent: Monday, October 28, 2019 7:31 PM

> To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> Cc: Ma, Le <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

> Subject: [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery

> succeeded

>

> Otherwise next err_event_athub error cannot call gpu reset.

>

> Change-Id: I5cd293f30f23876bf2a1860681bcb50f47713ecd

> Signed-off-by: Le Ma <le.ma@amd.com<mailto:le.ma@amd.com>>

> ---

>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 3 +++

>  1 file changed, 3 insertions(+)

>

> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> index 676cad1..51d74bb 100644

> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c

> @@ -4089,6 +4089,9 @@ int amdgpu_device_gpu_recover(struct

> amdgpu_device *adev,

>                      }

>          }

>

> +      if (!r && in_ras_intr)

> +                  atomic_set(&amdgpu_ras_in_intr, 0);

> [Guchun]To access this atomic variable, maybe it's better we create a new

> function like reset or clear in amdgpu_ras.h or .c first, then we can call that

> function here, like we we do to amdgpu_ras_intr_triggered in this same

> function. This will do assist to modularity of ras driver.

> [Le] Agree with you. We could make it paired with amdgpu_ras_intr_triggered.



>  skip_sched_resume:

>          list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {

>                      /*unlock kfd: SRIOV would do it separately */

> --

> 2.7.4

>

> _______________________________________________

> amd-gfx mailing list

> amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 7163 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-29  7:37             ` Ma, Le
  0 siblings, 0 replies; 18+ messages in thread
From: Ma, Le @ 2019-10-29  7:37 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW


[-- Attachment #1.1: Type: text/plain, Size: 2690 bytes --]





-----Original Message-----
From: Chen, Guchun <Guchun.Chen@amd.com>
Sent: Tuesday, October 29, 2019 9:37 AM
To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler









Regards,

Guchun



-----Original Message-----

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Le Ma

Sent: Monday, October 28, 2019 7:31 PM

To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

Cc: Ma, Le <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

Subject: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler



From: Le Ma <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>



Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2

Signed-off-by: Le Ma <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

---

drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-

1 file changed, 5 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

index 0db458f..876690a 100644

--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device

                                                                       RAS_CNTLR_INTERRUPT_CLEAR, 1);

                       WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);

-                       amdgpu_ras_global_ras_isr(adev);

+                      /*

+                      * ras_controller_int is dedicated for nbif ras error,

+                      * not the global interrupt for sync flood

+                      */

+                      amdgpu_ras_reset_gpu(adev, true);

[Guchun]We need to add one printing here to tell audience, who and why resets gpu? And moreover, in the removed global ras isr handler amdgpu_ras_global_ras_isr, we call amdgpu_ras_reset_gpu with is_baco parameter "false", but now we use "true" here?

[Le] We may consider add printing here to indicate it’s ras controller interrupt issue. The is_baco parameter is unused and has no effect. Anyway, I will revise and hold on patch #2 and #4 when baco based RAS recovery totally works as Hawking’s comment.

           }

}

--

2.7.4



_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 8458 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler
@ 2019-10-29  7:37             ` Ma, Le
  0 siblings, 0 replies; 18+ messages in thread
From: Ma, Le @ 2019-10-29  7:37 UTC (permalink / raw)
  To: Chen, Guchun, amd-gfx


[-- Attachment #1.1: Type: text/plain, Size: 2690 bytes --]





-----Original Message-----
From: Chen, Guchun <Guchun.Chen@amd.com>
Sent: Tuesday, October 29, 2019 9:37 AM
To: Ma, Le <Le.Ma@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Ma, Le <Le.Ma@amd.com>
Subject: RE: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler









Regards,

Guchun



-----Original Message-----

From: amd-gfx <amd-gfx-bounces@lists.freedesktop.org<mailto:amd-gfx-bounces@lists.freedesktop.org>> On Behalf Of Le Ma

Sent: Monday, October 28, 2019 7:31 PM

To: amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

Cc: Ma, Le <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

Subject: [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler



From: Le Ma <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>



Change-Id: Ia8a61a4b3bd529f0f691e43e69b299d7d151c0c2

Signed-off-by: Le Ma <Le.Ma@amd.com<mailto:Le.Ma@amd.com>>

---

drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c | 6 +++++-

1 file changed, 5 insertions(+), 1 deletion(-)



diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

index 0db458f..876690a 100644

--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c

@@ -324,7 +324,11 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device

                                                                       RAS_CNTLR_INTERRUPT_CLEAR, 1);

                       WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);

-                       amdgpu_ras_global_ras_isr(adev);

+                      /*

+                      * ras_controller_int is dedicated for nbif ras error,

+                      * not the global interrupt for sync flood

+                      */

+                      amdgpu_ras_reset_gpu(adev, true);

[Guchun]We need to add one printing here to tell audience, who and why resets gpu? And moreover, in the removed global ras isr handler amdgpu_ras_global_ras_isr, we call amdgpu_ras_reset_gpu with is_baco parameter "false", but now we use "true" here?

[Le] We may consider add printing here to indicate it’s ras controller interrupt issue. The is_baco parameter is unused and has no effect. Anyway, I will revise and hold on patch #2 and #4 when baco based RAS recovery totally works as Hawking’s comment.

           }

}

--

2.7.4



_______________________________________________

amd-gfx mailing list

amd-gfx@lists.freedesktop.org<mailto:amd-gfx@lists.freedesktop.org>

https://lists.freedesktop.org/mailman/listinfo/amd-gfx

[-- Attachment #1.2: Type: text/html, Size: 8458 bytes --]

[-- Attachment #2: Type: text/plain, Size: 153 bytes --]

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2019-10-29  7:37 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-10-28 11:31 [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated Le Ma
2019-10-28 11:31 ` Le Ma
     [not found] ` <1572262269-14985-1-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-10-28 11:31   ` [PATCH 2/4] drm/amdgpu: reset err_event_athub flag if gpu recovery succeeded Le Ma
2019-10-28 11:31     ` Le Ma
     [not found]     ` <1572262269-14985-2-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-10-29  1:27       ` Chen, Guchun
2019-10-29  1:27         ` Chen, Guchun
     [not found]         ` <BYAPR12MB280615A3803ADC31A4AE9C8EF1610-ZGDeBxoHBPk0CuAkIMgl3QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-10-29  7:27           ` Ma, Le
2019-10-29  7:27             ` Ma, Le
2019-10-28 11:31   ` [PATCH 3/4] drm/amdgpu: bypass some cleanup work after err_event_athub Le Ma
2019-10-28 11:31     ` Le Ma
2019-10-28 11:31   ` [PATCH 4/4] drm/amdgpu: remove ras global recovery handling from ras_controller_int handler Le Ma
2019-10-28 11:31     ` Le Ma
     [not found]     ` <1572262269-14985-4-git-send-email-le.ma-5C7GfCeVMHo@public.gmane.org>
2019-10-29  1:36       ` Chen, Guchun
2019-10-29  1:36         ` Chen, Guchun
     [not found]         ` <BYAPR12MB2806A8C355785EFB07D88E2EF1610-ZGDeBxoHBPk0CuAkIMgl3QdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-10-29  7:37           ` Ma, Le
2019-10-29  7:37             ` Ma, Le
2019-10-28 11:53   ` [PATCH 1/4] drm/amdgpu: clear UVD VCPU buffer when err_event_athub generated Zhang, Hawking
2019-10-28 11:53     ` Zhang, Hawking

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.