All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.
@ 2019-08-30 16:39 Andrey Grodzovsky
       [not found] ` <1567183153-11014-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2019-08-30 16:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Andrey Grodzovsky, ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w,
	Felix.Kuehling-5C7GfCeVMHo, Tao.Zhou1-5C7GfCeVMHo,
	alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Hawking.Zhang-5C7GfCeVMHo

Issue 1:
In  XGMI case amdgpu_device_lock_adev for other devices in hive
was called to late, after access to their repsective schedulers.
So relocate the lock to the begining of accessing the other devs.

Issue 2:
Using amdgpu_device_ip_need_full_reset to switch the device list from
all devices in hive to the single 'master' device who owns this reset
call is wrong because when stopping schedulers we iterate all the devices
in hive but when restarting we will only reactivate the 'master' device.
Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS
needed we then have to stop schedulers for all devices in hive and not
only the 'master' but with amdgpu_device_ip_need_full_reset  we
already missed the opprotunity do to so. So just remove this logic and
always stop and start all schedulers for all devices in hive.

Also minor cleanup and print fix.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a5daccc..19f6624 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3814,15 +3814,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		device_list_handle = &device_list;
 	}
 
-	/*
-	 * Mark these ASICs to be reseted as untracked first
-	 * And add them back after reset completed
-	 */
-	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
-		amdgpu_unregister_gpu_instance(tmp_adev);
-
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
+		if (tmp_adev != adev)
+			amdgpu_device_lock_adev(tmp_adev, false);
+		/*
+		 * Mark these ASICs to be reseted as untracked first
+		 * And add them back after reset completed
+		 */
+		amdgpu_unregister_gpu_instance(tmp_adev);
+
 		/* disable ras on ALL IPs */
 		if (amdgpu_device_ip_need_full_reset(tmp_adev))
 			amdgpu_ras_suspend(tmp_adev);
@@ -3848,9 +3849,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	    dma_fence_is_signaled(job->base.s_fence->parent))
 		job_signaled = true;
 
-	if (!amdgpu_device_ip_need_full_reset(adev))
-		device_list_handle = &device_list;
-
 	if (job_signaled) {
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
 		goto skip_hw_reset;
@@ -3869,10 +3867,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 retry:	/* Rest of adevs pre asic reset from XGMI hive. */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
 
-		if (tmp_adev == adev)
+		if(tmp_adev == adev)
 			continue;
 
-		amdgpu_device_lock_adev(tmp_adev, false);
 		r = amdgpu_device_pre_asic_reset(tmp_adev,
 						 NULL,
 						 &need_full_reset);
@@ -3921,10 +3918,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 		if (r) {
 			/* bad news, how to tell it to userspace ? */
-			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));
+			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&tmp_adev->gpu_reset_counter));
 			amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
 		} else {
-			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&adev->gpu_reset_counter));
+			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
 		}
 
 		amdgpu_device_unlock_adev(tmp_adev);
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS.
       [not found] ` <1567183153-11014-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2019-08-30 16:39   ` Andrey Grodzovsky
       [not found]     ` <1567183153-11014-2-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-08-30 16:39   ` [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS Andrey Grodzovsky
  2019-08-30 20:08   ` [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case Kuehling, Felix
  2 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2019-08-30 16:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Andrey Grodzovsky, ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w,
	Felix.Kuehling-5C7GfCeVMHo, Tao.Zhou1-5C7GfCeVMHo,
	alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Hawking.Zhang-5C7GfCeVMHo

Problem:
Under certain conditions, when some IP bocks take a RAS error,
we can get into a situation where a GPU reset is not possible
due to issues in RAS in SMU/PSP.

Temporary fix until proper solution in PSP/SMU is ready:
When uncorrectable error happens the DF will unconditionally
broadcast error event packets to all its clients/slave upon
receiving fatal error event and freeze all its outbound queues,
err_event_athub interrupt  will be triggered.
In such case and we use this interrupt
to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
reset, only stops schedulers, deatches all in progress and not yet scheduled
job's fences, set error code on them and signals.
Also reject any new incoming job submissions from user space.
All this is done to notify the applications of the problem.

v2:
Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
Remove print param from amdgpu_ras_query_error_count

v3:
Update based on previous bug fixing patch to properly call amdgpu_amdkfd_pre_reset
for other XGMI hive memebers.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  4 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++--------
 drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  5 ++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 38 ++++++++++++++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |  3 +++
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  6 +++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 22 +++++++++++++++--
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    | 10 ++++++++
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++----
 drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 24 ++++++++++---------
 drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c     |  5 ++++
 drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     | 32 +++++++++++++------------
 12 files changed, 155 insertions(+), 42 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
index d860170..494c384 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
@@ -38,6 +38,7 @@
 #include "amdgpu_gmc.h"
 #include "amdgpu_gem.h"
 #include "amdgpu_display.h"
+#include "amdgpu_ras.h"
 
 static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
 				      struct drm_amdgpu_cs_chunk_fence *data,
@@ -1438,6 +1439,9 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
 	bool reserved_buffers = false;
 	int i, r;
 
+	if (amdgpu_ras_intr_triggered())
+		return -EHWPOISON;
+
 	if (!adev->accel_working)
 		return -EBUSY;
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 19f6624..c9825ae 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3727,25 +3727,18 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, bool trylock)
 		adev->mp1_state = PP_MP1_STATE_NONE;
 		break;
 	}
-	/* Block kfd: SRIOV would do it separately */
-	if (!amdgpu_sriov_vf(adev))
-                amdgpu_amdkfd_pre_reset(adev);
 
 	return true;
 }
 
 static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
 {
-	/*unlock kfd: SRIOV would do it separately */
-	if (!amdgpu_sriov_vf(adev))
-                amdgpu_amdkfd_post_reset(adev);
 	amdgpu_vf_error_trans_all(adev);
 	adev->mp1_state = PP_MP1_STATE_NONE;
 	adev->in_gpu_reset = 0;
 	mutex_unlock(&adev->lock_reset);
 }
 
-
 /**
  * amdgpu_device_gpu_recover - reset the asic and recover scheduler
  *
@@ -3765,11 +3758,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	struct amdgpu_hive_info *hive = NULL;
 	struct amdgpu_device *tmp_adev = NULL;
 	int i, r = 0;
+	bool in_ras_intr = amdgpu_ras_intr_triggered();
 
 	need_full_reset = job_signaled = false;
 	INIT_LIST_HEAD(&device_list);
 
-	dev_info(adev->dev, "GPU reset begin!\n");
+	dev_info(adev->dev, "GPU %s begin!\n", in_ras_intr ? "jobs stop":"reset");
 
 	cancel_delayed_work_sync(&adev->delayed_init_work);
 
@@ -3796,9 +3790,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		return 0;
 	}
 
+	/* Block kfd: SRIOV would do it separately */
+	if (!amdgpu_sriov_vf(adev))
+                amdgpu_amdkfd_pre_reset(adev);
+
 	/* Build list of devices to reset */
 	if  (adev->gmc.xgmi.num_physical_nodes > 1) {
 		if (!hive) {
+			/*unlock kfd: SRIOV would do it separately */
+			if (!amdgpu_sriov_vf(adev))
+		                amdgpu_amdkfd_post_reset(adev);
 			amdgpu_device_unlock_adev(adev);
 			return -ENODEV;
 		}
@@ -3816,8 +3817,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	/* block all schedulers and reset given job's ring */
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
-		if (tmp_adev != adev)
+		if (tmp_adev != adev) {
 			amdgpu_device_lock_adev(tmp_adev, false);
+			if (!amdgpu_sriov_vf(tmp_adev))
+			                amdgpu_amdkfd_pre_reset(tmp_adev);
+		}
+
 		/*
 		 * Mark these ASICs to be reseted as untracked first
 		 * And add them back after reset completed
@@ -3825,7 +3830,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		amdgpu_unregister_gpu_instance(tmp_adev);
 
 		/* disable ras on ALL IPs */
-		if (amdgpu_device_ip_need_full_reset(tmp_adev))
+		if (!in_ras_intr && amdgpu_device_ip_need_full_reset(tmp_adev))
 			amdgpu_ras_suspend(tmp_adev);
 
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
@@ -3835,10 +3840,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 				continue;
 
 			drm_sched_stop(&ring->sched, job ? &job->base : NULL);
+
+			if (in_ras_intr)
+				amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
 		}
 	}
 
 
+	if (in_ras_intr)
+		goto skip_sched_resume;
+
 	/*
 	 * Must check guilty signal here since after this point all old
 	 * HW fences are force signaled.
@@ -3897,6 +3908,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 
 	/* Post ASIC reset for all devs .*/
 	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
+
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
@@ -3923,7 +3935,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		} else {
 			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
 		}
+	}
 
+skip_sched_resume:
+	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
+		/*unlock kfd: SRIOV would do it separately */
+		if (!in_ras_intr && !amdgpu_sriov_vf(tmp_adev))
+	                amdgpu_amdkfd_post_reset(tmp_adev);
 		amdgpu_device_unlock_adev(tmp_adev);
 	}
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
index 7679fe8..c73d26a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
@@ -40,6 +40,8 @@
 
 #include "amdgpu_amdkfd.h"
 
+#include "amdgpu_ras.h"
+
 /*
  * KMS wrapper.
  * - 3.0.0 - initial driver
@@ -1180,6 +1182,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
 	struct drm_device *dev = pci_get_drvdata(pdev);
 	struct amdgpu_device *adev = dev->dev_private;
 
+	if (amdgpu_ras_intr_triggered())
+		return;
+
 	/* if we are running in a VM, make sure the device
 	 * torn down properly on reboot/shutdown.
 	 * unfortunately we can't detect certain
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index 4d67b77..b12981e 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -250,6 +250,44 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
 	return fence;
 }
 
+#define to_drm_sched_job(sched_job)		\
+		container_of((sched_job), struct drm_sched_job, queue_node)
+
+void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
+{
+	struct drm_sched_job *s_job;
+	struct drm_sched_entity *s_entity = NULL;
+	int i;
+
+	/* Signal all jobs not yet scheduled */
+	for (i = DRM_SCHED_PRIORITY_MAX - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
+		struct drm_sched_rq *rq = &sched->sched_rq[i];
+
+		if (!rq)
+			continue;
+
+		spin_lock(&rq->lock);
+		list_for_each_entry(s_entity, &rq->entities, list) {
+			while ((s_job = to_drm_sched_job(spsc_queue_pop(&s_entity->job_queue)))) {
+				struct drm_sched_fence *s_fence = s_job->s_fence;
+
+				dma_fence_signal(&s_fence->scheduled);
+				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
+				dma_fence_signal(&s_fence->finished);
+			}
+		}
+		spin_unlock(&rq->lock);
+	}
+
+	/* Signal all jobs already scheduled to HW */
+	list_for_each_entry(s_job, &sched->ring_mirror_list, node) {
+		struct drm_sched_fence *s_fence = s_job->s_fence;
+
+		dma_fence_set_error(&s_fence->finished, -EHWPOISON);
+		dma_fence_signal(&s_fence->finished);
+	}
+}
+
 const struct drm_sched_backend_ops amdgpu_sched_ops = {
 	.dependency = amdgpu_job_dependency,
 	.run_job = amdgpu_job_run,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
index 51e6250..dc7ee93 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
@@ -76,4 +76,7 @@ int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
 		      void *owner, struct dma_fence **f);
 int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
 			     struct dma_fence **fence);
+
+void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched);
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 35a0866..535f690 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -1046,6 +1046,12 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
 	/* Ensure IB tests are run on ring */
 	flush_delayed_work(&adev->delayed_init_work);
 
+
+	if (amdgpu_ras_intr_triggered()) {
+		DRM_ERROR("RAS Intr triggered, device disabled!!");
+		return -EHWPOISON;
+	}
+
 	file_priv->driver_priv = NULL;
 
 	r = pm_runtime_get_sync(dev->dev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index aa51c00..1cc34de 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -24,6 +24,8 @@
 #include <linux/debugfs.h>
 #include <linux/list.h>
 #include <linux/module.h>
+#include <linux/reboot.h>
+#include <linux/syscalls.h>
 #include "amdgpu.h"
 #include "amdgpu_ras.h"
 #include "amdgpu_atomfirmware.h"
@@ -64,6 +66,9 @@ const char *ras_block_string[] = {
 /* inject address is 52 bits */
 #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
 
+
+atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
+
 static int amdgpu_ras_reserve_vram(struct amdgpu_device *adev,
 		uint64_t offset, uint64_t size,
 		struct amdgpu_bo **bo_ptr);
@@ -188,6 +193,10 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 
 	return 0;
 }
+
+static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
+		struct ras_common_if *head);
+
 /**
  * DOC: AMDGPU RAS debugfs control interface
  *
@@ -627,12 +636,14 @@ int amdgpu_ras_error_query(struct amdgpu_device *adev,
 	info->ue_count = obj->err_data.ue_count;
 	info->ce_count = obj->err_data.ce_count;
 
-	if (err_data.ce_count)
+	if (err_data.ce_count) {
 		dev_info(adev->dev, "%ld correctable errors detected in %s block\n",
 			 obj->err_data.ce_count, ras_block_str(info->head.block));
-	if (err_data.ue_count)
+	}
+	if (err_data.ue_count) {
 		dev_info(adev->dev, "%ld uncorrectable errors detected in %s block\n",
 			 obj->err_data.ue_count, ras_block_str(info->head.block));
+	}
 
 	return 0;
 }
@@ -1718,3 +1729,10 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
 
 	return 0;
 }
+
+void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
+{
+	if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
+		DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
+	}
+}
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index fc4fb0f..3ec2a87 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -600,4 +600,14 @@ int amdgpu_ras_interrupt_remove_handler(struct amdgpu_device *adev,
 
 int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
 		struct ras_dispatch_if *info);
+
+extern atomic_t amdgpu_ras_in_intr;
+
+static inline bool amdgpu_ras_intr_triggered(void)
+{
+	return !!atomic_read(&amdgpu_ras_in_intr);
+}
+
+void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev);
+
 #endif
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 93e3e89..817997b 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -5676,10 +5676,12 @@ static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
 		struct amdgpu_iv_entry *entry)
 {
 	/* TODO ue will trigger an interrupt. */
-	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
-	if (adev->gfx.funcs->query_ras_error_count)
-		adev->gfx.funcs->query_ras_error_count(adev, err_data);
-	amdgpu_ras_reset_gpu(adev, 0);
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
+		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+		if (adev->gfx.funcs->query_ras_error_count)
+			adev->gfx.funcs->query_ras_error_count(adev, err_data);
+		amdgpu_ras_reset_gpu(adev, 0);
+	}
 	return AMDGPU_RAS_SUCCESS;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
index 5eb17c7..2a6ac60 100644
--- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
@@ -243,18 +243,20 @@ static int gmc_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
 		struct ras_err_data *err_data,
 		struct amdgpu_iv_entry *entry)
 {
-	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
-	if (adev->umc.funcs->query_ras_error_count)
-		adev->umc.funcs->query_ras_error_count(adev, err_data);
-	/* umc query_ras_error_address is also responsible for clearing
-	 * error status
-	 */
-	if (adev->umc.funcs->query_ras_error_address)
-		adev->umc.funcs->query_ras_error_address(adev, err_data);
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
+		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+		if (adev->umc.funcs->query_ras_error_count)
+			adev->umc.funcs->query_ras_error_count(adev, err_data);
+		/* umc query_ras_error_address is also responsible for clearing
+		 * error status
+		 */
+		if (adev->umc.funcs->query_ras_error_address)
+			adev->umc.funcs->query_ras_error_address(adev, err_data);
 
-	/* only uncorrectable error needs gpu reset */
-	if (err_data->ue_count)
-		amdgpu_ras_reset_gpu(adev, 0);
+		/* only uncorrectable error needs gpu reset */
+		if (err_data->ue_count)
+			amdgpu_ras_reset_gpu(adev, 0);
+	}
 
 	return AMDGPU_RAS_SUCCESS;
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
index 367f9d6..545990c 100644
--- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
+++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
@@ -30,6 +30,7 @@
 #include "nbio/nbio_7_4_0_smn.h"
 #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
 #include <uapi/linux/kfd_ioctl.h>
+#include "amdgpu_ras.h"
 
 #define smnNBIF_MGCG_CTRL_LCLK	0x1013a21c
 
@@ -329,6 +330,8 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
 						BIF_DOORBELL_INT_CNTL,
 						RAS_CNTLR_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
+
+		amdgpu_ras_global_ras_isr(adev);
 	}
 }
 
@@ -344,6 +347,8 @@ static void nbio_v7_4_handle_ras_err_event_athub_intr_no_bifring(struct amdgpu_d
 						BIF_DOORBELL_INT_CNTL,
 						RAS_ATHUB_ERR_EVENT_INTERRUPT_CLEAR, 1);
 		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
+
+		amdgpu_ras_global_ras_isr(adev);
 	}
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
index b3ed533..b05428f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
@@ -1972,24 +1972,26 @@ static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
 	uint32_t err_source;
 	int instance;
 
-	instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
-	if (instance < 0)
-		return 0;
+	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
+		instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
+		if (instance < 0)
+			return 0;
 
-	switch (entry->src_id) {
-	case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
-		err_source = 0;
-		break;
-	case SDMA0_4_0__SRCID__SDMA_ECC:
-		err_source = 1;
-		break;
-	default:
-		return 0;
-	}
+		switch (entry->src_id) {
+		case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
+			err_source = 0;
+			break;
+		case SDMA0_4_0__SRCID__SDMA_ECC:
+			err_source = 1;
+			break;
+		default:
+			return 0;
+		}
 
-	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
+		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
 
-	amdgpu_ras_reset_gpu(adev, 0);
+		amdgpu_ras_reset_gpu(adev, 0);
+	}
 
 	return AMDGPU_RAS_SUCCESS;
 }
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.
       [not found] ` <1567183153-11014-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-08-30 16:39   ` [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS Andrey Grodzovsky
@ 2019-08-30 16:39   ` Andrey Grodzovsky
       [not found]     ` <1567183153-11014-3-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-08-30 20:08   ` [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case Kuehling, Felix
  2 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2019-08-30 16:39 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: Andrey Grodzovsky, ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w,
	Felix.Kuehling-5C7GfCeVMHo, Tao.Zhou1-5C7GfCeVMHo,
	alexdeucher-Re5JQEeQqe8AvxtiuMwx3w, Hawking.Zhang-5C7GfCeVMHo

In case of RAS error allow user configure auto system
reboot through ras_ctrl.
This is also part of the temproray work around for the RAS
hang problem.

Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 18 ++++++++++++++++++
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 +++++++++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c9825ae..e26f2e9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3760,6 +3760,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	int i, r = 0;
 	bool in_ras_intr = amdgpu_ras_intr_triggered();
 
+	/*
+	 * Flush RAM to disk so that after reboot
+	 * the user can read log and see why the system rebooted.
+	 *
+	 * Using user mode app call instead of kernel APIs such as
+	 * ksys_sync_helper for backward comparability with earlier
+	 * kernels into which this is also intended.
+	 */
+	if (in_ras_intr && amdgpu_ras_get_context(adev)->reboot) {
+		char *envp[] = { "HOME=/", NULL };
+		char *argv[] = { "/bin/sync", NULL };
+
+		DRM_WARN("Emergency reboot.");
+
+		call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
+		emergency_restart();
+	}
+
 	need_full_reset = job_signaled = false;
 	INIT_LIST_HEAD(&device_list);
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
index 1cc34de..bbcfb4f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
@@ -30,6 +30,7 @@
 #include "amdgpu_ras.h"
 #include "amdgpu_atomfirmware.h"
 #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
+#include <linux/kmod.h>
 
 const char *ras_error_string[] = {
 	"none",
@@ -154,6 +155,8 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
 		op = 1;
 	else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
 		op = 2;
+	else if (sscanf(str, "reboot %32s", block_name) == 1)
+		op = 3;
 	else if (str[0] && str[1] && str[2] && str[3])
 		/* ascii string, but commands are not matched. */
 		return -EINVAL;
@@ -287,6 +290,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
 		/* data.inject.address is offset instead of absolute gpu address */
 		ret = amdgpu_ras_error_inject(adev, &data.inject);
 		break;
+	case 3:
+		amdgpu_ras_get_context(adev)->reboot = true;
+		break;
 	default:
 		ret = -EINVAL;
 		break;
@@ -1733,6 +1739,8 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
 void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
 {
 	if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
-		DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
+		DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected!\n");
+
+		amdgpu_ras_reset_gpu(adev, false);
 	}
 }
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
index 3ec2a87..a83ec99 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
@@ -333,6 +333,7 @@ struct amdgpu_ras {
 	struct mutex recovery_lock;
 
 	uint32_t flags;
+	bool reboot;
 };
 
 struct ras_fs_data {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.
       [not found]     ` <1567183153-11014-3-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2019-08-30 19:55       ` Alex Deucher
  2019-08-31  0:21         ` Grodzovsky, Andrey
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Deucher @ 2019-08-30 19:55 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, Kuehling, Felix, Tao Zhou, amd-gfx list,
	Hawking Zhang

On Fri, Aug 30, 2019 at 12:39 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> In case of RAS error allow user configure auto system
> reboot through ras_ctrl.
> This is also part of the temproray work around for the RAS
> hang problem.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Typo in title: dmr -> drm

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 18 ++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 +++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c9825ae..e26f2e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3760,6 +3760,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         int i, r = 0;
>         bool in_ras_intr = amdgpu_ras_intr_triggered();
>
> +       /*
> +        * Flush RAM to disk so that after reboot
> +        * the user can read log and see why the system rebooted.
> +        *
> +        * Using user mode app call instead of kernel APIs such as
> +        * ksys_sync_helper for backward comparability with earlier
> +        * kernels into which this is also intended.
> +        */
> +       if (in_ras_intr && amdgpu_ras_get_context(adev)->reboot) {
> +               char *envp[] = { "HOME=/", NULL };
> +               char *argv[] = { "/bin/sync", NULL };
> +
> +               DRM_WARN("Emergency reboot.");
> +
> +               call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
> +               emergency_restart();
> +       }
> +

This is fine for dkms, but for upstream/amd-staging, we probably want
to call the appropriate APIs directly.

>         need_full_reset = job_signaled = false;
>         INIT_LIST_HEAD(&device_list);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1cc34de..bbcfb4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -30,6 +30,7 @@
>  #include "amdgpu_ras.h"
>  #include "amdgpu_atomfirmware.h"
>  #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
> +#include <linux/kmod.h>
>
>  const char *ras_error_string[] = {
>         "none",
> @@ -154,6 +155,8 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>                 op = 1;
>         else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
>                 op = 2;
> +       else if (sscanf(str, "reboot %32s", block_name) == 1)
> +               op = 3;
>         else if (str[0] && str[1] && str[2] && str[3])
>                 /* ascii string, but commands are not matched. */
>                 return -EINVAL;
> @@ -287,6 +290,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>                 /* data.inject.address is offset instead of absolute gpu address */
>                 ret = amdgpu_ras_error_inject(adev, &data.inject);
>                 break;
> +       case 3:
> +               amdgpu_ras_get_context(adev)->reboot = true;
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -1733,6 +1739,8 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
>  void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
>  {
>         if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
> -               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
> +               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected!\n");
> +
> +               amdgpu_ras_reset_gpu(adev, false);
>         }
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 3ec2a87..a83ec99 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -333,6 +333,7 @@ struct amdgpu_ras {
>         struct mutex recovery_lock;
>
>         uint32_t flags;
> +       bool reboot;
>  };
>
>  struct ras_fs_data {
> --
> 2.7.4
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS.
       [not found]     ` <1567183153-11014-2-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
@ 2019-08-30 19:58       ` Alex Deucher
  2019-08-30 20:29       ` Kuehling, Felix
  1 sibling, 0 replies; 9+ messages in thread
From: Alex Deucher @ 2019-08-30 19:58 UTC (permalink / raw)
  To: Andrey Grodzovsky
  Cc: Christian König, Kuehling, Felix, Tao Zhou, amd-gfx list,
	Hawking Zhang

On Fri, Aug 30, 2019 at 12:39 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> Problem:
> Under certain conditions, when some IP bocks take a RAS error,
> we can get into a situation where a GPU reset is not possible
> due to issues in RAS in SMU/PSP.
>
> Temporary fix until proper solution in PSP/SMU is ready:
> When uncorrectable error happens the DF will unconditionally
> broadcast error event packets to all its clients/slave upon
> receiving fatal error event and freeze all its outbound queues,
> err_event_athub interrupt  will be triggered.
> In such case and we use this interrupt
> to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
> reset, only stops schedulers, deatches all in progress and not yet scheduled
> job's fences, set error code on them and signals.
> Also reject any new incoming job submissions from user space.
> All this is done to notify the applications of the problem.
>
> v2:
> Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
> Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
> Remove print param from amdgpu_ras_query_error_count
>
> v3:
> Update based on previous bug fixing patch to properly call amdgpu_amdkfd_pre_reset
> for other XGMI hive memebers.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  4 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++--------
>  drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  5 ++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 38 ++++++++++++++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |  3 +++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  6 +++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 22 +++++++++++++++--
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    | 10 ++++++++
>  drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++----
>  drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 24 ++++++++++---------
>  drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c     |  5 ++++
>  drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     | 32 +++++++++++++------------
>  12 files changed, 155 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index d860170..494c384 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -38,6 +38,7 @@
>  #include "amdgpu_gmc.h"
>  #include "amdgpu_gem.h"
>  #include "amdgpu_display.h"
> +#include "amdgpu_ras.h"
>
>  static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>                                       struct drm_amdgpu_cs_chunk_fence *data,
> @@ -1438,6 +1439,9 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>         bool reserved_buffers = false;
>         int i, r;
>
> +       if (amdgpu_ras_intr_triggered())
> +               return -EHWPOISON;
> +
>         if (!adev->accel_working)
>                 return -EBUSY;
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 19f6624..c9825ae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3727,25 +3727,18 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, bool trylock)
>                 adev->mp1_state = PP_MP1_STATE_NONE;
>                 break;
>         }
> -       /* Block kfd: SRIOV would do it separately */
> -       if (!amdgpu_sriov_vf(adev))
> -                amdgpu_amdkfd_pre_reset(adev);
>
>         return true;
>  }
>
>  static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
>  {
> -       /*unlock kfd: SRIOV would do it separately */
> -       if (!amdgpu_sriov_vf(adev))
> -                amdgpu_amdkfd_post_reset(adev);
>         amdgpu_vf_error_trans_all(adev);
>         adev->mp1_state = PP_MP1_STATE_NONE;
>         adev->in_gpu_reset = 0;
>         mutex_unlock(&adev->lock_reset);
>  }
>
> -
>  /**
>   * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>   *
> @@ -3765,11 +3758,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         struct amdgpu_hive_info *hive = NULL;
>         struct amdgpu_device *tmp_adev = NULL;
>         int i, r = 0;
> +       bool in_ras_intr = amdgpu_ras_intr_triggered();
>
>         need_full_reset = job_signaled = false;
>         INIT_LIST_HEAD(&device_list);
>
> -       dev_info(adev->dev, "GPU reset begin!\n");
> +       dev_info(adev->dev, "GPU %s begin!\n", in_ras_intr ? "jobs stop":"reset");
>
>         cancel_delayed_work_sync(&adev->delayed_init_work);
>
> @@ -3796,9 +3790,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 return 0;
>         }
>
> +       /* Block kfd: SRIOV would do it separately */
> +       if (!amdgpu_sriov_vf(adev))
> +                amdgpu_amdkfd_pre_reset(adev);
> +
>         /* Build list of devices to reset */
>         if  (adev->gmc.xgmi.num_physical_nodes > 1) {
>                 if (!hive) {
> +                       /*unlock kfd: SRIOV would do it separately */
> +                       if (!amdgpu_sriov_vf(adev))
> +                               amdgpu_amdkfd_post_reset(adev);
>                         amdgpu_device_unlock_adev(adev);
>                         return -ENODEV;
>                 }
> @@ -3816,8 +3817,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>
>         /* block all schedulers and reset given job's ring */
>         list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -               if (tmp_adev != adev)
> +               if (tmp_adev != adev) {
>                         amdgpu_device_lock_adev(tmp_adev, false);
> +                       if (!amdgpu_sriov_vf(tmp_adev))
> +                                       amdgpu_amdkfd_pre_reset(tmp_adev);
> +               }
> +
>                 /*
>                  * Mark these ASICs to be reseted as untracked first
>                  * And add them back after reset completed
> @@ -3825,7 +3830,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 amdgpu_unregister_gpu_instance(tmp_adev);
>
>                 /* disable ras on ALL IPs */
> -               if (amdgpu_device_ip_need_full_reset(tmp_adev))
> +               if (!in_ras_intr && amdgpu_device_ip_need_full_reset(tmp_adev))
>                         amdgpu_ras_suspend(tmp_adev);
>
>                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> @@ -3835,10 +3840,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                                 continue;
>
>                         drm_sched_stop(&ring->sched, job ? &job->base : NULL);
> +
> +                       if (in_ras_intr)
> +                               amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>                 }
>         }
>
>
> +       if (in_ras_intr)
> +               goto skip_sched_resume;
> +
>         /*
>          * Must check guilty signal here since after this point all old
>          * HW fences are force signaled.
> @@ -3897,6 +3908,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>
>         /* Post ASIC reset for all devs .*/
>         list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +
>                 for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>                         struct amdgpu_ring *ring = tmp_adev->rings[i];
>
> @@ -3923,7 +3935,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>                 } else {
>                         dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
>                 }
> +       }
>
> +skip_sched_resume:
> +       list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +               /*unlock kfd: SRIOV would do it separately */
> +               if (!in_ras_intr && !amdgpu_sriov_vf(tmp_adev))
> +                       amdgpu_amdkfd_post_reset(tmp_adev);
>                 amdgpu_device_unlock_adev(tmp_adev);
>         }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 7679fe8..c73d26a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -40,6 +40,8 @@
>
>  #include "amdgpu_amdkfd.h"
>
> +#include "amdgpu_ras.h"
> +
>  /*
>   * KMS wrapper.
>   * - 3.0.0 - initial driver
> @@ -1180,6 +1182,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>         struct drm_device *dev = pci_get_drvdata(pdev);
>         struct amdgpu_device *adev = dev->dev_private;
>
> +       if (amdgpu_ras_intr_triggered())
> +               return;
> +
>         /* if we are running in a VM, make sure the device
>          * torn down properly on reboot/shutdown.
>          * unfortunately we can't detect certain
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4d67b77..b12981e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -250,6 +250,44 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>         return fence;
>  }
>
> +#define to_drm_sched_job(sched_job)            \
> +               container_of((sched_job), struct drm_sched_job, queue_node)
> +
> +void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
> +{
> +       struct drm_sched_job *s_job;
> +       struct drm_sched_entity *s_entity = NULL;
> +       int i;
> +
> +       /* Signal all jobs not yet scheduled */
> +       for (i = DRM_SCHED_PRIORITY_MAX - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> +               struct drm_sched_rq *rq = &sched->sched_rq[i];
> +
> +               if (!rq)
> +                       continue;
> +
> +               spin_lock(&rq->lock);
> +               list_for_each_entry(s_entity, &rq->entities, list) {
> +                       while ((s_job = to_drm_sched_job(spsc_queue_pop(&s_entity->job_queue)))) {
> +                               struct drm_sched_fence *s_fence = s_job->s_fence;
> +
> +                               dma_fence_signal(&s_fence->scheduled);
> +                               dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> +                               dma_fence_signal(&s_fence->finished);
> +                       }
> +               }
> +               spin_unlock(&rq->lock);
> +       }
> +
> +       /* Signal all jobs already scheduled to HW */
> +       list_for_each_entry(s_job, &sched->ring_mirror_list, node) {
> +               struct drm_sched_fence *s_fence = s_job->s_fence;
> +
> +               dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> +               dma_fence_signal(&s_fence->finished);
> +       }
> +}
> +
>  const struct drm_sched_backend_ops amdgpu_sched_ops = {
>         .dependency = amdgpu_job_dependency,
>         .run_job = amdgpu_job_run,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 51e6250..dc7ee93 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -76,4 +76,7 @@ int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
>                       void *owner, struct dma_fence **f);
>  int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
>                              struct dma_fence **fence);
> +
> +void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched);
> +
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 35a0866..535f690 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1046,6 +1046,12 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>         /* Ensure IB tests are run on ring */
>         flush_delayed_work(&adev->delayed_init_work);
>
> +
> +       if (amdgpu_ras_intr_triggered()) {
> +               DRM_ERROR("RAS Intr triggered, device disabled!!");
> +               return -EHWPOISON;
> +       }
> +
>         file_priv->driver_priv = NULL;
>
>         r = pm_runtime_get_sync(dev->dev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index aa51c00..1cc34de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -24,6 +24,8 @@
>  #include <linux/debugfs.h>
>  #include <linux/list.h>
>  #include <linux/module.h>
> +#include <linux/reboot.h>
> +#include <linux/syscalls.h>
>  #include "amdgpu.h"
>  #include "amdgpu_ras.h"
>  #include "amdgpu_atomfirmware.h"
> @@ -64,6 +66,9 @@ const char *ras_block_string[] = {
>  /* inject address is 52 bits */
>  #define        RAS_UMC_INJECT_ADDR_LIMIT       (0x1ULL << 52)
>
> +
> +atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
> +
>  static int amdgpu_ras_reserve_vram(struct amdgpu_device *adev,
>                 uint64_t offset, uint64_t size,
>                 struct amdgpu_bo **bo_ptr);
> @@ -188,6 +193,10 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>
>         return 0;
>  }
> +
> +static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
> +               struct ras_common_if *head);
> +
>  /**
>   * DOC: AMDGPU RAS debugfs control interface
>   *
> @@ -627,12 +636,14 @@ int amdgpu_ras_error_query(struct amdgpu_device *adev,
>         info->ue_count = obj->err_data.ue_count;
>         info->ce_count = obj->err_data.ce_count;
>
> -       if (err_data.ce_count)
> +       if (err_data.ce_count) {
>                 dev_info(adev->dev, "%ld correctable errors detected in %s block\n",
>                          obj->err_data.ce_count, ras_block_str(info->head.block));
> -       if (err_data.ue_count)
> +       }
> +       if (err_data.ue_count) {
>                 dev_info(adev->dev, "%ld uncorrectable errors detected in %s block\n",
>                          obj->err_data.ue_count, ras_block_str(info->head.block));
> +       }
>
>         return 0;
>  }
> @@ -1718,3 +1729,10 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
>
>         return 0;
>  }
> +
> +void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
> +{
> +       if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
> +               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
> +       }
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index fc4fb0f..3ec2a87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -600,4 +600,14 @@ int amdgpu_ras_interrupt_remove_handler(struct amdgpu_device *adev,
>
>  int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
>                 struct ras_dispatch_if *info);
> +
> +extern atomic_t amdgpu_ras_in_intr;
> +
> +static inline bool amdgpu_ras_intr_triggered(void)
> +{
> +       return !!atomic_read(&amdgpu_ras_in_intr);
> +}
> +
> +void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev);
> +
>  #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 93e3e89..817997b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -5676,10 +5676,12 @@ static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
>                 struct amdgpu_iv_entry *entry)
>  {
>         /* TODO ue will trigger an interrupt. */
> -       kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> -       if (adev->gfx.funcs->query_ras_error_count)
> -               adev->gfx.funcs->query_ras_error_count(adev, err_data);
> -       amdgpu_ras_reset_gpu(adev, 0);
> +       if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {

I think we have to check for AMDGPU_RAS_BLOCK__MMHUB as well since it
will also cause a sync flood.

> +               kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +               if (adev->gfx.funcs->query_ras_error_count)
> +                       adev->gfx.funcs->query_ras_error_count(adev, err_data);
> +               amdgpu_ras_reset_gpu(adev, 0);
> +       }
>         return AMDGPU_RAS_SUCCESS;
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 5eb17c7..2a6ac60 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -243,18 +243,20 @@ static int gmc_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
>                 struct ras_err_data *err_data,
>                 struct amdgpu_iv_entry *entry)
>  {
> -       kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> -       if (adev->umc.funcs->query_ras_error_count)
> -               adev->umc.funcs->query_ras_error_count(adev, err_data);
> -       /* umc query_ras_error_address is also responsible for clearing
> -        * error status
> -        */
> -       if (adev->umc.funcs->query_ras_error_address)
> -               adev->umc.funcs->query_ras_error_address(adev, err_data);
> +       if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {

Check for MMHUB as well.

> +               kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +               if (adev->umc.funcs->query_ras_error_count)
> +                       adev->umc.funcs->query_ras_error_count(adev, err_data);
> +               /* umc query_ras_error_address is also responsible for clearing
> +                * error status
> +                */
> +               if (adev->umc.funcs->query_ras_error_address)
> +                       adev->umc.funcs->query_ras_error_address(adev, err_data);
>
> -       /* only uncorrectable error needs gpu reset */
> -       if (err_data->ue_count)
> -               amdgpu_ras_reset_gpu(adev, 0);
> +               /* only uncorrectable error needs gpu reset */
> +               if (err_data->ue_count)
> +                       amdgpu_ras_reset_gpu(adev, 0);
> +       }
>
>         return AMDGPU_RAS_SUCCESS;
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 367f9d6..545990c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -30,6 +30,7 @@
>  #include "nbio/nbio_7_4_0_smn.h"
>  #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
>  #include <uapi/linux/kfd_ioctl.h>
> +#include "amdgpu_ras.h"
>
>  #define smnNBIF_MGCG_CTRL_LCLK 0x1013a21c
>
> @@ -329,6 +330,8 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
>                                                 BIF_DOORBELL_INT_CNTL,
>                                                 RAS_CNTLR_INTERRUPT_CLEAR, 1);
>                 WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
> +
> +               amdgpu_ras_global_ras_isr(adev);
>         }
>  }
>
> @@ -344,6 +347,8 @@ static void nbio_v7_4_handle_ras_err_event_athub_intr_no_bifring(struct amdgpu_d
>                                                 BIF_DOORBELL_INT_CNTL,
>                                                 RAS_ATHUB_ERR_EVENT_INTERRUPT_CLEAR, 1);
>                 WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
> +
> +               amdgpu_ras_global_ras_isr(adev);
>         }
>  }
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index b3ed533..b05428f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -1972,24 +1972,26 @@ static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
>         uint32_t err_source;
>         int instance;
>
> -       instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
> -       if (instance < 0)
> -               return 0;
> +       if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {

Check for MMHUB as well.

> +               instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
> +               if (instance < 0)
> +                       return 0;
>
> -       switch (entry->src_id) {
> -       case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
> -               err_source = 0;
> -               break;
> -       case SDMA0_4_0__SRCID__SDMA_ECC:
> -               err_source = 1;
> -               break;
> -       default:
> -               return 0;
> -       }
> +               switch (entry->src_id) {
> +               case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
> +                       err_source = 0;
> +                       break;
> +               case SDMA0_4_0__SRCID__SDMA_ECC:
> +                       err_source = 1;
> +                       break;
> +               default:
> +                       return 0;
> +               }
>
> -       kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +               kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
>
> -       amdgpu_ras_reset_gpu(adev, 0);
> +               amdgpu_ras_reset_gpu(adev, 0);
> +       }
>
>         return AMDGPU_RAS_SUCCESS;
>  }
> --
> 2.7.4
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case.
       [not found] ` <1567183153-11014-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-08-30 16:39   ` [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS Andrey Grodzovsky
  2019-08-30 16:39   ` [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS Andrey Grodzovsky
@ 2019-08-30 20:08   ` Kuehling, Felix
  2 siblings, 0 replies; 9+ messages in thread
From: Kuehling, Felix @ 2019-08-30 20:08 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w,
	ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w, Zhou1, Tao, Zhang,
	Hawking

On 2019-08-30 12:39 p.m., Andrey Grodzovsky wrote:
> Issue 1:
> In  XGMI case amdgpu_device_lock_adev for other devices in hive
> was called to late, after access to their repsective schedulers.
> So relocate the lock to the begining of accessing the other devs.
>
> Issue 2:
> Using amdgpu_device_ip_need_full_reset to switch the device list from
> all devices in hive to the single 'master' device who owns this reset
> call is wrong because when stopping schedulers we iterate all the devices
> in hive but when restarting we will only reactivate the 'master' device.
> Also, in case amdgpu_device_pre_asic_reset conlcudes that full reset IS
> needed we then have to stop schedulers for all devices in hive and not
> only the 'master' but with amdgpu_device_ip_need_full_reset  we
> already missed the opprotunity do to so. So just remove this logic and
> always stop and start all schedulers for all devices in hive.
>
> Also minor cleanup and print fix.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Minor nit-pick inline. With that fixed this patch is Acked-by: Felix 
Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 25 +++++++++++--------------
>   1 file changed, 11 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a5daccc..19f6624 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3814,15 +3814,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		device_list_handle = &device_list;
>   	}
>   
> -	/*
> -	 * Mark these ASICs to be reseted as untracked first
> -	 * And add them back after reset completed
> -	 */
> -	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head)
> -		amdgpu_unregister_gpu_instance(tmp_adev);
> -
>   	/* block all schedulers and reset given job's ring */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +		if (tmp_adev != adev)
> +			amdgpu_device_lock_adev(tmp_adev, false);
> +		/*
> +		 * Mark these ASICs to be reseted as untracked first
> +		 * And add them back after reset completed
> +		 */
> +		amdgpu_unregister_gpu_instance(tmp_adev);
> +
>   		/* disable ras on ALL IPs */
>   		if (amdgpu_device_ip_need_full_reset(tmp_adev))
>   			amdgpu_ras_suspend(tmp_adev);
> @@ -3848,9 +3849,6 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	    dma_fence_is_signaled(job->base.s_fence->parent))
>   		job_signaled = true;
>   
> -	if (!amdgpu_device_ip_need_full_reset(adev))
> -		device_list_handle = &device_list;
> -
>   	if (job_signaled) {
>   		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
>   		goto skip_hw_reset;
> @@ -3869,10 +3867,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   retry:	/* Rest of adevs pre asic reset from XGMI hive. */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
>   
> -		if (tmp_adev == adev)
> +		if(tmp_adev == adev)

The space before ( was correct coding style. This will trigger a 
checkpatch error or warning.


>   			continue;
>   
> -		amdgpu_device_lock_adev(tmp_adev, false);
>   		r = amdgpu_device_pre_asic_reset(tmp_adev,
>   						 NULL,
>   						 &need_full_reset);
> @@ -3921,10 +3918,10 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   		if (r) {
>   			/* bad news, how to tell it to userspace ? */
> -			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));
> +			dev_info(tmp_adev->dev, "GPU reset(%d) failed\n", atomic_read(&tmp_adev->gpu_reset_counter));
>   			amdgpu_vf_error_put(tmp_adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
>   		} else {
> -			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&adev->gpu_reset_counter));
> +			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
>   		}
>   
>   		amdgpu_device_unlock_adev(tmp_adev);
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS.
       [not found]     ` <1567183153-11014-2-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
  2019-08-30 19:58       ` Alex Deucher
@ 2019-08-30 20:29       ` Kuehling, Felix
  1 sibling, 0 replies; 9+ messages in thread
From: Kuehling, Felix @ 2019-08-30 20:29 UTC (permalink / raw)
  To: Grodzovsky, Andrey, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW
  Cc: alexdeucher-Re5JQEeQqe8AvxtiuMwx3w,
	ckoenig.leichtzumerken-Re5JQEeQqe8AvxtiuMwx3w, Zhou1, Tao, Zhang,
	Hawking

On 2019-08-30 12:39 p.m., Andrey Grodzovsky wrote:
> Problem:
> Under certain conditions, when some IP bocks take a RAS error,
> we can get into a situation where a GPU reset is not possible
> due to issues in RAS in SMU/PSP.
>
> Temporary fix until proper solution in PSP/SMU is ready:
> When uncorrectable error happens the DF will unconditionally
> broadcast error event packets to all its clients/slave upon
> receiving fatal error event and freeze all its outbound queues,
> err_event_athub interrupt  will be triggered.
> In such case and we use this interrupt
> to issue GPU reset. THe GPU reset code is modified for such case to avoid HW
> reset, only stops schedulers, deatches all in progress and not yet scheduled
> job's fences, set error code on them and signals.
> Also reject any new incoming job submissions from user space.
> All this is done to notify the applications of the problem.
>
> v2:
> Extract amdgpu_amdkfd_pre/post_reset from amdgpu_device_lock/unlock_adev
> Move amdgpu_job_stop_all_jobs_on_sched to amdgpu_job.c
> Remove print param from amdgpu_ras_query_error_count
>
> v3:
> Update based on previous bug fixing patch to properly call amdgpu_amdkfd_pre_reset
> for other XGMI hive memebers.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

The KFD part looks good to me. Acked-by: Felix Kuehling 
<Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c     |  4 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 38 ++++++++++++++++++++++--------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c    |  5 ++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    | 38 ++++++++++++++++++++++++++++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.h    |  3 +++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c    |  6 +++++
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 22 +++++++++++++++--
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    | 10 ++++++++
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 10 ++++----
>   drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c      | 24 ++++++++++---------
>   drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c     |  5 ++++
>   drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c     | 32 +++++++++++++------------
>   12 files changed, 155 insertions(+), 42 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> index d860170..494c384 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_cs.c
> @@ -38,6 +38,7 @@
>   #include "amdgpu_gmc.h"
>   #include "amdgpu_gem.h"
>   #include "amdgpu_display.h"
> +#include "amdgpu_ras.h"
>   
>   static int amdgpu_cs_user_fence_chunk(struct amdgpu_cs_parser *p,
>   				      struct drm_amdgpu_cs_chunk_fence *data,
> @@ -1438,6 +1439,9 @@ int amdgpu_cs_ioctl(struct drm_device *dev, void *data, struct drm_file *filp)
>   	bool reserved_buffers = false;
>   	int i, r;
>   
> +	if (amdgpu_ras_intr_triggered())
> +		return -EHWPOISON;
> +
>   	if (!adev->accel_working)
>   		return -EBUSY;
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 19f6624..c9825ae 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3727,25 +3727,18 @@ static bool amdgpu_device_lock_adev(struct amdgpu_device *adev, bool trylock)
>   		adev->mp1_state = PP_MP1_STATE_NONE;
>   		break;
>   	}
> -	/* Block kfd: SRIOV would do it separately */
> -	if (!amdgpu_sriov_vf(adev))
> -                amdgpu_amdkfd_pre_reset(adev);
>   
>   	return true;
>   }
>   
>   static void amdgpu_device_unlock_adev(struct amdgpu_device *adev)
>   {
> -	/*unlock kfd: SRIOV would do it separately */
> -	if (!amdgpu_sriov_vf(adev))
> -                amdgpu_amdkfd_post_reset(adev);
>   	amdgpu_vf_error_trans_all(adev);
>   	adev->mp1_state = PP_MP1_STATE_NONE;
>   	adev->in_gpu_reset = 0;
>   	mutex_unlock(&adev->lock_reset);
>   }
>   
> -
>   /**
>    * amdgpu_device_gpu_recover - reset the asic and recover scheduler
>    *
> @@ -3765,11 +3758,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	struct amdgpu_hive_info *hive = NULL;
>   	struct amdgpu_device *tmp_adev = NULL;
>   	int i, r = 0;
> +	bool in_ras_intr = amdgpu_ras_intr_triggered();
>   
>   	need_full_reset = job_signaled = false;
>   	INIT_LIST_HEAD(&device_list);
>   
> -	dev_info(adev->dev, "GPU reset begin!\n");
> +	dev_info(adev->dev, "GPU %s begin!\n", in_ras_intr ? "jobs stop":"reset");
>   
>   	cancel_delayed_work_sync(&adev->delayed_init_work);
>   
> @@ -3796,9 +3790,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		return 0;
>   	}
>   
> +	/* Block kfd: SRIOV would do it separately */
> +	if (!amdgpu_sriov_vf(adev))
> +                amdgpu_amdkfd_pre_reset(adev);
> +
>   	/* Build list of devices to reset */
>   	if  (adev->gmc.xgmi.num_physical_nodes > 1) {
>   		if (!hive) {
> +			/*unlock kfd: SRIOV would do it separately */
> +			if (!amdgpu_sriov_vf(adev))
> +		                amdgpu_amdkfd_post_reset(adev);
>   			amdgpu_device_unlock_adev(adev);
>   			return -ENODEV;
>   		}
> @@ -3816,8 +3817,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   	/* block all schedulers and reset given job's ring */
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> -		if (tmp_adev != adev)
> +		if (tmp_adev != adev) {
>   			amdgpu_device_lock_adev(tmp_adev, false);
> +			if (!amdgpu_sriov_vf(tmp_adev))
> +			                amdgpu_amdkfd_pre_reset(tmp_adev);
> +		}
> +
>   		/*
>   		 * Mark these ASICs to be reseted as untracked first
>   		 * And add them back after reset completed
> @@ -3825,7 +3830,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		amdgpu_unregister_gpu_instance(tmp_adev);
>   
>   		/* disable ras on ALL IPs */
> -		if (amdgpu_device_ip_need_full_reset(tmp_adev))
> +		if (!in_ras_intr && amdgpu_device_ip_need_full_reset(tmp_adev))
>   			amdgpu_ras_suspend(tmp_adev);
>   
>   		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> @@ -3835,10 +3840,16 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   				continue;
>   
>   			drm_sched_stop(&ring->sched, job ? &job->base : NULL);
> +
> +			if (in_ras_intr)
> +				amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
>   		}
>   	}
>   
>   
> +	if (in_ras_intr)
> +		goto skip_sched_resume;
> +
>   	/*
>   	 * Must check guilty signal here since after this point all old
>   	 * HW fences are force signaled.
> @@ -3897,6 +3908,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   
>   	/* Post ASIC reset for all devs .*/
>   	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +
>   		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   			struct amdgpu_ring *ring = tmp_adev->rings[i];
>   
> @@ -3923,7 +3935,13 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		} else {
>   			dev_info(tmp_adev->dev, "GPU reset(%d) succeeded!\n", atomic_read(&tmp_adev->gpu_reset_counter));
>   		}
> +	}
>   
> +skip_sched_resume:
> +	list_for_each_entry(tmp_adev, device_list_handle, gmc.xgmi.head) {
> +		/*unlock kfd: SRIOV would do it separately */
> +		if (!in_ras_intr && !amdgpu_sriov_vf(tmp_adev))
> +	                amdgpu_amdkfd_post_reset(tmp_adev);
>   		amdgpu_device_unlock_adev(tmp_adev);
>   	}
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> index 7679fe8..c73d26a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_drv.c
> @@ -40,6 +40,8 @@
>   
>   #include "amdgpu_amdkfd.h"
>   
> +#include "amdgpu_ras.h"
> +
>   /*
>    * KMS wrapper.
>    * - 3.0.0 - initial driver
> @@ -1180,6 +1182,9 @@ amdgpu_pci_shutdown(struct pci_dev *pdev)
>   	struct drm_device *dev = pci_get_drvdata(pdev);
>   	struct amdgpu_device *adev = dev->dev_private;
>   
> +	if (amdgpu_ras_intr_triggered())
> +		return;
> +
>   	/* if we are running in a VM, make sure the device
>   	 * torn down properly on reboot/shutdown.
>   	 * unfortunately we can't detect certain
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index 4d67b77..b12981e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -250,6 +250,44 @@ static struct dma_fence *amdgpu_job_run(struct drm_sched_job *sched_job)
>   	return fence;
>   }
>   
> +#define to_drm_sched_job(sched_job)		\
> +		container_of((sched_job), struct drm_sched_job, queue_node)
> +
> +void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched)
> +{
> +	struct drm_sched_job *s_job;
> +	struct drm_sched_entity *s_entity = NULL;
> +	int i;
> +
> +	/* Signal all jobs not yet scheduled */
> +	for (i = DRM_SCHED_PRIORITY_MAX - 1; i >= DRM_SCHED_PRIORITY_MIN; i--) {
> +		struct drm_sched_rq *rq = &sched->sched_rq[i];
> +
> +		if (!rq)
> +			continue;
> +
> +		spin_lock(&rq->lock);
> +		list_for_each_entry(s_entity, &rq->entities, list) {
> +			while ((s_job = to_drm_sched_job(spsc_queue_pop(&s_entity->job_queue)))) {
> +				struct drm_sched_fence *s_fence = s_job->s_fence;
> +
> +				dma_fence_signal(&s_fence->scheduled);
> +				dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> +				dma_fence_signal(&s_fence->finished);
> +			}
> +		}
> +		spin_unlock(&rq->lock);
> +	}
> +
> +	/* Signal all jobs already scheduled to HW */
> +	list_for_each_entry(s_job, &sched->ring_mirror_list, node) {
> +		struct drm_sched_fence *s_fence = s_job->s_fence;
> +
> +		dma_fence_set_error(&s_fence->finished, -EHWPOISON);
> +		dma_fence_signal(&s_fence->finished);
> +	}
> +}
> +
>   const struct drm_sched_backend_ops amdgpu_sched_ops = {
>   	.dependency = amdgpu_job_dependency,
>   	.run_job = amdgpu_job_run,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> index 51e6250..dc7ee93 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.h
> @@ -76,4 +76,7 @@ int amdgpu_job_submit(struct amdgpu_job *job, struct drm_sched_entity *entity,
>   		      void *owner, struct dma_fence **f);
>   int amdgpu_job_submit_direct(struct amdgpu_job *job, struct amdgpu_ring *ring,
>   			     struct dma_fence **fence);
> +
> +void amdgpu_job_stop_all_jobs_on_sched(struct drm_gpu_scheduler *sched);
> +
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 35a0866..535f690 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -1046,6 +1046,12 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   	/* Ensure IB tests are run on ring */
>   	flush_delayed_work(&adev->delayed_init_work);
>   
> +
> +	if (amdgpu_ras_intr_triggered()) {
> +		DRM_ERROR("RAS Intr triggered, device disabled!!");
> +		return -EHWPOISON;
> +	}
> +
>   	file_priv->driver_priv = NULL;
>   
>   	r = pm_runtime_get_sync(dev->dev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index aa51c00..1cc34de 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -24,6 +24,8 @@
>   #include <linux/debugfs.h>
>   #include <linux/list.h>
>   #include <linux/module.h>
> +#include <linux/reboot.h>
> +#include <linux/syscalls.h>
>   #include "amdgpu.h"
>   #include "amdgpu_ras.h"
>   #include "amdgpu_atomfirmware.h"
> @@ -64,6 +66,9 @@ const char *ras_block_string[] = {
>   /* inject address is 52 bits */
>   #define	RAS_UMC_INJECT_ADDR_LIMIT	(0x1ULL << 52)
>   
> +
> +atomic_t amdgpu_ras_in_intr = ATOMIC_INIT(0);
> +
>   static int amdgpu_ras_reserve_vram(struct amdgpu_device *adev,
>   		uint64_t offset, uint64_t size,
>   		struct amdgpu_bo **bo_ptr);
> @@ -188,6 +193,10 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>   
>   	return 0;
>   }
> +
> +static struct ras_manager *amdgpu_ras_find_obj(struct amdgpu_device *adev,
> +		struct ras_common_if *head);
> +
>   /**
>    * DOC: AMDGPU RAS debugfs control interface
>    *
> @@ -627,12 +636,14 @@ int amdgpu_ras_error_query(struct amdgpu_device *adev,
>   	info->ue_count = obj->err_data.ue_count;
>   	info->ce_count = obj->err_data.ce_count;
>   
> -	if (err_data.ce_count)
> +	if (err_data.ce_count) {
>   		dev_info(adev->dev, "%ld correctable errors detected in %s block\n",
>   			 obj->err_data.ce_count, ras_block_str(info->head.block));
> -	if (err_data.ue_count)
> +	}
> +	if (err_data.ue_count) {
>   		dev_info(adev->dev, "%ld uncorrectable errors detected in %s block\n",
>   			 obj->err_data.ue_count, ras_block_str(info->head.block));
> +	}
>   
>   	return 0;
>   }
> @@ -1718,3 +1729,10 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
>   
>   	return 0;
>   }
> +
> +void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
> +{
> +	if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
> +		DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
> +	}
> +}
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index fc4fb0f..3ec2a87 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -600,4 +600,14 @@ int amdgpu_ras_interrupt_remove_handler(struct amdgpu_device *adev,
>   
>   int amdgpu_ras_interrupt_dispatch(struct amdgpu_device *adev,
>   		struct ras_dispatch_if *info);
> +
> +extern atomic_t amdgpu_ras_in_intr;
> +
> +static inline bool amdgpu_ras_intr_triggered(void)
> +{
> +	return !!atomic_read(&amdgpu_ras_in_intr);
> +}
> +
> +void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev);
> +
>   #endif
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 93e3e89..817997b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -5676,10 +5676,12 @@ static int gfx_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
>   		struct amdgpu_iv_entry *entry)
>   {
>   	/* TODO ue will trigger an interrupt. */
> -	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> -	if (adev->gfx.funcs->query_ras_error_count)
> -		adev->gfx.funcs->query_ras_error_count(adev, err_data);
> -	amdgpu_ras_reset_gpu(adev, 0);
> +	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
> +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +		if (adev->gfx.funcs->query_ras_error_count)
> +			adev->gfx.funcs->query_ras_error_count(adev, err_data);
> +		amdgpu_ras_reset_gpu(adev, 0);
> +	}
>   	return AMDGPU_RAS_SUCCESS;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> index 5eb17c7..2a6ac60 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gmc_v9_0.c
> @@ -243,18 +243,20 @@ static int gmc_v9_0_process_ras_data_cb(struct amdgpu_device *adev,
>   		struct ras_err_data *err_data,
>   		struct amdgpu_iv_entry *entry)
>   {
> -	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> -	if (adev->umc.funcs->query_ras_error_count)
> -		adev->umc.funcs->query_ras_error_count(adev, err_data);
> -	/* umc query_ras_error_address is also responsible for clearing
> -	 * error status
> -	 */
> -	if (adev->umc.funcs->query_ras_error_address)
> -		adev->umc.funcs->query_ras_error_address(adev, err_data);
> +	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
> +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +		if (adev->umc.funcs->query_ras_error_count)
> +			adev->umc.funcs->query_ras_error_count(adev, err_data);
> +		/* umc query_ras_error_address is also responsible for clearing
> +		 * error status
> +		 */
> +		if (adev->umc.funcs->query_ras_error_address)
> +			adev->umc.funcs->query_ras_error_address(adev, err_data);
>   
> -	/* only uncorrectable error needs gpu reset */
> -	if (err_data->ue_count)
> -		amdgpu_ras_reset_gpu(adev, 0);
> +		/* only uncorrectable error needs gpu reset */
> +		if (err_data->ue_count)
> +			amdgpu_ras_reset_gpu(adev, 0);
> +	}
>   
>   	return AMDGPU_RAS_SUCCESS;
>   }
> diff --git a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> index 367f9d6..545990c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> +++ b/drivers/gpu/drm/amd/amdgpu/nbio_v7_4.c
> @@ -30,6 +30,7 @@
>   #include "nbio/nbio_7_4_0_smn.h"
>   #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
>   #include <uapi/linux/kfd_ioctl.h>
> +#include "amdgpu_ras.h"
>   
>   #define smnNBIF_MGCG_CTRL_LCLK	0x1013a21c
>   
> @@ -329,6 +330,8 @@ static void nbio_v7_4_handle_ras_controller_intr_no_bifring(struct amdgpu_device
>   						BIF_DOORBELL_INT_CNTL,
>   						RAS_CNTLR_INTERRUPT_CLEAR, 1);
>   		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
> +
> +		amdgpu_ras_global_ras_isr(adev);
>   	}
>   }
>   
> @@ -344,6 +347,8 @@ static void nbio_v7_4_handle_ras_err_event_athub_intr_no_bifring(struct amdgpu_d
>   						BIF_DOORBELL_INT_CNTL,
>   						RAS_ATHUB_ERR_EVENT_INTERRUPT_CLEAR, 1);
>   		WREG32_SOC15(NBIO, 0, mmBIF_DOORBELL_INT_CNTL, bif_doorbell_intr_cntl);
> +
> +		amdgpu_ras_global_ras_isr(adev);
>   	}
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> index b3ed533..b05428f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sdma_v4_0.c
> @@ -1972,24 +1972,26 @@ static int sdma_v4_0_process_ras_data_cb(struct amdgpu_device *adev,
>   	uint32_t err_source;
>   	int instance;
>   
> -	instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
> -	if (instance < 0)
> -		return 0;
> +	if (!amdgpu_ras_is_supported(adev, AMDGPU_RAS_BLOCK__GFX)) {
> +		instance = sdma_v4_0_irq_id_to_seq(entry->client_id);
> +		if (instance < 0)
> +			return 0;
>   
> -	switch (entry->src_id) {
> -	case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
> -		err_source = 0;
> -		break;
> -	case SDMA0_4_0__SRCID__SDMA_ECC:
> -		err_source = 1;
> -		break;
> -	default:
> -		return 0;
> -	}
> +		switch (entry->src_id) {
> +		case SDMA0_4_0__SRCID__SDMA_SRAM_ECC:
> +			err_source = 0;
> +			break;
> +		case SDMA0_4_0__SRCID__SDMA_ECC:
> +			err_source = 1;
> +			break;
> +		default:
> +			return 0;
> +		}
>   
> -	kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
> +		kgd2kfd_set_sram_ecc_flag(adev->kfd.dev);
>   
> -	amdgpu_ras_reset_gpu(adev, 0);
> +		amdgpu_ras_reset_gpu(adev, 0);
> +	}
>   
>   	return AMDGPU_RAS_SUCCESS;
>   }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.
  2019-08-30 19:55       ` Alex Deucher
@ 2019-08-31  0:21         ` Grodzovsky, Andrey
       [not found]           ` <MWHPR12MB14530CA84252B13E5150743BEABC0-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 9+ messages in thread
From: Grodzovsky, Andrey @ 2019-08-31  0:21 UTC (permalink / raw)
  To: Alex Deucher
  Cc: Christian König, Kuehling, Felix, Zhou1, Tao, amd-gfx list,
	Zhang, Hawking

But I am not the one cherry-picking to DKMS, should I just let this person know this is the DKMS code he should use for when appropriate API doesn't exist ?

Andrey

________________________________________
From: Alex Deucher <alexdeucher@gmail.com>
Sent: 30 August 2019 15:55:03
To: Grodzovsky, Andrey
Cc: amd-gfx list; Zhang, Hawking; Christian König; Zhou1, Tao; Kuehling, Felix
Subject: Re: [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.

On Fri, Aug 30, 2019 at 12:39 PM Andrey Grodzovsky
<andrey.grodzovsky@amd.com> wrote:
>
> In case of RAS error allow user configure auto system
> reboot through ras_ctrl.
> This is also part of the temproray work around for the RAS
> hang problem.
>
> Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Typo in title: dmr -> drm

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 18 ++++++++++++++++++
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 +++++++++-
>  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
>  3 files changed, 28 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c9825ae..e26f2e9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3760,6 +3760,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>         int i, r = 0;
>         bool in_ras_intr = amdgpu_ras_intr_triggered();
>
> +       /*
> +        * Flush RAM to disk so that after reboot
> +        * the user can read log and see why the system rebooted.
> +        *
> +        * Using user mode app call instead of kernel APIs such as
> +        * ksys_sync_helper for backward comparability with earlier
> +        * kernels into which this is also intended.
> +        */
> +       if (in_ras_intr && amdgpu_ras_get_context(adev)->reboot) {
> +               char *envp[] = { "HOME=/", NULL };
> +               char *argv[] = { "/bin/sync", NULL };
> +
> +               DRM_WARN("Emergency reboot.");
> +
> +               call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
> +               emergency_restart();
> +       }
> +

This is fine for dkms, but for upstream/amd-staging, we probably want
to call the appropriate APIs directly.

>         need_full_reset = job_signaled = false;
>         INIT_LIST_HEAD(&device_list);
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> index 1cc34de..bbcfb4f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> @@ -30,6 +30,7 @@
>  #include "amdgpu_ras.h"
>  #include "amdgpu_atomfirmware.h"
>  #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
> +#include <linux/kmod.h>
>
>  const char *ras_error_string[] = {
>         "none",
> @@ -154,6 +155,8 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
>                 op = 1;
>         else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
>                 op = 2;
> +       else if (sscanf(str, "reboot %32s", block_name) == 1)
> +               op = 3;
>         else if (str[0] && str[1] && str[2] && str[3])
>                 /* ascii string, but commands are not matched. */
>                 return -EINVAL;
> @@ -287,6 +290,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
>                 /* data.inject.address is offset instead of absolute gpu address */
>                 ret = amdgpu_ras_error_inject(adev, &data.inject);
>                 break;
> +       case 3:
> +               amdgpu_ras_get_context(adev)->reboot = true;
> +               break;
>         default:
>                 ret = -EINVAL;
>                 break;
> @@ -1733,6 +1739,8 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
>  void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
>  {
>         if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
> -               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
> +               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected!\n");
> +
> +               amdgpu_ras_reset_gpu(adev, false);
>         }
>  }
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> index 3ec2a87..a83ec99 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> @@ -333,6 +333,7 @@ struct amdgpu_ras {
>         struct mutex recovery_lock;
>
>         uint32_t flags;
> +       bool reboot;
>  };
>
>  struct ras_fs_data {
> --
> 2.7.4
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.
       [not found]           ` <MWHPR12MB14530CA84252B13E5150743BEABC0-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2019-09-01 17:19             ` Alex Deucher
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Deucher @ 2019-09-01 17:19 UTC (permalink / raw)
  To: Grodzovsky, Andrey
  Cc: Christian König, Kuehling, Felix, Zhou1, Tao, amd-gfx list,
	Zhang, Hawking

yeah, that's fine.

Alex

On Fri, Aug 30, 2019 at 8:21 PM Grodzovsky, Andrey
<Andrey.Grodzovsky@amd.com> wrote:
>
> But I am not the one cherry-picking to DKMS, should I just let this person know this is the DKMS code he should use for when appropriate API doesn't exist ?
>
> Andrey
>
> ________________________________________
> From: Alex Deucher <alexdeucher@gmail.com>
> Sent: 30 August 2019 15:55:03
> To: Grodzovsky, Andrey
> Cc: amd-gfx list; Zhang, Hawking; Christian König; Zhou1, Tao; Kuehling, Felix
> Subject: Re: [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS.
>
> On Fri, Aug 30, 2019 at 12:39 PM Andrey Grodzovsky
> <andrey.grodzovsky@amd.com> wrote:
> >
> > In case of RAS error allow user configure auto system
> > reboot through ras_ctrl.
> > This is also part of the temproray work around for the RAS
> > hang problem.
> >
> > Signed-off-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
>
> Typo in title: dmr -> drm
>
> > ---
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 18 ++++++++++++++++++
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c    | 10 +++++++++-
> >  drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h    |  1 +
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > index c9825ae..e26f2e9 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> > @@ -3760,6 +3760,24 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
> >         int i, r = 0;
> >         bool in_ras_intr = amdgpu_ras_intr_triggered();
> >
> > +       /*
> > +        * Flush RAM to disk so that after reboot
> > +        * the user can read log and see why the system rebooted.
> > +        *
> > +        * Using user mode app call instead of kernel APIs such as
> > +        * ksys_sync_helper for backward comparability with earlier
> > +        * kernels into which this is also intended.
> > +        */
> > +       if (in_ras_intr && amdgpu_ras_get_context(adev)->reboot) {
> > +               char *envp[] = { "HOME=/", NULL };
> > +               char *argv[] = { "/bin/sync", NULL };
> > +
> > +               DRM_WARN("Emergency reboot.");
> > +
> > +               call_usermodehelper(argv[0], argv, envp, UMH_WAIT_PROC);
> > +               emergency_restart();
> > +       }
> > +
>
> This is fine for dkms, but for upstream/amd-staging, we probably want
> to call the appropriate APIs directly.
>
> >         need_full_reset = job_signaled = false;
> >         INIT_LIST_HEAD(&device_list);
> >
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > index 1cc34de..bbcfb4f 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.c
> > @@ -30,6 +30,7 @@
> >  #include "amdgpu_ras.h"
> >  #include "amdgpu_atomfirmware.h"
> >  #include "ivsrcid/nbio/irqsrcs_nbif_7_4.h"
> > +#include <linux/kmod.h>
> >
> >  const char *ras_error_string[] = {
> >         "none",
> > @@ -154,6 +155,8 @@ static int amdgpu_ras_debugfs_ctrl_parse_data(struct file *f,
> >                 op = 1;
> >         else if (sscanf(str, "inject %32s %8s", block_name, err) == 2)
> >                 op = 2;
> > +       else if (sscanf(str, "reboot %32s", block_name) == 1)
> > +               op = 3;
> >         else if (str[0] && str[1] && str[2] && str[3])
> >                 /* ascii string, but commands are not matched. */
> >                 return -EINVAL;
> > @@ -287,6 +290,9 @@ static ssize_t amdgpu_ras_debugfs_ctrl_write(struct file *f, const char __user *
> >                 /* data.inject.address is offset instead of absolute gpu address */
> >                 ret = amdgpu_ras_error_inject(adev, &data.inject);
> >                 break;
> > +       case 3:
> > +               amdgpu_ras_get_context(adev)->reboot = true;
> > +               break;
> >         default:
> >                 ret = -EINVAL;
> >                 break;
> > @@ -1733,6 +1739,8 @@ int amdgpu_ras_fini(struct amdgpu_device *adev)
> >  void amdgpu_ras_global_ras_isr(struct amdgpu_device *adev)
> >  {
> >         if (atomic_cmpxchg(&amdgpu_ras_in_intr, 0, 1) == 0) {
> > -               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected! Stopping all GPU jobs.\n");
> > +               DRM_WARN("RAS event of type ERREVENT_ATHUB_INTERRUPT detected!\n");
> > +
> > +               amdgpu_ras_reset_gpu(adev, false);
> >         }
> >  }
> > diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > index 3ec2a87..a83ec99 100644
> > --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ras.h
> > @@ -333,6 +333,7 @@ struct amdgpu_ras {
> >         struct mutex recovery_lock;
> >
> >         uint32_t flags;
> > +       bool reboot;
> >  };
> >
> >  struct ras_fs_data {
> > --
> > 2.7.4
> >
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2019-09-01 17:19 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-30 16:39 [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case Andrey Grodzovsky
     [not found] ` <1567183153-11014-1-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-08-30 16:39   ` [PATCH v3 2/3] dmr/amdgpu: Avoid HW GPU reset for RAS Andrey Grodzovsky
     [not found]     ` <1567183153-11014-2-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-08-30 19:58       ` Alex Deucher
2019-08-30 20:29       ` Kuehling, Felix
2019-08-30 16:39   ` [PATCH v3 3/3] dmr/amdgpu: Add system auto reboot to RAS Andrey Grodzovsky
     [not found]     ` <1567183153-11014-3-git-send-email-andrey.grodzovsky-5C7GfCeVMHo@public.gmane.org>
2019-08-30 19:55       ` Alex Deucher
2019-08-31  0:21         ` Grodzovsky, Andrey
     [not found]           ` <MWHPR12MB14530CA84252B13E5150743BEABC0-Gy0DoCVfaSWZBIDmKHdw+wdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2019-09-01 17:19             ` Alex Deucher
2019-08-30 20:08   ` [PATCH v3 1/3] drm/amdgpu: Fix bugs in amdgpu_device_gpu_recover in XGMI case Kuehling, Felix

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.