All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/9] *** patch series for new gpu recover***
@ 2017-10-25  9:22 Monk Liu
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

*** BLURB HERE ***

Monk Liu (9):
  drm/amdgpu:cleanup hw fence get&put
  amd/scheduler:imple job skip feature
  drm/amdgpu:implement new GPU recover
  drm/amdgpu:replace deprecated gpu reset
  drm/amdgpu:cleanup in_sriov_reset and lock_reset
  drm/amdgpu:cleanup ucode_init_bo
  drm/amdgpu:block kms open during gpu_reset
  drm/amdgpu/sriov:fix memory leak in psp_load_fw
  drm/amdgpu:fix random missing of FLR NOTIFY

 drivers/gpu/drm/amd/amdgpu/amdgpu.h           |   9 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c    | 311 ++++++++++++--------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c     |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c       |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  14 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c       |   3 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c       |  22 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c     |   3 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c      |   2 -
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h      |   2 -
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c         |   6 +-
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c         |   6 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c         |  16 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c         |   2 +-
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c |  46 ++--
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |   2 +-
 16 files changed, 230 insertions(+), 230 deletions(-)

-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 1/9] drm/amdgpu:cleanup hw fence get&put
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 2/9] amd/scheduler:imple job skip feature Monk Liu
                     ` (7 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

In amdgpu_fence_process it shouldn't put the signaled hw fence and set
RCU pointer to NULL, instead the more reasonable sequence is just leave
the RCU be there untouched, and:
1) Either driver_fini() should put this RCU fence or
2) New hw fence emitting should take the RCU slot and put the old hw fence.

mapping between get & put on hw fence is:

Get						Put
fence_emit:fence_init	--->	free_job:put job->fence
	       		--->	run_job:put job->fence //GPU RESET case

fence_emit:rcu_assign_pointer	---> fence_driver_fini:put ring->fence_drv.fences[j]
				---> fence_emit:put(old) //after fence_wait(old)

run_job:job->fence=fence_get()	--> sched_main:put(fence)
				--> job_recovery:put(fence) //for GPU RESET case

sched_main:parent=fence_get()	--> sched_fence_free:put(fence->parent)
sched_job_recovery:parent=fence_get() --> sched_hw_job_reset:put(s_job->s_fence->parent) // for GPU RESET case

Change-Id: I623167c1e3143233f4e1be7a400a9b698b4a1355
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index f43319c..d7374cf 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -161,6 +161,8 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f)
 		dma_fence_wait(old, false);
 	}
 
+	dma_fence_put(old);
+
 	rcu_assign_pointer(*ptr, dma_fence_get(&fence->base));
 
 	*f = &fence->base;
@@ -246,7 +248,6 @@ void amdgpu_fence_process(struct amdgpu_ring *ring)
 
 		/* There is always exactly one thread signaling this fence slot */
 		fence = rcu_dereference_protected(*ptr, 1);
-		RCU_INIT_POINTER(*ptr, NULL);
 
 		if (!fence)
 			continue;
@@ -257,7 +258,6 @@ void amdgpu_fence_process(struct amdgpu_ring *ring)
 		else
 			BUG();
 
-		dma_fence_put(fence);
 	} while (last_seq != seq);
 }
 
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 2/9] amd/scheduler:imple job skip feature
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 1/9] drm/amdgpu:cleanup hw fence get&put Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 3/9] drm/amdgpu:implement new GPU recover(v2) Monk Liu
                     ` (6 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

jobs are skipped under two cases
1)when the entity behind this job marked guilty, the job
poped from this entity's queue will be dropped in sched_main loop.

2)in job_recovery(), skip the scheduling job if its karma detected
above limit, and also skipped as well for other jobs sharing the
same fence context. this approach is becuase job_recovery() cannot
access job->entity due to entity may already dead.

with this feature we can introduce new gpu recover feature.

Change-Id: I268b1c752c94e6ecd4ea78c87eb226ea3f52908a
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  9 +++---
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 46 +++++++++++++++++++--------
 drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +-
 3 files changed, 39 insertions(+), 18 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index a58e3c5..f08fde9 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -177,7 +177,7 @@ static struct dma_fence *amdgpu_job_dependency(struct amd_sched_job *sched_job)
 	return fence;
 }
 
-static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job)
+static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job, bool skip)
 {
 	struct dma_fence *fence = NULL;
 	struct amdgpu_device *adev;
@@ -194,10 +194,11 @@ static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job)
 	BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
 
 	trace_amdgpu_sched_run_job(job);
-	/* skip ib schedule when vram is lost */
-	if (job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
+
+	if (skip || job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
+		/* skip ib schedule if looks needed, and set error */
 		dma_fence_set_error(&job->base.s_fence->finished, -ECANCELED);
-		DRM_ERROR("Skip scheduling IBs!\n");
+		DRM_INFO("Skip scheduling IBs!\n");
 	} else {
 		r = amdgpu_ib_schedule(job->ring, job->num_ibs, job->ibs, job,
 				       &fence);
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
index 9cbeade..29841ba 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
@@ -330,6 +330,20 @@ static bool amd_sched_entity_add_dependency_cb(struct amd_sched_entity *entity)
 	return false;
 }
 
+static void amd_sched_job_fake_signal(struct amd_sched_job *job)
+{
+	struct amd_sched_fence *s_fence = job->s_fence;
+
+	dma_fence_set_error(&job->s_fence->finished, -ECANCELED);
+	/* fake signaling the scheduled fence */
+	amd_sched_fence_scheduled(s_fence);
+	/* fake signaling the finished fence */
+	dma_fence_put(&job->s_fence->finished);
+	job->sched->ops->free_job(job);
+	/* call CB on the s_fence, e.g. wake up WAIT_CS */
+	amd_sched_fence_finished(s_fence);
+}
+
 static struct amd_sched_job *
 amd_sched_entity_pop_job(struct amd_sched_entity *entity)
 {
@@ -345,6 +359,12 @@ amd_sched_entity_pop_job(struct amd_sched_entity *entity)
 			return NULL;
 
 	sched_job->s_entity = NULL;
+
+	if (entity->guilty && atomic_read(entity->guilty)) {
+		amd_sched_job_fake_signal(sched_job);
+		sched_job = NULL;
+	}
+
 	spsc_queue_pop(&entity->job_queue);
 	return sched_job;
 }
@@ -441,13 +461,6 @@ static void amd_sched_job_timedout(struct work_struct *work)
 	job->sched->ops->timedout_job(job);
 }
 
-static void amd_sched_set_guilty(struct amd_sched_job *s_job)
-{
-	if (atomic_inc_return(&s_job->karma) > s_job->sched->hang_limit)
-		if (s_job->s_entity->guilty)
-			atomic_set(s_job->s_entity->guilty, 1);
-}
-
 void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_job *bad)
 {
 	struct amd_sched_job *s_job;
@@ -466,7 +479,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
 	}
 	spin_unlock(&sched->job_list_lock);
 
-	if (bad) {
+	if (bad && atomic_inc_return(&bad->karma) > bad->sched->hang_limit) {
 		bool found = false;
 
 		for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++ ) {
@@ -476,7 +489,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
 			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
 				if (bad->s_fence->scheduled.context == entity->fence_context) {
 					found = true;
-					amd_sched_set_guilty(bad);
+					if (entity->guilty)
+						atomic_set(entity->guilty, 1);
 					break;
 				}
 			}
@@ -510,9 +524,17 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
 	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
 		struct amd_sched_fence *s_fence = s_job->s_fence;
 		struct dma_fence *fence;
+		bool found_guilty = false, skip;
+		uint64_t guilty_context;
+
+		if (!found_guilty && atomic_read(&s_job->karma) > sched->hang_limit) {
+			found_guilty = true;
+			guilty_context = s_job->s_fence->scheduled.context;
+		}
 
+		skip = (found_guilty && s_job->s_fence->scheduled.context == guilty_context);
 		spin_unlock(&sched->job_list_lock);
-		fence = sched->ops->run_job(s_job);
+		fence = sched->ops->run_job(s_job, skip);
 		atomic_inc(&sched->hw_rq_count);
 		if (fence) {
 			s_fence->parent = dma_fence_get(fence);
@@ -525,7 +547,6 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
 					  r);
 			dma_fence_put(fence);
 		} else {
-			DRM_ERROR("Failed to run job!\n");
 			amd_sched_process_job(NULL, &s_fence->cb);
 		}
 		spin_lock(&sched->job_list_lock);
@@ -650,7 +671,7 @@ static int amd_sched_main(void *param)
 		atomic_inc(&sched->hw_rq_count);
 		amd_sched_job_begin(sched_job);
 
-		fence = sched->ops->run_job(sched_job);
+		fence = sched->ops->run_job(sched_job, false);
 		amd_sched_fence_scheduled(s_fence);
 
 		if (fence) {
@@ -664,7 +685,6 @@ static int amd_sched_main(void *param)
 					  r);
 			dma_fence_put(fence);
 		} else {
-			DRM_ERROR("Failed to run job!\n");
 			amd_sched_process_job(NULL, &s_fence->cb);
 		}
 
diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
index f9e3a83..a8e5a7d 100644
--- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
+++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
@@ -126,7 +126,7 @@ static inline bool amd_sched_invalidate_job(struct amd_sched_job *s_job, int thr
 */
 struct amd_sched_backend_ops {
 	struct dma_fence *(*dependency)(struct amd_sched_job *sched_job);
-	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job);
+	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job, bool skip);
 	void (*timedout_job)(struct amd_sched_job *sched_job);
 	void (*free_job)(struct amd_sched_job *sched_job);
 };
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 3/9] drm/amdgpu:implement new GPU recover(v2)
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 1/9] drm/amdgpu:cleanup hw fence get&put Monk Liu
  2017-10-25  9:22   ` [PATCH 2/9] amd/scheduler:imple job skip feature Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset Monk Liu
                     ` (5 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1,new imple names amdgpu_gpu_recover which gives more hint
on what it does compared with gpu_reset

2,gpu_recover unify bare-metal and SR-IOV, only the asic reset
part is implemented differently

3,gpu_recover will increase hang job karma and mark its entity/context
as guilty if exceeds limit

V2:

4,in scheduler main routine the job from guilty context  will be immedialy
fake signaled after it poped from queue and its fence be set with
"-ECANCELED" error

5,in scheduler recovery routine all jobs from the guilty entity would be
dropped

6,in run_job() routine the real IB submission would be skipped if @skip parameter
equales true or there was VRAM lost occured.

Change-Id: I30d30924eeb512e8f3243a8af727d061ed41f800
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 +
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 262 +++++++++++++++++++++++++++++
 2 files changed, 266 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index ba1ab97..003668f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -178,6 +178,10 @@ extern int amdgpu_cik_support;
 #define CIK_CURSOR_WIDTH 128
 #define CIK_CURSOR_HEIGHT 128
 
+/* GPU RESET flags */
+#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
+#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
+
 struct amdgpu_device;
 struct amdgpu_ib;
 struct amdgpu_cs_parser;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a07544d..0db3b3c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -3116,6 +3116,268 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
 	return r;
 }
 
+static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
+{
+	int r;
+	bool need_full_reset, vram_lost = 0;
+
+	need_full_reset = amdgpu_need_full_reset(adev);
+
+	if (!need_full_reset) {
+		amdgpu_pre_soft_reset(adev);
+		r = amdgpu_soft_reset(adev);
+		amdgpu_post_soft_reset(adev);
+		if (r || amdgpu_check_soft_reset(adev)) {
+			DRM_INFO("soft reset failed, will fallback to full reset!\n");
+			need_full_reset = true;
+		}
+
+	}
+
+	if (need_full_reset) {
+		r = amdgpu_suspend(adev);
+
+retry:
+		amdgpu_atombios_scratch_regs_save(adev);
+		r = amdgpu_asic_reset(adev);
+		amdgpu_atombios_scratch_regs_restore(adev);
+		/* post card */
+		amdgpu_atom_asic_init(adev->mode_info.atom_context);
+
+		if (!r) {
+			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
+			r = amdgpu_resume_phase1(adev);
+			if (r)
+				goto out;
+
+			vram_lost = amdgpu_check_vram_lost(adev);
+			if (vram_lost) {
+				DRM_ERROR("VRAM is lost!\n");
+				atomic_inc(&adev->vram_lost_counter);
+			}
+
+			r = amdgpu_ttm_recover_gart(adev);
+			if (r)
+				goto out;
+
+			r = amdgpu_resume_phase2(adev);
+			if (r)
+				goto out;
+
+			if (vram_lost)
+				amdgpu_fill_reset_magic(adev);
+		}
+	}
+
+out:
+	if (!r) {
+		amdgpu_irq_gpu_reset_resume_helper(adev);
+		r = amdgpu_ib_ring_tests(adev);
+		if (r) {
+			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
+			r = amdgpu_suspend(adev);
+			need_full_reset = true;
+			goto retry;
+		}
+	}
+
+	if (reset_flags) {
+		if (vram_lost)
+			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
+
+		if (need_full_reset)
+			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
+	}
+
+	return r;
+}
+
+static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, bool from_hypervisor)
+{
+	int r;
+
+	if (from_hypervisor)
+		r = amdgpu_virt_request_full_gpu(adev, true);
+	else
+		r = amdgpu_virt_reset_gpu(adev);
+	if (r)
+		return r;
+
+	/* Resume IP prior to SMC */
+	r = amdgpu_sriov_reinit_early(adev);
+	if (r)
+		goto error;
+
+	/* we need recover gart prior to run SMC/CP/SDMA resume */
+	amdgpu_ttm_recover_gart(adev);
+
+	/* now we are okay to resume SMC/CP/SDMA */
+	r = amdgpu_sriov_reinit_late(adev);
+	if (r)
+		goto error;
+
+	amdgpu_irq_gpu_reset_resume_helper(adev);
+	r = amdgpu_ib_ring_tests(adev);
+	if (r)
+		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
+
+error:
+	/* release full control of GPU after ib test */
+	amdgpu_virt_release_full_gpu(adev, true);
+
+	if (reset_flags) {
+		/* will get vram_lost from GIM in future, now all
+		 * reset request considered VRAM LOST
+		 */
+		(*reset_flags) |= ~AMDGPU_RESET_INFO_VRAM_LOST;
+		atomic_inc(&adev->vram_lost_counter);
+
+		/* VF FLR or hotlink reset is always full-reset */
+		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
+	}
+
+	return r;
+}
+
+/**
+ * amdgpu_gpu_recover - reset the asic and recover scheduler
+ *
+ * @adev: amdgpu device pointer
+ * @job: which job trigger hang
+ *
+ * Attempt to reset the GPU if it has hung (all asics).
+ * Returns 0 for success or an error on failure.
+ */
+int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
+{
+	struct drm_atomic_state *state = NULL;
+	uint64_t reset_flags = 0;
+	int i, r, resched;
+
+	if (!amdgpu_check_soft_reset(adev)) {
+		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
+		return 0;
+	}
+
+	dev_info(adev->dev, "GPU reset begin!\n");
+
+	mutex_lock(&adev->virt.lock_reset);
+	atomic_inc(&adev->gpu_reset_counter);
+	adev->in_sriov_reset = 1;
+
+	/* block TTM */
+	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
+	/* store modesetting */
+	if (amdgpu_device_has_dc_support(adev))
+		state = drm_atomic_helper_suspend(adev->ddev);
+
+	/* block scheduler */
+	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+		struct amdgpu_ring *ring = adev->rings[i];
+
+		if (!ring || !ring->sched.thread)
+			continue;
+
+		/* only focus on the ring hit timeout if &job not NULL */
+		if (job && job->ring->idx != i)
+			continue;
+
+		kthread_park(ring->sched.thread);
+		amd_sched_hw_job_reset(&ring->sched, &job->base);
+
+		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
+		amdgpu_fence_driver_force_completion(ring);
+	}
+
+	if (amdgpu_sriov_vf(adev))
+		r = amdgpu_reset_sriov(adev, &reset_flags, job ? false : true);
+	else
+		r = amdgpu_reset(adev, &reset_flags);
+
+	if (!r) {
+		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
+			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
+			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
+			struct amdgpu_bo *bo, *tmp;
+			struct dma_fence *fence = NULL, *next = NULL;
+
+			DRM_INFO("recover vram bo from shadow\n");
+			mutex_lock(&adev->shadow_list_lock);
+			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
+				next = NULL;
+				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
+				if (fence) {
+					r = dma_fence_wait(fence, false);
+					if (r) {
+						WARN(r, "recovery from shadow isn't completed\n");
+						break;
+					}
+				}
+
+				dma_fence_put(fence);
+				fence = next;
+			}
+			mutex_unlock(&adev->shadow_list_lock);
+			if (fence) {
+				r = dma_fence_wait(fence, false);
+				if (r)
+					WARN(r, "recovery from shadow isn't completed\n");
+			}
+			dma_fence_put(fence);
+		}
+
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			/* only focus on the ring hit timeout if &job not NULL */
+			if (job && job->ring->idx != i)
+				continue;
+
+			amd_sched_job_recovery(&ring->sched);
+			kthread_unpark(ring->sched.thread);
+		}
+	} else {
+		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
+			struct amdgpu_ring *ring = adev->rings[i];
+
+			if (!ring || !ring->sched.thread)
+				continue;
+
+			/* only focus on the ring hit timeout if &job not NULL */
+			if (job && job->ring->idx != i)
+				continue;
+
+			kthread_unpark(adev->rings[i]->sched.thread);
+		}
+	}
+
+	if (amdgpu_device_has_dc_support(adev)) {
+		if (drm_atomic_helper_resume(adev->ddev, state))
+			dev_info(adev->dev, "drm resume failed:%d\n", r);
+		amdgpu_dm_display_resume(adev);
+	} else {
+		drm_helper_resume_force_mode(adev->ddev);
+	}
+
+	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
+
+	if (r) {
+		/* bad news, how to tell it to userspace ? */
+		dev_info(adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));
+		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
+	} else {
+		dev_info(adev->dev, "GPU reset(%d) successed!\n",atomic_read(&adev->gpu_reset_counter));
+	}
+
+	amdgpu_vf_error_trans_all(adev);
+	adev->in_sriov_reset = 0;
+	mutex_unlock(&adev->virt.lock_reset);
+	return r;
+}
+
 void amdgpu_get_pcie_info(struct amdgpu_device *adev)
 {
 	u32 mask;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (2 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 3/9] drm/amdgpu:implement new GPU recover(v2) Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 5/9] drm/amdgpu:cleanup in_sriov_reset and lock_reset Monk Liu
                     ` (4 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

now use new gpu recover

Change-Id: Ieccd25772c47c0e710ad81537a3dd0c1767585a1
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 -----------------------------
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  10 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |   2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   5 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |   2 +-
 drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   2 +-
 8 files changed, 10 insertions(+), 312 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 003668f..335df11 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1844,7 +1844,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
 #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
 
 /* Common functions */
-int amdgpu_gpu_reset(struct amdgpu_device *adev);
+int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
 bool amdgpu_need_backup(struct amdgpu_device *adev);
 void amdgpu_pci_config_reset(struct amdgpu_device *adev);
 bool amdgpu_need_post(struct amdgpu_device *adev);
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 0db3b3c..a2f9a7f 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2818,304 +2818,6 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
 	return r;
 }
 
-/**
- * amdgpu_sriov_gpu_reset - reset the asic
- *
- * @adev: amdgpu device pointer
- * @job: which job trigger hang
- *
- * Attempt the reset the GPU if it has hung (all asics).
- * for SRIOV case.
- * Returns 0 for success or an error on failure.
- */
-int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
-{
-	int i, j, r = 0;
-	int resched;
-	struct amdgpu_bo *bo, *tmp;
-	struct amdgpu_ring *ring;
-	struct dma_fence *fence = NULL, *next = NULL;
-
-	mutex_lock(&adev->virt.lock_reset);
-	atomic_inc(&adev->gpu_reset_counter);
-	adev->in_sriov_reset = true;
-
-	/* block TTM */
-	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
-
-	/* we start from the ring trigger GPU hang */
-	j = job ? job->ring->idx : 0;
-
-	/* block scheduler */
-	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
-		ring = adev->rings[i % AMDGPU_MAX_RINGS];
-		if (!ring || !ring->sched.thread)
-			continue;
-
-		kthread_park(ring->sched.thread);
-
-		if (job && j != i)
-			continue;
-
-		/* here give the last chance to check if job removed from mirror-list
-		 * since we already pay some time on kthread_park */
-		if (job && list_empty(&job->base.node)) {
-			kthread_unpark(ring->sched.thread);
-			goto give_up_reset;
-		}
-
-		if (amd_sched_invalidate_job(&job->base, amdgpu_job_hang_limit))
-			amd_sched_job_kickout(&job->base);
-
-		/* only do job_reset on the hang ring if @job not NULL */
-		amd_sched_hw_job_reset(&ring->sched, NULL);
-
-		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
-		amdgpu_fence_driver_force_completion(ring);
-	}
-
-	/* request to take full control of GPU before re-initialization  */
-	if (job)
-		amdgpu_virt_reset_gpu(adev);
-	else
-		amdgpu_virt_request_full_gpu(adev, true);
-
-
-	/* Resume IP prior to SMC */
-	amdgpu_sriov_reinit_early(adev);
-
-	/* we need recover gart prior to run SMC/CP/SDMA resume */
-	amdgpu_ttm_recover_gart(adev);
-
-	/* now we are okay to resume SMC/CP/SDMA */
-	amdgpu_sriov_reinit_late(adev);
-
-	amdgpu_irq_gpu_reset_resume_helper(adev);
-
-	if (amdgpu_ib_ring_tests(adev))
-		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
-
-	/* release full control of GPU after ib test */
-	amdgpu_virt_release_full_gpu(adev, true);
-
-	DRM_INFO("recover vram bo from shadow\n");
-
-	ring = adev->mman.buffer_funcs_ring;
-	mutex_lock(&adev->shadow_list_lock);
-	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
-		next = NULL;
-		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
-		if (fence) {
-			r = dma_fence_wait(fence, false);
-			if (r) {
-				WARN(r, "recovery from shadow isn't completed\n");
-				break;
-			}
-		}
-
-		dma_fence_put(fence);
-		fence = next;
-	}
-	mutex_unlock(&adev->shadow_list_lock);
-
-	if (fence) {
-		r = dma_fence_wait(fence, false);
-		if (r)
-			WARN(r, "recovery from shadow isn't completed\n");
-	}
-	dma_fence_put(fence);
-
-	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
-		ring = adev->rings[i % AMDGPU_MAX_RINGS];
-		if (!ring || !ring->sched.thread)
-			continue;
-
-		if (job && j != i) {
-			kthread_unpark(ring->sched.thread);
-			continue;
-		}
-
-		amd_sched_job_recovery(&ring->sched);
-		kthread_unpark(ring->sched.thread);
-	}
-
-	drm_helper_resume_force_mode(adev->ddev);
-give_up_reset:
-	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
-	if (r) {
-		/* bad news, how to tell it to userspace ? */
-		dev_info(adev->dev, "GPU reset failed\n");
-	} else {
-		dev_info(adev->dev, "GPU reset successed!\n");
-	}
-
-	adev->in_sriov_reset = false;
-	mutex_unlock(&adev->virt.lock_reset);
-	return r;
-}
-
-/**
- * amdgpu_gpu_reset - reset the asic
- *
- * @adev: amdgpu device pointer
- *
- * Attempt the reset the GPU if it has hung (all asics).
- * Returns 0 for success or an error on failure.
- */
-int amdgpu_gpu_reset(struct amdgpu_device *adev)
-{
-	struct drm_atomic_state *state = NULL;
-	int i, r;
-	int resched;
-	bool need_full_reset, vram_lost = false;
-
-	if (!amdgpu_check_soft_reset(adev)) {
-		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
-		return 0;
-	}
-
-	atomic_inc(&adev->gpu_reset_counter);
-
-	/* block TTM */
-	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
-	/* store modesetting */
-	if (amdgpu_device_has_dc_support(adev))
-		state = drm_atomic_helper_suspend(adev->ddev);
-
-	/* block scheduler */
-	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-		struct amdgpu_ring *ring = adev->rings[i];
-
-		if (!ring || !ring->sched.thread)
-			continue;
-		kthread_park(ring->sched.thread);
-		amd_sched_hw_job_reset(&ring->sched, NULL);
-		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
-		amdgpu_fence_driver_force_completion(ring);
-	}
-
-	need_full_reset = amdgpu_need_full_reset(adev);
-
-	if (!need_full_reset) {
-		amdgpu_pre_soft_reset(adev);
-		r = amdgpu_soft_reset(adev);
-		amdgpu_post_soft_reset(adev);
-		if (r || amdgpu_check_soft_reset(adev)) {
-			DRM_INFO("soft reset failed, will fallback to full reset!\n");
-			need_full_reset = true;
-		}
-	}
-
-	if (need_full_reset) {
-		r = amdgpu_suspend(adev);
-
-retry:
-		amdgpu_atombios_scratch_regs_save(adev);
-		r = amdgpu_asic_reset(adev);
-		amdgpu_atombios_scratch_regs_restore(adev);
-		/* post card */
-		amdgpu_atom_asic_init(adev->mode_info.atom_context);
-
-		if (!r) {
-			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
-			r = amdgpu_resume_phase1(adev);
-			if (r)
-				goto out;
-			vram_lost = amdgpu_check_vram_lost(adev);
-			if (vram_lost) {
-				DRM_ERROR("VRAM is lost!\n");
-				atomic_inc(&adev->vram_lost_counter);
-			}
-			r = amdgpu_ttm_recover_gart(adev);
-			if (r)
-				goto out;
-			r = amdgpu_resume_phase2(adev);
-			if (r)
-				goto out;
-			if (vram_lost)
-				amdgpu_fill_reset_magic(adev);
-		}
-	}
-out:
-	if (!r) {
-		amdgpu_irq_gpu_reset_resume_helper(adev);
-		r = amdgpu_ib_ring_tests(adev);
-		if (r) {
-			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
-			r = amdgpu_suspend(adev);
-			need_full_reset = true;
-			goto retry;
-		}
-		/**
-		 * recovery vm page tables, since we cannot depend on VRAM is
-		 * consistent after gpu full reset.
-		 */
-		if (need_full_reset && amdgpu_need_backup(adev)) {
-			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
-			struct amdgpu_bo *bo, *tmp;
-			struct dma_fence *fence = NULL, *next = NULL;
-
-			DRM_INFO("recover vram bo from shadow\n");
-			mutex_lock(&adev->shadow_list_lock);
-			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
-				next = NULL;
-				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
-				if (fence) {
-					r = dma_fence_wait(fence, false);
-					if (r) {
-						WARN(r, "recovery from shadow isn't completed\n");
-						break;
-					}
-				}
-
-				dma_fence_put(fence);
-				fence = next;
-			}
-			mutex_unlock(&adev->shadow_list_lock);
-			if (fence) {
-				r = dma_fence_wait(fence, false);
-				if (r)
-					WARN(r, "recovery from shadow isn't completed\n");
-			}
-			dma_fence_put(fence);
-		}
-		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-			struct amdgpu_ring *ring = adev->rings[i];
-
-			if (!ring || !ring->sched.thread)
-				continue;
-
-			amd_sched_job_recovery(&ring->sched);
-			kthread_unpark(ring->sched.thread);
-		}
-	} else {
-		dev_err(adev->dev, "asic resume failed (%d).\n", r);
-		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
-			if (adev->rings[i] && adev->rings[i]->sched.thread) {
-				kthread_unpark(adev->rings[i]->sched.thread);
-			}
-		}
-	}
-
-	if (amdgpu_device_has_dc_support(adev)) {
-		r = drm_atomic_helper_resume(adev->ddev, state);
-		amdgpu_dm_display_resume(adev);
-	} else
-		drm_helper_resume_force_mode(adev->ddev);
-
-	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
-	if (r) {
-		/* bad news, how to tell it to userspace ? */
-		dev_info(adev->dev, "GPU reset failed\n");
-	}
-	else {
-		dev_info(adev->dev, "GPU reset successed!\n");
-	}
-
-	amdgpu_vf_error_trans_all(adev);
-	return r;
-}
-
 static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
 {
 	int r;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index d7374cf..9d67bcb 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -694,25 +694,25 @@ static int amdgpu_debugfs_fence_info(struct seq_file *m, void *data)
 }
 
 /**
- * amdgpu_debugfs_gpu_reset - manually trigger a gpu reset
+ * amdgpu_debugfs_gpu_recover - manually trigger a gpu reset & recover
  *
  * Manually trigger a gpu reset at the next fence wait.
  */
-static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
+static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
 {
 	struct drm_info_node *node = (struct drm_info_node *) m->private;
 	struct drm_device *dev = node->minor->dev;
 	struct amdgpu_device *adev = dev->dev_private;
 
-	seq_printf(m, "gpu reset\n");
-	amdgpu_gpu_reset(adev);
+	seq_printf(m, "gpu recover\n");
+	amdgpu_gpu_recover(adev, NULL);
 
 	return 0;
 }
 
 static const struct drm_info_list amdgpu_debugfs_fence_list[] = {
 	{"amdgpu_fence_info", &amdgpu_debugfs_fence_info, 0, NULL},
-	{"amdgpu_gpu_reset", &amdgpu_debugfs_gpu_reset, 0, NULL}
+	{"amdgpu_gpu_recover", &amdgpu_debugfs_gpu_recover, 0, NULL}
 };
 
 static const struct drm_info_list amdgpu_debugfs_fence_list_sriov[] = {
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
index 47c5ce9..9c2f87c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
@@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
 						  reset_work);
 
 	if (!amdgpu_sriov_vf(adev))
-		amdgpu_gpu_reset(adev);
+		amdgpu_gpu_recover(adev, NULL);
 }
 
 /* Disable *all* interrupts */
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
index f08fde9..ac6a4f3 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
@@ -37,10 +37,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
 		  atomic_read(&job->ring->fence_drv.last_seq),
 		  job->ring->fence_drv.sync_seq);
 
-	if (amdgpu_sriov_vf(job->adev))
-		amdgpu_sriov_gpu_reset(job->adev, job);
-	else
-		amdgpu_gpu_reset(job->adev);
+	amdgpu_gpu_recover(job->adev, job);
 }
 
 int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index b89d37f..3a661aa 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -285,7 +285,6 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);
 int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
 int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
-int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
 int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
 void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
 int amdgpu_virt_fw_reserve_get_checksum(void *obj, unsigned long obj_size,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index b4906d2..f8522a0 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -254,7 +254,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, NULL);
+	amdgpu_gpu_recover(adev, NULL);
 }
 
 static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
index c25a831..dae6d3a 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
@@ -514,7 +514,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
 	}
 
 	/* Trigger recovery due to world switch failure */
-	amdgpu_sriov_gpu_reset(adev, NULL);
+	amdgpu_gpu_recover(adev, NULL);
 }
 
 static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 5/9] drm/amdgpu:cleanup in_sriov_reset and lock_reset
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (3 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 6/9] drm/amdgpu:cleanup ucode_init_bo Monk Liu
                     ` (3 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

since now gpu reset is unified with gpu_recover
for both bare-metal and SR-IOV:

1)rename in_sriov_reset to in_gpu_reset
2)move lock_reset from adev->virt to adev

Change-Id: I9f4dbab9a4c916fbc156f669824d15ddcd0f2322
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 ++-
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  | 2 +-
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 2 --
 drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 1 -
 drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c      | 6 +++---
 drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 6 +++---
 8 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
index 335df11..6e89be5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
@@ -1650,7 +1650,8 @@ struct amdgpu_device {
 
 	/* record last mm index being written through WREG32*/
 	unsigned long last_mm_index;
-	bool                            in_sriov_reset;
+	bool                            in_gpu_reset;
+	struct mutex  lock_reset;
 };
 
 static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index a2f9a7f..4cf1146 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -2161,6 +2161,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
 	mutex_init(&adev->mn_lock);
 	mutex_init(&adev->virt.vf_errors.lock);
 	hash_init(adev->mn_hash);
+	mutex_init(&adev->lock_reset);
 
 	amdgpu_check_arguments(adev);
 
@@ -2963,9 +2964,9 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 
 	dev_info(adev->dev, "GPU reset begin!\n");
 
-	mutex_lock(&adev->virt.lock_reset);
+	mutex_lock(&adev->lock_reset);
 	atomic_inc(&adev->gpu_reset_counter);
-	adev->in_sriov_reset = 1;
+	adev->in_gpu_reset = 1;
 
 	/* block TTM */
 	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
@@ -3075,8 +3076,8 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
 	}
 
 	amdgpu_vf_error_trans_all(adev);
-	adev->in_sriov_reset = 0;
-	mutex_unlock(&adev->virt.lock_reset);
+	adev->in_gpu_reset = 0;
+	mutex_unlock(&adev->lock_reset);
 	return r;
 }
 
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 447d446..76f531b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -264,7 +264,7 @@ static int psp_hw_start(struct psp_context *psp)
 	struct amdgpu_device *adev = psp->adev;
 	int ret;
 
-	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
+	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
 		ret = psp_bootloader_load_sysdrv(psp);
 		if (ret)
 			return ret;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index 6564902..edc37cc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -370,7 +370,7 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		return 0;
 	}
 
-	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
+	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
 		err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
 					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
 					AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
index e97f80f..c249725 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
@@ -107,8 +107,6 @@ void amdgpu_virt_init_setting(struct amdgpu_device *adev)
 	adev->enable_virtual_display = true;
 	adev->cg_flags = 0;
 	adev->pg_flags = 0;
-
-	mutex_init(&adev->virt.lock_reset);
 }
 
 uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
index 3a661aa..a710384 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
@@ -238,7 +238,6 @@ struct amdgpu_virt {
 	uint64_t			csa_vmid0_addr;
 	bool chained_ib_support;
 	uint32_t			reg_val_offs;
-	struct mutex                    lock_reset;
 	struct amdgpu_irq_src		ack_irq;
 	struct amdgpu_irq_src		rcv_irq;
 	struct work_struct		flr_work;
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
index b8002ac..72ddb6e 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
@@ -4807,7 +4807,7 @@ static int gfx_v8_0_kiq_init_queue(struct amdgpu_ring *ring)
 
 	gfx_v8_0_kiq_setting(ring);
 
-	if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	if (adev->in_gpu_reset) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
@@ -4844,7 +4844,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
 	struct vi_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
+	if (!adev->in_gpu_reset && !adev->gfx.in_suspend) {
 		memset((void *)mqd, 0, sizeof(struct vi_mqd_allocation));
 		((struct vi_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct vi_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -4856,7 +4856,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct vi_mqd_allocation));
-	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	} else if (adev->in_gpu_reset) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
index 8738b13..d598d78 100644
--- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
+++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
@@ -2732,7 +2732,7 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring *ring)
 
 	gfx_v9_0_kiq_setting(ring);
 
-	if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	if (adev->in_gpu_reset) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));
@@ -2770,7 +2770,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
 	struct v9_mqd *mqd = ring->mqd_ptr;
 	int mqd_idx = ring - &adev->gfx.compute_ring[0];
 
-	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
+	if (!adev->in_gpu_reset && !adev->gfx.in_suspend) {
 		memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
 		((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
 		((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
@@ -2782,7 +2782,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
 
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct v9_mqd_allocation));
-	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
+	} else if (adev->in_gpu_reset) { /* for GPU_RESET case */
 		/* reset MQD to a clean status */
 		if (adev->gfx.mec.mqd_backup[mqd_idx])
 			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 6/9] drm/amdgpu:cleanup ucode_init_bo
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (4 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 5/9] drm/amdgpu:cleanup in_sriov_reset and lock_reset Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset Monk Liu
                     ` (2 subsequent siblings)
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

1,no sriov check since gpu recover is unified
2,need CPU_ACCESS_REQUIRED flag for VRAM if SRIOV
because otherwise after following PIN the first allocated
VRAM bo is wasted due to some TTM mgr reason.

Change-Id: I4d029f2da8bb463942c7861d3e52f309bdba9576
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
index edc37cc..4939e5a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
@@ -370,9 +370,10 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
 		return 0;
 	}
 
-	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
+	if (!adev->in_gpu_reset) {
 		err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
 					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
+					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS|AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED :
 					AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
 					NULL, NULL, 0, bo);
 		if (err) {
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (5 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 6/9] drm/amdgpu:cleanup ucode_init_bo Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 8/9] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
  2017-10-25  9:22   ` [PATCH 9/9] drm/amdgpu:fix random missing of FLR NOTIFY Monk Liu
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
index 4a9f749..c155ce4 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
@@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
 	if (r < 0)
 		return r;
 
+	if (adev->in_gpu_reset)
+		return -ENODEV;
+
 	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
 	if (unlikely(!fpriv)) {
 		r = -ENOMEM;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 8/9] drm/amdgpu/sriov:fix memory leak in psp_load_fw
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (6 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  2017-10-25  9:22   ` [PATCH 9/9] drm/amdgpu:fix random missing of FLR NOTIFY Monk Liu
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

for SR-IOV when doing gpu reset this routine shouldn't do
resource allocating otherwise memory leak

Change-Id: I25da3a5b475196c75c7e639adc40751754625968
Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 20 ++++++++++++--------
 1 file changed, 12 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
index 76f531b..2157d45 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
@@ -334,23 +334,26 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	int ret;
 	struct psp_context *psp = &adev->psp;
 
+	if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset != 0)
+		goto skip_memalloc;
+
 	psp->cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
 	if (!psp->cmd)
 		return -ENOMEM;
 
 	ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
-				      AMDGPU_GEM_DOMAIN_GTT,
-				      &psp->fw_pri_bo,
-				      &psp->fw_pri_mc_addr,
-				      &psp->fw_pri_buf);
+					AMDGPU_GEM_DOMAIN_GTT,
+					&psp->fw_pri_bo,
+					&psp->fw_pri_mc_addr,
+					&psp->fw_pri_buf);
 	if (ret)
 		goto failed;
 
 	ret = amdgpu_bo_create_kernel(adev, PSP_FENCE_BUFFER_SIZE, PAGE_SIZE,
-				      AMDGPU_GEM_DOMAIN_VRAM,
-				      &psp->fence_buf_bo,
-				      &psp->fence_buf_mc_addr,
-				      &psp->fence_buf);
+					AMDGPU_GEM_DOMAIN_VRAM,
+					&psp->fence_buf_bo,
+					&psp->fence_buf_mc_addr,
+					&psp->fence_buf);
 	if (ret)
 		goto failed_mem2;
 
@@ -375,6 +378,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
 	if (ret)
 		goto failed_mem;
 
+skip_memalloc:
 	ret = psp_hw_start(psp);
 	if (ret)
 		goto failed_mem;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 9/9] drm/amdgpu:fix random missing of FLR NOTIFY
       [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
                     ` (7 preceding siblings ...)
  2017-10-25  9:22   ` [PATCH 8/9] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
@ 2017-10-25  9:22   ` Monk Liu
       [not found]     ` <1508923343-24404-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
  8 siblings, 1 reply; 30+ messages in thread
From: Monk Liu @ 2017-10-25  9:22 UTC (permalink / raw)
  To: amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW; +Cc: Monk Liu

Signed-off-by: Monk Liu <Monk.Liu@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
index f8522a0..a43cffb 100644
--- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
+++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
@@ -282,9 +282,17 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
 		/* see what event we get */
 		r = xgpu_ai_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
 
-		/* only handle FLR_NOTIFY now */
-		if (!r)
-			schedule_work(&adev->virt.flr_work);
+		/* sometimes the interrupt is delayed to inject to VM, so under such case
+		 * the IDH_FLR_NOTIFICATION is overwritten by VF FLR from GIM side, thus
+		 * above recieve message could be failed, we should schedule the flr_work
+		 * anyway
+		 */
+		if (r) {
+			DRM_ERROR("FLR_NOTIFICATION is missed\n");
+			xgpu_ai_mailbox_send_ack(adev);
+		}
+
+		schedule_work(&adev->virt.flr_work);
 	}
 
 	return 0;
-- 
2.7.4

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH 1/9] drm/amdgpu:cleanup hw fence get&put
       [not found]     ` <1508923343-24404-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-25 11:38       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-25 11:38 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

NAK, we have tried this before.

This way we can accidentally try to signal a fence twice when the fence 
value is somehow corrupted.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> In amdgpu_fence_process it shouldn't put the signaled hw fence and set
> RCU pointer to NULL, instead the more reasonable sequence is just leave
> the RCU be there untouched, and:
> 1) Either driver_fini() should put this RCU fence or
> 2) New hw fence emitting should take the RCU slot and put the old hw fence.
>
> mapping between get & put on hw fence is:
>
> Get						Put
> fence_emit:fence_init	--->	free_job:put job->fence
> 	       		--->	run_job:put job->fence //GPU RESET case
>
> fence_emit:rcu_assign_pointer	---> fence_driver_fini:put ring->fence_drv.fences[j]
> 				---> fence_emit:put(old) //after fence_wait(old)
>
> run_job:job->fence=fence_get()	--> sched_main:put(fence)
> 				--> job_recovery:put(fence) //for GPU RESET case
>
> sched_main:parent=fence_get()	--> sched_fence_free:put(fence->parent)
> sched_job_recovery:parent=fence_get() --> sched_hw_job_reset:put(s_job->s_fence->parent) // for GPU RESET case
>
> Change-Id: I623167c1e3143233f4e1be7a400a9b698b4a1355
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index f43319c..d7374cf 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -161,6 +161,8 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, struct dma_fence **f)
>   		dma_fence_wait(old, false);
>   	}
>   
> +	dma_fence_put(old);
> +
>   	rcu_assign_pointer(*ptr, dma_fence_get(&fence->base));
>   
>   	*f = &fence->base;
> @@ -246,7 +248,6 @@ void amdgpu_fence_process(struct amdgpu_ring *ring)
>   
>   		/* There is always exactly one thread signaling this fence slot */
>   		fence = rcu_dereference_protected(*ptr, 1);
> -		RCU_INIT_POINTER(*ptr, NULL);
>   
>   		if (!fence)
>   			continue;
> @@ -257,7 +258,6 @@ void amdgpu_fence_process(struct amdgpu_ring *ring)
>   		else
>   			BUG();
>   
> -		dma_fence_put(fence);
>   	} while (last_seq != seq);
>   }
>   


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 2/9] amd/scheduler:imple job skip feature
       [not found]     ` <1508923343-24404-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:06       ` Christian König
       [not found]         ` <79a7f228-3d63-e043-58c1-2986379b6296-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-26  7:06 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> jobs are skipped under two cases
> 1)when the entity behind this job marked guilty, the job
> poped from this entity's queue will be dropped in sched_main loop.
>
> 2)in job_recovery(), skip the scheduling job if its karma detected
> above limit, and also skipped as well for other jobs sharing the
> same fence context. this approach is becuase job_recovery() cannot
> access job->entity due to entity may already dead.
>
> with this feature we can introduce new gpu recover feature.

Sorry for the delay, totally swamped once more at the moment.

>
> Change-Id: I268b1c752c94e6ecd4ea78c87eb226ea3f52908a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  9 +++---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 46 +++++++++++++++++++--------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +-
>   3 files changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index a58e3c5..f08fde9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -177,7 +177,7 @@ static struct dma_fence *amdgpu_job_dependency(struct amd_sched_job *sched_job)
>   	return fence;
>   }
>   
> -static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job)
> +static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job, bool skip)
>   {
>   	struct dma_fence *fence = NULL;
>   	struct amdgpu_device *adev;
> @@ -194,10 +194,11 @@ static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job)
>   	BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
>   
>   	trace_amdgpu_sched_run_job(job);
> -	/* skip ib schedule when vram is lost */
> -	if (job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +
> +	if (skip || job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +		/* skip ib schedule if looks needed, and set error */
>   		dma_fence_set_error(&job->base.s_fence->finished, -ECANCELED);
> -		DRM_ERROR("Skip scheduling IBs!\n");
> +		DRM_INFO("Skip scheduling IBs!\n");

Not really needed, we could just use amd_sched_job_fake_signal() here as 
well.

>   	} else {
>   		r = amdgpu_ib_schedule(job->ring, job->num_ibs, job->ibs, job,
>   				       &fence);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 9cbeade..29841ba 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -330,6 +330,20 @@ static bool amd_sched_entity_add_dependency_cb(struct amd_sched_entity *entity)
>   	return false;
>   }
>   
> +static void amd_sched_job_fake_signal(struct amd_sched_job *job)
> +{
> +	struct amd_sched_fence *s_fence = job->s_fence;
> +
> +	dma_fence_set_error(&job->s_fence->finished, -ECANCELED);
> +	/* fake signaling the scheduled fence */
> +	amd_sched_fence_scheduled(s_fence);
> +	/* fake signaling the finished fence */
> +	dma_fence_put(&job->s_fence->finished);
> +	job->sched->ops->free_job(job);

Well signaling the finished fence will free the job as well, won't it?

> +	/* call CB on the s_fence, e.g. wake up WAIT_CS */
> +	amd_sched_fence_finished(s_fence);
> +}
> +
>   static struct amd_sched_job *
>   amd_sched_entity_pop_job(struct amd_sched_entity *entity)
>   {
> @@ -345,6 +359,12 @@ amd_sched_entity_pop_job(struct amd_sched_entity *entity)
>   			return NULL;
>   
>   	sched_job->s_entity = NULL;
> +
> +	if (entity->guilty && atomic_read(entity->guilty)) {
> +		amd_sched_job_fake_signal(sched_job);
> +		sched_job = NULL;
> +	}
> +

This must come after the spsc_queue_pop() below, otherwise you mess up 
the spsc queue.

>   	spsc_queue_pop(&entity->job_queue);
>   	return sched_job;
>   }
> @@ -441,13 +461,6 @@ static void amd_sched_job_timedout(struct work_struct *work)
>   	job->sched->ops->timedout_job(job);
>   }
>   
> -static void amd_sched_set_guilty(struct amd_sched_job *s_job)
> -{
> -	if (atomic_inc_return(&s_job->karma) > s_job->sched->hang_limit)
> -		if (s_job->s_entity->guilty)
> -			atomic_set(s_job->s_entity->guilty, 1);
> -}
> -
>   void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_job *bad)
>   {
>   	struct amd_sched_job *s_job;
> @@ -466,7 +479,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   	}
>   	spin_unlock(&sched->job_list_lock);
>   
> -	if (bad) {
> +	if (bad && atomic_inc_return(&bad->karma) > bad->sched->hang_limit) {
>   		bool found = false;
>   
>   		for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++ ) {
> @@ -476,7 +489,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>   				if (bad->s_fence->scheduled.context == entity->fence_context) {
>   					found = true;
> -					amd_sched_set_guilty(bad);
> +					if (entity->guilty)
> +						atomic_set(entity->guilty, 1);
>   					break;
>   				}
>   			}
> @@ -510,9 +524,17 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
>   	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
>   		struct amd_sched_fence *s_fence = s_job->s_fence;
>   		struct dma_fence *fence;
> +		bool found_guilty = false, skip;
> +		uint64_t guilty_context;

Well if you want the variables to have an effect you should probably 
move them outside the loop. Otherwise the code doesn't make much sense 
to me.

Regards,
Christian.

> +
> +		if (!found_guilty && atomic_read(&s_job->karma) > sched->hang_limit) {
> +			found_guilty = true;
> +			guilty_context = s_job->s_fence->scheduled.context;
> +		}
>   
> +		skip = (found_guilty && s_job->s_fence->scheduled.context == guilty_context);
>   		spin_unlock(&sched->job_list_lock);
> -		fence = sched->ops->run_job(s_job);
> +		fence = sched->ops->run_job(s_job, skip);
>   		atomic_inc(&sched->hw_rq_count);
>   		if (fence) {
>   			s_fence->parent = dma_fence_get(fence);
> @@ -525,7 +547,6 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
>   					  r);
>   			dma_fence_put(fence);
>   		} else {
> -			DRM_ERROR("Failed to run job!\n");
>   			amd_sched_process_job(NULL, &s_fence->cb);
>   		}
>   		spin_lock(&sched->job_list_lock);
> @@ -650,7 +671,7 @@ static int amd_sched_main(void *param)
>   		atomic_inc(&sched->hw_rq_count);
>   		amd_sched_job_begin(sched_job);
>   
> -		fence = sched->ops->run_job(sched_job);
> +		fence = sched->ops->run_job(sched_job, false);
>   		amd_sched_fence_scheduled(s_fence);
>   
>   		if (fence) {
> @@ -664,7 +685,6 @@ static int amd_sched_main(void *param)
>   					  r);
>   			dma_fence_put(fence);
>   		} else {
> -			DRM_ERROR("Failed to run job!\n");
>   			amd_sched_process_job(NULL, &s_fence->cb);
>   		}
>   
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f9e3a83..a8e5a7d 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -126,7 +126,7 @@ static inline bool amd_sched_invalidate_job(struct amd_sched_job *s_job, int thr
>   */
>   struct amd_sched_backend_ops {
>   	struct dma_fence *(*dependency)(struct amd_sched_job *sched_job);
> -	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job);
> +	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job, bool skip);
>   	void (*timedout_job)(struct amd_sched_job *sched_job);
>   	void (*free_job)(struct amd_sched_job *sched_job);
>   };


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 3/9] drm/amdgpu:implement new GPU recover(v2)
       [not found]     ` <1508923343-24404-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:11       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  7:11 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> 1,new imple names amdgpu_gpu_recover which gives more hint
> on what it does compared with gpu_reset
>
> 2,gpu_recover unify bare-metal and SR-IOV, only the asic reset
> part is implemented differently
>
> 3,gpu_recover will increase hang job karma and mark its entity/context
> as guilty if exceeds limit
>
> V2:
>
> 4,in scheduler main routine the job from guilty context  will be immedialy
> fake signaled after it poped from queue and its fence be set with
> "-ECANCELED" error
>
> 5,in scheduler recovery routine all jobs from the guilty entity would be
> dropped
>
> 6,in run_job() routine the real IB submission would be skipped if @skip parameter
> equales true or there was VRAM lost occured.
>
> Change-Id: I30d30924eeb512e8f3243a8af727d061ed41f800
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   4 +
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 262 +++++++++++++++++++++++++++++
>   2 files changed, 266 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index ba1ab97..003668f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -178,6 +178,10 @@ extern int amdgpu_cik_support;
>   #define CIK_CURSOR_WIDTH 128
>   #define CIK_CURSOR_HEIGHT 128
>   
> +/* GPU RESET flags */
> +#define AMDGPU_RESET_INFO_VRAM_LOST  (1 << 0)
> +#define AMDGPU_RESET_INFO_FULLRESET  (1 << 1)
> +
>   struct amdgpu_device;
>   struct amdgpu_ib;
>   struct amdgpu_cs_parser;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a07544d..0db3b3c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -3116,6 +3116,268 @@ int amdgpu_gpu_reset(struct amdgpu_device *adev)
>   	return r;
>   }
>   
> +static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
> +{
> +	int r;
> +	bool need_full_reset, vram_lost = 0;

In general reverse tree order please. E.g. should look like this:
> bool need_full_reset, vram_lost = 0;
> int r;



> +
> +	need_full_reset = amdgpu_need_full_reset(adev);
> +
> +	if (!need_full_reset) {
> +		amdgpu_pre_soft_reset(adev);
> +		r = amdgpu_soft_reset(adev);
> +		amdgpu_post_soft_reset(adev);
> +		if (r || amdgpu_check_soft_reset(adev)) {
> +			DRM_INFO("soft reset failed, will fallback to full reset!\n");
> +			need_full_reset = true;
> +		}
> +
> +	}
> +
> +	if (need_full_reset) {
> +		r = amdgpu_suspend(adev);
> +
> +retry:
> +		amdgpu_atombios_scratch_regs_save(adev);
> +		r = amdgpu_asic_reset(adev);
> +		amdgpu_atombios_scratch_regs_restore(adev);
> +		/* post card */
> +		amdgpu_atom_asic_init(adev->mode_info.atom_context);
> +
> +		if (!r) {
> +			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
> +			r = amdgpu_resume_phase1(adev);
> +			if (r)
> +				goto out;
> +
> +			vram_lost = amdgpu_check_vram_lost(adev);
> +			if (vram_lost) {
> +				DRM_ERROR("VRAM is lost!\n");
> +				atomic_inc(&adev->vram_lost_counter);
> +			}
> +
> +			r = amdgpu_ttm_recover_gart(adev);
> +			if (r)
> +				goto out;
> +
> +			r = amdgpu_resume_phase2(adev);
> +			if (r)
> +				goto out;
> +
> +			if (vram_lost)
> +				amdgpu_fill_reset_magic(adev);
> +		}
> +	}
> +
> +out:

Would probably be better to move the full reset into a separate function 
instead of the retry/out gotos here.

Apart from that it looks like that should work.

Regards,
Christian.

> +	if (!r) {
> +		amdgpu_irq_gpu_reset_resume_helper(adev);
> +		r = amdgpu_ib_ring_tests(adev);
> +		if (r) {
> +			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
> +			r = amdgpu_suspend(adev);
> +			need_full_reset = true;
> +			goto retry;
> +		}
> +	}
> +
> +	if (reset_flags) {
> +		if (vram_lost)
> +			(*reset_flags) |= AMDGPU_RESET_INFO_VRAM_LOST;
> +
> +		if (need_full_reset)
> +			(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +	}
> +
> +	return r;
> +}
> +
> +static int amdgpu_reset_sriov(struct amdgpu_device *adev, uint64_t *reset_flags, bool from_hypervisor)
> +{
> +	int r;
> +
> +	if (from_hypervisor)
> +		r = amdgpu_virt_request_full_gpu(adev, true);
> +	else
> +		r = amdgpu_virt_reset_gpu(adev);
> +	if (r)
> +		return r;
> +
> +	/* Resume IP prior to SMC */
> +	r = amdgpu_sriov_reinit_early(adev);
> +	if (r)
> +		goto error;
> +
> +	/* we need recover gart prior to run SMC/CP/SDMA resume */
> +	amdgpu_ttm_recover_gart(adev);
> +
> +	/* now we are okay to resume SMC/CP/SDMA */
> +	r = amdgpu_sriov_reinit_late(adev);
> +	if (r)
> +		goto error;
> +
> +	amdgpu_irq_gpu_reset_resume_helper(adev);
> +	r = amdgpu_ib_ring_tests(adev);
> +	if (r)
> +		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> +
> +error:
> +	/* release full control of GPU after ib test */
> +	amdgpu_virt_release_full_gpu(adev, true);
> +
> +	if (reset_flags) {
> +		/* will get vram_lost from GIM in future, now all
> +		 * reset request considered VRAM LOST
> +		 */
> +		(*reset_flags) |= ~AMDGPU_RESET_INFO_VRAM_LOST;
> +		atomic_inc(&adev->vram_lost_counter);
> +
> +		/* VF FLR or hotlink reset is always full-reset */
> +		(*reset_flags) |= AMDGPU_RESET_INFO_FULLRESET;
> +	}
> +
> +	return r;
> +}
> +
> +/**
> + * amdgpu_gpu_recover - reset the asic and recover scheduler
> + *
> + * @adev: amdgpu device pointer
> + * @job: which job trigger hang
> + *
> + * Attempt to reset the GPU if it has hung (all asics).
> + * Returns 0 for success or an error on failure.
> + */
> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
> +{
> +	struct drm_atomic_state *state = NULL;
> +	uint64_t reset_flags = 0;
> +	int i, r, resched;
> +
> +	if (!amdgpu_check_soft_reset(adev)) {
> +		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
> +		return 0;
> +	}
> +
> +	dev_info(adev->dev, "GPU reset begin!\n");
> +
> +	mutex_lock(&adev->virt.lock_reset);
> +	atomic_inc(&adev->gpu_reset_counter);
> +	adev->in_sriov_reset = 1;
> +
> +	/* block TTM */
> +	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> +	/* store modesetting */
> +	if (amdgpu_device_has_dc_support(adev))
> +		state = drm_atomic_helper_suspend(adev->ddev);
> +
> +	/* block scheduler */
> +	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +		struct amdgpu_ring *ring = adev->rings[i];
> +
> +		if (!ring || !ring->sched.thread)
> +			continue;
> +
> +		/* only focus on the ring hit timeout if &job not NULL */
> +		if (job && job->ring->idx != i)
> +			continue;
> +
> +		kthread_park(ring->sched.thread);
> +		amd_sched_hw_job_reset(&ring->sched, &job->base);
> +
> +		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> +		amdgpu_fence_driver_force_completion(ring);
> +	}
> +
> +	if (amdgpu_sriov_vf(adev))
> +		r = amdgpu_reset_sriov(adev, &reset_flags, job ? false : true);
> +	else
> +		r = amdgpu_reset(adev, &reset_flags);
> +
> +	if (!r) {
> +		if (((reset_flags & AMDGPU_RESET_INFO_FULLRESET) && !(adev->flags & AMD_IS_APU)) ||
> +			(reset_flags & AMDGPU_RESET_INFO_VRAM_LOST)) {
> +			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> +			struct amdgpu_bo *bo, *tmp;
> +			struct dma_fence *fence = NULL, *next = NULL;
> +
> +			DRM_INFO("recover vram bo from shadow\n");
> +			mutex_lock(&adev->shadow_list_lock);
> +			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> +				next = NULL;
> +				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> +				if (fence) {
> +					r = dma_fence_wait(fence, false);
> +					if (r) {
> +						WARN(r, "recovery from shadow isn't completed\n");
> +						break;
> +					}
> +				}
> +
> +				dma_fence_put(fence);
> +				fence = next;
> +			}
> +			mutex_unlock(&adev->shadow_list_lock);
> +			if (fence) {
> +				r = dma_fence_wait(fence, false);
> +				if (r)
> +					WARN(r, "recovery from shadow isn't completed\n");
> +			}
> +			dma_fence_put(fence);
> +		}
> +
> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +			struct amdgpu_ring *ring = adev->rings[i];
> +
> +			if (!ring || !ring->sched.thread)
> +				continue;
> +
> +			/* only focus on the ring hit timeout if &job not NULL */
> +			if (job && job->ring->idx != i)
> +				continue;
> +
> +			amd_sched_job_recovery(&ring->sched);
> +			kthread_unpark(ring->sched.thread);
> +		}
> +	} else {
> +		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> +			struct amdgpu_ring *ring = adev->rings[i];
> +
> +			if (!ring || !ring->sched.thread)
> +				continue;
> +
> +			/* only focus on the ring hit timeout if &job not NULL */
> +			if (job && job->ring->idx != i)
> +				continue;
> +
> +			kthread_unpark(adev->rings[i]->sched.thread);
> +		}
> +	}
> +
> +	if (amdgpu_device_has_dc_support(adev)) {
> +		if (drm_atomic_helper_resume(adev->ddev, state))
> +			dev_info(adev->dev, "drm resume failed:%d\n", r);
> +		amdgpu_dm_display_resume(adev);
> +	} else {
> +		drm_helper_resume_force_mode(adev->ddev);
> +	}
> +
> +	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> +
> +	if (r) {
> +		/* bad news, how to tell it to userspace ? */
> +		dev_info(adev->dev, "GPU reset(%d) failed\n", atomic_read(&adev->gpu_reset_counter));
> +		amdgpu_vf_error_put(adev, AMDGIM_ERROR_VF_GPU_RESET_FAIL, 0, r);
> +	} else {
> +		dev_info(adev->dev, "GPU reset(%d) successed!\n",atomic_read(&adev->gpu_reset_counter));
> +	}
> +
> +	amdgpu_vf_error_trans_all(adev);
> +	adev->in_sriov_reset = 0;
> +	mutex_unlock(&adev->virt.lock_reset);
> +	return r;
> +}
> +
>   void amdgpu_get_pcie_info(struct amdgpu_device *adev)
>   {
>   	u32 mask;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset
       [not found]     ` <1508923343-24404-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:13       ` Christian König
       [not found]         ` <13ad16b1-6357-1f0e-3cdd-ddc42d4462b0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-26  7:13 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> now use new gpu recover

Might be better to squash together with the previous patch.

This one doesn't introduce new functionality, but only removes the old 
code and switches over to the new one.



>
> Change-Id: Ieccd25772c47c0e710ad81537a3dd0c1767585a1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 -----------------------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |   2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   2 +-
>   8 files changed, 10 insertions(+), 312 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 003668f..335df11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1844,7 +1844,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
>   #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
>   
>   /* Common functions */
> -int amdgpu_gpu_reset(struct amdgpu_device *adev);
> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
>   bool amdgpu_need_backup(struct amdgpu_device *adev);
>   void amdgpu_pci_config_reset(struct amdgpu_device *adev);
>   bool amdgpu_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0db3b3c..a2f9a7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2818,304 +2818,6 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> -/**
> - * amdgpu_sriov_gpu_reset - reset the asic
> - *
> - * @adev: amdgpu device pointer
> - * @job: which job trigger hang
> - *
> - * Attempt the reset the GPU if it has hung (all asics).
> - * for SRIOV case.
> - * Returns 0 for success or an error on failure.
> - */
> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
> -{
> -	int i, j, r = 0;
> -	int resched;
> -	struct amdgpu_bo *bo, *tmp;
> -	struct amdgpu_ring *ring;
> -	struct dma_fence *fence = NULL, *next = NULL;
> -
> -	mutex_lock(&adev->virt.lock_reset);
> -	atomic_inc(&adev->gpu_reset_counter);
> -	adev->in_sriov_reset = true;
> -
> -	/* block TTM */
> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> -
> -	/* we start from the ring trigger GPU hang */
> -	j = job ? job->ring->idx : 0;
> -
> -	/* block scheduler */
> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -
> -		kthread_park(ring->sched.thread);
> -
> -		if (job && j != i)
> -			continue;
> -
> -		/* here give the last chance to check if job removed from mirror-list
> -		 * since we already pay some time on kthread_park */
> -		if (job && list_empty(&job->base.node)) {
> -			kthread_unpark(ring->sched.thread);
> -			goto give_up_reset;
> -		}
> -
> -		if (amd_sched_invalidate_job(&job->base, amdgpu_job_hang_limit))
> -			amd_sched_job_kickout(&job->base);
> -
> -		/* only do job_reset on the hang ring if @job not NULL */
> -		amd_sched_hw_job_reset(&ring->sched, NULL);
> -
> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> -		amdgpu_fence_driver_force_completion(ring);
> -	}
> -
> -	/* request to take full control of GPU before re-initialization  */
> -	if (job)
> -		amdgpu_virt_reset_gpu(adev);
> -	else
> -		amdgpu_virt_request_full_gpu(adev, true);
> -
> -
> -	/* Resume IP prior to SMC */
> -	amdgpu_sriov_reinit_early(adev);
> -
> -	/* we need recover gart prior to run SMC/CP/SDMA resume */
> -	amdgpu_ttm_recover_gart(adev);
> -
> -	/* now we are okay to resume SMC/CP/SDMA */
> -	amdgpu_sriov_reinit_late(adev);
> -
> -	amdgpu_irq_gpu_reset_resume_helper(adev);
> -
> -	if (amdgpu_ib_ring_tests(adev))
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> -
> -	/* release full control of GPU after ib test */
> -	amdgpu_virt_release_full_gpu(adev, true);
> -
> -	DRM_INFO("recover vram bo from shadow\n");
> -
> -	ring = adev->mman.buffer_funcs_ring;
> -	mutex_lock(&adev->shadow_list_lock);
> -	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -		next = NULL;
> -		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -		if (fence) {
> -			r = dma_fence_wait(fence, false);
> -			if (r) {
> -				WARN(r, "recovery from shadow isn't completed\n");
> -				break;
> -			}
> -		}
> -
> -		dma_fence_put(fence);
> -		fence = next;
> -	}
> -	mutex_unlock(&adev->shadow_list_lock);
> -
> -	if (fence) {
> -		r = dma_fence_wait(fence, false);
> -		if (r)
> -			WARN(r, "recovery from shadow isn't completed\n");
> -	}
> -	dma_fence_put(fence);
> -
> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -
> -		if (job && j != i) {
> -			kthread_unpark(ring->sched.thread);
> -			continue;
> -		}
> -
> -		amd_sched_job_recovery(&ring->sched);
> -		kthread_unpark(ring->sched.thread);
> -	}
> -
> -	drm_helper_resume_force_mode(adev->ddev);
> -give_up_reset:
> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> -	if (r) {
> -		/* bad news, how to tell it to userspace ? */
> -		dev_info(adev->dev, "GPU reset failed\n");
> -	} else {
> -		dev_info(adev->dev, "GPU reset successed!\n");
> -	}
> -
> -	adev->in_sriov_reset = false;
> -	mutex_unlock(&adev->virt.lock_reset);
> -	return r;
> -}
> -
> -/**
> - * amdgpu_gpu_reset - reset the asic
> - *
> - * @adev: amdgpu device pointer
> - *
> - * Attempt the reset the GPU if it has hung (all asics).
> - * Returns 0 for success or an error on failure.
> - */
> -int amdgpu_gpu_reset(struct amdgpu_device *adev)
> -{
> -	struct drm_atomic_state *state = NULL;
> -	int i, r;
> -	int resched;
> -	bool need_full_reset, vram_lost = false;
> -
> -	if (!amdgpu_check_soft_reset(adev)) {
> -		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
> -		return 0;
> -	}
> -
> -	atomic_inc(&adev->gpu_reset_counter);
> -
> -	/* block TTM */
> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> -	/* store modesetting */
> -	if (amdgpu_device_has_dc_support(adev))
> -		state = drm_atomic_helper_suspend(adev->ddev);
> -
> -	/* block scheduler */
> -	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -		struct amdgpu_ring *ring = adev->rings[i];
> -
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -		kthread_park(ring->sched.thread);
> -		amd_sched_hw_job_reset(&ring->sched, NULL);
> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> -		amdgpu_fence_driver_force_completion(ring);
> -	}
> -
> -	need_full_reset = amdgpu_need_full_reset(adev);
> -
> -	if (!need_full_reset) {
> -		amdgpu_pre_soft_reset(adev);
> -		r = amdgpu_soft_reset(adev);
> -		amdgpu_post_soft_reset(adev);
> -		if (r || amdgpu_check_soft_reset(adev)) {
> -			DRM_INFO("soft reset failed, will fallback to full reset!\n");
> -			need_full_reset = true;
> -		}
> -	}
> -
> -	if (need_full_reset) {
> -		r = amdgpu_suspend(adev);
> -
> -retry:
> -		amdgpu_atombios_scratch_regs_save(adev);
> -		r = amdgpu_asic_reset(adev);
> -		amdgpu_atombios_scratch_regs_restore(adev);
> -		/* post card */
> -		amdgpu_atom_asic_init(adev->mode_info.atom_context);
> -
> -		if (!r) {
> -			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
> -			r = amdgpu_resume_phase1(adev);
> -			if (r)
> -				goto out;
> -			vram_lost = amdgpu_check_vram_lost(adev);
> -			if (vram_lost) {
> -				DRM_ERROR("VRAM is lost!\n");
> -				atomic_inc(&adev->vram_lost_counter);
> -			}
> -			r = amdgpu_ttm_recover_gart(adev);
> -			if (r)
> -				goto out;
> -			r = amdgpu_resume_phase2(adev);
> -			if (r)
> -				goto out;
> -			if (vram_lost)
> -				amdgpu_fill_reset_magic(adev);
> -		}
> -	}
> -out:
> -	if (!r) {
> -		amdgpu_irq_gpu_reset_resume_helper(adev);
> -		r = amdgpu_ib_ring_tests(adev);
> -		if (r) {
> -			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
> -			r = amdgpu_suspend(adev);
> -			need_full_reset = true;
> -			goto retry;
> -		}
> -		/**
> -		 * recovery vm page tables, since we cannot depend on VRAM is
> -		 * consistent after gpu full reset.
> -		 */
> -		if (need_full_reset && amdgpu_need_backup(adev)) {
> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -			struct amdgpu_bo *bo, *tmp;
> -			struct dma_fence *fence = NULL, *next = NULL;
> -
> -			DRM_INFO("recover vram bo from shadow\n");
> -			mutex_lock(&adev->shadow_list_lock);
> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -				next = NULL;
> -				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -				if (fence) {
> -					r = dma_fence_wait(fence, false);
> -					if (r) {
> -						WARN(r, "recovery from shadow isn't completed\n");
> -						break;
> -					}
> -				}
> -
> -				dma_fence_put(fence);
> -				fence = next;
> -			}
> -			mutex_unlock(&adev->shadow_list_lock);
> -			if (fence) {
> -				r = dma_fence_wait(fence, false);
> -				if (r)
> -					WARN(r, "recovery from shadow isn't completed\n");
> -			}
> -			dma_fence_put(fence);
> -		}
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			struct amdgpu_ring *ring = adev->rings[i];
> -
> -			if (!ring || !ring->sched.thread)
> -				continue;
> -
> -			amd_sched_job_recovery(&ring->sched);
> -			kthread_unpark(ring->sched.thread);
> -		}
> -	} else {
> -		dev_err(adev->dev, "asic resume failed (%d).\n", r);
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			if (adev->rings[i] && adev->rings[i]->sched.thread) {
> -				kthread_unpark(adev->rings[i]->sched.thread);
> -			}
> -		}
> -	}
> -
> -	if (amdgpu_device_has_dc_support(adev)) {
> -		r = drm_atomic_helper_resume(adev->ddev, state);
> -		amdgpu_dm_display_resume(adev);
> -	} else
> -		drm_helper_resume_force_mode(adev->ddev);
> -
> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> -	if (r) {
> -		/* bad news, how to tell it to userspace ? */
> -		dev_info(adev->dev, "GPU reset failed\n");
> -	}
> -	else {
> -		dev_info(adev->dev, "GPU reset successed!\n");
> -	}
> -
> -	amdgpu_vf_error_trans_all(adev);
> -	return r;
> -}
> -
>   static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
>   {
>   	int r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index d7374cf..9d67bcb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -694,25 +694,25 @@ static int amdgpu_debugfs_fence_info(struct seq_file *m, void *data)
>   }
>   
>   /**
> - * amdgpu_debugfs_gpu_reset - manually trigger a gpu reset
> + * amdgpu_debugfs_gpu_recover - manually trigger a gpu reset & recover
>    *
>    * Manually trigger a gpu reset at the next fence wait.
>    */
> -static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
> +static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
>   {
>   	struct drm_info_node *node = (struct drm_info_node *) m->private;
>   	struct drm_device *dev = node->minor->dev;
>   	struct amdgpu_device *adev = dev->dev_private;
>   
> -	seq_printf(m, "gpu reset\n");
> -	amdgpu_gpu_reset(adev);
> +	seq_printf(m, "gpu recover\n");
> +	amdgpu_gpu_recover(adev, NULL);
>   
>   	return 0;
>   }
>   
>   static const struct drm_info_list amdgpu_debugfs_fence_list[] = {
>   	{"amdgpu_fence_info", &amdgpu_debugfs_fence_info, 0, NULL},
> -	{"amdgpu_gpu_reset", &amdgpu_debugfs_gpu_reset, 0, NULL}
> +	{"amdgpu_gpu_recover", &amdgpu_debugfs_gpu_recover, 0, NULL}
>   };
>   
>   static const struct drm_info_list amdgpu_debugfs_fence_list_sriov[] = {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index 47c5ce9..9c2f87c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>   						  reset_work);
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		amdgpu_gpu_reset(adev);
> +		amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   /* Disable *all* interrupts */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index f08fde9..ac6a4f3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -37,10 +37,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   		  atomic_read(&job->ring->fence_drv.last_seq),
>   		  job->ring->fence_drv.sync_seq);
>   
> -	if (amdgpu_sriov_vf(job->adev))
> -		amdgpu_sriov_gpu_reset(job->adev, job);
> -	else
> -		amdgpu_gpu_reset(job->adev);
> +	amdgpu_gpu_recover(job->adev, job);
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index b89d37f..3a661aa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -285,7 +285,6 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);
>   int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
>   int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>   void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
>   int amdgpu_virt_fw_reserve_get_checksum(void *obj, unsigned long obj_size,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index b4906d2..f8522a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -254,7 +254,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index c25a831..dae6d3a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -514,7 +514,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 5/9] drm/amdgpu:cleanup in_sriov_reset and lock_reset
       [not found]     ` <1508923343-24404-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:14       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  7:14 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> since now gpu reset is unified with gpu_recover
> for both bare-metal and SR-IOV:
>
> 1)rename in_sriov_reset to in_gpu_reset
> 2)move lock_reset from adev->virt to adev
>
> Change-Id: I9f4dbab9a4c916fbc156f669824d15ddcd0f2322
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

Reviewed-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        | 3 ++-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 9 +++++----
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c    | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c  | 2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c   | 2 --
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   | 1 -
>   drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c      | 6 +++---
>   drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c      | 6 +++---
>   8 files changed, 15 insertions(+), 16 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 335df11..6e89be5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1650,7 +1650,8 @@ struct amdgpu_device {
>   
>   	/* record last mm index being written through WREG32*/
>   	unsigned long last_mm_index;
> -	bool                            in_sriov_reset;
> +	bool                            in_gpu_reset;
> +	struct mutex  lock_reset;
>   };
>   
>   static inline struct amdgpu_device *amdgpu_ttm_adev(struct ttm_bo_device *bdev)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index a2f9a7f..4cf1146 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2161,6 +2161,7 @@ int amdgpu_device_init(struct amdgpu_device *adev,
>   	mutex_init(&adev->mn_lock);
>   	mutex_init(&adev->virt.vf_errors.lock);
>   	hash_init(adev->mn_hash);
> +	mutex_init(&adev->lock_reset);
>   
>   	amdgpu_check_arguments(adev);
>   
> @@ -2963,9 +2964,9 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   
>   	dev_info(adev->dev, "GPU reset begin!\n");
>   
> -	mutex_lock(&adev->virt.lock_reset);
> +	mutex_lock(&adev->lock_reset);
>   	atomic_inc(&adev->gpu_reset_counter);
> -	adev->in_sriov_reset = 1;
> +	adev->in_gpu_reset = 1;
>   
>   	/* block TTM */
>   	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> @@ -3075,8 +3076,8 @@ int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job *job)
>   	}
>   
>   	amdgpu_vf_error_trans_all(adev);
> -	adev->in_sriov_reset = 0;
> -	mutex_unlock(&adev->virt.lock_reset);
> +	adev->in_gpu_reset = 0;
> +	mutex_unlock(&adev->lock_reset);
>   	return r;
>   }
>   
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 447d446..76f531b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -264,7 +264,7 @@ static int psp_hw_start(struct psp_context *psp)
>   	struct amdgpu_device *adev = psp->adev;
>   	int ret;
>   
> -	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
> +	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
>   		ret = psp_bootloader_load_sysdrv(psp);
>   		if (ret)
>   			return ret;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> index 6564902..edc37cc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> @@ -370,7 +370,7 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>   		return 0;
>   	}
>   
> -	if (!amdgpu_sriov_vf(adev) || !adev->in_sriov_reset) {
> +	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
>   		err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
>   					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
>   					AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> index e97f80f..c249725 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.c
> @@ -107,8 +107,6 @@ void amdgpu_virt_init_setting(struct amdgpu_device *adev)
>   	adev->enable_virtual_display = true;
>   	adev->cg_flags = 0;
>   	adev->pg_flags = 0;
> -
> -	mutex_init(&adev->virt.lock_reset);
>   }
>   
>   uint32_t amdgpu_virt_kiq_rreg(struct amdgpu_device *adev, uint32_t reg)
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index 3a661aa..a710384 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -238,7 +238,6 @@ struct amdgpu_virt {
>   	uint64_t			csa_vmid0_addr;
>   	bool chained_ib_support;
>   	uint32_t			reg_val_offs;
> -	struct mutex                    lock_reset;
>   	struct amdgpu_irq_src		ack_irq;
>   	struct amdgpu_irq_src		rcv_irq;
>   	struct work_struct		flr_work;
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> index b8002ac..72ddb6e 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v8_0.c
> @@ -4807,7 +4807,7 @@ static int gfx_v8_0_kiq_init_queue(struct amdgpu_ring *ring)
>   
>   	gfx_v8_0_kiq_setting(ring);
>   
> -	if (adev->in_sriov_reset) { /* for GPU_RESET case */
> +	if (adev->in_gpu_reset) { /* for GPU_RESET case */
>   		/* reset MQD to a clean status */
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
> @@ -4844,7 +4844,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
>   	struct vi_mqd *mqd = ring->mqd_ptr;
>   	int mqd_idx = ring - &adev->gfx.compute_ring[0];
>   
> -	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
> +	if (!adev->in_gpu_reset && !adev->gfx.in_suspend) {
>   		memset((void *)mqd, 0, sizeof(struct vi_mqd_allocation));
>   		((struct vi_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
>   		((struct vi_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
> @@ -4856,7 +4856,7 @@ static int gfx_v8_0_kcq_init_queue(struct amdgpu_ring *ring)
>   
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct vi_mqd_allocation));
> -	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
> +	} else if (adev->in_gpu_reset) { /* for GPU_RESET case */
>   		/* reset MQD to a clean status */
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct vi_mqd_allocation));
> diff --git a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> index 8738b13..d598d78 100644
> --- a/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> +++ b/drivers/gpu/drm/amd/amdgpu/gfx_v9_0.c
> @@ -2732,7 +2732,7 @@ static int gfx_v9_0_kiq_init_queue(struct amdgpu_ring *ring)
>   
>   	gfx_v9_0_kiq_setting(ring);
>   
> -	if (adev->in_sriov_reset) { /* for GPU_RESET case */
> +	if (adev->in_gpu_reset) { /* for GPU_RESET case */
>   		/* reset MQD to a clean status */
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));
> @@ -2770,7 +2770,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
>   	struct v9_mqd *mqd = ring->mqd_ptr;
>   	int mqd_idx = ring - &adev->gfx.compute_ring[0];
>   
> -	if (!adev->in_sriov_reset && !adev->gfx.in_suspend) {
> +	if (!adev->in_gpu_reset && !adev->gfx.in_suspend) {
>   		memset((void *)mqd, 0, sizeof(struct v9_mqd_allocation));
>   		((struct v9_mqd_allocation *)mqd)->dynamic_cu_mask = 0xFFFFFFFF;
>   		((struct v9_mqd_allocation *)mqd)->dynamic_rb_mask = 0xFFFFFFFF;
> @@ -2782,7 +2782,7 @@ static int gfx_v9_0_kcq_init_queue(struct amdgpu_ring *ring)
>   
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(adev->gfx.mec.mqd_backup[mqd_idx], mqd, sizeof(struct v9_mqd_allocation));
> -	} else if (adev->in_sriov_reset) { /* for GPU_RESET case */
> +	} else if (adev->in_gpu_reset) { /* for GPU_RESET case */
>   		/* reset MQD to a clean status */
>   		if (adev->gfx.mec.mqd_backup[mqd_idx])
>   			memcpy(mqd, adev->gfx.mec.mqd_backup[mqd_idx], sizeof(struct v9_mqd_allocation));


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 6/9] drm/amdgpu:cleanup ucode_init_bo
       [not found]     ` <1508923343-24404-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:17       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  7:17 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> 1,no sriov check since gpu recover is unified
> 2,need CPU_ACCESS_REQUIRED flag for VRAM if SRIOV
> because otherwise after following PIN the first allocated
> VRAM bo is wasted due to some TTM mgr reason.
>
> Change-Id: I4d029f2da8bb463942c7861d3e52f309bdba9576
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> index edc37cc..4939e5a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_ucode.c
> @@ -370,9 +370,10 @@ int amdgpu_ucode_init_bo(struct amdgpu_device *adev)
>   		return 0;
>   	}
>   
> -	if (!amdgpu_sriov_vf(adev) || !adev->in_gpu_reset) {
> +	if (!adev->in_gpu_reset) {
>   		err = amdgpu_bo_create(adev, adev->firmware.fw_size, PAGE_SIZE, true,
>   					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_DOMAIN_VRAM : AMDGPU_GEM_DOMAIN_GTT,
> +					amdgpu_sriov_vf(adev) ? AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS|AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED :

Just set the AMDGPU_GEM_CREATE_CPU_ACCESS_REQUIRED unconditionally here.

Apart from that feel free to add my Acked-by.

Christian.

>   					AMDGPU_GEM_CREATE_VRAM_CONTIGUOUS,
>   					NULL, NULL, 0, bo);
>   		if (err) {


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]     ` <1508923343-24404-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:18       ` Christian König
       [not found]         ` <46bc9872-0bd7-c2b7-4879-6cab5789f252-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-26  7:18 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

NAK, why the heck should we do this? It would just block all new 
processes from using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 4a9f749..c155ce4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   	if (r < 0)
>   		return r;
>   
> +	if (adev->in_gpu_reset)
> +		return -ENODEV;
> +
>   	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>   	if (unlikely(!fpriv)) {
>   		r = -ENOMEM;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 8/9] drm/amdgpu/sriov:fix memory leak in psp_load_fw
       [not found]     ` <1508923343-24404-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:21       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  7:21 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> for SR-IOV when doing gpu reset this routine shouldn't do
> resource allocating otherwise memory leak

Can't judge if this patch is correct or not, but fixing the whitespaces 
should be a separate patch.

Christian.

>
> Change-Id: I25da3a5b475196c75c7e639adc40751754625968
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c | 20 ++++++++++++--------
>   1 file changed, 12 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> index 76f531b..2157d45 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_psp.c
> @@ -334,23 +334,26 @@ static int psp_load_fw(struct amdgpu_device *adev)
>   	int ret;
>   	struct psp_context *psp = &adev->psp;
>   
> +	if (amdgpu_sriov_vf(adev) && adev->in_gpu_reset != 0)
> +		goto skip_memalloc;
> +
>   	psp->cmd = kzalloc(sizeof(struct psp_gfx_cmd_resp), GFP_KERNEL);
>   	if (!psp->cmd)
>   		return -ENOMEM;
>   
>   	ret = amdgpu_bo_create_kernel(adev, PSP_1_MEG, PSP_1_MEG,
> -				      AMDGPU_GEM_DOMAIN_GTT,
> -				      &psp->fw_pri_bo,
> -				      &psp->fw_pri_mc_addr,
> -				      &psp->fw_pri_buf);
> +					AMDGPU_GEM_DOMAIN_GTT,
> +					&psp->fw_pri_bo,
> +					&psp->fw_pri_mc_addr,
> +					&psp->fw_pri_buf);
>   	if (ret)
>   		goto failed;
>   
>   	ret = amdgpu_bo_create_kernel(adev, PSP_FENCE_BUFFER_SIZE, PAGE_SIZE,
> -				      AMDGPU_GEM_DOMAIN_VRAM,
> -				      &psp->fence_buf_bo,
> -				      &psp->fence_buf_mc_addr,
> -				      &psp->fence_buf);
> +					AMDGPU_GEM_DOMAIN_VRAM,
> +					&psp->fence_buf_bo,
> +					&psp->fence_buf_mc_addr,
> +					&psp->fence_buf);
>   	if (ret)
>   		goto failed_mem2;
>   
> @@ -375,6 +378,7 @@ static int psp_load_fw(struct amdgpu_device *adev)
>   	if (ret)
>   		goto failed_mem;
>   
> +skip_memalloc:
>   	ret = psp_hw_start(psp);
>   	if (ret)
>   		goto failed_mem;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 9/9] drm/amdgpu:fix random missing of FLR NOTIFY
       [not found]     ` <1508923343-24404-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26  7:22       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  7:22 UTC (permalink / raw)
  To: Monk Liu, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>

No idea if this is correct or not, but it looks like an important fix 
which should go into the branch first.

Feel free to add my Acked-by: Christian König <christian.koenig@amd.com>.

Regards,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c | 14 +++++++++++---
>   1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index f8522a0..a43cffb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -282,9 +282,17 @@ static int xgpu_ai_mailbox_rcv_irq(struct amdgpu_device *adev,
>   		/* see what event we get */
>   		r = xgpu_ai_mailbox_rcv_msg(adev, IDH_FLR_NOTIFICATION);
>   
> -		/* only handle FLR_NOTIFY now */
> -		if (!r)
> -			schedule_work(&adev->virt.flr_work);
> +		/* sometimes the interrupt is delayed to inject to VM, so under such case
> +		 * the IDH_FLR_NOTIFICATION is overwritten by VF FLR from GIM side, thus
> +		 * above recieve message could be failed, we should schedule the flr_work
> +		 * anyway
> +		 */
> +		if (r) {
> +			DRM_ERROR("FLR_NOTIFICATION is missed\n");
> +			xgpu_ai_mailbox_send_ack(adev);
> +		}
> +
> +		schedule_work(&adev->virt.flr_work);
>   	}
>   
>   	return 0;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 2/9] amd/scheduler:imple job skip feature
       [not found]         ` <79a7f228-3d63-e043-58c1-2986379b6296-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-26  8:11           ` Liu, Monk
  0 siblings, 0 replies; 30+ messages in thread
From: Liu, Monk @ 2017-10-26  8:11 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Christian

Looks something is misunderstanding in this patch:
> -	if (job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +
> +	if (skip || job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +		/* skip ib schedule if looks needed, and set error */
>   		dma_fence_set_error(&job->base.s_fence->finished, -ECANCELED);
> -		DRM_ERROR("Skip scheduling IBs!\n");
> +		DRM_INFO("Skip scheduling IBs!\n");

Not really needed, we could just use amd_sched_job_fake_signal() here as well.

[ml] the @skip parameter is needed, and we cannot use "fake signal", because "fake signal" may already called in entity_pop_job() stage
We skip job at two place, one is in entity_pop_job() if scheduler can detect the job is guilty in that time
Another one is in run_job() by this @skip in job_recovery(),

> @@ -510,9 +524,17 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
>   	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
>   		struct amd_sched_fence *s_fence = s_job->s_fence;
>   		struct dma_fence *fence;
> +		bool found_guilty = false, skip;
> +		uint64_t guilty_context;

Well if you want the variables to have an effect you should probably move them outside the loop. Otherwise the code doesn't make much sense to me.

[ML] yeah, that's an error, will fix it !



> +static void amd_sched_job_fake_signal(struct amd_sched_job *job) {
> +	struct amd_sched_fence *s_fence = job->s_fence;
> +
> +	dma_fence_set_error(&job->s_fence->finished, -ECANCELED);
> +	/* fake signaling the scheduled fence */
> +	amd_sched_fence_scheduled(s_fence);
> +	/* fake signaling the finished fence */
> +	dma_fence_put(&job->s_fence->finished);
> +	job->sched->ops->free_job(job);

Well signaling the finished fence will free the job as well, won't it?

[ML] no, fake signal is invoked before begin_job(), so the job_finish_cb callback haven't been hooked in "finished" fence, that's why
I manually call "free_job()", but keep in mind that WAIT_CS may already called so we still rely on signaling the "finished" fence to wakeup
The WAIT_CS,


> +	if (entity->guilty && atomic_read(entity->guilty)) {
> +		amd_sched_job_fake_signal(sched_job);
> +		sched_job = NULL;
> +	}
> +

This must come after the spsc_queue_pop() below, otherwise you mess up the spsc queue.

[ML] yes, I also noticed that error, will fix the sequence !

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月26日 15:06
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 2/9] amd/scheduler:imple job skip feature

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> jobs are skipped under two cases
> 1)when the entity behind this job marked guilty, the job poped from 
> this entity's queue will be dropped in sched_main loop.
>
> 2)in job_recovery(), skip the scheduling job if its karma detected 
> above limit, and also skipped as well for other jobs sharing the same 
> fence context. this approach is becuase job_recovery() cannot access 
> job->entity due to entity may already dead.
>
> with this feature we can introduce new gpu recover feature.

Sorry for the delay, totally swamped once more at the moment.

>
> Change-Id: I268b1c752c94e6ecd4ea78c87eb226ea3f52908a
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c       |  9 +++---
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.c | 46 +++++++++++++++++++--------
>   drivers/gpu/drm/amd/scheduler/gpu_scheduler.h |  2 +-
>   3 files changed, 39 insertions(+), 18 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index a58e3c5..f08fde9 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -177,7 +177,7 @@ static struct dma_fence *amdgpu_job_dependency(struct amd_sched_job *sched_job)
>   	return fence;
>   }
>   
> -static struct dma_fence *amdgpu_job_run(struct amd_sched_job 
> *sched_job)
> +static struct dma_fence *amdgpu_job_run(struct amd_sched_job 
> +*sched_job, bool skip)
>   {
>   	struct dma_fence *fence = NULL;
>   	struct amdgpu_device *adev;
> @@ -194,10 +194,11 @@ static struct dma_fence *amdgpu_job_run(struct amd_sched_job *sched_job)
>   	BUG_ON(amdgpu_sync_peek_fence(&job->sync, NULL));
>   
>   	trace_amdgpu_sched_run_job(job);
> -	/* skip ib schedule when vram is lost */
> -	if (job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +
> +	if (skip || job->vram_lost_counter != atomic_read(&adev->vram_lost_counter)) {
> +		/* skip ib schedule if looks needed, and set error */
>   		dma_fence_set_error(&job->base.s_fence->finished, -ECANCELED);
> -		DRM_ERROR("Skip scheduling IBs!\n");
> +		DRM_INFO("Skip scheduling IBs!\n");

Not really needed, we could just use amd_sched_job_fake_signal() here as well.

>   	} else {
>   		r = amdgpu_ib_schedule(job->ring, job->num_ibs, job->ibs, job,
>   				       &fence);
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> index 9cbeade..29841ba 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.c
> @@ -330,6 +330,20 @@ static bool amd_sched_entity_add_dependency_cb(struct amd_sched_entity *entity)
>   	return false;
>   }
>   
> +static void amd_sched_job_fake_signal(struct amd_sched_job *job) {
> +	struct amd_sched_fence *s_fence = job->s_fence;
> +
> +	dma_fence_set_error(&job->s_fence->finished, -ECANCELED);
> +	/* fake signaling the scheduled fence */
> +	amd_sched_fence_scheduled(s_fence);
> +	/* fake signaling the finished fence */
> +	dma_fence_put(&job->s_fence->finished);
> +	job->sched->ops->free_job(job);

Well signaling the finished fence will free the job as well, won't it?

> +	/* call CB on the s_fence, e.g. wake up WAIT_CS */
> +	amd_sched_fence_finished(s_fence);
> +}
> +
>   static struct amd_sched_job *
>   amd_sched_entity_pop_job(struct amd_sched_entity *entity)
>   {
> @@ -345,6 +359,12 @@ amd_sched_entity_pop_job(struct amd_sched_entity *entity)
>   			return NULL;
>   
>   	sched_job->s_entity = NULL;
> +
> +	if (entity->guilty && atomic_read(entity->guilty)) {
> +		amd_sched_job_fake_signal(sched_job);
> +		sched_job = NULL;
> +	}
> +

This must come after the spsc_queue_pop() below, otherwise you mess up the spsc queue.

>   	spsc_queue_pop(&entity->job_queue);
>   	return sched_job;
>   }
> @@ -441,13 +461,6 @@ static void amd_sched_job_timedout(struct work_struct *work)
>   	job->sched->ops->timedout_job(job);
>   }
>   
> -static void amd_sched_set_guilty(struct amd_sched_job *s_job) -{
> -	if (atomic_inc_return(&s_job->karma) > s_job->sched->hang_limit)
> -		if (s_job->s_entity->guilty)
> -			atomic_set(s_job->s_entity->guilty, 1);
> -}
> -
>   void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_job *bad)
>   {
>   	struct amd_sched_job *s_job;
> @@ -466,7 +479,7 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   	}
>   	spin_unlock(&sched->job_list_lock);
>   
> -	if (bad) {
> +	if (bad && atomic_inc_return(&bad->karma) > bad->sched->hang_limit) 
> +{
>   		bool found = false;
>   
>   		for (i = AMD_SCHED_PRIORITY_MIN; i < AMD_SCHED_PRIORITY_MAX; i++ ) 
> { @@ -476,7 +489,8 @@ void amd_sched_hw_job_reset(struct amd_gpu_scheduler *sched, struct amd_sched_jo
>   			list_for_each_entry_safe(entity, tmp, &rq->entities, list) {
>   				if (bad->s_fence->scheduled.context == entity->fence_context) {
>   					found = true;
> -					amd_sched_set_guilty(bad);
> +					if (entity->guilty)
> +						atomic_set(entity->guilty, 1);
>   					break;
>   				}
>   			}
> @@ -510,9 +524,17 @@ void amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
>   	list_for_each_entry_safe(s_job, tmp, &sched->ring_mirror_list, node) {
>   		struct amd_sched_fence *s_fence = s_job->s_fence;
>   		struct dma_fence *fence;
> +		bool found_guilty = false, skip;
> +		uint64_t guilty_context;

Well if you want the variables to have an effect you should probably move them outside the loop. Otherwise the code doesn't make much sense to me.

Regards,
Christian.

> +
> +		if (!found_guilty && atomic_read(&s_job->karma) > sched->hang_limit) {
> +			found_guilty = true;
> +			guilty_context = s_job->s_fence->scheduled.context;
> +		}
>   
> +		skip = (found_guilty && s_job->s_fence->scheduled.context == 
> +guilty_context);
>   		spin_unlock(&sched->job_list_lock);
> -		fence = sched->ops->run_job(s_job);
> +		fence = sched->ops->run_job(s_job, skip);
>   		atomic_inc(&sched->hw_rq_count);
>   		if (fence) {
>   			s_fence->parent = dma_fence_get(fence); @@ -525,7 +547,6 @@ void 
> amd_sched_job_recovery(struct amd_gpu_scheduler *sched)
>   					  r);
>   			dma_fence_put(fence);
>   		} else {
> -			DRM_ERROR("Failed to run job!\n");
>   			amd_sched_process_job(NULL, &s_fence->cb);
>   		}
>   		spin_lock(&sched->job_list_lock);
> @@ -650,7 +671,7 @@ static int amd_sched_main(void *param)
>   		atomic_inc(&sched->hw_rq_count);
>   		amd_sched_job_begin(sched_job);
>   
> -		fence = sched->ops->run_job(sched_job);
> +		fence = sched->ops->run_job(sched_job, false);
>   		amd_sched_fence_scheduled(s_fence);
>   
>   		if (fence) {
> @@ -664,7 +685,6 @@ static int amd_sched_main(void *param)
>   					  r);
>   			dma_fence_put(fence);
>   		} else {
> -			DRM_ERROR("Failed to run job!\n");
>   			amd_sched_process_job(NULL, &s_fence->cb);
>   		}
>   
> diff --git a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h 
> b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> index f9e3a83..a8e5a7d 100644
> --- a/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> +++ b/drivers/gpu/drm/amd/scheduler/gpu_scheduler.h
> @@ -126,7 +126,7 @@ static inline bool amd_sched_invalidate_job(struct amd_sched_job *s_job, int thr
>   */
>   struct amd_sched_backend_ops {
>   	struct dma_fence *(*dependency)(struct amd_sched_job *sched_job);
> -	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job);
> +	struct dma_fence *(*run_job)(struct amd_sched_job *sched_job, bool 
> +skip);
>   	void (*timedout_job)(struct amd_sched_job *sched_job);
>   	void (*free_job)(struct amd_sched_job *sched_job);
>   };


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset
       [not found]         ` <13ad16b1-6357-1f0e-3cdd-ddc42d4462b0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-26  8:13           ` Liu, Monk
       [not found]             ` <BLUPR12MB04499B9B8F278BE639D144A184450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2017-10-26  8:13 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Yeah I know I can squash them,

I spit the patch due to convenience for reviewing 

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月26日 15:13
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> now use new gpu recover

Might be better to squash together with the previous patch.

This one doesn't introduce new functionality, but only removes the old code and switches over to the new one.



>
> Change-Id: Ieccd25772c47c0e710ad81537a3dd0c1767585a1
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 -----------------------------
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  10 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |   2 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   5 +-
>   drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
>   drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |   2 +-
>   drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   2 +-
>   8 files changed, 10 insertions(+), 312 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> index 003668f..335df11 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
> @@ -1844,7 +1844,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
>   #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
>   
>   /* Common functions */
> -int amdgpu_gpu_reset(struct amdgpu_device *adev);
> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
>   bool amdgpu_need_backup(struct amdgpu_device *adev);
>   void amdgpu_pci_config_reset(struct amdgpu_device *adev);
>   bool amdgpu_need_post(struct amdgpu_device *adev);
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 0db3b3c..a2f9a7f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -2818,304 +2818,6 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>   	return r;
>   }
>   
> -/**
> - * amdgpu_sriov_gpu_reset - reset the asic
> - *
> - * @adev: amdgpu device pointer
> - * @job: which job trigger hang
> - *
> - * Attempt the reset the GPU if it has hung (all asics).
> - * for SRIOV case.
> - * Returns 0 for success or an error on failure.
> - */
> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
> -{
> -	int i, j, r = 0;
> -	int resched;
> -	struct amdgpu_bo *bo, *tmp;
> -	struct amdgpu_ring *ring;
> -	struct dma_fence *fence = NULL, *next = NULL;
> -
> -	mutex_lock(&adev->virt.lock_reset);
> -	atomic_inc(&adev->gpu_reset_counter);
> -	adev->in_sriov_reset = true;
> -
> -	/* block TTM */
> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> -
> -	/* we start from the ring trigger GPU hang */
> -	j = job ? job->ring->idx : 0;
> -
> -	/* block scheduler */
> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -
> -		kthread_park(ring->sched.thread);
> -
> -		if (job && j != i)
> -			continue;
> -
> -		/* here give the last chance to check if job removed from mirror-list
> -		 * since we already pay some time on kthread_park */
> -		if (job && list_empty(&job->base.node)) {
> -			kthread_unpark(ring->sched.thread);
> -			goto give_up_reset;
> -		}
> -
> -		if (amd_sched_invalidate_job(&job->base, amdgpu_job_hang_limit))
> -			amd_sched_job_kickout(&job->base);
> -
> -		/* only do job_reset on the hang ring if @job not NULL */
> -		amd_sched_hw_job_reset(&ring->sched, NULL);
> -
> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> -		amdgpu_fence_driver_force_completion(ring);
> -	}
> -
> -	/* request to take full control of GPU before re-initialization  */
> -	if (job)
> -		amdgpu_virt_reset_gpu(adev);
> -	else
> -		amdgpu_virt_request_full_gpu(adev, true);
> -
> -
> -	/* Resume IP prior to SMC */
> -	amdgpu_sriov_reinit_early(adev);
> -
> -	/* we need recover gart prior to run SMC/CP/SDMA resume */
> -	amdgpu_ttm_recover_gart(adev);
> -
> -	/* now we are okay to resume SMC/CP/SDMA */
> -	amdgpu_sriov_reinit_late(adev);
> -
> -	amdgpu_irq_gpu_reset_resume_helper(adev);
> -
> -	if (amdgpu_ib_ring_tests(adev))
> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
> -
> -	/* release full control of GPU after ib test */
> -	amdgpu_virt_release_full_gpu(adev, true);
> -
> -	DRM_INFO("recover vram bo from shadow\n");
> -
> -	ring = adev->mman.buffer_funcs_ring;
> -	mutex_lock(&adev->shadow_list_lock);
> -	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -		next = NULL;
> -		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -		if (fence) {
> -			r = dma_fence_wait(fence, false);
> -			if (r) {
> -				WARN(r, "recovery from shadow isn't completed\n");
> -				break;
> -			}
> -		}
> -
> -		dma_fence_put(fence);
> -		fence = next;
> -	}
> -	mutex_unlock(&adev->shadow_list_lock);
> -
> -	if (fence) {
> -		r = dma_fence_wait(fence, false);
> -		if (r)
> -			WARN(r, "recovery from shadow isn't completed\n");
> -	}
> -	dma_fence_put(fence);
> -
> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -
> -		if (job && j != i) {
> -			kthread_unpark(ring->sched.thread);
> -			continue;
> -		}
> -
> -		amd_sched_job_recovery(&ring->sched);
> -		kthread_unpark(ring->sched.thread);
> -	}
> -
> -	drm_helper_resume_force_mode(adev->ddev);
> -give_up_reset:
> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> -	if (r) {
> -		/* bad news, how to tell it to userspace ? */
> -		dev_info(adev->dev, "GPU reset failed\n");
> -	} else {
> -		dev_info(adev->dev, "GPU reset successed!\n");
> -	}
> -
> -	adev->in_sriov_reset = false;
> -	mutex_unlock(&adev->virt.lock_reset);
> -	return r;
> -}
> -
> -/**
> - * amdgpu_gpu_reset - reset the asic
> - *
> - * @adev: amdgpu device pointer
> - *
> - * Attempt the reset the GPU if it has hung (all asics).
> - * Returns 0 for success or an error on failure.
> - */
> -int amdgpu_gpu_reset(struct amdgpu_device *adev)
> -{
> -	struct drm_atomic_state *state = NULL;
> -	int i, r;
> -	int resched;
> -	bool need_full_reset, vram_lost = false;
> -
> -	if (!amdgpu_check_soft_reset(adev)) {
> -		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
> -		return 0;
> -	}
> -
> -	atomic_inc(&adev->gpu_reset_counter);
> -
> -	/* block TTM */
> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
> -	/* store modesetting */
> -	if (amdgpu_device_has_dc_support(adev))
> -		state = drm_atomic_helper_suspend(adev->ddev);
> -
> -	/* block scheduler */
> -	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -		struct amdgpu_ring *ring = adev->rings[i];
> -
> -		if (!ring || !ring->sched.thread)
> -			continue;
> -		kthread_park(ring->sched.thread);
> -		amd_sched_hw_job_reset(&ring->sched, NULL);
> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
> -		amdgpu_fence_driver_force_completion(ring);
> -	}
> -
> -	need_full_reset = amdgpu_need_full_reset(adev);
> -
> -	if (!need_full_reset) {
> -		amdgpu_pre_soft_reset(adev);
> -		r = amdgpu_soft_reset(adev);
> -		amdgpu_post_soft_reset(adev);
> -		if (r || amdgpu_check_soft_reset(adev)) {
> -			DRM_INFO("soft reset failed, will fallback to full reset!\n");
> -			need_full_reset = true;
> -		}
> -	}
> -
> -	if (need_full_reset) {
> -		r = amdgpu_suspend(adev);
> -
> -retry:
> -		amdgpu_atombios_scratch_regs_save(adev);
> -		r = amdgpu_asic_reset(adev);
> -		amdgpu_atombios_scratch_regs_restore(adev);
> -		/* post card */
> -		amdgpu_atom_asic_init(adev->mode_info.atom_context);
> -
> -		if (!r) {
> -			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
> -			r = amdgpu_resume_phase1(adev);
> -			if (r)
> -				goto out;
> -			vram_lost = amdgpu_check_vram_lost(adev);
> -			if (vram_lost) {
> -				DRM_ERROR("VRAM is lost!\n");
> -				atomic_inc(&adev->vram_lost_counter);
> -			}
> -			r = amdgpu_ttm_recover_gart(adev);
> -			if (r)
> -				goto out;
> -			r = amdgpu_resume_phase2(adev);
> -			if (r)
> -				goto out;
> -			if (vram_lost)
> -				amdgpu_fill_reset_magic(adev);
> -		}
> -	}
> -out:
> -	if (!r) {
> -		amdgpu_irq_gpu_reset_resume_helper(adev);
> -		r = amdgpu_ib_ring_tests(adev);
> -		if (r) {
> -			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
> -			r = amdgpu_suspend(adev);
> -			need_full_reset = true;
> -			goto retry;
> -		}
> -		/**
> -		 * recovery vm page tables, since we cannot depend on VRAM is
> -		 * consistent after gpu full reset.
> -		 */
> -		if (need_full_reset && amdgpu_need_backup(adev)) {
> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
> -			struct amdgpu_bo *bo, *tmp;
> -			struct dma_fence *fence = NULL, *next = NULL;
> -
> -			DRM_INFO("recover vram bo from shadow\n");
> -			mutex_lock(&adev->shadow_list_lock);
> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
> -				next = NULL;
> -				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
> -				if (fence) {
> -					r = dma_fence_wait(fence, false);
> -					if (r) {
> -						WARN(r, "recovery from shadow isn't completed\n");
> -						break;
> -					}
> -				}
> -
> -				dma_fence_put(fence);
> -				fence = next;
> -			}
> -			mutex_unlock(&adev->shadow_list_lock);
> -			if (fence) {
> -				r = dma_fence_wait(fence, false);
> -				if (r)
> -					WARN(r, "recovery from shadow isn't completed\n");
> -			}
> -			dma_fence_put(fence);
> -		}
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			struct amdgpu_ring *ring = adev->rings[i];
> -
> -			if (!ring || !ring->sched.thread)
> -				continue;
> -
> -			amd_sched_job_recovery(&ring->sched);
> -			kthread_unpark(ring->sched.thread);
> -		}
> -	} else {
> -		dev_err(adev->dev, "asic resume failed (%d).\n", r);
> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
> -			if (adev->rings[i] && adev->rings[i]->sched.thread) {
> -				kthread_unpark(adev->rings[i]->sched.thread);
> -			}
> -		}
> -	}
> -
> -	if (amdgpu_device_has_dc_support(adev)) {
> -		r = drm_atomic_helper_resume(adev->ddev, state);
> -		amdgpu_dm_display_resume(adev);
> -	} else
> -		drm_helper_resume_force_mode(adev->ddev);
> -
> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
> -	if (r) {
> -		/* bad news, how to tell it to userspace ? */
> -		dev_info(adev->dev, "GPU reset failed\n");
> -	}
> -	else {
> -		dev_info(adev->dev, "GPU reset successed!\n");
> -	}
> -
> -	amdgpu_vf_error_trans_all(adev);
> -	return r;
> -}
> -
>   static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
>   {
>   	int r;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index d7374cf..9d67bcb 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -694,25 +694,25 @@ static int amdgpu_debugfs_fence_info(struct seq_file *m, void *data)
>   }
>   
>   /**
> - * amdgpu_debugfs_gpu_reset - manually trigger a gpu reset
> + * amdgpu_debugfs_gpu_recover - manually trigger a gpu reset & recover
>    *
>    * Manually trigger a gpu reset at the next fence wait.
>    */
> -static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
> +static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
>   {
>   	struct drm_info_node *node = (struct drm_info_node *) m->private;
>   	struct drm_device *dev = node->minor->dev;
>   	struct amdgpu_device *adev = dev->dev_private;
>   
> -	seq_printf(m, "gpu reset\n");
> -	amdgpu_gpu_reset(adev);
> +	seq_printf(m, "gpu recover\n");
> +	amdgpu_gpu_recover(adev, NULL);
>   
>   	return 0;
>   }
>   
>   static const struct drm_info_list amdgpu_debugfs_fence_list[] = {
>   	{"amdgpu_fence_info", &amdgpu_debugfs_fence_info, 0, NULL},
> -	{"amdgpu_gpu_reset", &amdgpu_debugfs_gpu_reset, 0, NULL}
> +	{"amdgpu_gpu_recover", &amdgpu_debugfs_gpu_recover, 0, NULL}
>   };
>   
>   static const struct drm_info_list amdgpu_debugfs_fence_list_sriov[] = {
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> index 47c5ce9..9c2f87c 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
> @@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>   						  reset_work);
>   
>   	if (!amdgpu_sriov_vf(adev))
> -		amdgpu_gpu_reset(adev);
> +		amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   /* Disable *all* interrupts */
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> index f08fde9..ac6a4f3 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
> @@ -37,10 +37,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>   		  atomic_read(&job->ring->fence_drv.last_seq),
>   		  job->ring->fence_drv.sync_seq);
>   
> -	if (amdgpu_sriov_vf(job->adev))
> -		amdgpu_sriov_gpu_reset(job->adev, job);
> -	else
> -		amdgpu_gpu_reset(job->adev);
> +	amdgpu_gpu_recover(job->adev, job);
>   }
>   
>   int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> index b89d37f..3a661aa 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
> @@ -285,7 +285,6 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);
>   int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
>   int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
>   int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>   void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
>   int amdgpu_virt_fw_reserve_get_checksum(void *obj, unsigned long obj_size,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> index b4906d2..f8522a0 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
> @@ -254,7 +254,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> index c25a831..dae6d3a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
> @@ -514,7 +514,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>   	}
>   
>   	/* Trigger recovery due to world switch failure */
> -	amdgpu_sriov_gpu_reset(adev, NULL);
> +	amdgpu_gpu_recover(adev, NULL);
>   }
>   
>   static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset
       [not found]             ` <BLUPR12MB04499B9B8F278BE639D144A184450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-26  8:16               ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-26  8:16 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> I spit the patch due to convenience for reviewing
Thought so, but this way we can't see the delta for reviewing.

Splitting up in multiple patches makes only sense if you can limit the 
patch to one component or functionality change at a time.

Christian.

Am 26.10.2017 um 10:13 schrieb Liu, Monk:
> Yeah I know I can squash them,
>
> I spit the patch due to convenience for reviewing
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月26日 15:13
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset
>
> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>> now use new gpu recover
> Might be better to squash together with the previous patch.
>
> This one doesn't introduce new functionality, but only removes the old code and switches over to the new one.
>
>
>
>> Change-Id: Ieccd25772c47c0e710ad81537a3dd0c1767585a1
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu.h        |   2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 298 -----------------------------
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  10 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c    |   2 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_job.c    |   5 +-
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h   |   1 -
>>    drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c      |   2 +-
>>    drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c      |   2 +-
>>    8 files changed, 10 insertions(+), 312 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu.h b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> index 003668f..335df11 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu.h
>> @@ -1844,7 +1844,7 @@ amdgpu_get_sdma_instance(struct amdgpu_ring *ring)
>>    #define amdgpu_psp_check_fw_loading_status(adev, i) (adev)->firmware.funcs->check_fw_loading_status((adev), (i))
>>    
>>    /* Common functions */
>> -int amdgpu_gpu_reset(struct amdgpu_device *adev);
>> +int amdgpu_gpu_recover(struct amdgpu_device *adev, struct amdgpu_job* job);
>>    bool amdgpu_need_backup(struct amdgpu_device *adev);
>>    void amdgpu_pci_config_reset(struct amdgpu_device *adev);
>>    bool amdgpu_need_post(struct amdgpu_device *adev);
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 0db3b3c..a2f9a7f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -2818,304 +2818,6 @@ static int amdgpu_recover_vram_from_shadow(struct amdgpu_device *adev,
>>    	return r;
>>    }
>>    
>> -/**
>> - * amdgpu_sriov_gpu_reset - reset the asic
>> - *
>> - * @adev: amdgpu device pointer
>> - * @job: which job trigger hang
>> - *
>> - * Attempt the reset the GPU if it has hung (all asics).
>> - * for SRIOV case.
>> - * Returns 0 for success or an error on failure.
>> - */
>> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job)
>> -{
>> -	int i, j, r = 0;
>> -	int resched;
>> -	struct amdgpu_bo *bo, *tmp;
>> -	struct amdgpu_ring *ring;
>> -	struct dma_fence *fence = NULL, *next = NULL;
>> -
>> -	mutex_lock(&adev->virt.lock_reset);
>> -	atomic_inc(&adev->gpu_reset_counter);
>> -	adev->in_sriov_reset = true;
>> -
>> -	/* block TTM */
>> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>> -
>> -	/* we start from the ring trigger GPU hang */
>> -	j = job ? job->ring->idx : 0;
>> -
>> -	/* block scheduler */
>> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
>> -		if (!ring || !ring->sched.thread)
>> -			continue;
>> -
>> -		kthread_park(ring->sched.thread);
>> -
>> -		if (job && j != i)
>> -			continue;
>> -
>> -		/* here give the last chance to check if job removed from mirror-list
>> -		 * since we already pay some time on kthread_park */
>> -		if (job && list_empty(&job->base.node)) {
>> -			kthread_unpark(ring->sched.thread);
>> -			goto give_up_reset;
>> -		}
>> -
>> -		if (amd_sched_invalidate_job(&job->base, amdgpu_job_hang_limit))
>> -			amd_sched_job_kickout(&job->base);
>> -
>> -		/* only do job_reset on the hang ring if @job not NULL */
>> -		amd_sched_hw_job_reset(&ring->sched, NULL);
>> -
>> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
>> -		amdgpu_fence_driver_force_completion(ring);
>> -	}
>> -
>> -	/* request to take full control of GPU before re-initialization  */
>> -	if (job)
>> -		amdgpu_virt_reset_gpu(adev);
>> -	else
>> -		amdgpu_virt_request_full_gpu(adev, true);
>> -
>> -
>> -	/* Resume IP prior to SMC */
>> -	amdgpu_sriov_reinit_early(adev);
>> -
>> -	/* we need recover gart prior to run SMC/CP/SDMA resume */
>> -	amdgpu_ttm_recover_gart(adev);
>> -
>> -	/* now we are okay to resume SMC/CP/SDMA */
>> -	amdgpu_sriov_reinit_late(adev);
>> -
>> -	amdgpu_irq_gpu_reset_resume_helper(adev);
>> -
>> -	if (amdgpu_ib_ring_tests(adev))
>> -		dev_err(adev->dev, "[GPU_RESET] ib ring test failed (%d).\n", r);
>> -
>> -	/* release full control of GPU after ib test */
>> -	amdgpu_virt_release_full_gpu(adev, true);
>> -
>> -	DRM_INFO("recover vram bo from shadow\n");
>> -
>> -	ring = adev->mman.buffer_funcs_ring;
>> -	mutex_lock(&adev->shadow_list_lock);
>> -	list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
>> -		next = NULL;
>> -		amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
>> -		if (fence) {
>> -			r = dma_fence_wait(fence, false);
>> -			if (r) {
>> -				WARN(r, "recovery from shadow isn't completed\n");
>> -				break;
>> -			}
>> -		}
>> -
>> -		dma_fence_put(fence);
>> -		fence = next;
>> -	}
>> -	mutex_unlock(&adev->shadow_list_lock);
>> -
>> -	if (fence) {
>> -		r = dma_fence_wait(fence, false);
>> -		if (r)
>> -			WARN(r, "recovery from shadow isn't completed\n");
>> -	}
>> -	dma_fence_put(fence);
>> -
>> -	for (i = j; i < j + AMDGPU_MAX_RINGS; ++i) {
>> -		ring = adev->rings[i % AMDGPU_MAX_RINGS];
>> -		if (!ring || !ring->sched.thread)
>> -			continue;
>> -
>> -		if (job && j != i) {
>> -			kthread_unpark(ring->sched.thread);
>> -			continue;
>> -		}
>> -
>> -		amd_sched_job_recovery(&ring->sched);
>> -		kthread_unpark(ring->sched.thread);
>> -	}
>> -
>> -	drm_helper_resume_force_mode(adev->ddev);
>> -give_up_reset:
>> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
>> -	if (r) {
>> -		/* bad news, how to tell it to userspace ? */
>> -		dev_info(adev->dev, "GPU reset failed\n");
>> -	} else {
>> -		dev_info(adev->dev, "GPU reset successed!\n");
>> -	}
>> -
>> -	adev->in_sriov_reset = false;
>> -	mutex_unlock(&adev->virt.lock_reset);
>> -	return r;
>> -}
>> -
>> -/**
>> - * amdgpu_gpu_reset - reset the asic
>> - *
>> - * @adev: amdgpu device pointer
>> - *
>> - * Attempt the reset the GPU if it has hung (all asics).
>> - * Returns 0 for success or an error on failure.
>> - */
>> -int amdgpu_gpu_reset(struct amdgpu_device *adev)
>> -{
>> -	struct drm_atomic_state *state = NULL;
>> -	int i, r;
>> -	int resched;
>> -	bool need_full_reset, vram_lost = false;
>> -
>> -	if (!amdgpu_check_soft_reset(adev)) {
>> -		DRM_INFO("No hardware hang detected. Did some blocks stall?\n");
>> -		return 0;
>> -	}
>> -
>> -	atomic_inc(&adev->gpu_reset_counter);
>> -
>> -	/* block TTM */
>> -	resched = ttm_bo_lock_delayed_workqueue(&adev->mman.bdev);
>> -	/* store modesetting */
>> -	if (amdgpu_device_has_dc_support(adev))
>> -		state = drm_atomic_helper_suspend(adev->ddev);
>> -
>> -	/* block scheduler */
>> -	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -		struct amdgpu_ring *ring = adev->rings[i];
>> -
>> -		if (!ring || !ring->sched.thread)
>> -			continue;
>> -		kthread_park(ring->sched.thread);
>> -		amd_sched_hw_job_reset(&ring->sched, NULL);
>> -		/* after all hw jobs are reset, hw fence is meaningless, so force_completion */
>> -		amdgpu_fence_driver_force_completion(ring);
>> -	}
>> -
>> -	need_full_reset = amdgpu_need_full_reset(adev);
>> -
>> -	if (!need_full_reset) {
>> -		amdgpu_pre_soft_reset(adev);
>> -		r = amdgpu_soft_reset(adev);
>> -		amdgpu_post_soft_reset(adev);
>> -		if (r || amdgpu_check_soft_reset(adev)) {
>> -			DRM_INFO("soft reset failed, will fallback to full reset!\n");
>> -			need_full_reset = true;
>> -		}
>> -	}
>> -
>> -	if (need_full_reset) {
>> -		r = amdgpu_suspend(adev);
>> -
>> -retry:
>> -		amdgpu_atombios_scratch_regs_save(adev);
>> -		r = amdgpu_asic_reset(adev);
>> -		amdgpu_atombios_scratch_regs_restore(adev);
>> -		/* post card */
>> -		amdgpu_atom_asic_init(adev->mode_info.atom_context);
>> -
>> -		if (!r) {
>> -			dev_info(adev->dev, "GPU reset succeeded, trying to resume\n");
>> -			r = amdgpu_resume_phase1(adev);
>> -			if (r)
>> -				goto out;
>> -			vram_lost = amdgpu_check_vram_lost(adev);
>> -			if (vram_lost) {
>> -				DRM_ERROR("VRAM is lost!\n");
>> -				atomic_inc(&adev->vram_lost_counter);
>> -			}
>> -			r = amdgpu_ttm_recover_gart(adev);
>> -			if (r)
>> -				goto out;
>> -			r = amdgpu_resume_phase2(adev);
>> -			if (r)
>> -				goto out;
>> -			if (vram_lost)
>> -				amdgpu_fill_reset_magic(adev);
>> -		}
>> -	}
>> -out:
>> -	if (!r) {
>> -		amdgpu_irq_gpu_reset_resume_helper(adev);
>> -		r = amdgpu_ib_ring_tests(adev);
>> -		if (r) {
>> -			dev_err(adev->dev, "ib ring test failed (%d).\n", r);
>> -			r = amdgpu_suspend(adev);
>> -			need_full_reset = true;
>> -			goto retry;
>> -		}
>> -		/**
>> -		 * recovery vm page tables, since we cannot depend on VRAM is
>> -		 * consistent after gpu full reset.
>> -		 */
>> -		if (need_full_reset && amdgpu_need_backup(adev)) {
>> -			struct amdgpu_ring *ring = adev->mman.buffer_funcs_ring;
>> -			struct amdgpu_bo *bo, *tmp;
>> -			struct dma_fence *fence = NULL, *next = NULL;
>> -
>> -			DRM_INFO("recover vram bo from shadow\n");
>> -			mutex_lock(&adev->shadow_list_lock);
>> -			list_for_each_entry_safe(bo, tmp, &adev->shadow_list, shadow_list) {
>> -				next = NULL;
>> -				amdgpu_recover_vram_from_shadow(adev, ring, bo, &next);
>> -				if (fence) {
>> -					r = dma_fence_wait(fence, false);
>> -					if (r) {
>> -						WARN(r, "recovery from shadow isn't completed\n");
>> -						break;
>> -					}
>> -				}
>> -
>> -				dma_fence_put(fence);
>> -				fence = next;
>> -			}
>> -			mutex_unlock(&adev->shadow_list_lock);
>> -			if (fence) {
>> -				r = dma_fence_wait(fence, false);
>> -				if (r)
>> -					WARN(r, "recovery from shadow isn't completed\n");
>> -			}
>> -			dma_fence_put(fence);
>> -		}
>> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -			struct amdgpu_ring *ring = adev->rings[i];
>> -
>> -			if (!ring || !ring->sched.thread)
>> -				continue;
>> -
>> -			amd_sched_job_recovery(&ring->sched);
>> -			kthread_unpark(ring->sched.thread);
>> -		}
>> -	} else {
>> -		dev_err(adev->dev, "asic resume failed (%d).\n", r);
>> -		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>> -			if (adev->rings[i] && adev->rings[i]->sched.thread) {
>> -				kthread_unpark(adev->rings[i]->sched.thread);
>> -			}
>> -		}
>> -	}
>> -
>> -	if (amdgpu_device_has_dc_support(adev)) {
>> -		r = drm_atomic_helper_resume(adev->ddev, state);
>> -		amdgpu_dm_display_resume(adev);
>> -	} else
>> -		drm_helper_resume_force_mode(adev->ddev);
>> -
>> -	ttm_bo_unlock_delayed_workqueue(&adev->mman.bdev, resched);
>> -	if (r) {
>> -		/* bad news, how to tell it to userspace ? */
>> -		dev_info(adev->dev, "GPU reset failed\n");
>> -	}
>> -	else {
>> -		dev_info(adev->dev, "GPU reset successed!\n");
>> -	}
>> -
>> -	amdgpu_vf_error_trans_all(adev);
>> -	return r;
>> -}
>> -
>>    static int amdgpu_reset(struct amdgpu_device *adev, uint64_t* reset_flags)
>>    {
>>    	int r;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index d7374cf..9d67bcb 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -694,25 +694,25 @@ static int amdgpu_debugfs_fence_info(struct seq_file *m, void *data)
>>    }
>>    
>>    /**
>> - * amdgpu_debugfs_gpu_reset - manually trigger a gpu reset
>> + * amdgpu_debugfs_gpu_recover - manually trigger a gpu reset & recover
>>     *
>>     * Manually trigger a gpu reset at the next fence wait.
>>     */
>> -static int amdgpu_debugfs_gpu_reset(struct seq_file *m, void *data)
>> +static int amdgpu_debugfs_gpu_recover(struct seq_file *m, void *data)
>>    {
>>    	struct drm_info_node *node = (struct drm_info_node *) m->private;
>>    	struct drm_device *dev = node->minor->dev;
>>    	struct amdgpu_device *adev = dev->dev_private;
>>    
>> -	seq_printf(m, "gpu reset\n");
>> -	amdgpu_gpu_reset(adev);
>> +	seq_printf(m, "gpu recover\n");
>> +	amdgpu_gpu_recover(adev, NULL);
>>    
>>    	return 0;
>>    }
>>    
>>    static const struct drm_info_list amdgpu_debugfs_fence_list[] = {
>>    	{"amdgpu_fence_info", &amdgpu_debugfs_fence_info, 0, NULL},
>> -	{"amdgpu_gpu_reset", &amdgpu_debugfs_gpu_reset, 0, NULL}
>> +	{"amdgpu_gpu_recover", &amdgpu_debugfs_gpu_recover, 0, NULL}
>>    };
>>    
>>    static const struct drm_info_list amdgpu_debugfs_fence_list_sriov[] = {
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> index 47c5ce9..9c2f87c 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_irq.c
>> @@ -88,7 +88,7 @@ static void amdgpu_irq_reset_work_func(struct work_struct *work)
>>    						  reset_work);
>>    
>>    	if (!amdgpu_sriov_vf(adev))
>> -		amdgpu_gpu_reset(adev);
>> +		amdgpu_gpu_recover(adev, NULL);
>>    }
>>    
>>    /* Disable *all* interrupts */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> index f08fde9..ac6a4f3 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c
>> @@ -37,10 +37,7 @@ static void amdgpu_job_timedout(struct amd_sched_job *s_job)
>>    		  atomic_read(&job->ring->fence_drv.last_seq),
>>    		  job->ring->fence_drv.sync_seq);
>>    
>> -	if (amdgpu_sriov_vf(job->adev))
>> -		amdgpu_sriov_gpu_reset(job->adev, job);
>> -	else
>> -		amdgpu_gpu_reset(job->adev);
>> +	amdgpu_gpu_recover(job->adev, job);
>>    }
>>    
>>    int amdgpu_job_alloc(struct amdgpu_device *adev, unsigned num_ibs,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>> index b89d37f..3a661aa 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_virt.h
>> @@ -285,7 +285,6 @@ void amdgpu_virt_kiq_wreg(struct amdgpu_device *adev, uint32_t reg, uint32_t v);
>>    int amdgpu_virt_request_full_gpu(struct amdgpu_device *adev, bool init);
>>    int amdgpu_virt_release_full_gpu(struct amdgpu_device *adev, bool init);
>>    int amdgpu_virt_reset_gpu(struct amdgpu_device *adev);
>> -int amdgpu_sriov_gpu_reset(struct amdgpu_device *adev, struct amdgpu_job *job);
>>    int amdgpu_virt_alloc_mm_table(struct amdgpu_device *adev);
>>    void amdgpu_virt_free_mm_table(struct amdgpu_device *adev);
>>    int amdgpu_virt_fw_reserve_get_checksum(void *obj, unsigned long obj_size,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> index b4906d2..f8522a0 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_ai.c
>> @@ -254,7 +254,7 @@ static void xgpu_ai_mailbox_flr_work(struct work_struct *work)
>>    	}
>>    
>>    	/* Trigger recovery due to world switch failure */
>> -	amdgpu_sriov_gpu_reset(adev, NULL);
>> +	amdgpu_gpu_recover(adev, NULL);
>>    }
>>    
>>    static int xgpu_ai_set_mailbox_rcv_irq(struct amdgpu_device *adev,
>> diff --git a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>> index c25a831..dae6d3a 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/mxgpu_vi.c
>> @@ -514,7 +514,7 @@ static void xgpu_vi_mailbox_flr_work(struct work_struct *work)
>>    	}
>>    
>>    	/* Trigger recovery due to world switch failure */
>> -	amdgpu_sriov_gpu_reset(adev, NULL);
>> +	amdgpu_gpu_recover(adev, NULL);
>>    }
>>    
>>    static int xgpu_vi_set_mailbox_rcv_irq(struct amdgpu_device *adev,
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]         ` <46bc9872-0bd7-c2b7-4879-6cab5789f252-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-26  8:17           ` Liu, Monk
       [not found]             ` <BLUPR12MB0449D707ECAE76FFAA13C46684450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2017-10-26  8:17 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().

e.g. VM init need to create page table, but keep In mind that gpu_recover() calls ASIC RESET, 

if we don't block device open while gpu doing recover, the vm init (SDMA working on page table creating) would be ruined by ASIC RESET

do you have any good solution ? the key point is avoid/delay/push_back hw activities from UMD side when we are running in gpu_recover() function 

BR Monk

-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月26日 15:18
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

NAK, why the heck should we do this? It would just block all new processes from using the device.

Christian.

Am 25.10.2017 um 11:22 schrieb Monk Liu:
> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> index 4a9f749..c155ce4 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>   	if (r < 0)
>   		return r;
>   
> +	if (adev->in_gpu_reset)
> +		return -ENODEV;
> +
>   	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>   	if (unlikely(!fpriv)) {
>   		r = -ENOMEM;


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]             ` <BLUPR12MB0449D707ECAE76FFAA13C46684450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-26 10:54               ` Christian König
       [not found]                 ` <06a765e3-971e-698c-6d07-1dee69c1e0c8-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-26 10:54 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> if we don't block device open while gpu doing recover, the vm init (SDMA working on page table creating) would be ruined by ASIC RESET
That is not a problem at all. SDMA just does some clear operation on the 
page tables and those are either recovered from the shadow or run after 
the reset.

Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:
> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>
> e.g. VM init need to create page table, but keep In mind that gpu_recover() calls ASIC RESET,
>
> if we don't block device open while gpu doing recover, the vm init (SDMA working on page table creating) would be ruined by ASIC RESET
>
> do you have any good solution ? the key point is avoid/delay/push_back hw activities from UMD side when we are running in gpu_recover() function
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月26日 15:18
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> NAK, why the heck should we do this? It would just block all new processes from using the device.
>
> Christian.
>
> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>    1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 4a9f749..c155ce4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>    	if (r < 0)
>>    		return r;
>>    
>> +	if (adev->in_gpu_reset)
>> +		return -ENODEV;
>> +
>>    	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>    	if (unlikely(!fpriv)) {
>>    		r = -ENOMEM;
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                 ` <06a765e3-971e-698c-6d07-1dee69c1e0c8-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-26 11:08                   ` Liu, Monk
       [not found]                     ` <BLUPR12MB044997E32FD02ACDBD978C7084450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2017-10-26 11:08 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

"Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???

BR Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月26日 18:54
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

> if we don't block device open while gpu doing recover, the vm init 
> (SDMA working on page table creating) would be ruined by ASIC RESET
That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.

Regards,
Christian.

Am 26.10.2017 um 10:17 schrieb Liu, Monk:
> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>
> e.g. VM init need to create page table, but keep In mind that 
> gpu_recover() calls ASIC RESET,
>
> if we don't block device open while gpu doing recover, the vm init 
> (SDMA working on page table creating) would be ruined by ASIC RESET
>
> do you have any good solution ? the key point is avoid/delay/push_back 
> hw activities from UMD side when we are running in gpu_recover() 
> function
>
> BR Monk
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月26日 15:18
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> NAK, why the heck should we do this? It would just block all new processes from using the device.
>
> Christian.
>
> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>    1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> index 4a9f749..c155ce4 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>    	if (r < 0)
>>    		return r;
>>    
>> +	if (adev->in_gpu_reset)
>> +		return -ENODEV;
>> +
>>    	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>    	if (unlikely(!fpriv)) {
>>    		r = -ENOMEM;
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                     ` <BLUPR12MB044997E32FD02ACDBD978C7084450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-26 15:15                       ` Christian König
       [not found]                         ` <46a5317f-e606-5cfe-82c9-46df64818435-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-26 15:15 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

Am 26.10.2017 um 13:08 schrieb Liu, Monk:
> "Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
I can't see any difference between the handling of existing VMs and new 
created ones.

Either we have correct handling and can redo the activity or we have 
corrupted VM page tables and crash again immediately.

So we need to handle this gracefully anyway,
Christian.

>
> BR Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月26日 18:54
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
>> if we don't block device open while gpu doing recover, the vm init
>> (SDMA working on page table creating) would be ruined by ASIC RESET
> That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.
>
> Regards,
> Christian.
>
> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>>
>> e.g. VM init need to create page table, but keep In mind that
>> gpu_recover() calls ASIC RESET,
>>
>> if we don't block device open while gpu doing recover, the vm init
>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>
>> do you have any good solution ? the key point is avoid/delay/push_back
>> hw activities from UMD side when we are running in gpu_recover()
>> function
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月26日 15:18
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>> NAK, why the heck should we do this? It would just block all new processes from using the device.
>>
>> Christian.
>>
>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>     1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 4a9f749..c155ce4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     	if (r < 0)
>>>     		return r;
>>>     
>>> +	if (adev->in_gpu_reset)
>>> +		return -ENODEV;
>>> +
>>>     	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>>     	if (unlikely(!fpriv)) {
>>>     		r = -ENOMEM;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                         ` <46a5317f-e606-5cfe-82c9-46df64818435-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2017-10-30  3:47                           ` Liu, Monk
       [not found]                             ` <BLUPR12MB0449DD842E46A942CF53484084590-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2017-10-30  3:47 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> I can't see any difference between the handling of existing VMs and new created ones.
I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...

My eldest patch series actually use a way can 100% avoid such problem: use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
Read lock, and gpu_recover() take the write lock. 
But you gave NAK on this approach, so I want to hear your idea.

>Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...


> So we need to handle this gracefully anyway, Christian.
Yeah I'd like to hear 


BR Monk



-----Original Message-----
From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com] 
Sent: 2017年10月26日 23:15
To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

Am 26.10.2017 um 13:08 schrieb Liu, Monk:
> "Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
I can't see any difference between the handling of existing VMs and new created ones.

Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.

So we need to handle this gracefully anyway, Christian.

>
> BR Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月26日 18:54
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
>> if we don't block device open while gpu doing recover, the vm init 
>> (SDMA working on page table creating) would be ruined by ASIC RESET
> That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.
>
> Regards,
> Christian.
>
> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>>
>> e.g. VM init need to create page table, but keep In mind that
>> gpu_recover() calls ASIC RESET,
>>
>> if we don't block device open while gpu doing recover, the vm init 
>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>
>> do you have any good solution ? the key point is 
>> avoid/delay/push_back hw activities from UMD side when we are running 
>> in gpu_recover() function
>>
>> BR Monk
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月26日 15:18
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>> NAK, why the heck should we do this? It would just block all new processes from using the device.
>>
>> Christian.
>>
>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>     1 file changed, 3 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> index 4a9f749..c155ce4 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>     	if (r < 0)
>>>     		return r;
>>>     
>>> +	if (adev->in_gpu_reset)
>>> +		return -ENODEV;
>>> +
>>>     	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>>     	if (unlikely(!fpriv)) {
>>>     		r = -ENOMEM;
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                             ` <BLUPR12MB0449DD842E46A942CF53484084590-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-30  7:37                               ` Christian König
       [not found]                                 ` <f2213f71-9280-bb51-5283-5349878956f7-5C7GfCeVMHo@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Christian König @ 2017-10-30  7:37 UTC (permalink / raw)
  To: Liu, Monk, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

>> I can't see any difference between the handling of existing VMs and new created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
> Read lock, and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
Well, what I wanted to say is we simply don't have a problem here. The 
handling for new VMs is exactly the same as for existing VMs.

Regarding the lock that is still a big NAK, we just fixed the 
(hopefully) last deadlock between IOCTLs and GPU reset. Introducing any 
locks would cause problems with that once more.

What we can do is use a struct completion to prevent new changes from 
piling up while we are in the GPU reset code.

> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
That is not correct. All VM activity goes through the GPU scheduler as 
well. So we are certainly going to re-schedule everything.

>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
As I said we don't need to do anything, the handling is correct as it is 
right now.

E.g. we have a shadow copy of the page tables in system memory to 
recover after VRAM lost and reschedule any pending operation after GPU 
reset.

That should be sufficient to handle all existing VM cases and even when 
it's not we need to work on the general handling and not add any checks 
to the IOCTL code which prevents applications from loading the driver.

Regards,
Christian.

Am 30.10.2017 um 04:47 schrieb Liu, Monk:
>> I can't see any difference between the handling of existing VMs and new created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the
> Read lock, and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
>
>> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
>
>
>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
>
>
> BR Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月26日 23:15
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> Am 26.10.2017 um 13:08 schrieb Liu, Monk:
>> "Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
> I can't see any difference between the handling of existing VMs and new created ones.
>
> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
>
> So we need to handle this gracefully anyway, Christian.
>
>> BR Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2017年10月26日 18:54
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>>> if we don't block device open while gpu doing recover, the vm init
>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>> That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.
>>
>> Regards,
>> Christian.
>>
>> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>>>
>>> e.g. VM init need to create page table, but keep In mind that
>>> gpu_recover() calls ASIC RESET,
>>>
>>> if we don't block device open while gpu doing recover, the vm init
>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>>
>>> do you have any good solution ? the key point is
>>> avoid/delay/push_back hw activities from UMD side when we are running
>>> in gpu_recover() function
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2017年10月26日 15:18
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>>
>>> NAK, why the heck should we do this? It would just block all new processes from using the device.
>>>
>>> Christian.
>>>
>>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>>      1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 4a9f749..c155ce4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>      	if (r < 0)
>>>>      		return r;
>>>>      
>>>> +	if (adev->in_gpu_reset)
>>>> +		return -ENODEV;
>>>> +
>>>>      	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>>>      	if (unlikely(!fpriv)) {
>>>>      		r = -ENOMEM;
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* RE: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                                 ` <f2213f71-9280-bb51-5283-5349878956f7-5C7GfCeVMHo@public.gmane.org>
@ 2017-10-30  7:54                                   ` Liu, Monk
       [not found]                                     ` <SN1PR12MB04629FAC65DFD5F133D2E34484590-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
  0 siblings, 1 reply; 30+ messages in thread
From: Liu, Monk @ 2017-10-30  7:54 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

> That is not correct. All VM activity goes through the GPU scheduler as well. So we are certainly going to re-schedule everything.

I'll check on that, if all go through scheduler that will be fine, that way just don't increase the karma of the job from kernel RQ is enough
To prevent VM jobs being corrupted by gpu reset, we can re-schedule those jobs if they are not signaled (but that need timedout not zero)

BR  Monk

-----Original Message-----
From: Koenig, Christian 
Sent: 2017年10月30日 15:37
To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset

>> I can't see any difference between the handling of existing VMs and new created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: 
> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
Well, what I wanted to say is we simply don't have a problem here. The handling for new VMs is exactly the same as for existing VMs.

Regarding the lock that is still a big NAK, we just fixed the
(hopefully) last deadlock between IOCTLs and GPU reset. Introducing any locks would cause problems with that once more.

What we can do is use a struct completion to prevent new changes from piling up while we are in the GPU reset code.

> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
That is not correct. All VM activity goes through the GPU scheduler as well. So we are certainly going to re-schedule everything.

>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
As I said we don't need to do anything, the handling is correct as it is right now.

E.g. we have a shadow copy of the page tables in system memory to recover after VRAM lost and reschedule any pending operation after GPU reset.

That should be sufficient to handle all existing VM cases and even when it's not we need to work on the general handling and not add any checks to the IOCTL code which prevents applications from loading the driver.

Regards,
Christian.

Am 30.10.2017 um 04:47 schrieb Liu, Monk:
>> I can't see any difference between the handling of existing VMs and new created ones.
> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>
> My eldest patch series actually use a way can 100% avoid such problem: 
> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, and gpu_recover() take the write lock.
> But you gave NAK on this approach, so I want to hear your idea.
>
>> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
>
>
>> So we need to handle this gracefully anyway, Christian.
> Yeah I'd like to hear
>
>
> BR Monk
>
>
>
> -----Original Message-----
> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
> Sent: 2017年10月26日 23:15
> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian 
> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
> Am 26.10.2017 um 13:08 schrieb Liu, Monk:
>> "Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
> I can't see any difference between the handling of existing VMs and new created ones.
>
> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
>
> So we need to handle this gracefully anyway, Christian.
>
>> BR Monk
>>
>> -----Original Message-----
>> From: Koenig, Christian
>> Sent: 2017年10月26日 18:54
>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>>> if we don't block device open while gpu doing recover, the vm init 
>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>> That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.
>>
>> Regards,
>> Christian.
>>
>> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>>>
>>> e.g. VM init need to create page table, but keep In mind that
>>> gpu_recover() calls ASIC RESET,
>>>
>>> if we don't block device open while gpu doing recover, the vm init 
>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>>
>>> do you have any good solution ? the key point is 
>>> avoid/delay/push_back hw activities from UMD side when we are 
>>> running in gpu_recover() function
>>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>> Sent: 2017年10月26日 15:18
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>>
>>> NAK, why the heck should we do this? It would just block all new processes from using the device.
>>>
>>> Christian.
>>>
>>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>> ---
>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>>      1 file changed, 3 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> index 4a9f749..c155ce4 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>      	if (r < 0)
>>>>      		return r;
>>>>      
>>>> +	if (adev->in_gpu_reset)
>>>> +		return -ENODEV;
>>>> +
>>>>      	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>>>      	if (unlikely(!fpriv)) {
>>>>      		r = -ENOMEM;
>> _______________________________________________
>> amd-gfx mailing list
>> amd-gfx@lists.freedesktop.org
>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
>

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
       [not found]                                     ` <SN1PR12MB04629FAC65DFD5F133D2E34484590-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
@ 2017-10-30  9:20                                       ` Christian König
  0 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2017-10-30  9:20 UTC (permalink / raw)
  To: Liu, Monk, Koenig, Christian, amd-gfx-PD4FTy7X32lNgt0PjOBp9y5qC8QIuHrW

The only thing that is not pushed through the scheduler are the ring and 
IB tests and those only because we want to use them directly in the GPU 
reset code to check if the hardware now works properly again.

The VM updates have a separate queue for each VM, but the initial clear 
goes through the kernel queue which is also used for buffer moves during 
eviction. But both always go through the scheduler.

Originally used the VM queue for the initial clear as well, but I 
changed that because I thought we could save some complexity. Might be a 
good idea to revert that because it is not as obvious and caused trouble 
with the ATC support on VG10 as well.

Regards,
Christian.

Am 30.10.2017 um 08:54 schrieb Liu, Monk:
>> That is not correct. All VM activity goes through the GPU scheduler as well. So we are certainly going to re-schedule everything.
> I'll check on that, if all go through scheduler that will be fine, that way just don't increase the karma of the job from kernel RQ is enough
> To prevent VM jobs being corrupted by gpu reset, we can re-schedule those jobs if they are not signaled (but that need timedout not zero)
>
> BR  Monk
>
> -----Original Message-----
> From: Koenig, Christian
> Sent: 2017年10月30日 15:37
> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>
>>> I can't see any difference between the handling of existing VMs and new created ones.
>> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>>
>> My eldest patch series actually use a way can 100% avoid such problem:
>> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, and gpu_recover() take the write lock.
>> But you gave NAK on this approach, so I want to hear your idea.
> Well, what I wanted to say is we simply don't have a problem here. The handling for new VMs is exactly the same as for existing VMs.
>
> Regarding the lock that is still a big NAK, we just fixed the
> (hopefully) last deadlock between IOCTLs and GPU reset. Introducing any locks would cause problems with that once more.
>
> What we can do is use a struct completion to prevent new changes from piling up while we are in the GPU reset code.
>
>> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
> That is not correct. All VM activity goes through the GPU scheduler as well. So we are certainly going to re-schedule everything.
>
>>> So we need to handle this gracefully anyway, Christian.
>> Yeah I'd like to hear
> As I said we don't need to do anything, the handling is correct as it is right now.
>
> E.g. we have a shadow copy of the page tables in system memory to recover after VRAM lost and reschedule any pending operation after GPU reset.
>
> That should be sufficient to handle all existing VM cases and even when it's not we need to work on the general handling and not add any checks to the IOCTL code which prevents applications from loading the driver.
>
> Regards,
> Christian.
>
> Am 30.10.2017 um 04:47 schrieb Liu, Monk:
>>> I can't see any difference between the handling of existing VMs and new created ones.
>> I know, for existing VMs we still have similar problems, I'm not saying this patch can save existing VM problem ...
>>
>> My eldest patch series actually use a way can 100% avoid such problem:
>> use RW mlock on drm_ioctl and gpu_recover(), drm_ioctl() take the Read lock, and gpu_recover() take the write lock.
>> But you gave NAK on this approach, so I want to hear your idea.
>>
>>> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
>> The thing is some VM activity is not go through GPU scheduler (direct), if it is interrupted by gpu_recover() it's not going to be re-scheduled again ...
>>
>>
>>> So we need to handle this gracefully anyway, Christian.
>> Yeah I'd like to hear
>>
>>
>> BR Monk
>>
>>
>>
>> -----Original Message-----
>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>> Sent: 2017年10月26日 23:15
>> To: Liu, Monk <Monk.Liu@amd.com>; Koenig, Christian
>> <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>
>> Am 26.10.2017 um 13:08 schrieb Liu, Monk:
>>> "Clear operation on the page table " is some kind of SDMA activity right? What if ASIC RESET from amd_gpu_recover() interrupted this activity in fly ???
>> I can't see any difference between the handling of existing VMs and new created ones.
>>
>> Either we have correct handling and can redo the activity or we have corrupted VM page tables and crash again immediately.
>>
>> So we need to handle this gracefully anyway, Christian.
>>
>>> BR Monk
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian
>>> Sent: 2017年10月26日 18:54
>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>>
>>>> if we don't block device open while gpu doing recover, the vm init
>>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>> That is not a problem at all. SDMA just does some clear operation on the page tables and those are either recovered from the shadow or run after the reset.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 26.10.2017 um 10:17 schrieb Liu, Monk:
>>>> When amdgpu_gpu_recover() routine is in the fly, we shouldn't let UMD open our device, otherwise the VM init would be ruined by gpu_recover().
>>>>
>>>> e.g. VM init need to create page table, but keep In mind that
>>>> gpu_recover() calls ASIC RESET,
>>>>
>>>> if we don't block device open while gpu doing recover, the vm init
>>>> (SDMA working on page table creating) would be ruined by ASIC RESET
>>>>
>>>> do you have any good solution ? the key point is
>>>> avoid/delay/push_back hw activities from UMD side when we are
>>>> running in gpu_recover() function
>>>>
>>>> BR Monk
>>>>
>>>> -----Original Message-----
>>>> From: Christian König [mailto:ckoenig.leichtzumerken@gmail.com]
>>>> Sent: 2017年10月26日 15:18
>>>> To: Liu, Monk <Monk.Liu@amd.com>; amd-gfx@lists.freedesktop.org
>>>> Subject: Re: [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset
>>>>
>>>> NAK, why the heck should we do this? It would just block all new processes from using the device.
>>>>
>>>> Christian.
>>>>
>>>> Am 25.10.2017 um 11:22 schrieb Monk Liu:
>>>>> Change-Id: Ibdb0ea9e3769d572fbbc13bbf1ef73f1af2ab7be
>>>>> Signed-off-by: Monk Liu <Monk.Liu@amd.com>
>>>>> ---
>>>>>       drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c | 3 +++
>>>>>       1 file changed, 3 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> index 4a9f749..c155ce4 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_kms.c
>>>>> @@ -813,6 +813,9 @@ int amdgpu_driver_open_kms(struct drm_device *dev, struct drm_file *file_priv)
>>>>>       	if (r < 0)
>>>>>       		return r;
>>>>>       
>>>>> +	if (adev->in_gpu_reset)
>>>>> +		return -ENODEV;
>>>>> +
>>>>>       	fpriv = kzalloc(sizeof(*fpriv), GFP_KERNEL);
>>>>>       	if (unlikely(!fpriv)) {
>>>>>       		r = -ENOMEM;
>>> _______________________________________________
>>> amd-gfx mailing list
>>> amd-gfx@lists.freedesktop.org
>>> https://lists.freedesktop.org/mailman/listinfo/amd-gfx
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2017-10-30  9:20 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-25  9:22 [PATCH 0/9] *** patch series for new gpu recover*** Monk Liu
     [not found] ` <1508923343-24404-1-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-25  9:22   ` [PATCH 1/9] drm/amdgpu:cleanup hw fence get&put Monk Liu
     [not found]     ` <1508923343-24404-2-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-25 11:38       ` Christian König
2017-10-25  9:22   ` [PATCH 2/9] amd/scheduler:imple job skip feature Monk Liu
     [not found]     ` <1508923343-24404-3-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:06       ` Christian König
     [not found]         ` <79a7f228-3d63-e043-58c1-2986379b6296-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-26  8:11           ` Liu, Monk
2017-10-25  9:22   ` [PATCH 3/9] drm/amdgpu:implement new GPU recover(v2) Monk Liu
     [not found]     ` <1508923343-24404-4-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:11       ` Christian König
2017-10-25  9:22   ` [PATCH 4/9] drm/amdgpu:replace deprecated gpu reset Monk Liu
     [not found]     ` <1508923343-24404-5-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:13       ` Christian König
     [not found]         ` <13ad16b1-6357-1f0e-3cdd-ddc42d4462b0-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-26  8:13           ` Liu, Monk
     [not found]             ` <BLUPR12MB04499B9B8F278BE639D144A184450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-26  8:16               ` Christian König
2017-10-25  9:22   ` [PATCH 5/9] drm/amdgpu:cleanup in_sriov_reset and lock_reset Monk Liu
     [not found]     ` <1508923343-24404-6-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:14       ` Christian König
2017-10-25  9:22   ` [PATCH 6/9] drm/amdgpu:cleanup ucode_init_bo Monk Liu
     [not found]     ` <1508923343-24404-7-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:17       ` Christian König
2017-10-25  9:22   ` [PATCH 7/9] drm/amdgpu:block kms open during gpu_reset Monk Liu
     [not found]     ` <1508923343-24404-8-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:18       ` Christian König
     [not found]         ` <46bc9872-0bd7-c2b7-4879-6cab5789f252-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-26  8:17           ` Liu, Monk
     [not found]             ` <BLUPR12MB0449D707ECAE76FFAA13C46684450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-26 10:54               ` Christian König
     [not found]                 ` <06a765e3-971e-698c-6d07-1dee69c1e0c8-5C7GfCeVMHo@public.gmane.org>
2017-10-26 11:08                   ` Liu, Monk
     [not found]                     ` <BLUPR12MB044997E32FD02ACDBD978C7084450-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-26 15:15                       ` Christian König
     [not found]                         ` <46a5317f-e606-5cfe-82c9-46df64818435-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2017-10-30  3:47                           ` Liu, Monk
     [not found]                             ` <BLUPR12MB0449DD842E46A942CF53484084590-7LeqcoF/hwpTIQvHjXdJlwdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-30  7:37                               ` Christian König
     [not found]                                 ` <f2213f71-9280-bb51-5283-5349878956f7-5C7GfCeVMHo@public.gmane.org>
2017-10-30  7:54                                   ` Liu, Monk
     [not found]                                     ` <SN1PR12MB04629FAC65DFD5F133D2E34484590-z7L1TMIYDg7VaWpRXmIMygdYzm3356FpvxpqHgZTriW3zl9H0oFU5g@public.gmane.org>
2017-10-30  9:20                                       ` Christian König
2017-10-25  9:22   ` [PATCH 8/9] drm/amdgpu/sriov:fix memory leak in psp_load_fw Monk Liu
     [not found]     ` <1508923343-24404-9-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:21       ` Christian König
2017-10-25  9:22   ` [PATCH 9/9] drm/amdgpu:fix random missing of FLR NOTIFY Monk Liu
     [not found]     ` <1508923343-24404-10-git-send-email-Monk.Liu-5C7GfCeVMHo@public.gmane.org>
2017-10-26  7:22       ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.