All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
@ 2022-09-14 10:10 Victor Zhao
  2022-09-14 10:10 ` [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume Victor Zhao
  2022-09-15  4:02 ` [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Zhao, Victor
  0 siblings, 2 replies; 21+ messages in thread
From: Victor Zhao @ 2022-09-14 10:10 UTC (permalink / raw)
  To: amd-gfx; +Cc: emily.deng, Victor Zhao, Andrey.Grodzovsky

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute),
jobs from another ring (e.g. gfx) may continue signaling during
drm_sched_stop stage. The signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled
bit set will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not
cleared yet.

Then overflow happens in the fence driver slots and some jobs may be
skipped and leave the rcu pointer not cleared which makes an infinite
wait for the slot on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And
driver will stuck at the its sched stop step because kthread_park
cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
before drm sched stop
2. handle all fences in fence process to aviod skip when overflow
happens

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_virt_fini_data_exchange(adev);
 	}
 
-	amdgpu_fence_driver_isr_toggle(adev, true);
-
 	/* block all schedulers and reset given job's ring */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
 			amdgpu_ras_suspend(tmp_adev);
 
+		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
+
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		atomic_inc(&tmp_adev->gpu_reset_counter);
 	}
 
-	if (need_emergency_restart)
+	if (need_emergency_restart) {
+		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+		}
 		goto skip_sched_resume;
+	}
 
 	/*
 	 * Must check guilty signal here since after this point all old
@@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	if (job && dma_fence_is_signaled(&job->hw_fence)) {
 		job_signaled = true;
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
+		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+		}
 		goto skip_hw_reset;
 	}
 
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r && r == -EAGAIN) {
 			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
 			adev->asic_reset_res = 0;
+			amdgpu_fence_driver_isr_toggle(adev, true);
 			goto retry;
 		}
 	}
@@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
 	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
 
+	amdgpu_fence_driver_isr_toggle(adev, true);
+
 	adev->no_hw_access = true;
 	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
 	adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
 	if (unlikely(seq == last_seq))
 		return false;
 
-	last_seq &= drv->num_fences_mask;
-	seq &= drv->num_fences_mask;
-
 	do {
 		struct dma_fence *fence, **ptr;
 
 		++last_seq;
-		last_seq &= drv->num_fences_mask;
-		ptr = &drv->fences[last_seq];
+		ptr = &drv->fences[last_seq & drv->num_fences_mask];
 
 		/* There is always exactly one thread signaling this fence slot */
 		fence = rcu_dereference_protected(*ptr, 1);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
  2022-09-14 10:10 [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Victor Zhao
@ 2022-09-14 10:10 ` Victor Zhao
  2022-09-15  5:57   ` Lazar, Lijo
  2022-09-15  4:02 ` [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Zhao, Victor
  1 sibling, 1 reply; 21+ messages in thread
From: Victor Zhao @ 2022-09-14 10:10 UTC (permalink / raw)
  To: amd-gfx; +Cc: emily.deng, Victor Zhao, Andrey.Grodzovsky

[background]
On current sienna cichlid mode2 reset, on the slow job hang cases,
since page table context was reverted to completely stop gpu, it
will generate page fault interrupt.

Since the irq are open during recovery stage, during ih resume step,
if this interrupt was in processing, which increased ih ring rptr,
and ih resume meanwhile will set rptr and wptr to 0. This may cause
rptr greater than wptr. Such case was not handled in ih process,
and it will cause rptr continue increasing util reaches the max.
Such case will make fence fallback situation happen.

[how]
Move the enable of irq after ih resumed and before ib test.
Adjusting the position of enable irq on other reset paths accordingly.

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
 drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index c0cfae52f12b..0b658225e9ef 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_fence_driver_force_completion(ring);
 	}
 
-	amdgpu_fence_driver_isr_toggle(adev, false);
-
 	if (job && job->vm)
 		drm_sched_increase_karma(&job->base);
 
@@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
 		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
 	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, &reset_context->flags);
 
+	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+	}
+
 	/*
 	 * ASIC reset has to be done on all XGMI hive nodes ASAP
 	 * to allow proper links negotiation in FW (within 1 sec)
@@ -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
 			/* Clear this failed job from fence array */
 			amdgpu_fence_driver_clear_job_fences(ring);
 
-			amdgpu_fence_driver_isr_toggle(adev, false);
-
 			/* Since the job won't signal and we go for
 			 * another resubmit drop this parent pointer
 			 */
diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
index 7aa570c1ce4a..953036482d1f 100644
--- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
+++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
@@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
 	* Add this ASIC as tracked as reset was already
 	* complete successfully.
 	*/
+	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
 	amdgpu_register_gpu_instance(tmp_adev);
 
 	/* Resume RAS */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-14 10:10 [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Victor Zhao
  2022-09-14 10:10 ` [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume Victor Zhao
@ 2022-09-15  4:02 ` Zhao, Victor
  2022-09-15  6:31   ` Christian König
  1 sibling, 1 reply; 21+ messages in thread
From: Zhao, Victor @ 2022-09-15  4:02 UTC (permalink / raw)
  To: Zhao, Victor, amd-gfx, Grodzovsky, Andrey, Koenig, Christian; +Cc: Deng, Emily

[AMD Official Use Only - General]

Ping.

Hi @Koenig, Christian and @Grodzovsky, Andrey,

We found some reset related issues during stress test on the sequence. Please help give some comments.


Thanks,
Victor



-----Original Message-----
From: Victor Zhao <Victor.Zhao@amd.com> 
Sent: Wednesday, September 14, 2022 6:10 PM
To: amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

[background]
For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.

At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.

Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.

This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.

[how]
1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt before drm sched stop 2. handle all fences in fence process to aviod skip when overflow happens

Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
 2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 943c9e750575..c0cfae52f12b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
 		amdgpu_virt_fini_data_exchange(adev);
 	}
 
-	amdgpu_fence_driver_isr_toggle(adev, true);
-
 	/* block all schedulers and reset given job's ring */
 	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		      amdgpu_device_ip_need_full_reset(tmp_adev))
 			amdgpu_ras_suspend(tmp_adev);
 
+		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
+
 		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
 			struct amdgpu_ring *ring = tmp_adev->rings[i];
 
@@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		atomic_inc(&tmp_adev->gpu_reset_counter);
 	}
 
-	if (need_emergency_restart)
+	if (need_emergency_restart) {
+		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+		}
 		goto skip_sched_resume;
+	}
 
 	/*
 	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 	if (job && dma_fence_is_signaled(&job->hw_fence)) {
 		job_signaled = true;
 		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
+		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
+			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
+		}
 		goto skip_hw_reset;
 	}
 
@@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
 		if (r && r == -EAGAIN) {
 			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
 			adev->asic_reset_res = 0;
+			amdgpu_fence_driver_isr_toggle(adev, true);
 			goto retry;
 		}
 	}
@@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
 	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
 	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
 
+	amdgpu_fence_driver_isr_toggle(adev, true);
+
 	adev->no_hw_access = true;
 	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
 	adev->no_hw_access = false;
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..65a877e1a7fc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
 	if (unlikely(seq == last_seq))
 		return false;
 
-	last_seq &= drv->num_fences_mask;
-	seq &= drv->num_fences_mask;
-
 	do {
 		struct dma_fence *fence, **ptr;
 
 		++last_seq;
-		last_seq &= drv->num_fences_mask;
-		ptr = &drv->fences[last_seq];
+		ptr = &drv->fences[last_seq & drv->num_fences_mask];
 
 		/* There is always exactly one thread signaling this fence slot */
 		fence = rcu_dereference_protected(*ptr, 1);
--
2.25.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
  2022-09-14 10:10 ` [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume Victor Zhao
@ 2022-09-15  5:57   ` Lazar, Lijo
  2022-09-15  6:38     ` Zhao, Victor
  0 siblings, 1 reply; 21+ messages in thread
From: Lazar, Lijo @ 2022-09-15  5:57 UTC (permalink / raw)
  To: Victor Zhao, amd-gfx; +Cc: emily.deng, Andrey.Grodzovsky



On 9/14/2022 3:40 PM, Victor Zhao wrote:
> [background]
> On current sienna cichlid mode2 reset, on the slow job hang cases,
> since page table context was reverted to completely stop gpu, it
> will generate page fault interrupt.
> 
> Since the irq are open during recovery stage, during ih resume step,
> if this interrupt was in processing, which increased ih ring rptr,
> and ih resume meanwhile will set rptr and wptr to 0. This may cause

AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before 
mode-2. Why do you resume IH after mode-2 when it is not suspended? Is 
it a special case for virtualization?

Thanks,
Lijo

> rptr greater than wptr. Such case was not handled in ih process,
> and it will cause rptr continue increasing util reaches the max.
> Such case will make fence fallback situation happen.
> 
> [how]
> Move the enable of irq after ih resumed and before ib test.
> Adjusting the position of enable irq on other reset paths accordingly.
> 
> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>   drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>   2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c0cfae52f12b..0b658225e9ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> -	amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>   	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, &reset_context->flags);
>   
> +	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +	}
> +
>   	/*
>   	 * ASIC reset has to be done on all XGMI hive nodes ASAP
>   	 * to allow proper links negotiation in FW (within 1 sec)
> @@ -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>   			/* Clear this failed job from fence array */
>   			amdgpu_fence_driver_clear_job_fences(ring);
>   
> -			amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   			/* Since the job won't signal and we go for
>   			 * another resubmit drop this parent pointer
>   			 */
> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> index 7aa570c1ce4a..953036482d1f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
>   	* Add this ASIC as tracked as reset was already
>   	* complete successfully.
>   	*/
> +	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>   	amdgpu_register_gpu_instance(tmp_adev);
>   
>   	/* Resume RAS */
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15  4:02 ` [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Zhao, Victor
@ 2022-09-15  6:31   ` Christian König
  2022-09-15 10:09     ` Zhao, Victor
  0 siblings, 1 reply; 21+ messages in thread
From: Christian König @ 2022-09-15  6:31 UTC (permalink / raw)
  To: Zhao, Victor, amd-gfx, Grodzovsky, Andrey; +Cc: Deng, Emily



Am 15.09.22 um 06:02 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Ping.
>
> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>
> We found some reset related issues during stress test on the sequence. Please help give some comments.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Victor Zhao <Victor.Zhao@amd.com>
> Sent: Wednesday, September 14, 2022 6:10 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
> [background]
> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.
>
> At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>
> Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.
>
> This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.
>
> [how]
> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt before drm sched stop 2. handle all fences in fence process to aviod skip when overflow happens
>
> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
>   2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 943c9e750575..c0cfae52f12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_virt_fini_data_exchange(adev);
>   	}
>   
> -	amdgpu_fence_driver_isr_toggle(adev, true);
> -
>   	/* block all schedulers and reset given job's ring */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		      amdgpu_device_ip_need_full_reset(tmp_adev))
>   			amdgpu_ras_suspend(tmp_adev);
>   
> +		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
> +
>   		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   			struct amdgpu_ring *ring = tmp_adev->rings[i];
>   
> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		atomic_inc(&tmp_adev->gpu_reset_counter);
>   	}
>   
> -	if (need_emergency_restart)
> +	if (need_emergency_restart) {
> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +		}
>   		goto skip_sched_resume;
> +	}
>   
>   	/*
>   	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	if (job && dma_fence_is_signaled(&job->hw_fence)) {
>   		job_signaled = true;
>   		dev_info(adev->dev, "Guilty job already signaled, skipping HW reset");
> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +		}
>   		goto skip_hw_reset;
>   	}
>   
> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		if (r && r == -EAGAIN) {
>   			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>   			adev->asic_reset_res = 0;
> +			amdgpu_fence_driver_isr_toggle(adev, true);
>   			goto retry;
>   		}
>   	}
> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>   	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>   	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>   
> +	amdgpu_fence_driver_isr_toggle(adev, true);
> +
>   	adev->no_hw_access = true;
>   	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>   	adev->no_hw_access = false;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8adeb7469f1e..65a877e1a7fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>   	if (unlikely(seq == last_seq))
>   		return false;
>   
> -	last_seq &= drv->num_fences_mask;
> -	seq &= drv->num_fences_mask;
> -
>   	do {
>   		struct dma_fence *fence, **ptr;
>   
>   		++last_seq;
> -		last_seq &= drv->num_fences_mask;
> -		ptr = &drv->fences[last_seq];
> +		ptr = &drv->fences[last_seq & drv->num_fences_mask];
>   
>   		/* There is always exactly one thread signaling this fence slot */
>   		fence = rcu_dereference_protected(*ptr, 1);

Those changes here doesn't seem to make sense. Please explain further 
why that is necessary.

Christian.

> --
> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
  2022-09-15  5:57   ` Lazar, Lijo
@ 2022-09-15  6:38     ` Zhao, Victor
  2022-09-15  7:59       ` Lazar, Lijo
  0 siblings, 1 reply; 21+ messages in thread
From: Zhao, Victor @ 2022-09-15  6:38 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deng, Emily, Grodzovsky, Andrey

[AMD Official Use Only - General]

Hi Lijo,

IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid:
- close down mode2 reset and do a mode1 reset first
- open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case.

Resume IH helps in this case


Thanks,
Victor



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@amd.com> 
Sent: Thursday, September 15, 2022 1:58 PM
To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume



On 9/14/2022 3:40 PM, Victor Zhao wrote:
> [background]
> On current sienna cichlid mode2 reset, on the slow job hang cases, 
> since page table context was reverted to completely stop gpu, it will 
> generate page fault interrupt.
> 
> Since the irq are open during recovery stage, during ih resume step, 
> if this interrupt was in processing, which increased ih ring rptr, and 
> ih resume meanwhile will set rptr and wptr to 0. This may cause

AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?

Thanks,
Lijo

> rptr greater than wptr. Such case was not handled in ih process, and 
> it will cause rptr continue increasing util reaches the max.
> Such case will make fence fallback situation happen.
> 
> [how]
> Move the enable of irq after ih resumed and before ib test.
> Adjusting the position of enable irq on other reset paths accordingly.
> 
> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>   drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>   2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index c0cfae52f12b..0b658225e9ef 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_fence_driver_force_completion(ring);
>   	}
>   
> -	amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   	if (job && job->vm)
>   		drm_sched_increase_karma(&job->base);
>   
> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>   		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>   	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, 
> &reset_context->flags);
>   
> +	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +	}
> +
>   	/*
>   	 * ASIC reset has to be done on all XGMI hive nodes ASAP
>   	 * to allow proper links negotiation in FW (within 1 sec) @@ 
> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>   			/* Clear this failed job from fence array */
>   			amdgpu_fence_driver_clear_job_fences(ring);
>   
> -			amdgpu_fence_driver_isr_toggle(adev, false);
> -
>   			/* Since the job won't signal and we go for
>   			 * another resubmit drop this parent pointer
>   			 */
> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c 
> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> index 7aa570c1ce4a..953036482d1f 100644
> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
>   	* Add this ASIC as tracked as reset was already
>   	* complete successfully.
>   	*/
> +	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>   	amdgpu_register_gpu_instance(tmp_adev);
>   
>   	/* Resume RAS */
> 

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
  2022-09-15  6:38     ` Zhao, Victor
@ 2022-09-15  7:59       ` Lazar, Lijo
  2022-09-15 10:11         ` Zhao, Victor
  0 siblings, 1 reply; 21+ messages in thread
From: Lazar, Lijo @ 2022-09-15  7:59 UTC (permalink / raw)
  To: Zhao, Victor, amd-gfx; +Cc: Deng, Emily, Grodzovsky, Andrey



On 9/15/2022 12:08 PM, Zhao, Victor wrote:
> [AMD Official Use Only - General]
> 
> Hi Lijo,
> 
> IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid:
> - close down mode2 reset and do a mode1 reset first
> - open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case.
> 
> Resume IH helps in this case
> 

Sorry, what do you mean by 'close down' mode2 /'open mode2 reset'? Do 
you mean if mode-1 reset is done first, a subsequent mode-2 reset 
doesn't work without IH resume?

Thanks,
Lijo

> 
> Thanks,
> Victor
> 
> 
> 
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, September 15, 2022 1:58 PM
> To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
> 
> 
> 
> On 9/14/2022 3:40 PM, Victor Zhao wrote:
>> [background]
>> On current sienna cichlid mode2 reset, on the slow job hang cases,
>> since page table context was reverted to completely stop gpu, it will
>> generate page fault interrupt.
>>
>> Since the irq are open during recovery stage, during ih resume step,
>> if this interrupt was in processing, which increased ih ring rptr, and
>> ih resume meanwhile will set rptr and wptr to 0. This may cause
> 
> AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?
> 
> Thanks,
> Lijo
> 
>> rptr greater than wptr. Such case was not handled in ih process, and
>> it will cause rptr continue increasing util reaches the max.
>> Such case will make fence fallback situation happen.
>>
>> [how]
>> Move the enable of irq after ih resumed and before ib test.
>> Adjusting the position of enable irq on other reset paths accordingly.
>>
>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>>    drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>>    2 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c0cfae52f12b..0b658225e9ef 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>    		amdgpu_fence_driver_force_completion(ring);
>>    	}
>>    
>> -	amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>    	if (job && job->vm)
>>    		drm_sched_increase_karma(&job->base);
>>    
>> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>    		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>>    	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET,
>> &reset_context->flags);
>>    
>> +	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +	}
>> +
>>    	/*
>>    	 * ASIC reset has to be done on all XGMI hive nodes ASAP
>>    	 * to allow proper links negotiation in FW (within 1 sec) @@
>> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>    			/* Clear this failed job from fence array */
>>    			amdgpu_fence_driver_clear_job_fences(ring);
>>    
>> -			amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>    			/* Since the job won't signal and we go for
>>    			 * another resubmit drop this parent pointer
>>    			 */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> index 7aa570c1ce4a..953036482d1f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
>>    	* Add this ASIC as tracked as reset was already
>>    	* complete successfully.
>>    	*/
>> +	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>    	amdgpu_register_gpu_instance(tmp_adev);
>>    
>>    	/* Resume RAS */
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15  6:31   ` Christian König
@ 2022-09-15 10:09     ` Zhao, Victor
  2022-09-15 11:29       ` Christian König
  2022-09-15 18:29       ` Andrey Grodzovsky
  0 siblings, 2 replies; 21+ messages in thread
From: Zhao, Victor @ 2022-09-15 10:09 UTC (permalink / raw)
  To: Koenig, Christian, amd-gfx, Grodzovsky,  Andrey; +Cc: Deng, Emily

[AMD Official Use Only - General]

Hi Christian,

The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr mode and mode2 reset enabled on driver.
When a compute hang job timeout happens, the 2 jobs on the gfx pending list maybe signaled after drm_sched_stop. So they will not be removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and removed from pending list.
At the resubmit setp, the second job (with signaled bit) will be resubmitted. Since it still has signaled bit, drm_sched_job_done will be called directly. This decrease the hw_rq_count which allows more jobs emitted but did not clean fence_drv rcu ptr. 
This results in an overflow in the fence_drv. Since we will use num_fences_mask in amdgpu_fence_process, when overflow happens, the signal of some job will be skipped which result in an infinite wait for the fence_drv rcu ptr.

So close irq before sched_stop could avoid signal jobs after drm_sched_stop. And signal job one by one in fence_process instead of using a mask will handle the overflow situation.

Another fix could be skip submitting jobs which already signaled during resubmit stage, which may look cleaner.

Please help give some advice.


Thanks,
Victor



-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Thursday, September 15, 2022 2:32 PM
To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Cc: Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow



Am 15.09.22 um 06:02 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Ping.
>
> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>
> We found some reset related issues during stress test on the sequence. Please help give some comments.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Victor Zhao <Victor.Zhao@amd.com>
> Sent: Wednesday, September 14, 2022 6:10 PM
> To: amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
> [background]
> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.
>
> At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>
> Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.
>
> This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.
>
> [how]
> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt 
> before drm sched stop 2. handle all fences in fence process to aviod 
> skip when overflow happens
>
> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
>   2 files changed, 14 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 943c9e750575..c0cfae52f12b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>   		amdgpu_virt_fini_data_exchange(adev);
>   	}
>   
> -	amdgpu_fence_driver_isr_toggle(adev, true);
> -
>   	/* block all schedulers and reset given job's ring */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		      amdgpu_device_ip_need_full_reset(tmp_adev))
>   			amdgpu_ras_suspend(tmp_adev);
>   
> +		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
> +
>   		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>   			struct amdgpu_ring *ring = tmp_adev->rings[i];
>   
> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		atomic_inc(&tmp_adev->gpu_reset_counter);
>   	}
>   
> -	if (need_emergency_restart)
> +	if (need_emergency_restart) {
> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +		}
>   		goto skip_sched_resume;
> +	}
>   
>   	/*
>   	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   	if (job && dma_fence_is_signaled(&job->hw_fence)) {
>   		job_signaled = true;
>   		dev_info(adev->dev, "Guilty job already signaled, skipping HW 
> reset");
> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
> +		}
>   		goto skip_hw_reset;
>   	}
>   
> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>   		if (r && r == -EAGAIN) {
>   			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>   			adev->asic_reset_res = 0;
> +			amdgpu_fence_driver_isr_toggle(adev, true);
>   			goto retry;
>   		}
>   	}
> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>   	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>   	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>   
> +	amdgpu_fence_driver_isr_toggle(adev, true);
> +
>   	adev->no_hw_access = true;
>   	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>   	adev->no_hw_access = false;
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8adeb7469f1e..65a877e1a7fc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>   	if (unlikely(seq == last_seq))
>   		return false;
>   
> -	last_seq &= drv->num_fences_mask;
> -	seq &= drv->num_fences_mask;
> -
>   	do {
>   		struct dma_fence *fence, **ptr;
>   
>   		++last_seq;
> -		last_seq &= drv->num_fences_mask;
> -		ptr = &drv->fences[last_seq];
> +		ptr = &drv->fences[last_seq & drv->num_fences_mask];
>   
>   		/* There is always exactly one thread signaling this fence slot */
>   		fence = rcu_dereference_protected(*ptr, 1);

Those changes here doesn't seem to make sense. Please explain further why that is necessary.

Christian.

> --
> 2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume
  2022-09-15  7:59       ` Lazar, Lijo
@ 2022-09-15 10:11         ` Zhao, Victor
  0 siblings, 0 replies; 21+ messages in thread
From: Zhao, Victor @ 2022-09-15 10:11 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deng, Emily, Grodzovsky, Andrey

[AMD Official Use Only - General]

Hi Lijo,

Yes, this is what I observed in sienna cichlid. 


Thanks,
Victor



-----Original Message-----
From: Lazar, Lijo <Lijo.Lazar@amd.com> 
Sent: Thursday, September 15, 2022 4:00 PM
To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume



On 9/15/2022 12:08 PM, Zhao, Victor wrote:
> [AMD Official Use Only - General]
> 
> Hi Lijo,
> 
> IH resume was added to resolve an issue found during mode2 bring up on sienna cichlid:
> - close down mode2 reset and do a mode1 reset first
> - open mode2 reset and do a mode2 reset. Mode2 reset was found fail in this case.
> 
> Resume IH helps in this case
> 

Sorry, what do you mean by 'close down' mode2 /'open mode2 reset'? Do you mean if mode-1 reset is done first, a subsequent mode-2 reset doesn't work without IH resume?

Thanks,
Lijo

> 
> Thanks,
> Victor
> 
> 
> 
> -----Original Message-----
> From: Lazar, Lijo <Lijo.Lazar@amd.com>
> Sent: Thursday, September 15, 2022 1:58 PM
> To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey 
> <Andrey.Grodzovsky@amd.com>
> Subject: Re: [PATCH 2/2] drm/amdgpu: move enable irq later to avoid 
> race with ih resume
> 
> 
> 
> On 9/14/2022 3:40 PM, Victor Zhao wrote:
>> [background]
>> On current sienna cichlid mode2 reset, on the slow job hang cases, 
>> since page table context was reverted to completely stop gpu, it will 
>> generate page fault interrupt.
>>
>> Since the irq are open during recovery stage, during ih resume step, 
>> if this interrupt was in processing, which increased ih ring rptr, 
>> and ih resume meanwhile will set rptr and wptr to 0. This may cause
> 
> AFAIK, only GFX/SDMA are affected by mode-2. IH is not suspended before mode-2. Why do you resume IH after mode-2 when it is not suspended? Is it a special case for virtualization?
> 
> Thanks,
> Lijo
> 
>> rptr greater than wptr. Such case was not handled in ih process, and 
>> it will cause rptr continue increasing util reaches the max.
>> Such case will make fence fallback situation happen.
>>
>> [how]
>> Move the enable of irq after ih resumed and before ib test.
>> Adjusting the position of enable irq on other reset paths accordingly.
>>
>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c  | 8 ++++----
>>    drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c | 1 +
>>    2 files changed, 5 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index c0cfae52f12b..0b658225e9ef 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4625,8 +4625,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>    		amdgpu_fence_driver_force_completion(ring);
>>    	}
>>    
>> -	amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>    	if (job && job->vm)
>>    		drm_sched_increase_karma(&job->base);
>>    
>> @@ -4758,6 +4756,10 @@ int amdgpu_do_asic_reset(struct list_head *device_list_handle,
>>    		test_bit(AMDGPU_NEED_FULL_RESET, &reset_context->flags);
>>    	skip_hw_reset = test_bit(AMDGPU_SKIP_HW_RESET, 
>> &reset_context->flags);
>>    
>> +	list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +		amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +	}
>> +
>>    	/*
>>    	 * ASIC reset has to be done on all XGMI hive nodes ASAP
>>    	 * to allow proper links negotiation in FW (within 1 sec) @@
>> -5031,8 +5033,6 @@ static void amdgpu_device_recheck_guilty_jobs(
>>    			/* Clear this failed job from fence array */
>>    			amdgpu_fence_driver_clear_job_fences(ring);
>>    
>> -			amdgpu_fence_driver_isr_toggle(adev, false);
>> -
>>    			/* Since the job won't signal and we go for
>>    			 * another resubmit drop this parent pointer
>>    			 */
>> diff --git a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> index 7aa570c1ce4a..953036482d1f 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/sienna_cichlid.c
>> @@ -240,6 +240,7 @@ sienna_cichlid_mode2_restore_hwcontext(struct amdgpu_reset_control *reset_ctl,
>>    	* Add this ASIC as tracked as reset was already
>>    	* complete successfully.
>>    	*/
>> +	amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>    	amdgpu_register_gpu_instance(tmp_adev);
>>    
>>    	/* Resume RAS */
>>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 10:09     ` Zhao, Victor
@ 2022-09-15 11:29       ` Christian König
  2022-09-15 18:29       ` Andrey Grodzovsky
  1 sibling, 0 replies; 21+ messages in thread
From: Christian König @ 2022-09-15 11:29 UTC (permalink / raw)
  To: Zhao, Victor, amd-gfx, Grodzovsky, Andrey; +Cc: Deng, Emily

Hi Victor,

the advanced tdr mode is not a feature we want to enable in production 
and keep for much longer.

So would that issue happen without this as well? If not it is rather 
questionable if we should look into fixing this in the first place.

Regards,
Christian.

Am 15.09.22 um 12:09 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Hi Christian,
>
> The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr mode and mode2 reset enabled on driver.
> When a compute hang job timeout happens, the 2 jobs on the gfx pending list maybe signaled after drm_sched_stop. So they will not be removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
> At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and removed from pending list.
> At the resubmit setp, the second job (with signaled bit) will be resubmitted. Since it still has signaled bit, drm_sched_job_done will be called directly. This decrease the hw_rq_count which allows more jobs emitted but did not clean fence_drv rcu ptr.
> This results in an overflow in the fence_drv. Since we will use num_fences_mask in amdgpu_fence_process, when overflow happens, the signal of some job will be skipped which result in an infinite wait for the fence_drv rcu ptr.
>
> So close irq before sched_stop could avoid signal jobs after drm_sched_stop. And signal job one by one in fence_process instead of using a mask will handle the overflow situation.
>
> Another fix could be skip submitting jobs which already signaled during resubmit stage, which may look cleaner.
>
> Please help give some advice.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Thursday, September 15, 2022 2:32 PM
> To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Deng, Emily <Emily.Deng@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
>
> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>> [AMD Official Use Only - General]
>>
>> Ping.
>>
>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>
>> We found some reset related issues during stress test on the sequence. Please help give some comments.
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Victor Zhao <Victor.Zhao@amd.com>
>> Sent: Wednesday, September 14, 2022 6:10 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>> [background]
>> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.
>>
>> At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
>> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>>
>> Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.
>>
>> This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.
>>
>> [how]
>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>> before drm sched stop 2. handle all fences in fence process to aviod
>> skip when overflow happens
>>
>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 943c9e750575..c0cfae52f12b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>    		amdgpu_virt_fini_data_exchange(adev);
>>    	}
>>    
>> -	amdgpu_fence_driver_isr_toggle(adev, true);
>> -
>>    	/* block all schedulers and reset given job's ring */
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		      amdgpu_device_ip_need_full_reset(tmp_adev))
>>    			amdgpu_ras_suspend(tmp_adev);
>>    
>> +		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>> +
>>    		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    			struct amdgpu_ring *ring = tmp_adev->rings[i];
>>    
>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		atomic_inc(&tmp_adev->gpu_reset_counter);
>>    	}
>>    
>> -	if (need_emergency_restart)
>> +	if (need_emergency_restart) {
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_sched_resume;
>> +	}
>>    
>>    	/*
>>    	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>    		job_signaled = true;
>>    		dev_info(adev->dev, "Guilty job already signaled, skipping HW
>> reset");
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_hw_reset;
>>    	}
>>    
>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		if (r && r == -EAGAIN) {
>>    			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>>    			adev->asic_reset_res = 0;
>> +			amdgpu_fence_driver_isr_toggle(adev, true);
>>    			goto retry;
>>    		}
>>    	}
>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>    	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>    	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>    
>> +	amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>>    	adev->no_hw_access = true;
>>    	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>    	adev->no_hw_access = false;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8adeb7469f1e..65a877e1a7fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>>    	if (unlikely(seq == last_seq))
>>    		return false;
>>    
>> -	last_seq &= drv->num_fences_mask;
>> -	seq &= drv->num_fences_mask;
>> -
>>    	do {
>>    		struct dma_fence *fence, **ptr;
>>    
>>    		++last_seq;
>> -		last_seq &= drv->num_fences_mask;
>> -		ptr = &drv->fences[last_seq];
>> +		ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>    
>>    		/* There is always exactly one thread signaling this fence slot */
>>    		fence = rcu_dereference_protected(*ptr, 1);
> Those changes here doesn't seem to make sense. Please explain further why that is necessary.
>
> Christian.
>
>> --
>> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 10:09     ` Zhao, Victor
  2022-09-15 11:29       ` Christian König
@ 2022-09-15 18:29       ` Andrey Grodzovsky
  2022-09-15 18:31         ` Andrey Grodzovsky
  2022-09-15 19:26         ` Christian König
  1 sibling, 2 replies; 21+ messages in thread
From: Andrey Grodzovsky @ 2022-09-15 18:29 UTC (permalink / raw)
  To: Zhao, Victor, Koenig, Christian, amd-gfx; +Cc: Deng, Emily


On 2022-09-15 06:09, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Christian,
>
> The test sequence is executing a compute engine hang while running a lot of containers submitting gfx jobs. We have advanced tdr mode and mode2 reset enabled on driver.
> When a compute hang job timeout happens, the 2 jobs on the gfx pending list maybe signaled after drm_sched_stop. So they will not be removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
> At the amdgpu_device_recheck_guilty_jobs step, the first job will be rerun and removed from pending list.
> At the resubmit setp, the second job (with signaled bit) will be resubmitted. Since it still has signaled bit, drm_sched_job_done will be called directly. This decrease the hw_rq_count which allows more jobs emitted but did not clean fence_drv rcu ptr.
> This results in an overflow in the fence_drv. Since we will use num_fences_mask in amdgpu_fence_process, when overflow happens, the signal of some job will be skipped which result in an infinite wait for the fence_drv rcu ptr.
>
> So close irq before sched_stop could avoid signal jobs after drm_sched_stop. And signal job one by one in fence_process instead of using a mask will handle the overflow situation.
>
> Another fix could be skip submitting jobs which already signaled during resubmit stage, which may look cleaner.
>
> Please help give some advice.


How about the code bellow  instead ? The real problem is that we reuse a 
dma fence twice which is not according to fma fence design, so maybe 
this can help ?


diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 8adeb7469f1e..033f0ae16784 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
struct dma_fence **f, struct amd
         if (job && job->job_run_counter) {
                 /* reinit seq for resubmitted jobs */
                 fence->seqno = seq;
+
+               /* For resubmitted job clear the singled bit */
+               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
+
                 /* TO be inline with external fence creation and other 
drivers */
                 dma_fence_get(fence);
         } else {


Andrey


>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Thursday, September 15, 2022 2:32 PM
> To: Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Cc: Deng, Emily <Emily.Deng@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
>
> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>> [AMD Official Use Only - General]
>>
>> Ping.
>>
>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>
>> We found some reset related issues during stress test on the sequence. Please help give some comments.
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Victor Zhao <Victor.Zhao@amd.com>
>> Sent: Wednesday, September 14, 2022 6:10 PM
>> To: amd-gfx@lists.freedesktop.org
>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>> [background]
>> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs from another ring (e.g. gfx) may continue signaling during drm_sched_stop stage. The signal bit will not be cleared.
>>
>> At the resubmit stage after recovery, the job with hw fence signaled bit set will call job done directly instead go through fence process.
>> This makes the hw_rq_count decrease but rcu fence pointer not cleared yet.
>>
>> Then overflow happens in the fence driver slots and some jobs may be skipped and leave the rcu pointer not cleared which makes an infinite wait for the slot on the next fence emitted.
>>
>> This infinite wait cause a job timeout on the emitting job. And driver will stuck at the its sched stop step because kthread_park cannot be done.
>>
>> [how]
>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>> before drm sched stop 2. handle all fences in fence process to aviod
>> skip when overflow happens
>>
>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |  6 +-----
>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 943c9e750575..c0cfae52f12b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct amdgpu_device *adev,
>>    		amdgpu_virt_fini_data_exchange(adev);
>>    	}
>>    
>> -	amdgpu_fence_driver_isr_toggle(adev, true);
>> -
>>    	/* block all schedulers and reset given job's ring */
>>    	for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    		struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		      amdgpu_device_ip_need_full_reset(tmp_adev))
>>    			amdgpu_ras_suspend(tmp_adev);
>>    
>> +		amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>> +
>>    		for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>    			struct amdgpu_ring *ring = tmp_adev->rings[i];
>>    
>> @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		atomic_inc(&tmp_adev->gpu_reset_counter);
>>    	}
>>    
>> -	if (need_emergency_restart)
>> +	if (need_emergency_restart) {
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_sched_resume;
>> +	}
>>    
>>    	/*
>>    	 * Must check guilty signal here since after this point all old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    	if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>    		job_signaled = true;
>>    		dev_info(adev->dev, "Guilty job already signaled, skipping HW
>> reset");
>> +		list_for_each_entry (tmp_adev, device_list_handle, reset_list) {
>> +			amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>> +		}
>>    		goto skip_hw_reset;
>>    	}
>>    
>> @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>    		if (r && r == -EAGAIN) {
>>    			set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>>    			adev->asic_reset_res = 0;
>> +			amdgpu_fence_driver_isr_toggle(adev, true);
>>    			goto retry;
>>    		}
>>    	}
>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>    	set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>    	set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>    
>> +	amdgpu_fence_driver_isr_toggle(adev, true);
>> +
>>    	adev->no_hw_access = true;
>>    	r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>    	adev->no_hw_access = false;
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8adeb7469f1e..65a877e1a7fc 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring *ring)
>>    	if (unlikely(seq == last_seq))
>>    		return false;
>>    
>> -	last_seq &= drv->num_fences_mask;
>> -	seq &= drv->num_fences_mask;
>> -
>>    	do {
>>    		struct dma_fence *fence, **ptr;
>>    
>>    		++last_seq;
>> -		last_seq &= drv->num_fences_mask;
>> -		ptr = &drv->fences[last_seq];
>> +		ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>    
>>    		/* There is always exactly one thread signaling this fence slot */
>>    		fence = rcu_dereference_protected(*ptr, 1);
> Those changes here doesn't seem to make sense. Please explain further why that is necessary.
>
> Christian.
>
>> --
>> 2.25.1

^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 18:29       ` Andrey Grodzovsky
@ 2022-09-15 18:31         ` Andrey Grodzovsky
  2022-09-15 19:26         ` Christian König
  1 sibling, 0 replies; 21+ messages in thread
From: Andrey Grodzovsky @ 2022-09-15 18:31 UTC (permalink / raw)
  To: Zhao, Victor, Koenig, Christian, amd-gfx; +Cc: Deng, Emily

Had a typo - see bellow

On 2022-09-15 14:29, Andrey Grodzovsky wrote:
>
> On 2022-09-15 06:09, Zhao, Victor wrote:
>> [AMD Official Use Only - General]
>>
>> Hi Christian,
>>
>> The test sequence is executing a compute engine hang while running a 
>> lot of containers submitting gfx jobs. We have advanced tdr mode and 
>> mode2 reset enabled on driver.
>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>> pending list maybe signaled after drm_sched_stop. So they will not be 
>> removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
>> At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
>> rerun and removed from pending list.
>> At the resubmit setp, the second job (with signaled bit) will be 
>> resubmitted. Since it still has signaled bit, drm_sched_job_done will 
>> be called directly. This decrease the hw_rq_count which allows more 
>> jobs emitted but did not clean fence_drv rcu ptr.
>> This results in an overflow in the fence_drv. Since we will use 
>> num_fences_mask in amdgpu_fence_process, when overflow happens, the 
>> signal of some job will be skipped which result in an infinite wait 
>> for the fence_drv rcu ptr.
>>
>> So close irq before sched_stop could avoid signal jobs after 
>> drm_sched_stop. And signal job one by one in fence_process instead of 
>> using a mask will handle the overflow situation.
>>
>> Another fix could be skip submitting jobs which already signaled 
>> during resubmit stage, which may look cleaner.
>>
>> Please help give some advice.
>
>
> How about the code bellow  instead ? The real problem is that we reuse 
> a dma fence twice which is not according to fma fence design, so maybe 
> this can help ?
>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8adeb7469f1e..033f0ae16784 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
> struct dma_fence **f, struct amd
>         if (job && job->job_run_counter) {
>                 /* reinit seq for resubmitted jobs */
>                 fence->seqno = seq;
> +
> +               /* For resubmitted job clear the singled bit */
> +               celar_bit(DMA_FENCE_FLAG_SIGNALED_BIT, &fence->flags);
> +
>                 /* TO be inline with external fence creation and other 
> drivers */
>                 dma_fence_get(fence);
>         } else {
>
>
> Andrey
>
>
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, September 15, 2022 2:32 PM
>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>
>> Cc: Deng, Emily <Emily.Deng@amd.com>
>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>>
>>
>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>> [AMD Official Use Only - General]
>>>
>>> Ping.
>>>
>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>
>>> We found some reset related issues during stress test on the 
>>> sequence. Please help give some comments.
>>>
>>>
>>> Thanks,
>>> Victor
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>
>>> [background]
>>> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs 
>>> from another ring (e.g. gfx) may continue signaling during 
>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>
>>> At the resubmit stage after recovery, the job with hw fence signaled 
>>> bit set will call job done directly instead go through fence process.
>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>> cleared yet.
>>>
>>> Then overflow happens in the fence driver slots and some jobs may be 
>>> skipped and leave the rcu pointer not cleared which makes an 
>>> infinite wait for the slot on the next fence emitted.
>>>
>>> This infinite wait cause a job timeout on the emitting job. And 
>>> driver will stuck at the its sched stop step because kthread_park 
>>> cannot be done.
>>>
>>> [how]
>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>> before drm sched stop 2. handle all fences in fence process to aviod
>>> skip when overflow happens
>>>
>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 +-----
>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 943c9e750575..c0cfae52f12b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>> amdgpu_device *adev,
>>>            amdgpu_virt_fini_data_exchange(adev);
>>>        }
>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>> -
>>>        /* block all schedulers and reset given job's ring */
>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                  amdgpu_device_ip_need_full_reset(tmp_adev))
>>>                amdgpu_ras_suspend(tmp_adev);
>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>> +
>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>            atomic_inc(&tmp_adev->gpu_reset_counter);
>>>        }
>>>    -    if (need_emergency_restart)
>>> +    if (need_emergency_restart) {
>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>> reset_list) {
>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>> +        }
>>>            goto skip_sched_resume;
>>> +    }
>>>           /*
>>>         * Must check guilty signal here since after this point all 
>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>            job_signaled = true;
>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>> skipping HW
>>> reset");
>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>> reset_list) {
>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>> +        }
>>>            goto skip_hw_reset;
>>>        }
>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>            if (r && r == -EAGAIN) {
>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>>>                adev->asic_reset_res = 0;
>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>                goto retry;
>>>            }
>>>        }
>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct 
>>> pci_dev *pdev)
>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>> +
>>>        adev->no_hw_access = true;
>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>        adev->no_hw_access = false;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring 
>>> *ring)
>>>        if (unlikely(seq == last_seq))
>>>            return false;
>>>    -    last_seq &= drv->num_fences_mask;
>>> -    seq &= drv->num_fences_mask;
>>> -
>>>        do {
>>>            struct dma_fence *fence, **ptr;
>>>               ++last_seq;
>>> -        last_seq &= drv->num_fences_mask;
>>> -        ptr = &drv->fences[last_seq];
>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>               /* There is always exactly one thread signaling this 
>>> fence slot */
>>>            fence = rcu_dereference_protected(*ptr, 1);
>> Those changes here doesn't seem to make sense. Please explain further 
>> why that is necessary.
>>
>> Christian.
>>
>>> -- 
>>> 2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 18:29       ` Andrey Grodzovsky
  2022-09-15 18:31         ` Andrey Grodzovsky
@ 2022-09-15 19:26         ` Christian König
  2022-09-15 20:37           ` Andrey Grodzovsky
  1 sibling, 1 reply; 21+ messages in thread
From: Christian König @ 2022-09-15 19:26 UTC (permalink / raw)
  To: Andrey Grodzovsky, Zhao, Victor, amd-gfx; +Cc: Deng, Emily

Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>
> On 2022-09-15 06:09, Zhao, Victor wrote:
>> [AMD Official Use Only - General]
>>
>> Hi Christian,
>>
>> The test sequence is executing a compute engine hang while running a 
>> lot of containers submitting gfx jobs. We have advanced tdr mode and 
>> mode2 reset enabled on driver.
>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>> pending list maybe signaled after drm_sched_stop. So they will not be 
>> removed from pending list but have the DMA_FENCE_FLAG_SIGNALED_BIT set.
>> At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
>> rerun and removed from pending list.
>> At the resubmit setp, the second job (with signaled bit) will be 
>> resubmitted. Since it still has signaled bit, drm_sched_job_done will 
>> be called directly. This decrease the hw_rq_count which allows more 
>> jobs emitted but did not clean fence_drv rcu ptr.
>> This results in an overflow in the fence_drv. Since we will use 
>> num_fences_mask in amdgpu_fence_process, when overflow happens, the 
>> signal of some job will be skipped which result in an infinite wait 
>> for the fence_drv rcu ptr.
>>
>> So close irq before sched_stop could avoid signal jobs after 
>> drm_sched_stop. And signal job one by one in fence_process instead of 
>> using a mask will handle the overflow situation.
>>
>> Another fix could be skip submitting jobs which already signaled 
>> during resubmit stage, which may look cleaner.
>>
>> Please help give some advice.
>
>
> How about the code bellow  instead ? The real problem is that we reuse 
> a dma fence twice which is not according to fma fence design, so maybe 
> this can help ?
>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 8adeb7469f1e..033f0ae16784 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
> struct dma_fence **f, struct amd
>         if (job && job->job_run_counter) {
>                 /* reinit seq for resubmitted jobs */
>                 fence->seqno = seq;
> +
> +               /* For resubmitted job clear the singled bit */
> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
> +

Upstream will pretty much kill you for that.

Re-setting a fence from a signaled to an unsignaled state is a massive 
no-go.

Christian.

>
>                 /* TO be inline with external fence creation and other 
> drivers */
>                 dma_fence_get(fence);
>         } else {
>
>
> Andrey
>
>
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Thursday, September 15, 2022 2:32 PM
>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>> <Andrey.Grodzovsky@amd.com>
>> Cc: Deng, Emily <Emily.Deng@amd.com>
>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>>
>>
>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>> [AMD Official Use Only - General]
>>>
>>> Ping.
>>>
>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>
>>> We found some reset related issues during stress test on the 
>>> sequence. Please help give some comments.
>>>
>>>
>>> Thanks,
>>> Victor
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>> To: amd-gfx@lists.freedesktop.org
>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>
>>> [background]
>>> For a gpu recovery caused by a hang on one ring (e.g. compute), jobs 
>>> from another ring (e.g. gfx) may continue signaling during 
>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>
>>> At the resubmit stage after recovery, the job with hw fence signaled 
>>> bit set will call job done directly instead go through fence process.
>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>> cleared yet.
>>>
>>> Then overflow happens in the fence driver slots and some jobs may be 
>>> skipped and leave the rcu pointer not cleared which makes an 
>>> infinite wait for the slot on the next fence emitted.
>>>
>>> This infinite wait cause a job timeout on the emitting job. And 
>>> driver will stuck at the its sched stop step because kthread_park 
>>> cannot be done.
>>>
>>> [how]
>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>> before drm sched stop 2. handle all fences in fence process to aviod
>>> skip when overflow happens
>>>
>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 +++++++++++++---  
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 +-----
>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 943c9e750575..c0cfae52f12b 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>> amdgpu_device *adev,
>>>            amdgpu_virt_fini_data_exchange(adev);
>>>        }
>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>> -
>>>        /* block all schedulers and reset given job's ring */
>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>                  amdgpu_device_ip_need_full_reset(tmp_adev))
>>>                amdgpu_ras_suspend(tmp_adev);
>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>> +
>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>            atomic_inc(&tmp_adev->gpu_reset_counter);
>>>        }
>>>    -    if (need_emergency_restart)
>>> +    if (need_emergency_restart) {
>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>> reset_list) {
>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>> +        }
>>>            goto skip_sched_resume;
>>> +    }
>>>           /*
>>>         * Must check guilty signal here since after this point all 
>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>            job_signaled = true;
>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>> skipping HW
>>> reset");
>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>> reset_list) {
>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>> +        }
>>>            goto skip_hw_reset;
>>>        }
>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>> amdgpu_device *adev,
>>>            if (r && r == -EAGAIN) {
>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context->flags);
>>>                adev->asic_reset_res = 0;
>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>                goto retry;
>>>            }
>>>        }
>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct 
>>> pci_dev *pdev)
>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>> +
>>>        adev->no_hw_access = true;
>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>        adev->no_hw_access = false;
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring 
>>> *ring)
>>>        if (unlikely(seq == last_seq))
>>>            return false;
>>>    -    last_seq &= drv->num_fences_mask;
>>> -    seq &= drv->num_fences_mask;
>>> -
>>>        do {
>>>            struct dma_fence *fence, **ptr;
>>>               ++last_seq;
>>> -        last_seq &= drv->num_fences_mask;
>>> -        ptr = &drv->fences[last_seq];
>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>               /* There is always exactly one thread signaling this 
>>> fence slot */
>>>            fence = rcu_dereference_protected(*ptr, 1);
>> Those changes here doesn't seem to make sense. Please explain further 
>> why that is necessary.
>>
>> Christian.
>>
>>> -- 
>>> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 19:26         ` Christian König
@ 2022-09-15 20:37           ` Andrey Grodzovsky
  2022-09-16  5:18             ` Christian König
  0 siblings, 1 reply; 21+ messages in thread
From: Andrey Grodzovsky @ 2022-09-15 20:37 UTC (permalink / raw)
  To: Christian König, Zhao, Victor, amd-gfx; +Cc: Deng, Emily


On 2022-09-15 15:26, Christian König wrote:
> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>> [AMD Official Use Only - General]
>>>
>>> Hi Christian,
>>>
>>> The test sequence is executing a compute engine hang while running a 
>>> lot of containers submitting gfx jobs. We have advanced tdr mode and 
>>> mode2 reset enabled on driver.
>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>> be removed from pending list but have the 
>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will be 
>>> rerun and removed from pending list.
>>> At the resubmit setp, the second job (with signaled bit) will be 
>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>> will be called directly. This decrease the hw_rq_count which allows 
>>> more jobs emitted but did not clean fence_drv rcu ptr.
>>> This results in an overflow in the fence_drv. Since we will use 
>>> num_fences_mask in amdgpu_fence_process, when overflow happens, the 
>>> signal of some job will be skipped which result in an infinite wait 
>>> for the fence_drv rcu ptr.
>>>
>>> So close irq before sched_stop could avoid signal jobs after 
>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>> of using a mask will handle the overflow situation.
>>>
>>> Another fix could be skip submitting jobs which already signaled 
>>> during resubmit stage, which may look cleaner.
>>>
>>> Please help give some advice.
>>
>>
>> How about the code bellow  instead ? The real problem is that we 
>> reuse a dma fence twice which is not according to fma fence design, 
>> so maybe this can help ?
>>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> index 8adeb7469f1e..033f0ae16784 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
>> struct dma_fence **f, struct amd
>>         if (job && job->job_run_counter) {
>>                 /* reinit seq for resubmitted jobs */
>>                 fence->seqno = seq;
>> +
>> +               /* For resubmitted job clear the singled bit */
>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
>> +
>
> Upstream will pretty much kill you for that.
>
> Re-setting a fence from a signaled to an unsignaled state is a massive 
> no-go.
>
> Christian.


Is it worse then doing fence->seqno = seq; ? This is already a huge hack 
, no ?

Andrey


>
>>
>>                 /* TO be inline with external fence creation and 
>> other drivers */
>>                 dma_fence_get(fence);
>>         } else {
>>
>>
>> Andrey
>>
>>
>>>
>>>
>>> Thanks,
>>> Victor
>>>
>>>
>>>
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Thursday, September 15, 2022 2:32 PM
>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>> <Andrey.Grodzovsky@amd.com>
>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>
>>>
>>>
>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Ping.
>>>>
>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>
>>>> We found some reset related issues during stress test on the 
>>>> sequence. Please help give some comments.
>>>>
>>>>
>>>> Thanks,
>>>> Victor
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>> To: amd-gfx@lists.freedesktop.org
>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>
>>>> [background]
>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>
>>>> At the resubmit stage after recovery, the job with hw fence 
>>>> signaled bit set will call job done directly instead go through 
>>>> fence process.
>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>> cleared yet.
>>>>
>>>> Then overflow happens in the fence driver slots and some jobs may 
>>>> be skipped and leave the rcu pointer not cleared which makes an 
>>>> infinite wait for the slot on the next fence emitted.
>>>>
>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>> driver will stuck at the its sched stop step because kthread_park 
>>>> cannot be done.
>>>>
>>>> [how]
>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>> skip when overflow happens
>>>>
>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>> +++++++++++++---  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>> +-----
>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 943c9e750575..c0cfae52f12b 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>> amdgpu_device *adev,
>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>        }
>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> -
>>>>        /* block all schedulers and reset given job's ring */
>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>                  amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>> +
>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>            atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>        }
>>>>    -    if (need_emergency_restart)
>>>> +    if (need_emergency_restart) {
>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>> reset_list) {
>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>> +        }
>>>>            goto skip_sched_resume;
>>>> +    }
>>>>           /*
>>>>         * Must check guilty signal here since after this point all 
>>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>            job_signaled = true;
>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>> skipping HW
>>>> reset");
>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>> reset_list) {
>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>> +        }
>>>>            goto skip_hw_reset;
>>>>        }
>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>> amdgpu_device *adev,
>>>>            if (r && r == -EAGAIN) {
>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>> &reset_context->flags);
>>>>                adev->asic_reset_res = 0;
>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>                goto retry;
>>>>            }
>>>>        }
>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t amdgpu_pci_slot_reset(struct 
>>>> pci_dev *pdev)
>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>> +
>>>>        adev->no_hw_access = true;
>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>        adev->no_hw_access = false;
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring 
>>>> *ring)
>>>>        if (unlikely(seq == last_seq))
>>>>            return false;
>>>>    -    last_seq &= drv->num_fences_mask;
>>>> -    seq &= drv->num_fences_mask;
>>>> -
>>>>        do {
>>>>            struct dma_fence *fence, **ptr;
>>>>               ++last_seq;
>>>> -        last_seq &= drv->num_fences_mask;
>>>> -        ptr = &drv->fences[last_seq];
>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>               /* There is always exactly one thread signaling this 
>>>> fence slot */
>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>> Those changes here doesn't seem to make sense. Please explain 
>>> further why that is necessary.
>>>
>>> Christian.
>>>
>>>> -- 
>>>> 2.25.1
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-15 20:37           ` Andrey Grodzovsky
@ 2022-09-16  5:18             ` Christian König
  2022-09-16 13:50               ` Andrey Grodzovsky
  0 siblings, 1 reply; 21+ messages in thread
From: Christian König @ 2022-09-16  5:18 UTC (permalink / raw)
  To: Andrey Grodzovsky, Zhao, Victor, amd-gfx; +Cc: Deng, Emily

Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>
> On 2022-09-15 15:26, Christian König wrote:
>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>> [AMD Official Use Only - General]
>>>>
>>>> Hi Christian,
>>>>
>>>> The test sequence is executing a compute engine hang while running 
>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode 
>>>> and mode2 reset enabled on driver.
>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>>> be removed from pending list but have the 
>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>> be rerun and removed from pending list.
>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>> will be called directly. This decrease the hw_rq_count which allows 
>>>> more jobs emitted but did not clean fence_drv rcu ptr.
>>>> This results in an overflow in the fence_drv. Since we will use 
>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, the 
>>>> signal of some job will be skipped which result in an infinite wait 
>>>> for the fence_drv rcu ptr.
>>>>
>>>> So close irq before sched_stop could avoid signal jobs after 
>>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>>> of using a mask will handle the overflow situation.
>>>>
>>>> Another fix could be skip submitting jobs which already signaled 
>>>> during resubmit stage, which may look cleaner.
>>>>
>>>> Please help give some advice.
>>>
>>>
>>> How about the code bellow  instead ? The real problem is that we 
>>> reuse a dma fence twice which is not according to fma fence design, 
>>> so maybe this can help ?
>>>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> index 8adeb7469f1e..033f0ae16784 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring *ring, 
>>> struct dma_fence **f, struct amd
>>>         if (job && job->job_run_counter) {
>>>                 /* reinit seq for resubmitted jobs */
>>>                 fence->seqno = seq;
>>> +
>>> +               /* For resubmitted job clear the singled bit */
>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, &fence->flags);
>>> +
>>
>> Upstream will pretty much kill you for that.
>>
>> Re-setting a fence from a signaled to an unsignaled state is a 
>> massive no-go.
>>
>> Christian.
>
>
> Is it worse then doing fence->seqno = seq; ? This is already a huge 
> hack , no ?

No, it's as equally bad. I don't think we can do either.

Christian.

>
> Andrey
>
>
>>
>>>
>>>                 /* TO be inline with external fence creation and 
>>> other drivers */
>>>                 dma_fence_get(fence);
>>>         } else {
>>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>
>>>> Thanks,
>>>> Victor
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>>> <Andrey.Grodzovsky@amd.com>
>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>
>>>>
>>>>
>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Ping.
>>>>>
>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>
>>>>> We found some reset related issues during stress test on the 
>>>>> sequence. Please help give some comments.
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>> To: amd-gfx@lists.freedesktop.org
>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>
>>>>> [background]
>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>
>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>> signaled bit set will call job done directly instead go through 
>>>>> fence process.
>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>> cleared yet.
>>>>>
>>>>> Then overflow happens in the fence driver slots and some jobs may 
>>>>> be skipped and leave the rcu pointer not cleared which makes an 
>>>>> infinite wait for the slot on the next fence emitted.
>>>>>
>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>> driver will stuck at the its sched stop step because kthread_park 
>>>>> cannot be done.
>>>>>
>>>>> [how]
>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>>> skip when overflow happens
>>>>>
>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>>> +-----
>>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>> amdgpu_device *adev,
>>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>>        }
>>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>> -
>>>>>        /* block all schedulers and reset given job's ring */
>>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>> +
>>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>        }
>>>>>    -    if (need_emergency_restart)
>>>>> +    if (need_emergency_restart) {
>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>> reset_list) {
>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>> +        }
>>>>>            goto skip_sched_resume;
>>>>> +    }
>>>>>           /*
>>>>>         * Must check guilty signal here since after this point all 
>>>>> old @@ -5240,6 +5244,9 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>            job_signaled = true;
>>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>>> skipping HW
>>>>> reset");
>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>> reset_list) {
>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>> +        }
>>>>>            goto skip_hw_reset;
>>>>>        }
>>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>> amdgpu_device *adev,
>>>>>            if (r && r == -EAGAIN) {
>>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>> &reset_context->flags);
>>>>>                adev->asic_reset_res = 0;
>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>                goto retry;
>>>>>            }
>>>>>        }
>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>> +
>>>>>        adev->no_hw_access = true;
>>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>        adev->no_hw_access = false;
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct amdgpu_ring 
>>>>> *ring)
>>>>>        if (unlikely(seq == last_seq))
>>>>>            return false;
>>>>>    -    last_seq &= drv->num_fences_mask;
>>>>> -    seq &= drv->num_fences_mask;
>>>>> -
>>>>>        do {
>>>>>            struct dma_fence *fence, **ptr;
>>>>>               ++last_seq;
>>>>> -        last_seq &= drv->num_fences_mask;
>>>>> -        ptr = &drv->fences[last_seq];
>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>               /* There is always exactly one thread signaling this 
>>>>> fence slot */
>>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>>> Those changes here doesn't seem to make sense. Please explain 
>>>> further why that is necessary.
>>>>
>>>> Christian.
>>>>
>>>>> -- 
>>>>> 2.25.1
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-16  5:18             ` Christian König
@ 2022-09-16 13:50               ` Andrey Grodzovsky
  2022-09-16 13:51                 ` Christian König
  2022-09-17  3:31                 ` Zhao, Victor
  0 siblings, 2 replies; 21+ messages in thread
From: Andrey Grodzovsky @ 2022-09-16 13:50 UTC (permalink / raw)
  To: Christian König, Zhao, Victor, amd-gfx; +Cc: Deng, Emily


On 2022-09-16 01:18, Christian König wrote:
> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 15:26, Christian König wrote:
>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> The test sequence is executing a compute engine hang while running 
>>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode 
>>>>> and mode2 reset enabled on driver.
>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>>>> be removed from pending list but have the 
>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>> be rerun and removed from pending list.
>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>> the signal of some job will be skipped which result in an infinite 
>>>>> wait for the fence_drv rcu ptr.
>>>>>
>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>>>> of using a mask will handle the overflow situation.
>>>>>
>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>> during resubmit stage, which may look cleaner.
>>>>>
>>>>> Please help give some advice.
>>>>
>>>>
>>>> How about the code bellow  instead ? The real problem is that we 
>>>> reuse a dma fence twice which is not according to fma fence design, 
>>>> so maybe this can help ?
>>>>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>> *ring, struct dma_fence **f, struct amd
>>>>         if (job && job->job_run_counter) {
>>>>                 /* reinit seq for resubmitted jobs */
>>>>                 fence->seqno = seq;
>>>> +
>>>> +               /* For resubmitted job clear the singled bit */
>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, 
>>>> &fence->flags);
>>>> +
>>>
>>> Upstream will pretty much kill you for that.
>>>
>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>> massive no-go.
>>>
>>> Christian.
>>
>>
>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>> hack , no ?
>
> No, it's as equally bad. I don't think we can do either.
>
> Christian.


And all those ugly hack are there because we reuse a dma_fence (hw_fence 
embedded into the job) and correct me if I am wrong
but I don't think dma_fence is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before 
sched_stop - this in my opinion should solve the issue, but Victor - why 
then you still need the change in amdgpu_fence_process ? You will not 
have the overflow situation because by moving irq_disable before stop 
any job that signaled will be removed from the scheduler pending list 
anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence 
numbers' and could reintroduce that bug.

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>>                 /* TO be inline with external fence creation and 
>>>> other drivers */
>>>>                 dma_fence_get(fence);
>>>>         } else {
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>>>> <Andrey.Grodzovsky@amd.com>
>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>
>>>>>
>>>>>
>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Ping.
>>>>>>
>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>
>>>>>> We found some reset related issues during stress test on the 
>>>>>> sequence. Please help give some comments.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>
>>>>>> [background]
>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>
>>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>>> signaled bit set will call job done directly instead go through 
>>>>>> fence process.
>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>>> cleared yet.
>>>>>>
>>>>>> Then overflow happens in the fence driver slots and some jobs may 
>>>>>> be skipped and leave the rcu pointer not cleared which makes an 
>>>>>> infinite wait for the slot on the next fence emitted.
>>>>>>
>>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>>> driver will stuck at the its sched stop step because kthread_park 
>>>>>> cannot be done.
>>>>>>
>>>>>> [how]
>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>>> before drm sched stop 2. handle all fences in fence process to aviod
>>>>>> skip when overflow happens
>>>>>>
>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>>>> +-----
>>>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>> amdgpu_device *adev,
>>>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>>>        }
>>>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> -
>>>>>>        /* block all schedulers and reset given job's ring */
>>>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>> +
>>>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>        }
>>>>>>    -    if (need_emergency_restart)
>>>>>> +    if (need_emergency_restart) {
>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>> reset_list) {
>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> +        }
>>>>>>            goto skip_sched_resume;
>>>>>> +    }
>>>>>>           /*
>>>>>>         * Must check guilty signal here since after this point 
>>>>>> all old @@ -5240,6 +5244,9 @@ int 
>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>            job_signaled = true;
>>>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>>>> skipping HW
>>>>>> reset");
>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>> reset_list) {
>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> +        }
>>>>>>            goto skip_hw_reset;
>>>>>>        }
>>>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>>            if (r && r == -EAGAIN) {
>>>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>>> &reset_context->flags);
>>>>>>                adev->asic_reset_res = 0;
>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>                goto retry;
>>>>>>            }
>>>>>>        }
>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> +
>>>>>>        adev->no_hw_access = true;
>>>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>        adev->no_hw_access = false;
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct 
>>>>>> amdgpu_ring *ring)
>>>>>>        if (unlikely(seq == last_seq))
>>>>>>            return false;
>>>>>>    -    last_seq &= drv->num_fences_mask;
>>>>>> -    seq &= drv->num_fences_mask;
>>>>>> -
>>>>>>        do {
>>>>>>            struct dma_fence *fence, **ptr;
>>>>>>               ++last_seq;
>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>               /* There is always exactly one thread signaling 
>>>>>> this fence slot */
>>>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>>>> Those changes here doesn't seem to make sense. Please explain 
>>>>> further why that is necessary.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> -- 
>>>>>> 2.25.1
>>>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-16 13:50               ` Andrey Grodzovsky
@ 2022-09-16 13:51                 ` Christian König
  2022-09-17  3:31                 ` Zhao, Victor
  1 sibling, 0 replies; 21+ messages in thread
From: Christian König @ 2022-09-16 13:51 UTC (permalink / raw)
  To: Andrey Grodzovsky, Zhao, Victor, amd-gfx; +Cc: Deng, Emily



Am 16.09.22 um 15:50 schrieb Andrey Grodzovsky:
>
> On 2022-09-16 01:18, Christian König wrote:
>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>>
>>> On 2022-09-15 15:26, Christian König wrote:
>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>>
>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> The test sequence is executing a compute engine hang while 
>>>>>> running a lot of containers submitting gfx jobs. We have advanced 
>>>>>> tdr mode and mode2 reset enabled on driver.
>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>>> pending list maybe signaled after drm_sched_stop. So they will 
>>>>>> not be removed from pending list but have the 
>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>>> be rerun and removed from pending list.
>>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>>> the signal of some job will be skipped which result in an 
>>>>>> infinite wait for the fence_drv rcu ptr.
>>>>>>
>>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>>> drm_sched_stop. And signal job one by one in fence_process 
>>>>>> instead of using a mask will handle the overflow situation.
>>>>>>
>>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>>> during resubmit stage, which may look cleaner.
>>>>>>
>>>>>> Please help give some advice.
>>>>>
>>>>>
>>>>> How about the code bellow  instead ? The real problem is that we 
>>>>> reuse a dma fence twice which is not according to fma fence 
>>>>> design, so maybe this can help ?
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>         if (job && job->job_run_counter) {
>>>>>                 /* reinit seq for resubmitted jobs */
>>>>>                 fence->seqno = seq;
>>>>> +
>>>>> +               /* For resubmitted job clear the singled bit */
>>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT, 
>>>>> &fence->flags);
>>>>> +
>>>>
>>>> Upstream will pretty much kill you for that.
>>>>
>>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>>> massive no-go.
>>>>
>>>> Christian.
>>>
>>>
>>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>>> hack , no ?
>>
>> No, it's as equally bad. I don't think we can do either.
>>
>> Christian.
>
>
> And all those ugly hack are there because we reuse a dma_fence 
> (hw_fence embedded into the job) and correct me if I am wrong
> but I don't think dma_fence is ever supposed to be reused.

Exactly that, yes.

Christian.

>
> So maybe like Victor suggested we should move close and flush irq 
> before sched_stop - this in my opinion should solve the issue, but 
> Victor - why then you still need the change in amdgpu_fence_process ? 
> You will not have the overflow situation because by moving irq_disable 
> before stop any job that signaled will be removed from the scheduler 
> pending list anyway. Also not that this change reverts 'drm/amdgpu: 
> sanitize fence numbers' and could reintroduce that bug.
>
> Andrey
>
>
>>
>>>
>>> Andrey
>>>
>>>
>>>>
>>>>>
>>>>>                 /* TO be inline with external fence creation and 
>>>>> other drivers */
>>>>>                 dma_fence_get(fence);
>>>>>         } else {
>>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>>>>> <Andrey.Grodzovsky@amd.com>
>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>>> [AMD Official Use Only - General]
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>>
>>>>>>> We found some reset related issues during stress test on the 
>>>>>>> sequence. Please help give some comments.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Victor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>>
>>>>>>> [background]
>>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>>
>>>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>>>> signaled bit set will call job done directly instead go through 
>>>>>>> fence process.
>>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>>>> cleared yet.
>>>>>>>
>>>>>>> Then overflow happens in the fence driver slots and some jobs 
>>>>>>> may be skipped and leave the rcu pointer not cleared which makes 
>>>>>>> an infinite wait for the slot on the next fence emitted.
>>>>>>>
>>>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>>>> driver will stuck at the its sched stop step because 
>>>>>>> kthread_park cannot be done.
>>>>>>>
>>>>>>> [how]
>>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>>>> before drm sched stop 2. handle all fences in fence process to 
>>>>>>> aviod
>>>>>>> skip when overflow happens
>>>>>>>
>>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16 
>>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6 
>>>>>>> +-----
>>>>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>>>>        }
>>>>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> -
>>>>>>>        /* block all schedulers and reset given job's ring */
>>>>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6 
>>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>>>>>> *adev,
>>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>>> +
>>>>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>> atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>>        }
>>>>>>>    -    if (need_emergency_restart)
>>>>>>> +    if (need_emergency_restart) {
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>            goto skip_sched_resume;
>>>>>>> +    }
>>>>>>>           /*
>>>>>>>         * Must check guilty signal here since after this point 
>>>>>>> all old @@ -5240,6 +5244,9 @@ int 
>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>            job_signaled = true;
>>>>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>>>>> skipping HW
>>>>>>> reset");
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle, 
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>            goto skip_hw_reset;
>>>>>>>        }
>>>>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>            if (r && r == -EAGAIN) {
>>>>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>>>> &reset_context->flags);
>>>>>>>                adev->asic_reset_res = 0;
>>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>                goto retry;
>>>>>>>            }
>>>>>>>        }
>>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> +
>>>>>>>        adev->no_hw_access = true;
>>>>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>>        adev->no_hw_access = false;
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct 
>>>>>>> amdgpu_ring *ring)
>>>>>>>        if (unlikely(seq == last_seq))
>>>>>>>            return false;
>>>>>>>    -    last_seq &= drv->num_fences_mask;
>>>>>>> -    seq &= drv->num_fences_mask;
>>>>>>> -
>>>>>>>        do {
>>>>>>>            struct dma_fence *fence, **ptr;
>>>>>>>               ++last_seq;
>>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>>               /* There is always exactly one thread signaling 
>>>>>>> this fence slot */
>>>>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>>>>> Those changes here doesn't seem to make sense. Please explain 
>>>>>> further why that is necessary.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> -- 
>>>>>>> 2.25.1
>>>>
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-16 13:50               ` Andrey Grodzovsky
  2022-09-16 13:51                 ` Christian König
@ 2022-09-17  3:31                 ` Zhao, Victor
  2022-09-19 15:04                   ` Andrey Grodzovsky
  1 sibling, 1 reply; 21+ messages in thread
From: Zhao, Victor @ 2022-09-17  3:31 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, amd-gfx; +Cc: Deng, Emily

[AMD Official Use Only - General]

Hi Andrey,

Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is just want to make sure driver can correct itself from an overflow situation. Didn’t know about the previous issue there.

Do you know if the issue still exists? Or is it on VCE only?


Thanks,
Victor



-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Friday, September 16, 2022 9:50 PM
To: Koenig, Christian <Christian.Koenig@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow


On 2022-09-16 01:18, Christian König wrote:
> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>
>> On 2022-09-15 15:26, Christian König wrote:
>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>
>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>> [AMD Official Use Only - General]
>>>>>
>>>>> Hi Christian,
>>>>>
>>>>> The test sequence is executing a compute engine hang while running 
>>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode 
>>>>> and mode2 reset enabled on driver.
>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>> pending list maybe signaled after drm_sched_stop. So they will not 
>>>>> be removed from pending list but have the 
>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>> be rerun and removed from pending list.
>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>> the signal of some job will be skipped which result in an infinite 
>>>>> wait for the fence_drv rcu ptr.
>>>>>
>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>> drm_sched_stop. And signal job one by one in fence_process instead 
>>>>> of using a mask will handle the overflow situation.
>>>>>
>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>> during resubmit stage, which may look cleaner.
>>>>>
>>>>> Please help give some advice.
>>>>
>>>>
>>>> How about the code bellow  instead ? The real problem is that we 
>>>> reuse a dma fence twice which is not according to fma fence design, 
>>>> so maybe this can help ?
>>>>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>> *ring, struct dma_fence **f, struct amd
>>>>         if (job && job->job_run_counter) {
>>>>                 /* reinit seq for resubmitted jobs */
>>>>                 fence->seqno = seq;
>>>> +
>>>> +               /* For resubmitted job clear the singled bit */
>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>> &fence->flags);
>>>> +
>>>
>>> Upstream will pretty much kill you for that.
>>>
>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>> massive no-go.
>>>
>>> Christian.
>>
>>
>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>> hack , no ?
>
> No, it's as equally bad. I don't think we can do either.
>
> Christian.


And all those ugly hack are there because we reuse a dma_fence (hw_fence embedded into the job) and correct me if I am wrong but I don't think dma_fence is ever supposed to be reused.

So maybe like Victor suggested we should move close and flush irq before sched_stop - this in my opinion should solve the issue, but Victor - why then you still need the change in amdgpu_fence_process ? You will not have the overflow situation because by moving irq_disable before stop any job that signaled will be removed from the scheduler pending list anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce that bug.

Andrey


>
>>
>> Andrey
>>
>>
>>>
>>>>
>>>>                 /* TO be inline with external fence creation and 
>>>> other drivers */
>>>>                 dma_fence_get(fence);
>>>>         } else {
>>>>
>>>>
>>>> Andrey
>>>>
>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Victor
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>>>> <Andrey.Grodzovsky@amd.com>
>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by 
>>>>> overflow
>>>>>
>>>>>
>>>>>
>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Ping.
>>>>>>
>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>
>>>>>> We found some reset related issues during stress test on the 
>>>>>> sequence. Please help give some comments.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey 
>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>
>>>>>> [background]
>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>
>>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>>> signaled bit set will call job done directly instead go through 
>>>>>> fence process.
>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>>> cleared yet.
>>>>>>
>>>>>> Then overflow happens in the fence driver slots and some jobs may 
>>>>>> be skipped and leave the rcu pointer not cleared which makes an 
>>>>>> infinite wait for the slot on the next fence emitted.
>>>>>>
>>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>>> driver will stuck at the its sched stop step because kthread_park 
>>>>>> cannot be done.
>>>>>>
>>>>>> [how]
>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt 
>>>>>> before drm sched stop 2. handle all fences in fence process to 
>>>>>> aviod skip when overflow happens
>>>>>>
>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6
>>>>>> +-----
>>>>>>    2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>> amdgpu_device *adev,
>>>>>>            amdgpu_virt_fini_data_exchange(adev);
>>>>>>        }
>>>>>>    -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> -
>>>>>>        /* block all schedulers and reset given job's ring */
>>>>>>        for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>            struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6
>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>>>>> +*adev,
>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>                amdgpu_ras_suspend(tmp_adev);
>>>>>>    +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>> +
>>>>>>            for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>                struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>    @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev, atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>        }
>>>>>>    -    if (need_emergency_restart)
>>>>>> +    if (need_emergency_restart) {
>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>> reset_list) {
>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> +        }
>>>>>>            goto skip_sched_resume;
>>>>>> +    }
>>>>>>           /*
>>>>>>         * Must check guilty signal here since after this point 
>>>>>> all old @@ -5240,6 +5244,9 @@ int 
>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>        if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>            job_signaled = true;
>>>>>>            dev_info(adev->dev, "Guilty job already signaled, 
>>>>>> skipping HW reset");
>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>> reset_list) {
>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>> +        }
>>>>>>            goto skip_hw_reset;
>>>>>>        }
>>>>>>    @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>>> amdgpu_device *adev,
>>>>>>            if (r && r == -EAGAIN) {
>>>>>>                set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>>> &reset_context->flags);
>>>>>>                adev->asic_reset_res = 0;
>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>                goto retry;
>>>>>>            }
>>>>>>        }
>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>        set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>        set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>    +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>> +
>>>>>>        adev->no_hw_access = true;
>>>>>>        r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>        adev->no_hw_access = false; diff --git 
>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct 
>>>>>> amdgpu_ring *ring)
>>>>>>        if (unlikely(seq == last_seq))
>>>>>>            return false;
>>>>>>    -    last_seq &= drv->num_fences_mask;
>>>>>> -    seq &= drv->num_fences_mask;
>>>>>> -
>>>>>>        do {
>>>>>>            struct dma_fence *fence, **ptr;
>>>>>>               ++last_seq;
>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>               /* There is always exactly one thread signaling 
>>>>>> this fence slot */
>>>>>>            fence = rcu_dereference_protected(*ptr, 1);
>>>>> Those changes here doesn't seem to make sense. Please explain 
>>>>> further why that is necessary.
>>>>>
>>>>> Christian.
>>>>>
>>>>>> --
>>>>>> 2.25.1
>>>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-17  3:31                 ` Zhao, Victor
@ 2022-09-19 15:04                   ` Andrey Grodzovsky
  2022-09-20 10:22                     ` Zhao, Victor
  0 siblings, 1 reply; 21+ messages in thread
From: Andrey Grodzovsky @ 2022-09-19 15:04 UTC (permalink / raw)
  To: Zhao, Victor, Koenig, Christian, amd-gfx; +Cc: Deng, Emily

I don't know if issue still exist but it worth checking with Christian 
who wrote this patch.

Andrey


On 2022-09-16 23:31, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is just want to make sure driver can correct itself from an overflow situation. Didn’t know about the previous issue there.
>
> Do you know if the issue still exists? Or is it on VCE only?
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, September 16, 2022 9:50 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
> On 2022-09-16 01:18, Christian König wrote:
>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>> On 2022-09-15 15:26, Christian König wrote:
>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> The test sequence is executing a compute engine hang while running
>>>>>> a lot of containers submitting gfx jobs. We have advanced tdr mode
>>>>>> and mode2 reset enabled on driver.
>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx
>>>>>> pending list maybe signaled after drm_sched_stop. So they will not
>>>>>> be removed from pending list but have the
>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will
>>>>>> be rerun and removed from pending list.
>>>>>> At the resubmit setp, the second job (with signaled bit) will be
>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done
>>>>>> will be called directly. This decrease the hw_rq_count which
>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>> This results in an overflow in the fence_drv. Since we will use
>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens,
>>>>>> the signal of some job will be skipped which result in an infinite
>>>>>> wait for the fence_drv rcu ptr.
>>>>>>
>>>>>> So close irq before sched_stop could avoid signal jobs after
>>>>>> drm_sched_stop. And signal job one by one in fence_process instead
>>>>>> of using a mask will handle the overflow situation.
>>>>>>
>>>>>> Another fix could be skip submitting jobs which already signaled
>>>>>> during resubmit stage, which may look cleaner.
>>>>>>
>>>>>> Please help give some advice.
>>>>>
>>>>> How about the code bellow  instead ? The real problem is that we
>>>>> reuse a dma fence twice which is not according to fma fence design,
>>>>> so maybe this can help ?
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>          if (job && job->job_run_counter) {
>>>>>                  /* reinit seq for resubmitted jobs */
>>>>>                  fence->seqno = seq;
>>>>> +
>>>>> +               /* For resubmitted job clear the singled bit */
>>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>>> &fence->flags);
>>>>> +
>>>> Upstream will pretty much kill you for that.
>>>>
>>>> Re-setting a fence from a signaled to an unsignaled state is a
>>>> massive no-go.
>>>>
>>>> Christian.
>>>
>>> Is it worse then doing fence->seqno = seq; ? This is already a huge
>>> hack , no ?
>> No, it's as equally bad. I don't think we can do either.
>>
>> Christian.
>
> And all those ugly hack are there because we reuse a dma_fence (hw_fence embedded into the job) and correct me if I am wrong but I don't think dma_fence is ever supposed to be reused.
>
> So maybe like Victor suggested we should move close and flush irq before sched_stop - this in my opinion should solve the issue, but Victor - why then you still need the change in amdgpu_fence_process ? You will not have the overflow situation because by moving irq_disable before stop any job that signaled will be removed from the scheduler pending list anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce that bug.
>
> Andrey
>
>
>>> Andrey
>>>
>>>
>>>>>                  /* TO be inline with external fence creation and
>>>>> other drivers */
>>>>>                  dma_fence_get(fence);
>>>>>          } else {
>>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>;
>>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
>>>>>> <Andrey.Grodzovsky@amd.com>
>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by
>>>>>> overflow
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>>> [AMD Official Use Only - General]
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>>
>>>>>>> We found some reset related issues during stress test on the
>>>>>>> sequence. Please help give some comments.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Victor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>>
>>>>>>> [background]
>>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute),
>>>>>>> jobs from another ring (e.g. gfx) may continue signaling during
>>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>>
>>>>>>> At the resubmit stage after recovery, the job with hw fence
>>>>>>> signaled bit set will call job done directly instead go through
>>>>>>> fence process.
>>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not
>>>>>>> cleared yet.
>>>>>>>
>>>>>>> Then overflow happens in the fence driver slots and some jobs may
>>>>>>> be skipped and leave the rcu pointer not cleared which makes an
>>>>>>> infinite wait for the slot on the next fence emitted.
>>>>>>>
>>>>>>> This infinite wait cause a job timeout on the emitting job. And
>>>>>>> driver will stuck at the its sched stop step because kthread_park
>>>>>>> cannot be done.
>>>>>>>
>>>>>>> [how]
>>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close interrupt
>>>>>>> before drm sched stop 2. handle all fences in fence process to
>>>>>>> aviod skip when overflow happens
>>>>>>>
>>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6
>>>>>>> +-----
>>>>>>>     2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             amdgpu_virt_fini_data_exchange(adev);
>>>>>>>         }
>>>>>>>     -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> -
>>>>>>>         /* block all schedulers and reset given job's ring */
>>>>>>>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>             struct amdgpu_ring *ring = adev->rings[i]; @@ -5214,6
>>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>>>>>>> +*adev,
>>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>>                 amdgpu_ras_suspend(tmp_adev);
>>>>>>>     +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>>> +
>>>>>>>             for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>                 struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>>     @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev, atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>>         }
>>>>>>>     -    if (need_emergency_restart)
>>>>>>> +    if (need_emergency_restart) {
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>             goto skip_sched_resume;
>>>>>>> +    }
>>>>>>>            /*
>>>>>>>          * Must check guilty signal here since after this point
>>>>>>> all old @@ -5240,6 +5244,9 @@ int
>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>         if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>             job_signaled = true;
>>>>>>>             dev_info(adev->dev, "Guilty job already signaled,
>>>>>>> skipping HW reset");
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>             goto skip_hw_reset;
>>>>>>>         }
>>>>>>>     @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             if (r && r == -EAGAIN) {
>>>>>>>                 set_bit(AMDGPU_SKIP_MODE2_RESET,
>>>>>>> &reset_context->flags);
>>>>>>>                 adev->asic_reset_res = 0;
>>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>                 goto retry;
>>>>>>>             }
>>>>>>>         }
>>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t
>>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>>         set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>>         set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>>     +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> +
>>>>>>>         adev->no_hw_access = true;
>>>>>>>         r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>>         adev->no_hw_access = false; diff --git
>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct
>>>>>>> amdgpu_ring *ring)
>>>>>>>         if (unlikely(seq == last_seq))
>>>>>>>             return false;
>>>>>>>     -    last_seq &= drv->num_fences_mask;
>>>>>>> -    seq &= drv->num_fences_mask;
>>>>>>> -
>>>>>>>         do {
>>>>>>>             struct dma_fence *fence, **ptr;
>>>>>>>                ++last_seq;
>>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>>                /* There is always exactly one thread signaling
>>>>>>> this fence slot */
>>>>>>>             fence = rcu_dereference_protected(*ptr, 1);
>>>>>> Those changes here doesn't seem to make sense. Please explain
>>>>>> further why that is necessary.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> --
>>>>>>> 2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* RE: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-19 15:04                   ` Andrey Grodzovsky
@ 2022-09-20 10:22                     ` Zhao, Victor
  2022-09-20 12:22                       ` Christian König
  0 siblings, 1 reply; 21+ messages in thread
From: Zhao, Victor @ 2022-09-20 10:22 UTC (permalink / raw)
  To: Grodzovsky, Andrey, Koenig, Christian, amd-gfx; +Cc: Deng, Emily

[AMD Official Use Only - General]

Hi @Koenig, Christian,

About the change 'drm/amdgpu: sanitize fence numbers', do you know if this vce issue still exists? Can we change fence process back?

Hi @Grodzovsky, Andrey,

So looks like close irq is possibly the most appropriate fix for this issue for now? Please help review this part.


Thanks,
Victor



-----Original Message-----
From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com> 
Sent: Monday, September 19, 2022 11:04 PM
To: Zhao, Victor <Victor.Zhao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Deng, Emily <Emily.Deng@amd.com>
Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow

I don't know if issue still exist but it worth checking with Christian who wrote this patch.

Andrey


On 2022-09-16 23:31, Zhao, Victor wrote:
> [AMD Official Use Only - General]
>
> Hi Andrey,
>
> Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is just want to make sure driver can correct itself from an overflow situation. Didn’t know about the previous issue there.
>
> Do you know if the issue still exists? Or is it on VCE only?
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Friday, September 16, 2022 9:50 PM
> To: Koenig, Christian <Christian.Koenig@amd.com>; Zhao, Victor 
> <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
>
> On 2022-09-16 01:18, Christian König wrote:
>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>> On 2022-09-15 15:26, Christian König wrote:
>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>> [AMD Official Use Only - General]
>>>>>>
>>>>>> Hi Christian,
>>>>>>
>>>>>> The test sequence is executing a compute engine hang while 
>>>>>> running a lot of containers submitting gfx jobs. We have advanced 
>>>>>> tdr mode and mode2 reset enabled on driver.
>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx 
>>>>>> pending list maybe signaled after drm_sched_stop. So they will 
>>>>>> not be removed from pending list but have the 
>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will 
>>>>>> be rerun and removed from pending list.
>>>>>> At the resubmit setp, the second job (with signaled bit) will be 
>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done 
>>>>>> will be called directly. This decrease the hw_rq_count which 
>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>> This results in an overflow in the fence_drv. Since we will use 
>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens, 
>>>>>> the signal of some job will be skipped which result in an 
>>>>>> infinite wait for the fence_drv rcu ptr.
>>>>>>
>>>>>> So close irq before sched_stop could avoid signal jobs after 
>>>>>> drm_sched_stop. And signal job one by one in fence_process 
>>>>>> instead of using a mask will handle the overflow situation.
>>>>>>
>>>>>> Another fix could be skip submitting jobs which already signaled 
>>>>>> during resubmit stage, which may look cleaner.
>>>>>>
>>>>>> Please help give some advice.
>>>>>
>>>>> How about the code bellow  instead ? The real problem is that we 
>>>>> reuse a dma fence twice which is not according to fma fence 
>>>>> design, so maybe this can help ?
>>>>>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring 
>>>>> *ring, struct dma_fence **f, struct amd
>>>>>          if (job && job->job_run_counter) {
>>>>>                  /* reinit seq for resubmitted jobs */
>>>>>                  fence->seqno = seq;
>>>>> +
>>>>> +               /* For resubmitted job clear the singled bit */
>>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>>> &fence->flags);
>>>>> +
>>>> Upstream will pretty much kill you for that.
>>>>
>>>> Re-setting a fence from a signaled to an unsignaled state is a 
>>>> massive no-go.
>>>>
>>>> Christian.
>>>
>>> Is it worse then doing fence->seqno = seq; ? This is already a huge 
>>> hack , no ?
>> No, it's as equally bad. I don't think we can do either.
>>
>> Christian.
>
> And all those ugly hack are there because we reuse a dma_fence (hw_fence embedded into the job) and correct me if I am wrong but I don't think dma_fence is ever supposed to be reused.
>
> So maybe like Victor suggested we should move close and flush irq before sched_stop - this in my opinion should solve the issue, but Victor - why then you still need the change in amdgpu_fence_process ? You will not have the overflow situation because by moving irq_disable before stop any job that signaled will be removed from the scheduler pending list anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce that bug.
>
> Andrey
>
>
>>> Andrey
>>>
>>>
>>>>>                  /* TO be inline with external fence creation and 
>>>>> other drivers */
>>>>>                  dma_fence_get(fence);
>>>>>          } else {
>>>>>
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>> Victor
>>>>>>
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>; 
>>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey 
>>>>>> <Andrey.Grodzovsky@amd.com>
>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by 
>>>>>> overflow
>>>>>>
>>>>>>
>>>>>>
>>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>>> [AMD Official Use Only - General]
>>>>>>>
>>>>>>> Ping.
>>>>>>>
>>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>>
>>>>>>> We found some reset related issues during stress test on the 
>>>>>>> sequence. Please help give some comments.
>>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Victor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey 
>>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>>
>>>>>>> [background]
>>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute), 
>>>>>>> jobs from another ring (e.g. gfx) may continue signaling during 
>>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>>
>>>>>>> At the resubmit stage after recovery, the job with hw fence 
>>>>>>> signaled bit set will call job done directly instead go through 
>>>>>>> fence process.
>>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not 
>>>>>>> cleared yet.
>>>>>>>
>>>>>>> Then overflow happens in the fence driver slots and some jobs 
>>>>>>> may be skipped and leave the rcu pointer not cleared which makes 
>>>>>>> an infinite wait for the slot on the next fence emitted.
>>>>>>>
>>>>>>> This infinite wait cause a job timeout on the emitting job. And 
>>>>>>> driver will stuck at the its sched stop step because 
>>>>>>> kthread_park cannot be done.
>>>>>>>
>>>>>>> [how]
>>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close 
>>>>>>> interrupt before drm sched stop 2. handle all fences in fence 
>>>>>>> process to aviod skip when overflow happens
>>>>>>>
>>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6
>>>>>>> +-----
>>>>>>>     2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>> amdgpu_device *adev,
>>>>>>>             amdgpu_virt_fini_data_exchange(adev);
>>>>>>>         }
>>>>>>>     -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> -
>>>>>>>         /* block all schedulers and reset given job's ring */
>>>>>>>         for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>             struct amdgpu_ring *ring = adev->rings[i]; @@ 
>>>>>>> -5214,6
>>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device 
>>>>>>> +*adev,
>>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>>                 amdgpu_ras_suspend(tmp_adev);
>>>>>>>     +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>>> +
>>>>>>>             for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>                 struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>>     @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev, atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>>         }
>>>>>>>     -    if (need_emergency_restart)
>>>>>>> +    if (need_emergency_restart) {
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>             goto skip_sched_resume;
>>>>>>> +    }
>>>>>>>            /*
>>>>>>>          * Must check guilty signal here since after this point 
>>>>>>> all old @@ -5240,6 +5244,9 @@ int 
>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>         if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>             job_signaled = true;
>>>>>>>             dev_info(adev->dev, "Guilty job already signaled, 
>>>>>>> skipping HW reset");
>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>> reset_list) {
>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>> +        }
>>>>>>>             goto skip_hw_reset;
>>>>>>>         }
>>>>>>>     @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct 
>>>>>>> amdgpu_device *adev,
>>>>>>>             if (r && r == -EAGAIN) {
>>>>>>>                 set_bit(AMDGPU_SKIP_MODE2_RESET, 
>>>>>>> &reset_context->flags);
>>>>>>>                 adev->asic_reset_res = 0;
>>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>                 goto retry;
>>>>>>>             }
>>>>>>>         }
>>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t 
>>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>>         set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>>         set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>>     +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>> +
>>>>>>>         adev->no_hw_access = true;
>>>>>>>         r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>>         adev->no_hw_access = false; diff --git 
>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct 
>>>>>>> amdgpu_ring *ring)
>>>>>>>         if (unlikely(seq == last_seq))
>>>>>>>             return false;
>>>>>>>     -    last_seq &= drv->num_fences_mask;
>>>>>>> -    seq &= drv->num_fences_mask;
>>>>>>> -
>>>>>>>         do {
>>>>>>>             struct dma_fence *fence, **ptr;
>>>>>>>                ++last_seq;
>>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>>                /* There is always exactly one thread signaling 
>>>>>>> this fence slot */
>>>>>>>             fence = rcu_dereference_protected(*ptr, 1);
>>>>>> Those changes here doesn't seem to make sense. Please explain 
>>>>>> further why that is necessary.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>>> --
>>>>>>> 2.25.1

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
  2022-09-20 10:22                     ` Zhao, Victor
@ 2022-09-20 12:22                       ` Christian König
  0 siblings, 0 replies; 21+ messages in thread
From: Christian König @ 2022-09-20 12:22 UTC (permalink / raw)
  To: Zhao, Victor, Grodzovsky, Andrey, Koenig, Christian, amd-gfx; +Cc: Deng, Emily

Hi Victor,

yes that behavior is mandatory. We can't rely on all engines always 
sending valid fence numbers.

Regards,
Christian.

Am 20.09.22 um 12:22 schrieb Zhao, Victor:
> [AMD Official Use Only - General]
>
> Hi @Koenig, Christian,
>
> About the change 'drm/amdgpu: sanitize fence numbers', do you know if this vce issue still exists? Can we change fence process back?
>
> Hi @Grodzovsky, Andrey,
>
> So looks like close irq is possibly the most appropriate fix for this issue for now? Please help review this part.
>
>
> Thanks,
> Victor
>
>
>
> -----Original Message-----
> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
> Sent: Monday, September 19, 2022 11:04 PM
> To: Zhao, Victor <Victor.Zhao@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; amd-gfx@lists.freedesktop.org
> Cc: Deng, Emily <Emily.Deng@amd.com>
> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>
> I don't know if issue still exist but it worth checking with Christian who wrote this patch.
>
> Andrey
>
>
> On 2022-09-16 23:31, Zhao, Victor wrote:
>> [AMD Official Use Only - General]
>>
>> Hi Andrey,
>>
>> Yes, moving irq disable can fix the issue. Change in amdgpu_fence_process is just want to make sure driver can correct itself from an overflow situation. Didn’t know about the previous issue there.
>>
>> Do you know if the issue still exists? Or is it on VCE only?
>>
>>
>> Thanks,
>> Victor
>>
>>
>>
>> -----Original Message-----
>> From: Grodzovsky, Andrey <Andrey.Grodzovsky@amd.com>
>> Sent: Friday, September 16, 2022 9:50 PM
>> To: Koenig, Christian <Christian.Koenig@amd.com>; Zhao, Victor
>> <Victor.Zhao@amd.com>; amd-gfx@lists.freedesktop.org
>> Cc: Deng, Emily <Emily.Deng@amd.com>
>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>
>>
>> On 2022-09-16 01:18, Christian König wrote:
>>> Am 15.09.22 um 22:37 schrieb Andrey Grodzovsky:
>>>> On 2022-09-15 15:26, Christian König wrote:
>>>>> Am 15.09.22 um 20:29 schrieb Andrey Grodzovsky:
>>>>>> On 2022-09-15 06:09, Zhao, Victor wrote:
>>>>>>> [AMD Official Use Only - General]
>>>>>>>
>>>>>>> Hi Christian,
>>>>>>>
>>>>>>> The test sequence is executing a compute engine hang while
>>>>>>> running a lot of containers submitting gfx jobs. We have advanced
>>>>>>> tdr mode and mode2 reset enabled on driver.
>>>>>>> When a compute hang job timeout happens, the 2 jobs on the gfx
>>>>>>> pending list maybe signaled after drm_sched_stop. So they will
>>>>>>> not be removed from pending list but have the
>>>>>>> DMA_FENCE_FLAG_SIGNALED_BIT set.
>>>>>>> At the amdgpu_device_recheck_guilty_jobs step, the first job will
>>>>>>> be rerun and removed from pending list.
>>>>>>> At the resubmit setp, the second job (with signaled bit) will be
>>>>>>> resubmitted. Since it still has signaled bit, drm_sched_job_done
>>>>>>> will be called directly. This decrease the hw_rq_count which
>>>>>>> allows more jobs emitted but did not clean fence_drv rcu ptr.
>>>>>>> This results in an overflow in the fence_drv. Since we will use
>>>>>>> num_fences_mask in amdgpu_fence_process, when overflow happens,
>>>>>>> the signal of some job will be skipped which result in an
>>>>>>> infinite wait for the fence_drv rcu ptr.
>>>>>>>
>>>>>>> So close irq before sched_stop could avoid signal jobs after
>>>>>>> drm_sched_stop. And signal job one by one in fence_process
>>>>>>> instead of using a mask will handle the overflow situation.
>>>>>>>
>>>>>>> Another fix could be skip submitting jobs which already signaled
>>>>>>> during resubmit stage, which may look cleaner.
>>>>>>>
>>>>>>> Please help give some advice.
>>>>>> How about the code bellow  instead ? The real problem is that we
>>>>>> reuse a dma fence twice which is not according to fma fence
>>>>>> design, so maybe this can help ?
>>>>>>
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> index 8adeb7469f1e..033f0ae16784 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>> @@ -164,6 +164,10 @@ int amdgpu_fence_emit(struct amdgpu_ring
>>>>>> *ring, struct dma_fence **f, struct amd
>>>>>>           if (job && job->job_run_counter) {
>>>>>>                   /* reinit seq for resubmitted jobs */
>>>>>>                   fence->seqno = seq;
>>>>>> +
>>>>>> +               /* For resubmitted job clear the singled bit */
>>>>>> +               celar_bit(DMA_FENCE_FLAG_TIMESTAMP_BIT,
>>>>>> &fence->flags);
>>>>>> +
>>>>> Upstream will pretty much kill you for that.
>>>>>
>>>>> Re-setting a fence from a signaled to an unsignaled state is a
>>>>> massive no-go.
>>>>>
>>>>> Christian.
>>>> Is it worse then doing fence->seqno = seq; ? This is already a huge
>>>> hack , no ?
>>> No, it's as equally bad. I don't think we can do either.
>>>
>>> Christian.
>> And all those ugly hack are there because we reuse a dma_fence (hw_fence embedded into the job) and correct me if I am wrong but I don't think dma_fence is ever supposed to be reused.
>>
>> So maybe like Victor suggested we should move close and flush irq before sched_stop - this in my opinion should solve the issue, but Victor - why then you still need the change in amdgpu_fence_process ? You will not have the overflow situation because by moving irq_disable before stop any job that signaled will be removed from the scheduler pending list anyway. Also not that this change reverts 'drm/amdgpu: sanitize fence numbers' and could reintroduce that bug.
>>
>> Andrey
>>
>>
>>>> Andrey
>>>>
>>>>
>>>>>>                   /* TO be inline with external fence creation and
>>>>>> other drivers */
>>>>>>                   dma_fence_get(fence);
>>>>>>           } else {
>>>>>>
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>>>> Thanks,
>>>>>>> Victor
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>>>>>> Sent: Thursday, September 15, 2022 2:32 PM
>>>>>>> To: Zhao, Victor <Victor.Zhao@amd.com>;
>>>>>>> amd-gfx@lists.freedesktop.org; Grodzovsky, Andrey
>>>>>>> <Andrey.Grodzovsky@amd.com>
>>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>
>>>>>>> Subject: Re: [PATCH 1/2] drm/amdgpu: fix deadlock caused by
>>>>>>> overflow
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 15.09.22 um 06:02 schrieb Zhao, Victor:
>>>>>>>> [AMD Official Use Only - General]
>>>>>>>>
>>>>>>>> Ping.
>>>>>>>>
>>>>>>>> Hi @Koenig, Christian and @Grodzovsky, Andrey,
>>>>>>>>
>>>>>>>> We found some reset related issues during stress test on the
>>>>>>>> sequence. Please help give some comments.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Victor
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>>> Sent: Wednesday, September 14, 2022 6:10 PM
>>>>>>>> To: amd-gfx@lists.freedesktop.org
>>>>>>>> Cc: Deng, Emily <Emily.Deng@amd.com>; Grodzovsky, Andrey
>>>>>>>> <Andrey.Grodzovsky@amd.com>; Zhao, Victor <Victor.Zhao@amd.com>
>>>>>>>> Subject: [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow
>>>>>>>>
>>>>>>>> [background]
>>>>>>>> For a gpu recovery caused by a hang on one ring (e.g. compute),
>>>>>>>> jobs from another ring (e.g. gfx) may continue signaling during
>>>>>>>> drm_sched_stop stage. The signal bit will not be cleared.
>>>>>>>>
>>>>>>>> At the resubmit stage after recovery, the job with hw fence
>>>>>>>> signaled bit set will call job done directly instead go through
>>>>>>>> fence process.
>>>>>>>> This makes the hw_rq_count decrease but rcu fence pointer not
>>>>>>>> cleared yet.
>>>>>>>>
>>>>>>>> Then overflow happens in the fence driver slots and some jobs
>>>>>>>> may be skipped and leave the rcu pointer not cleared which makes
>>>>>>>> an infinite wait for the slot on the next fence emitted.
>>>>>>>>
>>>>>>>> This infinite wait cause a job timeout on the emitting job. And
>>>>>>>> driver will stuck at the its sched stop step because
>>>>>>>> kthread_park cannot be done.
>>>>>>>>
>>>>>>>> [how]
>>>>>>>> 1. move amdgpu_fence_driver_isr_toggle earlier to close
>>>>>>>> interrupt before drm sched stop 2. handle all fences in fence
>>>>>>>> process to aviod skip when overflow happens
>>>>>>>>
>>>>>>>> Signed-off-by: Victor Zhao <Victor.Zhao@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 16
>>>>>>>> +++++++++++++--- drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  6
>>>>>>>> +-----
>>>>>>>>      2 files changed, 14 insertions(+), 8 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 943c9e750575..c0cfae52f12b 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4610,8 +4610,6 @@ int amdgpu_device_pre_asic_reset(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              amdgpu_virt_fini_data_exchange(adev);
>>>>>>>>          }
>>>>>>>>      -    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>> -
>>>>>>>>          /* block all schedulers and reset given job's ring */
>>>>>>>>          for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>>              struct amdgpu_ring *ring = adev->rings[i]; @@
>>>>>>>> -5214,6
>>>>>>>> +5212,8 @@ int amdgpu_device_gpu_recover(struct amdgpu_device
>>>>>>>> +*adev,
>>>>>>>> amdgpu_device_ip_need_full_reset(tmp_adev))
>>>>>>>>                  amdgpu_ras_suspend(tmp_adev);
>>>>>>>>      +        amdgpu_fence_driver_isr_toggle(tmp_adev, true);
>>>>>>>> +
>>>>>>>>              for (i = 0; i < AMDGPU_MAX_RINGS; ++i) {
>>>>>>>>                  struct amdgpu_ring *ring = tmp_adev->rings[i];
>>>>>>>>      @@ -5228,8 +5228,12 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev, atomic_inc(&tmp_adev->gpu_reset_counter);
>>>>>>>>          }
>>>>>>>>      -    if (need_emergency_restart)
>>>>>>>> +    if (need_emergency_restart) {
>>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>>> reset_list) {
>>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>>> +        }
>>>>>>>>              goto skip_sched_resume;
>>>>>>>> +    }
>>>>>>>>             /*
>>>>>>>>           * Must check guilty signal here since after this point
>>>>>>>> all old @@ -5240,6 +5244,9 @@ int
>>>>>>>> amdgpu_device_gpu_recover(struct amdgpu_device *adev,
>>>>>>>>          if (job && dma_fence_is_signaled(&job->hw_fence)) {
>>>>>>>>              job_signaled = true;
>>>>>>>>              dev_info(adev->dev, "Guilty job already signaled,
>>>>>>>> skipping HW reset");
>>>>>>>> +        list_for_each_entry (tmp_adev, device_list_handle,
>>>>>>>> reset_list) {
>>>>>>>> +            amdgpu_fence_driver_isr_toggle(tmp_adev, false);
>>>>>>>> +        }
>>>>>>>>              goto skip_hw_reset;
>>>>>>>>          }
>>>>>>>>      @@ -5276,6 +5283,7 @@ int amdgpu_device_gpu_recover(struct
>>>>>>>> amdgpu_device *adev,
>>>>>>>>              if (r && r == -EAGAIN) {
>>>>>>>>                  set_bit(AMDGPU_SKIP_MODE2_RESET,
>>>>>>>> &reset_context->flags);
>>>>>>>>                  adev->asic_reset_res = 0;
>>>>>>>> +            amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>>                  goto retry;
>>>>>>>>              }
>>>>>>>>          }
>>>>>>>> @@ -5711,6 +5719,8 @@ pci_ers_result_t
>>>>>>>> amdgpu_pci_slot_reset(struct pci_dev *pdev)
>>>>>>>>          set_bit(AMDGPU_SKIP_HW_RESET, &reset_context.flags);
>>>>>>>>          set_bit(AMDGPU_SKIP_MODE2_RESET, &reset_context.flags);
>>>>>>>>      +    amdgpu_fence_driver_isr_toggle(adev, true);
>>>>>>>> +
>>>>>>>>          adev->no_hw_access = true;
>>>>>>>>          r = amdgpu_device_pre_asic_reset(adev, &reset_context);
>>>>>>>>          adev->no_hw_access = false; diff --git
>>>>>>>> a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> index 8adeb7469f1e..65a877e1a7fc 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
>>>>>>>> @@ -287,15 +287,11 @@ bool amdgpu_fence_process(struct
>>>>>>>> amdgpu_ring *ring)
>>>>>>>>          if (unlikely(seq == last_seq))
>>>>>>>>              return false;
>>>>>>>>      -    last_seq &= drv->num_fences_mask;
>>>>>>>> -    seq &= drv->num_fences_mask;
>>>>>>>> -
>>>>>>>>          do {
>>>>>>>>              struct dma_fence *fence, **ptr;
>>>>>>>>                 ++last_seq;
>>>>>>>> -        last_seq &= drv->num_fences_mask;
>>>>>>>> -        ptr = &drv->fences[last_seq];
>>>>>>>> +        ptr = &drv->fences[last_seq & drv->num_fences_mask];
>>>>>>>>                 /* There is always exactly one thread signaling
>>>>>>>> this fence slot */
>>>>>>>>              fence = rcu_dereference_protected(*ptr, 1);
>>>>>>> Those changes here doesn't seem to make sense. Please explain
>>>>>>> further why that is necessary.
>>>>>>>
>>>>>>> Christian.
>>>>>>>
>>>>>>>> --
>>>>>>>> 2.25.1


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2022-09-20 12:22 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-14 10:10 [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Victor Zhao
2022-09-14 10:10 ` [PATCH 2/2] drm/amdgpu: move enable irq later to avoid race with ih resume Victor Zhao
2022-09-15  5:57   ` Lazar, Lijo
2022-09-15  6:38     ` Zhao, Victor
2022-09-15  7:59       ` Lazar, Lijo
2022-09-15 10:11         ` Zhao, Victor
2022-09-15  4:02 ` [PATCH 1/2] drm/amdgpu: fix deadlock caused by overflow Zhao, Victor
2022-09-15  6:31   ` Christian König
2022-09-15 10:09     ` Zhao, Victor
2022-09-15 11:29       ` Christian König
2022-09-15 18:29       ` Andrey Grodzovsky
2022-09-15 18:31         ` Andrey Grodzovsky
2022-09-15 19:26         ` Christian König
2022-09-15 20:37           ` Andrey Grodzovsky
2022-09-16  5:18             ` Christian König
2022-09-16 13:50               ` Andrey Grodzovsky
2022-09-16 13:51                 ` Christian König
2022-09-17  3:31                 ` Zhao, Victor
2022-09-19 15:04                   ` Andrey Grodzovsky
2022-09-20 10:22                     ` Zhao, Victor
2022-09-20 12:22                       ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.