All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
@ 2023-03-09  2:27 YuBiao Wang
  2023-03-10 16:56 ` Luben Tuikov
  2023-03-16  3:41 ` Luben Tuikov
  0 siblings, 2 replies; 5+ messages in thread
From: YuBiao Wang @ 2023-03-09  2:27 UTC (permalink / raw)
  To: amd-gfx
  Cc: YuBiao Wang, Feifei Xu, horace.chen, Kevin Wang, Tuikov Luben,
	Deucher Alexander, Evan Quan, Christian König, Monk Liu,
	Hawking Zhang

v2: Add comments to clarify in the code.

[Why]
For engines not supporting soft reset, i.e. VCN, there will be a failed
ib test before mode 1 reset during asic reset. The fences in this case
are never signaled and next time when we try to free the sa_bo, kernel
will hang.

[How]
During pre_asic_reset, driver will clear job fences and afterwards the
fences' refcount will be reduced to 1. For drm_sched_jobs it will be
released in job_free_cb, and for non-sched jobs like ib_test, it's meant
to be released in sa_bo_free but only when the fences are signaled. So
we have to force signal the non_sched bad job's fence during
pre_asic_reset or the clear is not complete.

Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index faff4a3f96e6..ad7c5b70c35a 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
 {
 	int i;
 	struct dma_fence *old, **ptr;
+	struct amdgpu_job *job;
 
 	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
 		ptr = &ring->fence_drv.fences[i];
@@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
 		if (old && old->ops == &amdgpu_job_fence_ops) {
 			RCU_INIT_POINTER(*ptr, NULL);
 			dma_fence_put(old);
+			/* For non-sched bad job, i.e. failed ib test, we need to force
+			 * signal it right here or we won't be able to track them in fence drv
+			 * and they will remain unsignaled during sa_bo free.
+			 */
+			job = container_of(old, struct amdgpu_job, hw_fence);
+			if (!job->base.s_fence && !dma_fence_is_signaled(old))
+				dma_fence_signal(old);
 		}
 	}
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
  2023-03-09  2:27 [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs YuBiao Wang
@ 2023-03-10 16:56 ` Luben Tuikov
  2023-03-14  6:33   ` Wang, YuBiao
  2023-03-16  3:41 ` Luben Tuikov
  1 sibling, 1 reply; 5+ messages in thread
From: Luben Tuikov @ 2023-03-10 16:56 UTC (permalink / raw)
  To: YuBiao Wang, amd-gfx
  Cc: Feifei Xu, horace.chen, Kevin Wang, Deucher Alexander, Evan Quan,
	Christian König, Monk Liu, Hawking Zhang

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which worries me
is this check here, !job->base.s_fence, which is used here to qualify that we
can signal the fence (and of course that the fence is not yet signalled.) We need
to audit this check to make sure that it is not overloaded to mean other things. I'll
take a look.

>  		}
>  	}
>  }

-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
  2023-03-10 16:56 ` Luben Tuikov
@ 2023-03-14  6:33   ` Wang, YuBiao
  2023-03-16  3:59     ` Liu, Monk
  0 siblings, 1 reply; 5+ messages in thread
From: Wang, YuBiao @ 2023-03-14  6:33 UTC (permalink / raw)
  To: Tuikov, Luben
  Cc: Wang, Yang(Kevin),
	Xu, Feifei, Chen, Horace, amd-gfx, Deucher, Alexander, Quan,
	Evan, Koenig, Christian, Liu,  Monk, Zhang, Hawking

Hi Luben,

I'd have to ping you because we've got a P1 ticket currently on this issue. Would you please give a vague time when would you confirm whether this patch is safe? Thank you a lot for helping double check this.

Regards & Thanks,
Yubiao 

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov@amd.com> 
Sent: Saturday, March 11, 2023 12:56 AM
To: Wang, YuBiao <YuBiao.Wang@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Quan, Evan <Evan.Quan@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Yang(Kevin) <KevinYang.Wang@amd.com>
Subject: Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a 
> failed ib test before mode 1 reset during asic reset. The fences in 
> this case are never signaled and next time when we try to free the 
> sa_bo, kernel will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the 
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be 
> released in job_free_cb, and for non-sched jobs like ib_test, it's 
> meant to be released in sa_bo_free but only when the fences are 
> signaled. So we have to force signal the non_sched bad job's fence 
> during pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct 
> amdgpu_ring *ring)  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which worries me is this check here, !job->base.s_fence, which is used here to qualify that we can signal the fence (and of course that the fence is not yet signalled.) We need to audit this check to make sure that it is not overloaded to mean other things. I'll take a look.

>  		}
>  	}
>  }

--
Regards,
Luben


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
  2023-03-09  2:27 [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs YuBiao Wang
  2023-03-10 16:56 ` Luben Tuikov
@ 2023-03-16  3:41 ` Luben Tuikov
  1 sibling, 0 replies; 5+ messages in thread
From: Luben Tuikov @ 2023-03-16  3:41 UTC (permalink / raw)
  To: YuBiao Wang, amd-gfx
  Cc: Feifei Xu, horace.chen, Kevin Wang, Deucher Alexander, Evan Quan,
	Christian König, Monk Liu, Hawking Zhang

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a failed
> ib test before mode 1 reset during asic reset. The fences in this case
> are never signaled and next time when we try to free the sa_bo, kernel
> will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be
> released in job_free_cb, and for non-sched jobs like ib_test, it's meant
> to be released in sa_bo_free but only when the fences are signaled. So
> we have to force signal the non_sched bad job's fence during
> pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Hi YuBiao,

Thanks for adding the clarifying comments and sending a v2 of this patch.

Perhaps move the chunk you're adding, to sit before, rather than after,
the statements of the if-conditional. Also move the "job" variable
declaration to be inside the if-conditional, since it is not used
by anything outside it. Something like this, (note a few small fixes to the comment),
	if (old && old->ops == &amdgpu_job_fence_ops) {
		struct amdgpu_job *job;

		/* For non-scheduler bad job, i.e. failed IB test, we need to signal
		 * it right here or we won't be able to track them in fence_drv
		 * and they will remain unsignaled during sa_bo free.
		 */
		job = container_of(old, struct amdgpu_job, hw_fence);
		if (!job->base.s_fence && !dma_fence_is_signaled(old))
			dma_fence_signal(old);
		RCU_INIT_POINTER(*ptr, NULL);
		dma_fence_put(old);
	}
Then, give it a test.
With that change, and upon successful test results, this patch is,
Acked-by: Luben Tuikov <luben.tuikov@amd.com>
-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs
  2023-03-14  6:33   ` Wang, YuBiao
@ 2023-03-16  3:59     ` Liu, Monk
  0 siblings, 0 replies; 5+ messages in thread
From: Liu, Monk @ 2023-03-16  3:59 UTC (permalink / raw)
  To: Wang, YuBiao, Tuikov, Luben
  Cc: Wang, Yang(Kevin),
	Xu, Feifei, Chen, Horace, amd-gfx, Deucher, Alexander, Quan,
	Evan, Koenig, Christian, Zhang, Hawking

[AMD Official Use Only - General]

Hi Luben

Please let us know if you don't have bandwidth to review and we can require other people to review, please be understand that we are in an urgent situation and need this change to go to mainline this week.

Thanks 
-------------------------------------------------------------------
Monk Liu | Cloud GPU & Virtualization Software | AMD
-------------------------------------------------------------------

-----Original Message-----
From: Wang, YuBiao <YuBiao.Wang@amd.com> 
Sent: 2023年3月14日 14:34
To: Tuikov, Luben <Luben.Tuikov@amd.com>
Cc: Quan, Evan <Evan.Quan@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Yang(Kevin) <KevinYang.Wang@amd.com>; amd-gfx@lists.freedesktop.org
Subject: RE: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

Hi Luben,

I'd have to ping you because we've got a P1 ticket currently on this issue. Would you please give a vague time when would you confirm whether this patch is safe? Thank you a lot for helping double check this.

Regards & Thanks,
Yubiao 

-----Original Message-----
From: Tuikov, Luben <Luben.Tuikov@amd.com>
Sent: Saturday, March 11, 2023 12:56 AM
To: Wang, YuBiao <YuBiao.Wang@amd.com>; amd-gfx@lists.freedesktop.org
Cc: Quan, Evan <Evan.Quan@amd.com>; Chen, Horace <Horace.Chen@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Liu, Monk <Monk.Liu@amd.com>; Xu, Feifei <Feifei.Xu@amd.com>; Wang, Yang(Kevin) <KevinYang.Wang@amd.com>
Subject: Re: [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs

On 2023-03-08 21:27, YuBiao Wang wrote:
> v2: Add comments to clarify in the code.
> 
> [Why]
> For engines not supporting soft reset, i.e. VCN, there will be a 
> failed ib test before mode 1 reset during asic reset. The fences in 
> this case are never signaled and next time when we try to free the 
> sa_bo, kernel will hang.
> 
> [How]
> During pre_asic_reset, driver will clear job fences and afterwards the 
> fences' refcount will be reduced to 1. For drm_sched_jobs it will be 
> released in job_free_cb, and for non-sched jobs like ib_test, it's 
> meant to be released in sa_bo_free but only when the fences are 
> signaled. So we have to force signal the non_sched bad job's fence 
> during pre_asic_reset or the clear is not complete.
> 
> Signed-off-by: YuBiao Wang <YuBiao.Wang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index faff4a3f96e6..ad7c5b70c35a 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -673,6 +673,7 @@ void amdgpu_fence_driver_clear_job_fences(struct
> amdgpu_ring *ring)  {
>  	int i;
>  	struct dma_fence *old, **ptr;
> +	struct amdgpu_job *job;
>  
>  	for (i = 0; i <= ring->fence_drv.num_fences_mask; i++) {
>  		ptr = &ring->fence_drv.fences[i];
> @@ -680,6 +681,13 @@ void amdgpu_fence_driver_clear_job_fences(struct amdgpu_ring *ring)
>  		if (old && old->ops == &amdgpu_job_fence_ops) {
>  			RCU_INIT_POINTER(*ptr, NULL);
>  			dma_fence_put(old);
> +			/* For non-sched bad job, i.e. failed ib test, we need to force
> +			 * signal it right here or we won't be able to track them in fence drv
> +			 * and they will remain unsignaled during sa_bo free.
> +			 */
> +			job = container_of(old, struct amdgpu_job, hw_fence);
> +			if (!job->base.s_fence && !dma_fence_is_signaled(old))
> +				dma_fence_signal(old);

Conceptually, I don't mind this patch for what it does. The only thing which worries me is this check here, !job->base.s_fence, which is used here to qualify that we can signal the fence (and of course that the fence is not yet signalled.) We need to audit this check to make sure that it is not overloaded to mean other things. I'll take a look.

>  		}
>  	}
>  }

--
Regards,
Luben

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2023-03-16  3:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-03-09  2:27 [PATCH v2] drm/amdgpu: Force signal hw_fences that are embedded in non-sched jobs YuBiao Wang
2023-03-10 16:56 ` Luben Tuikov
2023-03-14  6:33   ` Wang, YuBiao
2023-03-16  3:59     ` Liu, Monk
2023-03-16  3:41 ` Luben Tuikov

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.