All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
@ 2021-10-22  3:33 Jingwen Chen
  2021-10-22  8:14 ` JingWen Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Jingwen Chen @ 2021-10-22  3:33 UTC (permalink / raw)
  To: amd-gfx; +Cc: monk.liu, horace.chen, Jingwen Chen

[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.

[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs

Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..975f069f6fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* clear job's guilty and depend the folowing step to decide the real one */
 		drm_sched_reset_karma(s_job);
+		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
+		 * to make sure fence is balanced */
+		dma_fence_get(s_job->s_fence->parent);
 		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
 
 		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
@@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
 
 		/* got the hw fence, signal finished fence */
 		atomic_dec(ring->sched.score);
+		dma_fence_put(s_job->s_fence->parent);
 		dma_fence_get(&s_job->s_fence->finished);
 		dma_fence_signal(&s_job->s_fence->finished);
 		dma_fence_put(&s_job->s_fence->finished);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-22  3:33 [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow Jingwen Chen
@ 2021-10-22  8:14 ` JingWen Chen
  2021-10-22 20:41   ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-22  8:14 UTC (permalink / raw)
  To: Jingwen Chen, amd-gfx
  Cc: monk.liu, horace.chen, christian.koenig, Andrey.Grodzovsky

ping

On 2021/10/22 AM11:33, Jingwen Chen wrote:
> [Why]
> In advance tdr mode, the real bad job will be resubmitted twice, while
> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
> is put one more time than other jobs.
>
> [How]
> Adding dma_fence_get before resbumit job in
> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 41ce86244144..975f069f6fe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>  
>  		/* clear job's guilty and depend the folowing step to decide the real one */
>  		drm_sched_reset_karma(s_job);
> +		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
> +		 * to make sure fence is balanced */
> +		dma_fence_get(s_job->s_fence->parent);
>  		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>  
>  		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>  
>  		/* got the hw fence, signal finished fence */
>  		atomic_dec(ring->sched.score);
> +		dma_fence_put(s_job->s_fence->parent);
>  		dma_fence_get(&s_job->s_fence->finished);
>  		dma_fence_signal(&s_job->s_fence->finished);
>  		dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-22  8:14 ` JingWen Chen
@ 2021-10-22 20:41   ` Andrey Grodzovsky
  2021-10-25  2:56     ` JingWen Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-22 20:41 UTC (permalink / raw)
  To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig


What do you mean by underflow in this case ? You mean use after free 
because of extra dma_fence_put() ?

On 2021-10-22 4:14 a.m., JingWen Chen wrote:
> ping
>
> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>> [Why]
>> In advance tdr mode, the real bad job will be resubmitted twice, while
>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>> is put one more time than other jobs.
>>
>> [How]
>> Adding dma_fence_get before resbumit job in
>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 41ce86244144..975f069f6fe8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>   
>>   		/* clear job's guilty and depend the folowing step to decide the real one */
>>   		drm_sched_reset_karma(s_job);
>> +		/* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>> +		 * to make sure fence is balanced */


But that put in drm_sched_resubmit_jobs_ext is for the previous parent 
fence.
fence = sched->ops->run_job(s_job); returns a new HW fence and the put 
drops the refcount on the old one.

Andrey


>> +		dma_fence_get(s_job->s_fence->parent);
>>   		drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>   
>>   		ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>   
>>   		/* got the hw fence, signal finished fence */
>>   		atomic_dec(ring->sched.score);
>> +		dma_fence_put(s_job->s_fence->parent);
>>   		dma_fence_get(&s_job->s_fence->finished);
>>   		dma_fence_signal(&s_job->s_fence->finished);
>>   		dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-22 20:41   ` Andrey Grodzovsky
@ 2021-10-25  2:56     ` JingWen Chen
  2021-10-25 15:18       ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-25  2:56 UTC (permalink / raw)
  To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
  Cc: monk.liu, horace.chen, christian.koenig


On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>
> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
yes
>
> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>> ping
>>
>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>> [Why]
>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>> is put one more time than other jobs.
>>>
>>> [How]
>>> Adding dma_fence_get before resbumit job in
>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>
>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>   1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 41ce86244144..975f069f6fe8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>             /* clear job's guilty and depend the folowing step to decide the real one */
>>>           drm_sched_reset_karma(s_job);
>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>> +         * to make sure fence is balanced */
>
>
> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>
> Andrey
>
>
Hi Andrey,

If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.

>>> +        dma_fence_get(s_job->s_fence->parent);
>>>           drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>             ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>             /* got the hw fence, signal finished fence */
>>>           atomic_dec(ring->sched.score);
>>> +        dma_fence_put(s_job->s_fence->parent);
>>>           dma_fence_get(&s_job->s_fence->finished);
>>>           dma_fence_signal(&s_job->s_fence->finished);
>>>           dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-25  2:56     ` JingWen Chen
@ 2021-10-25 15:18       ` Andrey Grodzovsky
  2021-10-26  2:57         ` JingWen Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-25 15:18 UTC (permalink / raw)
  To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig


On 2021-10-24 10:56 p.m., JingWen Chen wrote:
> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
> yes


Then maybe update the description  because 'underflow' is very confusing


>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>> ping
>>>
>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>> [Why]
>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>> is put one more time than other jobs.
>>>>
>>>> [How]
>>>> Adding dma_fence_get before resbumit job in
>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>
>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>    1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 41ce86244144..975f069f6fe8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>              /* clear job's guilty and depend the folowing step to decide the real one */
>>>>            drm_sched_reset_karma(s_job);
>>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>> +         * to make sure fence is balanced */
>>
>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.


Right, I forgot that... What about removing line 
https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 
?
What if you make dma_get_fence unconditional instead ?

Andrey


>
>>>> +        dma_fence_get(s_job->s_fence->parent);
>>>>            drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>              ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>              /* got the hw fence, signal finished fence */
>>>>            atomic_dec(ring->sched.score);
>>>> +        dma_fence_put(s_job->s_fence->parent);
>>>>            dma_fence_get(&s_job->s_fence->finished);
>>>>            dma_fence_signal(&s_job->s_fence->finished);
>>>>            dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-25 15:18       ` Andrey Grodzovsky
@ 2021-10-26  2:57         ` JingWen Chen
  2021-10-27 19:43           ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-26  2:57 UTC (permalink / raw)
  To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
  Cc: monk.liu, horace.chen, christian.koenig


On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>
> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>> yes
>
>
> Then maybe update the description  because 'underflow' is very confusing
>
will do
>
>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>> ping
>>>>
>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>> [Why]
>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>> is put one more time than other jobs.
>>>>>
>>>>> [How]
>>>>> Adding dma_fence_get before resbumit job in
>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>
>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>    1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>              /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>            drm_sched_reset_karma(s_job);
>>>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>> +         * to make sure fence is balanced */
>>>
>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>
>>> Andrey
>>>
>>>
>> Hi Andrey,
>>
>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>
>
> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
> What if you make dma_get_fence unconditional instead ?
>
> Andrey
>
>
Hi Andrey,

I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put). I have figured out all the get/put

for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.

>>
>>>>> +        dma_fence_get(s_job->s_fence->parent);
>>>>>            drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>              ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>              /* got the hw fence, signal finished fence */
>>>>>            atomic_dec(ring->sched.score);
>>>>> +        dma_fence_put(s_job->s_fence->parent);
>>>>>            dma_fence_get(&s_job->s_fence->finished);
>>>>>            dma_fence_signal(&s_job->s_fence->finished);
>>>>>            dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-26  2:57         ` JingWen Chen
@ 2021-10-27 19:43           ` Andrey Grodzovsky
  2021-10-28  2:43             ` JingWen Chen
  0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-27 19:43 UTC (permalink / raw)
  To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig


On 2021-10-25 10:57 p.m., JingWen Chen wrote:
> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>> yes
>>
>> Then maybe update the description  because 'underflow' is very confusing
>>
> will do
>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>> ping
>>>>>
>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>> [Why]
>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>> is put one more time than other jobs.
>>>>>>
>>>>>> [How]
>>>>>> Adding dma_fence_get before resbumit job in
>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>
>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>> ---
>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>     1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>               /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>>             drm_sched_reset_karma(s_job);
>>>>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>> +         * to make sure fence is balanced */
>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>
>>>> Andrey
>>>>
>>>>
>>> Hi Andrey,
>>>
>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>
>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>> What if you make dma_get_fence unconditional instead ?
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).


I can't see it  - can you point me where in that case you get unbalanced 
refcount ? As far as I see for a a normal job
being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence 
is  -

drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1

In total refcount didn't change until now

Next,  dma_fence_wait_timeout completed successfully because the job is 
normal and then you delete that job from pending list and call the
free_job cb which drops remaining refcounts on the hw_fence.

I am probably missing some  dma_fence_get since you checked it on a 
device but I wonder where is my mistake ?

Andrey



> I have figured out all the get/put
>
> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>
>>>>>> +        dma_fence_get(s_job->s_fence->parent);
>>>>>>             drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>               ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>               /* got the hw fence, signal finished fence */
>>>>>>             atomic_dec(ring->sched.score);
>>>>>> +        dma_fence_put(s_job->s_fence->parent);
>>>>>>             dma_fence_get(&s_job->s_fence->finished);
>>>>>>             dma_fence_signal(&s_job->s_fence->finished);
>>>>>>             dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-27 19:43           ` Andrey Grodzovsky
@ 2021-10-28  2:43             ` JingWen Chen
  2021-10-28 14:46               ` Andrey Grodzovsky
  0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-28  2:43 UTC (permalink / raw)
  To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
  Cc: monk.liu, horace.chen, christian.koenig


On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
>
> On 2021-10-25 10:57 p.m., JingWen Chen wrote:
>> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>>> yes
>>>
>>> Then maybe update the description  because 'underflow' is very confusing
>>>
>> will do
>>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>>> ping
>>>>>>
>>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>>> [Why]
>>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>>> is put one more time than other jobs.
>>>>>>>
>>>>>>> [How]
>>>>>>> Adding dma_fence_get before resbumit job in
>>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>>
>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>> ---
>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>>     1 file changed, 4 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>               /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>>>             drm_sched_reset_karma(s_job);
>>>>>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>>> +         * to make sure fence is balanced */
>>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>> Hi Andrey,
>>>>
>>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>>
>>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>>> What if you make dma_get_fence unconditional instead ?
>>>
>>> Andrey
>>>
>>>
>> Hi Andrey,
>>
>> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
>
>
> I can't see it  - can you point me where in that case you get unbalanced refcount ? As far as I see for a a normal job
> being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is  -
>
> drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
> drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
>
> In total refcount didn't change until now
>
> Next,  dma_fence_wait_timeout completed successfully because the job is normal and then you delete that job from pending list and call the
> free_job cb which drops remaining refcounts on the hw_fence.
>
> I am probably missing some  dma_fence_get since you checked it on a device but I wonder where is my mistake ?
>
> Andrey
>
>
Hi Andrey,

The thing is the put/get is balanced right now for normal jobs in TDR. Changing this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no corresponding dma_fence_put for normal jobs.

And if this can be helpful, I try to find all dma_fence_get/put for a normal job in advance TDR based on the latest drm-next.

amdgpu_fence_emit -> dma_fence_init    ref_count = 1​
amdgpu_fence_emit -> add into rcu    ref_count = 2​
amdgpu_job_run->get after ib_schedule    ref_count = 3​
drm_sched_main-> add fence callback get    ref_count = 4​
drm_sched_main-> add fence callback put    ref_count = 3​
drm_sched_resubmit_jobs_ext    ref_count = 2
amdgpu_fence_emit -> add into rcu    ref_count = 3​
amdgpu_fence_process-> put after signal    ref_count = 2​
drm_sched_main->free_job    ref_count = 1​
drm_sched_fence_release_finished    ref_count = 0​

If we do unconditional get, this sequence will turn into:

amdgpu_fence_emit -> dma_fence_init    ref_count = 1​
amdgpu_fence_emit -> add into rcu    ref_count = 2​
amdgpu_job_run->get after ib_schedule    ref_count = 3​
drm_sched_main-> add fence callback get    ref_count = 4​
drm_sched_main-> add fence callback put    ref_count = 3​
drm_sched_resubmit_jobs_ext    ref_count = 2
amdgpu_fence_emit -> add into rcu    ref_count = 3​
+  amdgpu_job_run->get after ib_schedule    ref_count = 4
amdgpu_fence_process-> put after signal    ref_count = 3
drm_sched_main->free_job    ref_count = 2
drm_sched_fence_release_finished    ref_count = 1
>
>> I have figured out all the get/put
>>
>> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>>
>>>>>>> +        dma_fence_get(s_job->s_fence->parent);
>>>>>>>             drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>>               ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>               /* got the hw fence, signal finished fence */
>>>>>>>             atomic_dec(ring->sched.score);
>>>>>>> +        dma_fence_put(s_job->s_fence->parent);
>>>>>>>             dma_fence_get(&s_job->s_fence->finished);
>>>>>>>             dma_fence_signal(&s_job->s_fence->finished);
>>>>>>>             dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
  2021-10-28  2:43             ` JingWen Chen
@ 2021-10-28 14:46               ` Andrey Grodzovsky
  0 siblings, 0 replies; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-28 14:46 UTC (permalink / raw)
  To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig


On 2021-10-27 10:43 p.m., JingWen Chen wrote:
> On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
>> On 2021-10-25 10:57 p.m., JingWen Chen wrote:
>>> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>>>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>>>> yes
>>>> Then maybe update the description  because 'underflow' is very confusing
>>>>
>>> will do
>>>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>>>> ping
>>>>>>>
>>>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>>>> [Why]
>>>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>>>> is put one more time than other jobs.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> Adding dma_fence_get before resbumit job in
>>>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>>      drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>>>      1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>>                /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>>>>              drm_sched_reset_karma(s_job);
>>>>>>>> +        /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>>>> +         * to make sure fence is balanced */
>>>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>>>> What if you make dma_get_fence unconditional instead ?
>>>>
>>>> Andrey
>>>>
>>>>
>>> Hi Andrey,
>>>
>>> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
>>
>> I can't see it  - can you point me where in that case you get unbalanced refcount ? As far as I see for a a normal job
>> being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is  -
>>
>> drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
>> drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
>>
>> In total refcount didn't change until now
>>
>> Next,  dma_fence_wait_timeout completed successfully because the job is normal and then you delete that job from pending list and call the
>> free_job cb which drops remaining refcounts on the hw_fence.
>>
>> I am probably missing some  dma_fence_get since you checked it on a device but I wonder where is my mistake ?
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> The thing is the put/get is balanced right now for normal jobs in TDR. Changing this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no corresponding dma_fence_put for normal jobs.
>
> And if this can be helpful, I try to find all dma_fence_get/put for a normal job in advance TDR based on the latest drm-next.
>
> amdgpu_fence_emit -> dma_fence_init    ref_count = 1​
> amdgpu_fence_emit -> add into rcu    ref_count = 2​
> amdgpu_job_run->get after ib_schedule    ref_count = 3​
> drm_sched_main-> add fence callback get    ref_count = 4​
> drm_sched_main-> add fence callback put    ref_count = 3​
> drm_sched_resubmit_jobs_ext    ref_count = 2
> amdgpu_fence_emit -> add into rcu    ref_count = 3​


Now I see that that put in drm_sched_resubmit_jobs_ext is for dropping 
the refcount for the previous
'amdgpu_fence_emit -> add into rcu' get... This is all very convoluted 
and confusing. Probably requires some
rework to make the code more clear but for now we need the bug fixed so 
with the title changed
the patch is Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>

Andrey



> amdgpu_fence_process-> put after signal    ref_count = 2​
> drm_sched_main->free_job    ref_count = 1​
> drm_sched_fence_release_finished    ref_count = 0​
>
> If we do unconditional get, this sequence will turn into:
>
> amdgpu_fence_emit -> dma_fence_init    ref_count = 1​
> amdgpu_fence_emit -> add into rcu    ref_count = 2​
> amdgpu_job_run->get after ib_schedule    ref_count = 3​
> drm_sched_main-> add fence callback get    ref_count = 4​
> drm_sched_main-> add fence callback put    ref_count = 3​
> drm_sched_resubmit_jobs_ext    ref_count = 2
> amdgpu_fence_emit -> add into rcu    ref_count = 3​
> +  amdgpu_job_run->get after ib_schedule    ref_count = 4
> amdgpu_fence_process-> put after signal    ref_count = 3
> drm_sched_main->free_job    ref_count = 2
> drm_sched_fence_release_finished    ref_count = 1
>>> I have figured out all the get/put
>>>
>>> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>>>
>>>>>>>> +        dma_fence_get(s_job->s_fence->parent);
>>>>>>>>              drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>>>                ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>>                /* got the hw fence, signal finished fence */
>>>>>>>>              atomic_dec(ring->sched.score);
>>>>>>>> +        dma_fence_put(s_job->s_fence->parent);
>>>>>>>>              dma_fence_get(&s_job->s_fence->finished);
>>>>>>>>              dma_fence_signal(&s_job->s_fence->finished);
>>>>>>>>              dma_fence_put(&s_job->s_fence->finished);

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-10-28 14:46 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22  3:33 [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow Jingwen Chen
2021-10-22  8:14 ` JingWen Chen
2021-10-22 20:41   ` Andrey Grodzovsky
2021-10-25  2:56     ` JingWen Chen
2021-10-25 15:18       ` Andrey Grodzovsky
2021-10-26  2:57         ` JingWen Chen
2021-10-27 19:43           ` Andrey Grodzovsky
2021-10-28  2:43             ` JingWen Chen
2021-10-28 14:46               ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.