* [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
@ 2021-10-22 3:33 Jingwen Chen
2021-10-22 8:14 ` JingWen Chen
0 siblings, 1 reply; 9+ messages in thread
From: Jingwen Chen @ 2021-10-22 3:33 UTC (permalink / raw)
To: amd-gfx; +Cc: monk.liu, horace.chen, Jingwen Chen
[Why]
In advance tdr mode, the real bad job will be resubmitted twice, while
in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
is put one more time than other jobs.
[How]
Adding dma_fence_get before resbumit job in
amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index 41ce86244144..975f069f6fe8 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
/* clear job's guilty and depend the folowing step to decide the real one */
drm_sched_reset_karma(s_job);
+ /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
+ * to make sure fence is balanced */
+ dma_fence_get(s_job->s_fence->parent);
drm_sched_resubmit_jobs_ext(&ring->sched, 1);
ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
@@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
/* got the hw fence, signal finished fence */
atomic_dec(ring->sched.score);
+ dma_fence_put(s_job->s_fence->parent);
dma_fence_get(&s_job->s_fence->finished);
dma_fence_signal(&s_job->s_fence->finished);
dma_fence_put(&s_job->s_fence->finished);
--
2.30.2
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-22 3:33 [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow Jingwen Chen
@ 2021-10-22 8:14 ` JingWen Chen
2021-10-22 20:41 ` Andrey Grodzovsky
0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-22 8:14 UTC (permalink / raw)
To: Jingwen Chen, amd-gfx
Cc: monk.liu, horace.chen, christian.koenig, Andrey.Grodzovsky
ping
On 2021/10/22 AM11:33, Jingwen Chen wrote:
> [Why]
> In advance tdr mode, the real bad job will be resubmitted twice, while
> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
> is put one more time than other jobs.
>
> [How]
> Adding dma_fence_get before resbumit job in
> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>
> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
> ---
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index 41ce86244144..975f069f6fe8 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>
> /* clear job's guilty and depend the folowing step to decide the real one */
> drm_sched_reset_karma(s_job);
> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
> + * to make sure fence is balanced */
> + dma_fence_get(s_job->s_fence->parent);
> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>
> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>
> /* got the hw fence, signal finished fence */
> atomic_dec(ring->sched.score);
> + dma_fence_put(s_job->s_fence->parent);
> dma_fence_get(&s_job->s_fence->finished);
> dma_fence_signal(&s_job->s_fence->finished);
> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-22 8:14 ` JingWen Chen
@ 2021-10-22 20:41 ` Andrey Grodzovsky
2021-10-25 2:56 ` JingWen Chen
0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-22 20:41 UTC (permalink / raw)
To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig
What do you mean by underflow in this case ? You mean use after free
because of extra dma_fence_put() ?
On 2021-10-22 4:14 a.m., JingWen Chen wrote:
> ping
>
> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>> [Why]
>> In advance tdr mode, the real bad job will be resubmitted twice, while
>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>> is put one more time than other jobs.
>>
>> [How]
>> Adding dma_fence_get before resbumit job in
>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>
>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>> ---
>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>> 1 file changed, 4 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> index 41ce86244144..975f069f6fe8 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>
>> /* clear job's guilty and depend the folowing step to decide the real one */
>> drm_sched_reset_karma(s_job);
>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>> + * to make sure fence is balanced */
But that put in drm_sched_resubmit_jobs_ext is for the previous parent
fence.
fence = sched->ops->run_job(s_job); returns a new HW fence and the put
drops the refcount on the old one.
Andrey
>> + dma_fence_get(s_job->s_fence->parent);
>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>
>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>
>> /* got the hw fence, signal finished fence */
>> atomic_dec(ring->sched.score);
>> + dma_fence_put(s_job->s_fence->parent);
>> dma_fence_get(&s_job->s_fence->finished);
>> dma_fence_signal(&s_job->s_fence->finished);
>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-22 20:41 ` Andrey Grodzovsky
@ 2021-10-25 2:56 ` JingWen Chen
2021-10-25 15:18 ` Andrey Grodzovsky
0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-25 2:56 UTC (permalink / raw)
To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
Cc: monk.liu, horace.chen, christian.koenig
On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>
> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
yes
>
> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>> ping
>>
>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>> [Why]
>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>> is put one more time than other jobs.
>>>
>>> [How]
>>> Adding dma_fence_get before resbumit job in
>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>
>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>> ---
>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>> 1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> index 41ce86244144..975f069f6fe8 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>> drm_sched_reset_karma(s_job);
>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>> + * to make sure fence is balanced */
>
>
> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>
> Andrey
>
>
Hi Andrey,
If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>> + dma_fence_get(s_job->s_fence->parent);
>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>> /* got the hw fence, signal finished fence */
>>> atomic_dec(ring->sched.score);
>>> + dma_fence_put(s_job->s_fence->parent);
>>> dma_fence_get(&s_job->s_fence->finished);
>>> dma_fence_signal(&s_job->s_fence->finished);
>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-25 2:56 ` JingWen Chen
@ 2021-10-25 15:18 ` Andrey Grodzovsky
2021-10-26 2:57 ` JingWen Chen
0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-25 15:18 UTC (permalink / raw)
To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig
On 2021-10-24 10:56 p.m., JingWen Chen wrote:
> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
> yes
Then maybe update the description because 'underflow' is very confusing
>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>> ping
>>>
>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>> [Why]
>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>> is put one more time than other jobs.
>>>>
>>>> [How]
>>>> Adding dma_fence_get before resbumit job in
>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>
>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>> ---
>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>> 1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> index 41ce86244144..975f069f6fe8 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>> drm_sched_reset_karma(s_job);
>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>> + * to make sure fence is balanced */
>>
>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
Right, I forgot that... What about removing line
https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265
?
What if you make dma_get_fence unconditional instead ?
Andrey
>
>>>> + dma_fence_get(s_job->s_fence->parent);
>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>> /* got the hw fence, signal finished fence */
>>>> atomic_dec(ring->sched.score);
>>>> + dma_fence_put(s_job->s_fence->parent);
>>>> dma_fence_get(&s_job->s_fence->finished);
>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-25 15:18 ` Andrey Grodzovsky
@ 2021-10-26 2:57 ` JingWen Chen
2021-10-27 19:43 ` Andrey Grodzovsky
0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-26 2:57 UTC (permalink / raw)
To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
Cc: monk.liu, horace.chen, christian.koenig
On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>
> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>> yes
>
>
> Then maybe update the description because 'underflow' is very confusing
>
will do
>
>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>> ping
>>>>
>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>> [Why]
>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>> is put one more time than other jobs.
>>>>>
>>>>> [How]
>>>>> Adding dma_fence_get before resbumit job in
>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>
>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>> ---
>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>> 1 file changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>>> drm_sched_reset_karma(s_job);
>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>> + * to make sure fence is balanced */
>>>
>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>
>>> Andrey
>>>
>>>
>> Hi Andrey,
>>
>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>
>
> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
> What if you make dma_get_fence unconditional instead ?
>
> Andrey
>
>
Hi Andrey,
I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put). I have figured out all the get/put
for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>>
>>>>> + dma_fence_get(s_job->s_fence->parent);
>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>> /* got the hw fence, signal finished fence */
>>>>> atomic_dec(ring->sched.score);
>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-26 2:57 ` JingWen Chen
@ 2021-10-27 19:43 ` Andrey Grodzovsky
2021-10-28 2:43 ` JingWen Chen
0 siblings, 1 reply; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-27 19:43 UTC (permalink / raw)
To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig
On 2021-10-25 10:57 p.m., JingWen Chen wrote:
> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>> yes
>>
>> Then maybe update the description because 'underflow' is very confusing
>>
> will do
>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>> ping
>>>>>
>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>> [Why]
>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>> is put one more time than other jobs.
>>>>>>
>>>>>> [How]
>>>>>> Adding dma_fence_get before resbumit job in
>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>
>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>> ---
>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>> 1 file changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>> drm_sched_reset_karma(s_job);
>>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>> + * to make sure fence is balanced */
>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>
>>>> Andrey
>>>>
>>>>
>>> Hi Andrey,
>>>
>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>
>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>> What if you make dma_get_fence unconditional instead ?
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
I can't see it - can you point me where in that case you get unbalanced
refcount ? As far as I see for a a normal job
being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence
is -
drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
In total refcount didn't change until now
Next, dma_fence_wait_timeout completed successfully because the job is
normal and then you delete that job from pending list and call the
free_job cb which drops remaining refcounts on the hw_fence.
I am probably missing some dma_fence_get since you checked it on a
device but I wonder where is my mistake ?
Andrey
> I have figured out all the get/put
>
> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>
>>>>>> + dma_fence_get(s_job->s_fence->parent);
>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>> /* got the hw fence, signal finished fence */
>>>>>> atomic_dec(ring->sched.score);
>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-27 19:43 ` Andrey Grodzovsky
@ 2021-10-28 2:43 ` JingWen Chen
2021-10-28 14:46 ` Andrey Grodzovsky
0 siblings, 1 reply; 9+ messages in thread
From: JingWen Chen @ 2021-10-28 2:43 UTC (permalink / raw)
To: Andrey Grodzovsky, jingwen.chen2, amd-gfx
Cc: monk.liu, horace.chen, christian.koenig
On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
>
> On 2021-10-25 10:57 p.m., JingWen Chen wrote:
>> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>>> yes
>>>
>>> Then maybe update the description because 'underflow' is very confusing
>>>
>> will do
>>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>>> ping
>>>>>>
>>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>>> [Why]
>>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>>> is put one more time than other jobs.
>>>>>>>
>>>>>>> [How]
>>>>>>> Adding dma_fence_get before resbumit job in
>>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>>
>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>> ---
>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>> 1 file changed, 4 insertions(+)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>>> drm_sched_reset_karma(s_job);
>>>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>>> + * to make sure fence is balanced */
>>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>>
>>>>> Andrey
>>>>>
>>>>>
>>>> Hi Andrey,
>>>>
>>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>>
>>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>>> What if you make dma_get_fence unconditional instead ?
>>>
>>> Andrey
>>>
>>>
>> Hi Andrey,
>>
>> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
>
>
> I can't see it - can you point me where in that case you get unbalanced refcount ? As far as I see for a a normal job
> being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is -
>
> drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
> drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
>
> In total refcount didn't change until now
>
> Next, dma_fence_wait_timeout completed successfully because the job is normal and then you delete that job from pending list and call the
> free_job cb which drops remaining refcounts on the hw_fence.
>
> I am probably missing some dma_fence_get since you checked it on a device but I wonder where is my mistake ?
>
> Andrey
>
>
Hi Andrey,
The thing is the put/get is balanced right now for normal jobs in TDR. Changing this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no corresponding dma_fence_put for normal jobs.
And if this can be helpful, I try to find all dma_fence_get/put for a normal job in advance TDR based on the latest drm-next.
amdgpu_fence_emit -> dma_fence_init ref_count = 1
amdgpu_fence_emit -> add into rcu ref_count = 2
amdgpu_job_run->get after ib_schedule ref_count = 3
drm_sched_main-> add fence callback get ref_count = 4
drm_sched_main-> add fence callback put ref_count = 3
drm_sched_resubmit_jobs_ext ref_count = 2
amdgpu_fence_emit -> add into rcu ref_count = 3
amdgpu_fence_process-> put after signal ref_count = 2
drm_sched_main->free_job ref_count = 1
drm_sched_fence_release_finished ref_count = 0
If we do unconditional get, this sequence will turn into:
amdgpu_fence_emit -> dma_fence_init ref_count = 1
amdgpu_fence_emit -> add into rcu ref_count = 2
amdgpu_job_run->get after ib_schedule ref_count = 3
drm_sched_main-> add fence callback get ref_count = 4
drm_sched_main-> add fence callback put ref_count = 3
drm_sched_resubmit_jobs_ext ref_count = 2
amdgpu_fence_emit -> add into rcu ref_count = 3
+ amdgpu_job_run->get after ib_schedule ref_count = 4
amdgpu_fence_process-> put after signal ref_count = 3
drm_sched_main->free_job ref_count = 2
drm_sched_fence_release_finished ref_count = 1
>
>> I have figured out all the get/put
>>
>> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>>
>>>>>>> + dma_fence_get(s_job->s_fence->parent);
>>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>> /* got the hw fence, signal finished fence */
>>>>>>> atomic_dec(ring->sched.score);
>>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow
2021-10-28 2:43 ` JingWen Chen
@ 2021-10-28 14:46 ` Andrey Grodzovsky
0 siblings, 0 replies; 9+ messages in thread
From: Andrey Grodzovsky @ 2021-10-28 14:46 UTC (permalink / raw)
To: jingwen.chen2, amd-gfx; +Cc: monk.liu, horace.chen, christian.koenig
On 2021-10-27 10:43 p.m., JingWen Chen wrote:
> On 2021/10/28 上午3:43, Andrey Grodzovsky wrote:
>> On 2021-10-25 10:57 p.m., JingWen Chen wrote:
>>> On 2021/10/25 下午11:18, Andrey Grodzovsky wrote:
>>>> On 2021-10-24 10:56 p.m., JingWen Chen wrote:
>>>>> On 2021/10/23 上午4:41, Andrey Grodzovsky wrote:
>>>>>> What do you mean by underflow in this case ? You mean use after free because of extra dma_fence_put() ?
>>>>> yes
>>>> Then maybe update the description because 'underflow' is very confusing
>>>>
>>> will do
>>>>>> On 2021-10-22 4:14 a.m., JingWen Chen wrote:
>>>>>>> ping
>>>>>>>
>>>>>>> On 2021/10/22 AM11:33, Jingwen Chen wrote:
>>>>>>>> [Why]
>>>>>>>> In advance tdr mode, the real bad job will be resubmitted twice, while
>>>>>>>> in drm_sched_resubmit_jobs_ext, there's a dma_fence_put, so the bad job
>>>>>>>> is put one more time than other jobs.
>>>>>>>>
>>>>>>>> [How]
>>>>>>>> Adding dma_fence_get before resbumit job in
>>>>>>>> amdgpu_device_recheck_guilty_jobs and put the fence for normal jobs
>>>>>>>>
>>>>>>>> Signed-off-by: Jingwen Chen <Jingwen.Chen2@amd.com>
>>>>>>>> ---
>>>>>>>> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 4 ++++
>>>>>>>> 1 file changed, 4 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> index 41ce86244144..975f069f6fe8 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
>>>>>>>> @@ -4841,6 +4841,9 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>> /* clear job's guilty and depend the folowing step to decide the real one */
>>>>>>>> drm_sched_reset_karma(s_job);
>>>>>>>> + /* for the real bad job, it will be resubmitted twice, adding a dma_fence_get
>>>>>>>> + * to make sure fence is balanced */
>>>>>> But that put in drm_sched_resubmit_jobs_ext is for the previous parent fence.
>>>>>> fence = sched->ops->run_job(s_job); returns a new HW fence and the put drops the refcount on the old one.
>>>>>>
>>>>>> Andrey
>>>>>>
>>>>>>
>>>>> Hi Andrey,
>>>>>
>>>>> If I remember correctly, after we embedded the hw_fence into amdgpu_job, there will be not fence replacement in amdgpu_job_run.
>>>> Right, I forgot that... What about removing line https://elixir.bootlin.com/linux/v5.15-rc6/source/drivers/gpu/drm/amd/amdgpu/amdgpu_job.c#L265 ?
>>>> What if you make dma_get_fence unconditional instead ?
>>>>
>>>> Andrey
>>>>
>>>>
>>> Hi Andrey,
>>>
>>> I have tried this and this will cause normal jobs cannot be free(lacks a dma_fence_put).
>>
>> I can't see it - can you point me where in that case you get unbalanced refcount ? As far as I see for a a normal job
>> being ran in amdgpu_device_recheck_guilty_jobs the refcount on hw_fence is -
>>
>> drm_sched_resubmit_jobs_ext->dma_fence_put -> refcount decrease by 1
>> drm_sched_resubmit_jobs_ext->amdgpu_job_run->dma_fence_get increase by 1
>>
>> In total refcount didn't change until now
>>
>> Next, dma_fence_wait_timeout completed successfully because the job is normal and then you delete that job from pending list and call the
>> free_job cb which drops remaining refcounts on the hw_fence.
>>
>> I am probably missing some dma_fence_get since you checked it on a device but I wonder where is my mistake ?
>>
>> Andrey
>>
>>
> Hi Andrey,
>
> The thing is the put/get is balanced right now for normal jobs in TDR. Changing this dma_fence_get to unconditional simply adds 1 dma_fence_get but there's no corresponding dma_fence_put for normal jobs.
>
> And if this can be helpful, I try to find all dma_fence_get/put for a normal job in advance TDR based on the latest drm-next.
>
> amdgpu_fence_emit -> dma_fence_init ref_count = 1
> amdgpu_fence_emit -> add into rcu ref_count = 2
> amdgpu_job_run->get after ib_schedule ref_count = 3
> drm_sched_main-> add fence callback get ref_count = 4
> drm_sched_main-> add fence callback put ref_count = 3
> drm_sched_resubmit_jobs_ext ref_count = 2
> amdgpu_fence_emit -> add into rcu ref_count = 3
Now I see that that put in drm_sched_resubmit_jobs_ext is for dropping
the refcount for the previous
'amdgpu_fence_emit -> add into rcu' get... This is all very convoluted
and confusing. Probably requires some
rework to make the code more clear but for now we need the bug fixed so
with the title changed
the patch is Reviewed-by: Andrey Grodzovsky <andrey.grodzovsky@amd.com>
Andrey
> amdgpu_fence_process-> put after signal ref_count = 2
> drm_sched_main->free_job ref_count = 1
> drm_sched_fence_release_finished ref_count = 0
>
> If we do unconditional get, this sequence will turn into:
>
> amdgpu_fence_emit -> dma_fence_init ref_count = 1
> amdgpu_fence_emit -> add into rcu ref_count = 2
> amdgpu_job_run->get after ib_schedule ref_count = 3
> drm_sched_main-> add fence callback get ref_count = 4
> drm_sched_main-> add fence callback put ref_count = 3
> drm_sched_resubmit_jobs_ext ref_count = 2
> amdgpu_fence_emit -> add into rcu ref_count = 3
> + amdgpu_job_run->get after ib_schedule ref_count = 4
> amdgpu_fence_process-> put after signal ref_count = 3
> drm_sched_main->free_job ref_count = 2
> drm_sched_fence_release_finished ref_count = 1
>>> I have figured out all the get/put
>>>
>>> for sched_jobs and only the bad job lacks a dma_fence_get, other jobs are just fine.
>>>
>>>>>>>> + dma_fence_get(s_job->s_fence->parent);
>>>>>>>> drm_sched_resubmit_jobs_ext(&ring->sched, 1);
>>>>>>>> ret = dma_fence_wait_timeout(s_job->s_fence->parent, false, ring->sched.timeout);
>>>>>>>> @@ -4876,6 +4879,7 @@ static void amdgpu_device_recheck_guilty_jobs(
>>>>>>>> /* got the hw fence, signal finished fence */
>>>>>>>> atomic_dec(ring->sched.score);
>>>>>>>> + dma_fence_put(s_job->s_fence->parent);
>>>>>>>> dma_fence_get(&s_job->s_fence->finished);
>>>>>>>> dma_fence_signal(&s_job->s_fence->finished);
>>>>>>>> dma_fence_put(&s_job->s_fence->finished);
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2021-10-28 14:46 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-22 3:33 [PATCH] drm/amd/amdgpu: fix potential bad job hw_fence underflow Jingwen Chen
2021-10-22 8:14 ` JingWen Chen
2021-10-22 20:41 ` Andrey Grodzovsky
2021-10-25 2:56 ` JingWen Chen
2021-10-25 15:18 ` Andrey Grodzovsky
2021-10-26 2:57 ` JingWen Chen
2021-10-27 19:43 ` Andrey Grodzovsky
2021-10-28 2:43 ` JingWen Chen
2021-10-28 14:46 ` Andrey Grodzovsky
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.