All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix a race of IB test
@ 2021-09-11  1:55 xinhui pan
  2021-09-11  7:45 ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: xinhui pan @ 2021-09-11  1:55 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, christian.koenig, xinhui pan

Direct IB submission should be exclusive. So use write lock.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 19323b4cce7b..acbe02928791 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	}
 
 	/* Avoid accidently unparking the sched thread during GPU reset */
-	r = down_read_killable(&adev->reset_sem);
+	r = down_write_killable(&adev->reset_sem);
 	if (r)
 		return r;
 
+	/* Avoid concurrently IB test but not cancel it as I don't know whether we
+	 * would add more code in the delayed init work.
+	 */
+	flush_delayed_work(&adev->delayed_init_work);
+
 	/* hold on the scheduler */
 	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
 		struct amdgpu_ring *ring = adev->rings[i];
@@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 		kthread_unpark(ring->sched.thread);
 	}
 
-	up_read(&adev->reset_sem);
+	up_write(&adev->reset_sem);
 
 	pm_runtime_mark_last_busy(dev->dev);
 	pm_runtime_put_autosuspend(dev->dev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix a race of IB test
  2021-09-11  1:55 [PATCH] drm/amdgpu: Fix a race of IB test xinhui pan
@ 2021-09-11  7:45 ` Christian König
  2021-09-11 10:18   ` 回复: " Pan, Xinhui
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2021-09-11  7:45 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig



Am 11.09.21 um 03:55 schrieb xinhui pan:
> Direct IB submission should be exclusive. So use write lock.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 19323b4cce7b..acbe02928791 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   	}
>   
>   	/* Avoid accidently unparking the sched thread during GPU reset */
> -	r = down_read_killable(&adev->reset_sem);
> +	r = down_write_killable(&adev->reset_sem);
>   	if (r)
>   		return r;
>   
> +	/* Avoid concurrently IB test but not cancel it as I don't know whether we
> +	 * would add more code in the delayed init work.
> +	 */
> +	flush_delayed_work(&adev->delayed_init_work);
> +

That won't work. It's at least theoretical possible that the delayed 
init work waits for the reset_sem which we are holding here.

Very unlikely to happen, but lockdep might be able to point that out 
with a nice backtrace in the logs.

On the other hand delayed init work and direct IB test through this 
interface should work at the same time, so I would just drop it.

Christian.

>   	/* hold on the scheduler */
>   	for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>   		struct amdgpu_ring *ring = adev->rings[i];
> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   		kthread_unpark(ring->sched.thread);
>   	}
>   
> -	up_read(&adev->reset_sem);
> +	up_write(&adev->reset_sem);
>   
>   	pm_runtime_mark_last_busy(dev->dev);
>   	pm_runtime_put_autosuspend(dev->dev);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* 回复: [PATCH] drm/amdgpu: Fix a race of IB test
  2021-09-11  7:45 ` Christian König
@ 2021-09-11 10:18   ` Pan, Xinhui
  2021-09-13  6:19     ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: Pan, Xinhui @ 2021-09-11 10:18 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Deucher, Alexander, Koenig, Christian

[AMD Official Use Only]

For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing.
But I am not aware why delayed init work would try to lock reset_sem.

delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method.
But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed.

I will drop this flush in V2.
________________________________________
发件人: Christian König <ckoenig.leichtzumerken@gmail.com>
发送时间: 2021年9月11日 15:45
收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander; Koenig, Christian
主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test



Am 11.09.21 um 03:55 schrieb xinhui pan:
> Direct IB submission should be exclusive. So use write lock.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
>   1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 19323b4cce7b..acbe02928791 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>       }
>
>       /* Avoid accidently unparking the sched thread during GPU reset */
> -     r = down_read_killable(&adev->reset_sem);
> +     r = down_write_killable(&adev->reset_sem);
>       if (r)
>               return r;
>
> +     /* Avoid concurrently IB test but not cancel it as I don't know whether we
> +      * would add more code in the delayed init work.
> +      */
> +     flush_delayed_work(&adev->delayed_init_work);
> +

That won't work. It's at least theoretical possible that the delayed
init work waits for the reset_sem which we are holding here.

Very unlikely to happen, but lockdep might be able to point that out
with a nice backtrace in the logs.

On the other hand delayed init work and direct IB test through this
interface should work at the same time, so I would just drop it.

Christian.

>       /* hold on the scheduler */
>       for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>               struct amdgpu_ring *ring = adev->rings[i];
> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>               kthread_unpark(ring->sched.thread);
>       }
>
> -     up_read(&adev->reset_sem);
> +     up_write(&adev->reset_sem);
>
>       pm_runtime_mark_last_busy(dev->dev);
>       pm_runtime_put_autosuspend(dev->dev);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test
  2021-09-11 10:18   ` 回复: " Pan, Xinhui
@ 2021-09-13  6:19     ` Christian König
  2021-09-13  6:55       ` 回复: " Pan, Xinhui
  0 siblings, 1 reply; 6+ messages in thread
From: Christian König @ 2021-09-13  6:19 UTC (permalink / raw)
  To: Pan, Xinhui, Christian König, amd-gfx; +Cc: Deucher, Alexander

Well is the delayed init work using direct submission or submission 
through the scheduler?

If the later we have the down_write of the reset semaphore pulled in 
through the scheduler dependency.

Anyway just having the sync before taking the lock should work.

Christian.

Am 11.09.21 um 12:18 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing.
> But I am not aware why delayed init work would try to lock reset_sem.
>
> delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method.
> But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed.
>
> I will drop this flush in V2.
> ________________________________________
> 发件人: Christian König <ckoenig.leichtzumerken@gmail.com>
> 发送时间: 2021年9月11日 15:45
> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander; Koenig, Christian
> 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test
>
>
>
> Am 11.09.21 um 03:55 schrieb xinhui pan:
>> Direct IB submission should be exclusive. So use write lock.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 19323b4cce7b..acbe02928791 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>        }
>>
>>        /* Avoid accidently unparking the sched thread during GPU reset */
>> -     r = down_read_killable(&adev->reset_sem);
>> +     r = down_write_killable(&adev->reset_sem);
>>        if (r)
>>                return r;
>>
>> +     /* Avoid concurrently IB test but not cancel it as I don't know whether we
>> +      * would add more code in the delayed init work.
>> +      */
>> +     flush_delayed_work(&adev->delayed_init_work);
>> +
> That won't work. It's at least theoretical possible that the delayed
> init work waits for the reset_sem which we are holding here.
>
> Very unlikely to happen, but lockdep might be able to point that out
> with a nice backtrace in the logs.
>
> On the other hand delayed init work and direct IB test through this
> interface should work at the same time, so I would just drop it.
>
> Christian.
>
>>        /* hold on the scheduler */
>>        for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>                struct amdgpu_ring *ring = adev->rings[i];
>> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>                kthread_unpark(ring->sched.thread);
>>        }
>>
>> -     up_read(&adev->reset_sem);
>> +     up_write(&adev->reset_sem);
>>
>>        pm_runtime_mark_last_busy(dev->dev);
>>        pm_runtime_put_autosuspend(dev->dev);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* 回复: 回复: [PATCH] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:19     ` Christian König
@ 2021-09-13  6:55       ` Pan, Xinhui
  2021-09-13  6:58         ` Christian König
  0 siblings, 1 reply; 6+ messages in thread
From: Pan, Xinhui @ 2021-09-13  6:55 UTC (permalink / raw)
  To: Koenig, Christian, Christian König, amd-gfx; +Cc: Deucher, Alexander

[AMD Official Use Only]

These IB tests are all using direct IB submission including the delayed init work.
________________________________________
发件人: Koenig, Christian <Christian.Koenig@amd.com>
发送时间: 2021年9月13日 14:19
收件人: Pan, Xinhui; Christian König; amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander
主题: Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test

Well is the delayed init work using direct submission or submission
through the scheduler?

If the later we have the down_write of the reset semaphore pulled in
through the scheduler dependency.

Anyway just having the sync before taking the lock should work.

Christian.

Am 11.09.21 um 12:18 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing.
> But I am not aware why delayed init work would try to lock reset_sem.
>
> delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method.
> But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed.
>
> I will drop this flush in V2.
> ________________________________________
> 发件人: Christian König <ckoenig.leichtzumerken@gmail.com>
> 发送时间: 2021年9月11日 15:45
> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander; Koenig, Christian
> 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test
>
>
>
> Am 11.09.21 um 03:55 schrieb xinhui pan:
>> Direct IB submission should be exclusive. So use write lock.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
>>    1 file changed, 7 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 19323b4cce7b..acbe02928791 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>        }
>>
>>        /* Avoid accidently unparking the sched thread during GPU reset */
>> -     r = down_read_killable(&adev->reset_sem);
>> +     r = down_write_killable(&adev->reset_sem);
>>        if (r)
>>                return r;
>>
>> +     /* Avoid concurrently IB test but not cancel it as I don't know whether we
>> +      * would add more code in the delayed init work.
>> +      */
>> +     flush_delayed_work(&adev->delayed_init_work);
>> +
> That won't work. It's at least theoretical possible that the delayed
> init work waits for the reset_sem which we are holding here.
>
> Very unlikely to happen, but lockdep might be able to point that out
> with a nice backtrace in the logs.
>
> On the other hand delayed init work and direct IB test through this
> interface should work at the same time, so I would just drop it.
>
> Christian.
>
>>        /* hold on the scheduler */
>>        for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>                struct amdgpu_ring *ring = adev->rings[i];
>> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>                kthread_unpark(ring->sched.thread);
>>        }
>>
>> -     up_read(&adev->reset_sem);
>> +     up_write(&adev->reset_sem);
>>
>>        pm_runtime_mark_last_busy(dev->dev);
>>        pm_runtime_put_autosuspend(dev->dev);


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 回复: 回复: [PATCH] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:55       ` 回复: " Pan, Xinhui
@ 2021-09-13  6:58         ` Christian König
  0 siblings, 0 replies; 6+ messages in thread
From: Christian König @ 2021-09-13  6:58 UTC (permalink / raw)
  To: Pan, Xinhui, Koenig, Christian, amd-gfx; +Cc: Deucher, Alexander

Yeah, because we avoid need to allocate an entity otherwise.

Ok, all that comes swapped back into my head once more.

As far as I can see that should work, but I would ask Andrey as well 
since he now takes care of GPU reset.

Christian.

Am 13.09.21 um 08:55 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> These IB tests are all using direct IB submission including the delayed init work.
> ________________________________________
> 发件人: Koenig, Christian <Christian.Koenig@amd.com>
> 发送时间: 2021年9月13日 14:19
> 收件人: Pan, Xinhui; Christian König; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander
> 主题: Re: 回复: [PATCH] drm/amdgpu: Fix a race of IB test
>
> Well is the delayed init work using direct submission or submission
> through the scheduler?
>
> If the later we have the down_write of the reset semaphore pulled in
> through the scheduler dependency.
>
> Anyway just having the sync before taking the lock should work.
>
> Christian.
>
> Am 11.09.21 um 12:18 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> For the possible deadlock, we can just move flush_delayed_work above down_write. not a big thing.
>> But I am not aware why delayed init work would try to lock reset_sem.
>>
>> delayed init work is enqueued when device resume. It calls amdgpu_ib_ring_tests directly. We need one sync method.
>> But I see device resume itself woud flush it. So there is no race between them as userspace is still freezed.
>>
>> I will drop this flush in V2.
>> ________________________________________
>> 发件人: Christian König <ckoenig.leichtzumerken@gmail.com>
>> 发送时间: 2021年9月11日 15:45
>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian
>> 主题: Re: [PATCH] drm/amdgpu: Fix a race of IB test
>>
>>
>>
>> Am 11.09.21 um 03:55 schrieb xinhui pan:
>>> Direct IB submission should be exclusive. So use write lock.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 9 +++++++--
>>>     1 file changed, 7 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> index 19323b4cce7b..acbe02928791 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> @@ -1358,10 +1358,15 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>         }
>>>
>>>         /* Avoid accidently unparking the sched thread during GPU reset */
>>> -     r = down_read_killable(&adev->reset_sem);
>>> +     r = down_write_killable(&adev->reset_sem);
>>>         if (r)
>>>                 return r;
>>>
>>> +     /* Avoid concurrently IB test but not cancel it as I don't know whether we
>>> +      * would add more code in the delayed init work.
>>> +      */
>>> +     flush_delayed_work(&adev->delayed_init_work);
>>> +
>> That won't work. It's at least theoretical possible that the delayed
>> init work waits for the reset_sem which we are holding here.
>>
>> Very unlikely to happen, but lockdep might be able to point that out
>> with a nice backtrace in the logs.
>>
>> On the other hand delayed init work and direct IB test through this
>> interface should work at the same time, so I would just drop it.
>>
>> Christian.
>>
>>>         /* hold on the scheduler */
>>>         for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
>>>                 struct amdgpu_ring *ring = adev->rings[i];
>>> @@ -1387,7 +1392,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>                 kthread_unpark(ring->sched.thread);
>>>         }
>>>
>>> -     up_read(&adev->reset_sem);
>>> +     up_write(&adev->reset_sem);
>>>
>>>         pm_runtime_mark_last_busy(dev->dev);
>>>         pm_runtime_put_autosuspend(dev->dev);


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2021-09-13  6:58 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-11  1:55 [PATCH] drm/amdgpu: Fix a race of IB test xinhui pan
2021-09-11  7:45 ` Christian König
2021-09-11 10:18   ` 回复: " Pan, Xinhui
2021-09-13  6:19     ` Christian König
2021-09-13  6:55       ` 回复: " Pan, Xinhui
2021-09-13  6:58         ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.