All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] drm/amdgpu: Fix a race of IB test
@ 2021-09-12 23:48 xinhui pan
  2021-09-13  4:00 ` Lazar, Lijo
  2021-09-13 14:41 ` Andrey Grodzovsky
  0 siblings, 2 replies; 13+ messages in thread
From: xinhui pan @ 2021-09-12 23:48 UTC (permalink / raw)
  To: amd-gfx; +Cc: alexander.deucher, christian.koenig, xinhui pan

Direct IB submission should be exclusive. So use write lock.

Signed-off-by: xinhui pan <xinhui.pan@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
index 19323b4cce7b..be5d12ed3db1 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
@@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 	}
 
 	/* Avoid accidently unparking the sched thread during GPU reset */
-	r = down_read_killable(&adev->reset_sem);
+	r = down_write_killable(&adev->reset_sem);
 	if (r)
 		return r;
 
@@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
 		kthread_unpark(ring->sched.thread);
 	}
 
-	up_read(&adev->reset_sem);
+	up_write(&adev->reset_sem);
 
 	pm_runtime_mark_last_busy(dev->dev);
 	pm_runtime_put_autosuspend(dev->dev);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-12 23:48 [PATCH v2] drm/amdgpu: Fix a race of IB test xinhui pan
@ 2021-09-13  4:00 ` Lazar, Lijo
  2021-09-13  4:42   ` 回复: " Pan, Xinhui
  2021-09-13 14:41 ` Andrey Grodzovsky
  1 sibling, 1 reply; 13+ messages in thread
From: Lazar, Lijo @ 2021-09-13  4:00 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig



On 9/13/2021 5:18 AM, xinhui pan wrote:
> Direct IB submission should be exclusive. So use write lock.
> 
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 19323b4cce7b..be5d12ed3db1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   	}
>   
>   	/* Avoid accidently unparking the sched thread during GPU reset */
> -	r = down_read_killable(&adev->reset_sem);
> +	r = down_write_killable(&adev->reset_sem);

There are many ioctls and debugfs calls which takes this lock and as you 
know the purpose is to avoid them while there is a reset. The purpose is 
*not to* fix any concurrency issues those calls themselves have 
otherwise and fixing those concurrency issues this way is just lazy and 
not acceptable.

This will take away any fairness given to the writer in this rw lock and 
that is supposed to be the reset thread.

Thanks,
Lijo

>   	if (r)
>   		return r;
>   
> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   		kthread_unpark(ring->sched.thread);
>   	}
>   
> -	up_read(&adev->reset_sem);
> +	up_write(&adev->reset_sem);
>   
>   	pm_runtime_mark_last_busy(dev->dev);
>   	pm_runtime_put_autosuspend(dev->dev);
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  4:00 ` Lazar, Lijo
@ 2021-09-13  4:42   ` Pan, Xinhui
  2021-09-13  6:22     ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Pan, Xinhui @ 2021-09-13  4:42 UTC (permalink / raw)
  To: Lazar, Lijo, amd-gfx; +Cc: Deucher, Alexander, Koenig, Christian

[AMD Official Use Only]

yep, that is a lazy way to fix it.

I am thinking of adding one amdgpu_ring.direct_access_mutex before we issue test_ib on each ring.
________________________________________
发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
发送时间: 2021年9月13日 12:00
收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander; Koenig, Christian
主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test



On 9/13/2021 5:18 AM, xinhui pan wrote:
> Direct IB submission should be exclusive. So use write lock.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 19323b4cce7b..be5d12ed3db1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>       }
>
>       /* Avoid accidently unparking the sched thread during GPU reset */
> -     r = down_read_killable(&adev->reset_sem);
> +     r = down_write_killable(&adev->reset_sem);

There are many ioctls and debugfs calls which takes this lock and as you
know the purpose is to avoid them while there is a reset. The purpose is
*not to* fix any concurrency issues those calls themselves have
otherwise and fixing those concurrency issues this way is just lazy and
not acceptable.

This will take away any fairness given to the writer in this rw lock and
that is supposed to be the reset thread.

Thanks,
Lijo

>       if (r)
>               return r;
>
> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>               kthread_unpark(ring->sched.thread);
>       }
>
> -     up_read(&adev->reset_sem);
> +     up_write(&adev->reset_sem);
>
>       pm_runtime_mark_last_busy(dev->dev);
>       pm_runtime_put_autosuspend(dev->dev);
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  4:42   ` 回复: " Pan, Xinhui
@ 2021-09-13  6:22     ` Christian König
  2021-09-13  6:25       ` Lazar, Lijo
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-09-13  6:22 UTC (permalink / raw)
  To: Pan, Xinhui, Lazar, Lijo, amd-gfx; +Cc: Deucher, Alexander

NAK, this is not the lazy way to fix it at all.

The reset semaphore protects the scheduler and ring objects from 
concurrent modification, so taking the write side of it is perfectly 
valid here.

Christian.

Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
> [AMD Official Use Only]
>
> yep, that is a lazy way to fix it.
>
> I am thinking of adding one amdgpu_ring.direct_access_mutex before we issue test_ib on each ring.
> ________________________________________
> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
> 发送时间: 2021年9月13日 12:00
> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander; Koenig, Christian
> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>
>
>
> On 9/13/2021 5:18 AM, xinhui pan wrote:
>> Direct IB submission should be exclusive. So use write lock.
>>
>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>> ---
>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> index 19323b4cce7b..be5d12ed3db1 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>        }
>>
>>        /* Avoid accidently unparking the sched thread during GPU reset */
>> -     r = down_read_killable(&adev->reset_sem);
>> +     r = down_write_killable(&adev->reset_sem);
> There are many ioctls and debugfs calls which takes this lock and as you
> know the purpose is to avoid them while there is a reset. The purpose is
> *not to* fix any concurrency issues those calls themselves have
> otherwise and fixing those concurrency issues this way is just lazy and
> not acceptable.
>
> This will take away any fairness given to the writer in this rw lock and
> that is supposed to be the reset thread.
>
> Thanks,
> Lijo
>
>>        if (r)
>>                return r;
>>
>> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>                kthread_unpark(ring->sched.thread);
>>        }
>>
>> -     up_read(&adev->reset_sem);
>> +     up_write(&adev->reset_sem);
>>
>>        pm_runtime_mark_last_busy(dev->dev);
>>        pm_runtime_put_autosuspend(dev->dev);
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:22     ` Christian König
@ 2021-09-13  6:25       ` Lazar, Lijo
  2021-09-13  6:37         ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Lazar, Lijo @ 2021-09-13  6:25 UTC (permalink / raw)
  To: Christian König, Pan, Xinhui, amd-gfx; +Cc: Deucher, Alexander

This is a debugfs interface and adding another writer contention in 
debugfs over an actual reset is lazy fix. This shouldn't be executed in 
the first place and should not take precedence over any reset.

Thanks,
Lijo


On 9/13/2021 11:52 AM, Christian König wrote:
> NAK, this is not the lazy way to fix it at all.
> 
> The reset semaphore protects the scheduler and ring objects from 
> concurrent modification, so taking the write side of it is perfectly 
> valid here.
> 
> Christian.
> 
> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>> [AMD Official Use Only]
>>
>> yep, that is a lazy way to fix it.
>>
>> I am thinking of adding one amdgpu_ring.direct_access_mutex before we 
>> issue test_ib on each ring.
>> ________________________________________
>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>> 发送时间: 2021年9月13日 12:00
>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>> 抄送: Deucher, Alexander; Koenig, Christian
>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>
>>
>>
>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>> Direct IB submission should be exclusive. So use write lock.
>>>
>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>> ---
>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> index 19323b4cce7b..be5d12ed3db1 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>> seq_file *m, void *unused)
>>>        }
>>>
>>>        /* Avoid accidently unparking the sched thread during GPU 
>>> reset */
>>> -     r = down_read_killable(&adev->reset_sem);
>>> +     r = down_write_killable(&adev->reset_sem);
>> There are many ioctls and debugfs calls which takes this lock and as you
>> know the purpose is to avoid them while there is a reset. The purpose is
>> *not to* fix any concurrency issues those calls themselves have
>> otherwise and fixing those concurrency issues this way is just lazy and
>> not acceptable.
>>
>> This will take away any fairness given to the writer in this rw lock and
>> that is supposed to be the reset thread.
>>
>> Thanks,
>> Lijo
>>
>>>        if (r)
>>>                return r;
>>>
>>> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>> seq_file *m, void *unused)
>>>                kthread_unpark(ring->sched.thread);
>>>        }
>>>
>>> -     up_read(&adev->reset_sem);
>>> +     up_write(&adev->reset_sem);
>>>
>>>        pm_runtime_mark_last_busy(dev->dev);
>>>        pm_runtime_put_autosuspend(dev->dev);
>>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:25       ` Lazar, Lijo
@ 2021-09-13  6:37         ` Christian König
  2021-09-13  6:43           ` Lazar, Lijo
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-09-13  6:37 UTC (permalink / raw)
  To: Lazar, Lijo, Christian König, Pan, Xinhui, amd-gfx
  Cc: Deucher, Alexander

That's complete nonsense.

The debugfs interface emulates parts of the reset procedure for testing 
and we absolutely need to take the same locks as the reset to avoid 
corruption of the involved objects.

Regards,
Christian.

Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
> This is a debugfs interface and adding another writer contention in 
> debugfs over an actual reset is lazy fix. This shouldn't be executed 
> in the first place and should not take precedence over any reset.
>
> Thanks,
> Lijo
>
>
> On 9/13/2021 11:52 AM, Christian König wrote:
>> NAK, this is not the lazy way to fix it at all.
>>
>> The reset semaphore protects the scheduler and ring objects from 
>> concurrent modification, so taking the write side of it is perfectly 
>> valid here.
>>
>> Christian.
>>
>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>> [AMD Official Use Only]
>>>
>>> yep, that is a lazy way to fix it.
>>>
>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before 
>>> we issue test_ib on each ring.
>>> ________________________________________
>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>> 发送时间: 2021年9月13日 12:00
>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>> 抄送: Deucher, Alexander; Koenig, Christian
>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>
>>>
>>>
>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>> Direct IB submission should be exclusive. So use write lock.
>>>>
>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>> seq_file *m, void *unused)
>>>>        }
>>>>
>>>>        /* Avoid accidently unparking the sched thread during GPU 
>>>> reset */
>>>> -     r = down_read_killable(&adev->reset_sem);
>>>> +     r = down_write_killable(&adev->reset_sem);
>>> There are many ioctls and debugfs calls which takes this lock and as 
>>> you
>>> know the purpose is to avoid them while there is a reset. The 
>>> purpose is
>>> *not to* fix any concurrency issues those calls themselves have
>>> otherwise and fixing those concurrency issues this way is just lazy and
>>> not acceptable.
>>>
>>> This will take away any fairness given to the writer in this rw lock 
>>> and
>>> that is supposed to be the reset thread.
>>>
>>> Thanks,
>>> Lijo
>>>
>>>>        if (r)
>>>>                return r;
>>>>
>>>> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>> seq_file *m, void *unused)
>>>>                kthread_unpark(ring->sched.thread);
>>>>        }
>>>>
>>>> -     up_read(&adev->reset_sem);
>>>> +     up_write(&adev->reset_sem);
>>>>
>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:37         ` Christian König
@ 2021-09-13  6:43           ` Lazar, Lijo
  2021-09-13  6:51             ` Christian König
  0 siblings, 1 reply; 13+ messages in thread
From: Lazar, Lijo @ 2021-09-13  6:43 UTC (permalink / raw)
  To: Christian König, Christian König, Pan, Xinhui, amd-gfx
  Cc: Deucher, Alexander

There are other interfaces to emulate the exact reset process, or 
atleast this is not the one we are using for doing any sort of reset 
through debugfs.

In any case, the expectation is reset thread takes the write side of the 
lock and it's already done somewhere else.

Reset semaphore is supposed to protect the device from concurrent access 
(any sort of resource usage is thus protected by default). Then the same 
logic can be applied for any other call and that is not a reasonable ask.

Thanks,
Lijo

On 9/13/2021 12:07 PM, Christian König wrote:
> That's complete nonsense.
> 
> The debugfs interface emulates parts of the reset procedure for testing 
> and we absolutely need to take the same locks as the reset to avoid 
> corruption of the involved objects.
> 
> Regards,
> Christian.
> 
> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>> This is a debugfs interface and adding another writer contention in 
>> debugfs over an actual reset is lazy fix. This shouldn't be executed 
>> in the first place and should not take precedence over any reset.
>>
>> Thanks,
>> Lijo
>>
>>
>> On 9/13/2021 11:52 AM, Christian König wrote:
>>> NAK, this is not the lazy way to fix it at all.
>>>
>>> The reset semaphore protects the scheduler and ring objects from 
>>> concurrent modification, so taking the write side of it is perfectly 
>>> valid here.
>>>
>>> Christian.
>>>
>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>> [AMD Official Use Only]
>>>>
>>>> yep, that is a lazy way to fix it.
>>>>
>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before 
>>>> we issue test_ib on each ring.
>>>> ________________________________________
>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>> 发送时间: 2021年9月13日 12:00
>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>
>>>>
>>>>
>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>
>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>> ---
>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>>> seq_file *m, void *unused)
>>>>>        }
>>>>>
>>>>>        /* Avoid accidently unparking the sched thread during GPU 
>>>>> reset */
>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>> There are many ioctls and debugfs calls which takes this lock and as 
>>>> you
>>>> know the purpose is to avoid them while there is a reset. The 
>>>> purpose is
>>>> *not to* fix any concurrency issues those calls themselves have
>>>> otherwise and fixing those concurrency issues this way is just lazy and
>>>> not acceptable.
>>>>
>>>> This will take away any fairness given to the writer in this rw lock 
>>>> and
>>>> that is supposed to be the reset thread.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>>        if (r)
>>>>>                return r;
>>>>>
>>>>> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct 
>>>>> seq_file *m, void *unused)
>>>>>                kthread_unpark(ring->sched.thread);
>>>>>        }
>>>>>
>>>>> -     up_read(&adev->reset_sem);
>>>>> +     up_write(&adev->reset_sem);
>>>>>
>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:43           ` Lazar, Lijo
@ 2021-09-13  6:51             ` Christian König
  2021-09-13  7:15               ` Lazar, Lijo
  0 siblings, 1 reply; 13+ messages in thread
From: Christian König @ 2021-09-13  6:51 UTC (permalink / raw)
  To: Lazar, Lijo, Christian König, Pan, Xinhui, amd-gfx
  Cc: Deucher, Alexander

Keep in mind that we don't try to avoid contention here. The goal is 
rather to have as few locks as possible to avoid the extra overhead in 
the hot path.

Contention is completely irrelevant for the debug and device reset since 
that are rarely occurring events and performance doesn't matter for them.

It is perfectly reasonable to take the write side of the reset lock as 
necessary when we need to make sure that we don't have concurrent device 
access.

Regards,
Christian.

Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
> There are other interfaces to emulate the exact reset process, or 
> atleast this is not the one we are using for doing any sort of reset 
> through debugfs.
>
> In any case, the expectation is reset thread takes the write side of 
> the lock and it's already done somewhere else.
>
> Reset semaphore is supposed to protect the device from concurrent 
> access (any sort of resource usage is thus protected by default). Then 
> the same logic can be applied for any other call and that is not a 
> reasonable ask.
>
> Thanks,
> Lijo
>
> On 9/13/2021 12:07 PM, Christian König wrote:
>> That's complete nonsense.
>>
>> The debugfs interface emulates parts of the reset procedure for 
>> testing and we absolutely need to take the same locks as the reset to 
>> avoid corruption of the involved objects.
>>
>> Regards,
>> Christian.
>>
>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>> This is a debugfs interface and adding another writer contention in 
>>> debugfs over an actual reset is lazy fix. This shouldn't be executed 
>>> in the first place and should not take precedence over any reset.
>>>
>>> Thanks,
>>> Lijo
>>>
>>>
>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>> NAK, this is not the lazy way to fix it at all.
>>>>
>>>> The reset semaphore protects the scheduler and ring objects from 
>>>> concurrent modification, so taking the write side of it is 
>>>> perfectly valid here.
>>>>
>>>> Christian.
>>>>
>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>> [AMD Official Use Only]
>>>>>
>>>>> yep, that is a lazy way to fix it.
>>>>>
>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before 
>>>>> we issue test_ib on each ring.
>>>>> ________________________________________
>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>> 发送时间: 2021年9月13日 12:00
>>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>
>>>>>
>>>>>
>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>
>>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>>> ---
>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>> @@ -1358,7 +1358,7 @@ static int 
>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>        }
>>>>>>
>>>>>>        /* Avoid accidently unparking the sched thread during GPU 
>>>>>> reset */
>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>> There are many ioctls and debugfs calls which takes this lock and 
>>>>> as you
>>>>> know the purpose is to avoid them while there is a reset. The 
>>>>> purpose is
>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>> otherwise and fixing those concurrency issues this way is just 
>>>>> lazy and
>>>>> not acceptable.
>>>>>
>>>>> This will take away any fairness given to the writer in this rw 
>>>>> lock and
>>>>> that is supposed to be the reset thread.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>>        if (r)
>>>>>>                return r;
>>>>>>
>>>>>> @@ -1387,7 +1387,7 @@ static int 
>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>                kthread_unpark(ring->sched.thread);
>>>>>>        }
>>>>>>
>>>>>> -     up_read(&adev->reset_sem);
>>>>>> +     up_write(&adev->reset_sem);
>>>>>>
>>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  6:51             ` Christian König
@ 2021-09-13  7:15               ` Lazar, Lijo
  2021-09-13  7:22                 ` Christian König
  2021-09-13  7:23                 ` 回复: " Pan, Xinhui
  0 siblings, 2 replies; 13+ messages in thread
From: Lazar, Lijo @ 2021-09-13  7:15 UTC (permalink / raw)
  To: Christian König, Christian König, Pan, Xinhui, amd-gfx
  Cc: Deucher, Alexander



On 9/13/2021 12:21 PM, Christian König wrote:
> Keep in mind that we don't try to avoid contention here. The goal is 
> rather to have as few locks as possible to avoid the extra overhead in 
> the hot path.
> 
> Contention is completely irrelevant for the debug and device reset since 
> that are rarely occurring events and performance doesn't matter for them.
> 
> It is perfectly reasonable to take the write side of the reset lock as 
> necessary when we need to make sure that we don't have concurrent device 
> access.

The original code has down_read which gave the impression that there is 
some protection to avoid access during reset. Basically would like to 
avoid this as a precedence for this sort of usage for any debugfs call. 
Reset semaphore is supposed to be a 'protect all' thing and provides a 
shortcut.

BTW, question about a hypothetical case - what happens if the test 
itself causes a hang and need to trigger a reset? Will there be chance 
for the lock to be released (whether a submit call will hang 
indefinitely) for the actual reset to be executed?

Thanks,
Lijo

> 
> Regards,
> Christian.
> 
> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>> There are other interfaces to emulate the exact reset process, or 
>> atleast this is not the one we are using for doing any sort of reset 
>> through debugfs.
>>
>> In any case, the expectation is reset thread takes the write side of 
>> the lock and it's already done somewhere else.
>>
>> Reset semaphore is supposed to protect the device from concurrent 
>> access (any sort of resource usage is thus protected by default). Then 
>> the same logic can be applied for any other call and that is not a 
>> reasonable ask.
>>
>> Thanks,
>> Lijo
>>
>> On 9/13/2021 12:07 PM, Christian König wrote:
>>> That's complete nonsense.
>>>
>>> The debugfs interface emulates parts of the reset procedure for 
>>> testing and we absolutely need to take the same locks as the reset to 
>>> avoid corruption of the involved objects.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>> This is a debugfs interface and adding another writer contention in 
>>>> debugfs over an actual reset is lazy fix. This shouldn't be executed 
>>>> in the first place and should not take precedence over any reset.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>
>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>
>>>>> The reset semaphore protects the scheduler and ring objects from 
>>>>> concurrent modification, so taking the write side of it is 
>>>>> perfectly valid here.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> yep, that is a lazy way to fix it.
>>>>>>
>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before 
>>>>>> we issue test_ib on each ring.
>>>>>> ________________________________________
>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>
>>>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> @@ -1358,7 +1358,7 @@ static int 
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>        }
>>>>>>>
>>>>>>>        /* Avoid accidently unparking the sched thread during GPU 
>>>>>>> reset */
>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>> There are many ioctls and debugfs calls which takes this lock and 
>>>>>> as you
>>>>>> know the purpose is to avoid them while there is a reset. The 
>>>>>> purpose is
>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>> otherwise and fixing those concurrency issues this way is just 
>>>>>> lazy and
>>>>>> not acceptable.
>>>>>>
>>>>>> This will take away any fairness given to the writer in this rw 
>>>>>> lock and
>>>>>> that is supposed to be the reset thread.
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>        if (r)
>>>>>>>                return r;
>>>>>>>
>>>>>>> @@ -1387,7 +1387,7 @@ static int 
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>                kthread_unpark(ring->sched.thread);
>>>>>>>        }
>>>>>>>
>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>
>>>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>>>
>>>>>
>>>
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  7:15               ` Lazar, Lijo
@ 2021-09-13  7:22                 ` Christian König
  2021-09-13  7:23                 ` 回复: " Pan, Xinhui
  1 sibling, 0 replies; 13+ messages in thread
From: Christian König @ 2021-09-13  7:22 UTC (permalink / raw)
  To: Lazar, Lijo, Christian König, Pan, Xinhui, amd-gfx
  Cc: Deucher, Alexander

Am 13.09.21 um 09:15 schrieb Lazar, Lijo:
> On 9/13/2021 12:21 PM, Christian König wrote:
>> Keep in mind that we don't try to avoid contention here. The goal is 
>> rather to have as few locks as possible to avoid the extra overhead 
>> in the hot path.
>>
>> Contention is completely irrelevant for the debug and device reset 
>> since that are rarely occurring events and performance doesn't matter 
>> for them.
>>
>> It is perfectly reasonable to take the write side of the reset lock 
>> as necessary when we need to make sure that we don't have concurrent 
>> device access.
>
> The original code has down_read which gave the impression that there 
> is some protection to avoid access during reset. Basically would like 
> to avoid this as a precedence for this sort of usage for any debugfs 
> call. Reset semaphore is supposed to be a 'protect all' thing and 
> provides a shortcut.

Yeah, that's indeed a very valid fear. We had to reject that approach 
for multiple IOCTL, sysfs and debugfs accesses countless times now.

But in the case here it is indeed thee right thing to do, the only 
alternative would be to allocate an entity and use that for pushing the 
IBs though the scheduler.

>
> BTW, question about a hypothetical case - what happens if the test 
> itself causes a hang and need to trigger a reset? Will there be chance 
> for the lock to be released (whether a submit call will hang 
> indefinitely) for the actual reset to be executed?

Not sure if we added some timeout, but essentially it should hang 
forever, yes.

Regards,
Christian.

>
> Thanks,
> Lijo
>
>>
>> Regards,
>> Christian.
>>
>> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>>> There are other interfaces to emulate the exact reset process, or 
>>> atleast this is not the one we are using for doing any sort of reset 
>>> through debugfs.
>>>
>>> In any case, the expectation is reset thread takes the write side of 
>>> the lock and it's already done somewhere else.
>>>
>>> Reset semaphore is supposed to protect the device from concurrent 
>>> access (any sort of resource usage is thus protected by default). 
>>> Then the same logic can be applied for any other call and that is 
>>> not a reasonable ask.
>>>
>>> Thanks,
>>> Lijo
>>>
>>> On 9/13/2021 12:07 PM, Christian König wrote:
>>>> That's complete nonsense.
>>>>
>>>> The debugfs interface emulates parts of the reset procedure for 
>>>> testing and we absolutely need to take the same locks as the reset 
>>>> to avoid corruption of the involved objects.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>>> This is a debugfs interface and adding another writer contention 
>>>>> in debugfs over an actual reset is lazy fix. This shouldn't be 
>>>>> executed in the first place and should not take precedence over 
>>>>> any reset.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>
>>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>>
>>>>>> The reset semaphore protects the scheduler and ring objects from 
>>>>>> concurrent modification, so taking the write side of it is 
>>>>>> perfectly valid here.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> yep, that is a lazy way to fix it.
>>>>>>>
>>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex 
>>>>>>> before we issue test_ib on each ring.
>>>>>>> ________________________________________
>>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>>
>>>>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>>>>> ---
>>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c 
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> @@ -1358,7 +1358,7 @@ static int 
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>>        }
>>>>>>>>
>>>>>>>>        /* Avoid accidently unparking the sched thread during 
>>>>>>>> GPU reset */
>>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>>> There are many ioctls and debugfs calls which takes this lock 
>>>>>>> and as you
>>>>>>> know the purpose is to avoid them while there is a reset. The 
>>>>>>> purpose is
>>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>>> otherwise and fixing those concurrency issues this way is just 
>>>>>>> lazy and
>>>>>>> not acceptable.
>>>>>>>
>>>>>>> This will take away any fairness given to the writer in this rw 
>>>>>>> lock and
>>>>>>> that is supposed to be the reset thread.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>        if (r)
>>>>>>>>                return r;
>>>>>>>>
>>>>>>>> @@ -1387,7 +1387,7 @@ static int 
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>> kthread_unpark(ring->sched.thread);
>>>>>>>>        }
>>>>>>>>
>>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>>
>>>>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>>>>
>>>>>>
>>>>
>>


^ permalink raw reply	[flat|nested] 13+ messages in thread

* 回复: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  7:15               ` Lazar, Lijo
  2021-09-13  7:22                 ` Christian König
@ 2021-09-13  7:23                 ` Pan, Xinhui
  2021-09-13 13:50                   ` Lazar, Lijo
  1 sibling, 1 reply; 13+ messages in thread
From: Pan, Xinhui @ 2021-09-13  7:23 UTC (permalink / raw)
  To: Lazar, Lijo, Christian König, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander

[AMD Official Use Only]

Of source IB test can hang the GPU.
But it wait fence with one specific timeout. and it not depends on gpu scheduler.
So IB test must can return.

________________________________________
发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
发送时间: 2021年9月13日 15:15
收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx@lists.freedesktop.org
抄送: Deucher, Alexander
主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test



On 9/13/2021 12:21 PM, Christian König wrote:
> Keep in mind that we don't try to avoid contention here. The goal is
> rather to have as few locks as possible to avoid the extra overhead in
> the hot path.
>
> Contention is completely irrelevant for the debug and device reset since
> that are rarely occurring events and performance doesn't matter for them.
>
> It is perfectly reasonable to take the write side of the reset lock as
> necessary when we need to make sure that we don't have concurrent device
> access.

The original code has down_read which gave the impression that there is
some protection to avoid access during reset. Basically would like to
avoid this as a precedence for this sort of usage for any debugfs call.
Reset semaphore is supposed to be a 'protect all' thing and provides a
shortcut.

BTW, question about a hypothetical case - what happens if the test
itself causes a hang and need to trigger a reset? Will there be chance
for the lock to be released (whether a submit call will hang
indefinitely) for the actual reset to be executed?

Thanks,
Lijo

>
> Regards,
> Christian.
>
> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>> There are other interfaces to emulate the exact reset process, or
>> atleast this is not the one we are using for doing any sort of reset
>> through debugfs.
>>
>> In any case, the expectation is reset thread takes the write side of
>> the lock and it's already done somewhere else.
>>
>> Reset semaphore is supposed to protect the device from concurrent
>> access (any sort of resource usage is thus protected by default). Then
>> the same logic can be applied for any other call and that is not a
>> reasonable ask.
>>
>> Thanks,
>> Lijo
>>
>> On 9/13/2021 12:07 PM, Christian König wrote:
>>> That's complete nonsense.
>>>
>>> The debugfs interface emulates parts of the reset procedure for
>>> testing and we absolutely need to take the same locks as the reset to
>>> avoid corruption of the involved objects.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>> This is a debugfs interface and adding another writer contention in
>>>> debugfs over an actual reset is lazy fix. This shouldn't be executed
>>>> in the first place and should not take precedence over any reset.
>>>>
>>>> Thanks,
>>>> Lijo
>>>>
>>>>
>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>
>>>>> The reset semaphore protects the scheduler and ring objects from
>>>>> concurrent modification, so taking the write side of it is
>>>>> perfectly valid here.
>>>>>
>>>>> Christian.
>>>>>
>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>> [AMD Official Use Only]
>>>>>>
>>>>>> yep, that is a lazy way to fix it.
>>>>>>
>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before
>>>>>> we issue test_ib on each ring.
>>>>>> ________________________________________
>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>
>>>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>>>> ---
>>>>>>>    drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>    1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>> @@ -1358,7 +1358,7 @@ static int
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>        }
>>>>>>>
>>>>>>>        /* Avoid accidently unparking the sched thread during GPU
>>>>>>> reset */
>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>> There are many ioctls and debugfs calls which takes this lock and
>>>>>> as you
>>>>>> know the purpose is to avoid them while there is a reset. The
>>>>>> purpose is
>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>> otherwise and fixing those concurrency issues this way is just
>>>>>> lazy and
>>>>>> not acceptable.
>>>>>>
>>>>>> This will take away any fairness given to the writer in this rw
>>>>>> lock and
>>>>>> that is supposed to be the reset thread.
>>>>>>
>>>>>> Thanks,
>>>>>> Lijo
>>>>>>
>>>>>>>        if (r)
>>>>>>>                return r;
>>>>>>>
>>>>>>> @@ -1387,7 +1387,7 @@ static int
>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>                kthread_unpark(ring->sched.thread);
>>>>>>>        }
>>>>>>>
>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>
>>>>>>>        pm_runtime_mark_last_busy(dev->dev);
>>>>>>>        pm_runtime_put_autosuspend(dev->dev);
>>>>>>>
>>>>>
>>>
>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: 回复: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-13  7:23                 ` 回复: " Pan, Xinhui
@ 2021-09-13 13:50                   ` Lazar, Lijo
  0 siblings, 0 replies; 13+ messages in thread
From: Lazar, Lijo @ 2021-09-13 13:50 UTC (permalink / raw)
  To: Pan, Xinhui, Christian König, Koenig, Christian, amd-gfx
  Cc: Deucher, Alexander

Thanks for the clarification Xinhui.

Based on Christian's explanation, what I understood is - this is an 
exceptional case in debugfs calls and the other goal is to avoid 
maintenance of one more lock just to support this API. I no longer have 
any issues with this approach.

Thanks,
Lijo

On 9/13/2021 12:53 PM, Pan, Xinhui wrote:
> [AMD Official Use Only]
> 
> Of source IB test can hang the GPU.
> But it wait fence with one specific timeout. and it not depends on gpu scheduler.
> So IB test must can return.
> 
> ________________________________________
> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
> 发送时间: 2021年9月13日 15:15
> 收件人: Christian König; Koenig, Christian; Pan, Xinhui; amd-gfx@lists.freedesktop.org
> 抄送: Deucher, Alexander
> 主题: Re: 回复: [PATCH v2] drm/amdgpu: Fix a race of IB test
> 
> 
> 
> On 9/13/2021 12:21 PM, Christian König wrote:
>> Keep in mind that we don't try to avoid contention here. The goal is
>> rather to have as few locks as possible to avoid the extra overhead in
>> the hot path.
>>
>> Contention is completely irrelevant for the debug and device reset since
>> that are rarely occurring events and performance doesn't matter for them.
>>
>> It is perfectly reasonable to take the write side of the reset lock as
>> necessary when we need to make sure that we don't have concurrent device
>> access.
> 
> The original code has down_read which gave the impression that there is
> some protection to avoid access during reset. Basically would like to
> avoid this as a precedence for this sort of usage for any debugfs call.
> Reset semaphore is supposed to be a 'protect all' thing and provides a
> shortcut.
> 
> BTW, question about a hypothetical case - what happens if the test
> itself causes a hang and need to trigger a reset? Will there be chance
> for the lock to be released (whether a submit call will hang
> indefinitely) for the actual reset to be executed?
> 
> Thanks,
> Lijo
> 
>>
>> Regards,
>> Christian.
>>
>> Am 13.09.21 um 08:43 schrieb Lazar, Lijo:
>>> There are other interfaces to emulate the exact reset process, or
>>> atleast this is not the one we are using for doing any sort of reset
>>> through debugfs.
>>>
>>> In any case, the expectation is reset thread takes the write side of
>>> the lock and it's already done somewhere else.
>>>
>>> Reset semaphore is supposed to protect the device from concurrent
>>> access (any sort of resource usage is thus protected by default). Then
>>> the same logic can be applied for any other call and that is not a
>>> reasonable ask.
>>>
>>> Thanks,
>>> Lijo
>>>
>>> On 9/13/2021 12:07 PM, Christian König wrote:
>>>> That's complete nonsense.
>>>>
>>>> The debugfs interface emulates parts of the reset procedure for
>>>> testing and we absolutely need to take the same locks as the reset to
>>>> avoid corruption of the involved objects.
>>>>
>>>> Regards,
>>>> Christian.
>>>>
>>>> Am 13.09.21 um 08:25 schrieb Lazar, Lijo:
>>>>> This is a debugfs interface and adding another writer contention in
>>>>> debugfs over an actual reset is lazy fix. This shouldn't be executed
>>>>> in the first place and should not take precedence over any reset.
>>>>>
>>>>> Thanks,
>>>>> Lijo
>>>>>
>>>>>
>>>>> On 9/13/2021 11:52 AM, Christian König wrote:
>>>>>> NAK, this is not the lazy way to fix it at all.
>>>>>>
>>>>>> The reset semaphore protects the scheduler and ring objects from
>>>>>> concurrent modification, so taking the write side of it is
>>>>>> perfectly valid here.
>>>>>>
>>>>>> Christian.
>>>>>>
>>>>>> Am 13.09.21 um 06:42 schrieb Pan, Xinhui:
>>>>>>> [AMD Official Use Only]
>>>>>>>
>>>>>>> yep, that is a lazy way to fix it.
>>>>>>>
>>>>>>> I am thinking of adding one amdgpu_ring.direct_access_mutex before
>>>>>>> we issue test_ib on each ring.
>>>>>>> ________________________________________
>>>>>>> 发件人: Lazar, Lijo <Lijo.Lazar@amd.com>
>>>>>>> 发送时间: 2021年9月13日 12:00
>>>>>>> 收件人: Pan, Xinhui; amd-gfx@lists.freedesktop.org
>>>>>>> 抄送: Deucher, Alexander; Koenig, Christian
>>>>>>> 主题: Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 9/13/2021 5:18 AM, xinhui pan wrote:
>>>>>>>> Direct IB submission should be exclusive. So use write lock.
>>>>>>>>
>>>>>>>> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
>>>>>>>> ---
>>>>>>>>     drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>>>>>>>>     1 file changed, 2 insertions(+), 2 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> index 19323b4cce7b..be5d12ed3db1 100644
>>>>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
>>>>>>>> @@ -1358,7 +1358,7 @@ static int
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>>         }
>>>>>>>>
>>>>>>>>         /* Avoid accidently unparking the sched thread during GPU
>>>>>>>> reset */
>>>>>>>> -     r = down_read_killable(&adev->reset_sem);
>>>>>>>> +     r = down_write_killable(&adev->reset_sem);
>>>>>>> There are many ioctls and debugfs calls which takes this lock and
>>>>>>> as you
>>>>>>> know the purpose is to avoid them while there is a reset. The
>>>>>>> purpose is
>>>>>>> *not to* fix any concurrency issues those calls themselves have
>>>>>>> otherwise and fixing those concurrency issues this way is just
>>>>>>> lazy and
>>>>>>> not acceptable.
>>>>>>>
>>>>>>> This will take away any fairness given to the writer in this rw
>>>>>>> lock and
>>>>>>> that is supposed to be the reset thread.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lijo
>>>>>>>
>>>>>>>>         if (r)
>>>>>>>>                 return r;
>>>>>>>>
>>>>>>>> @@ -1387,7 +1387,7 @@ static int
>>>>>>>> amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>>>>>>>>                 kthread_unpark(ring->sched.thread);
>>>>>>>>         }
>>>>>>>>
>>>>>>>> -     up_read(&adev->reset_sem);
>>>>>>>> +     up_write(&adev->reset_sem);
>>>>>>>>
>>>>>>>>         pm_runtime_mark_last_busy(dev->dev);
>>>>>>>>         pm_runtime_put_autosuspend(dev->dev);
>>>>>>>>
>>>>>>
>>>>
>>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2] drm/amdgpu: Fix a race of IB test
  2021-09-12 23:48 [PATCH v2] drm/amdgpu: Fix a race of IB test xinhui pan
  2021-09-13  4:00 ` Lazar, Lijo
@ 2021-09-13 14:41 ` Andrey Grodzovsky
  1 sibling, 0 replies; 13+ messages in thread
From: Andrey Grodzovsky @ 2021-09-13 14:41 UTC (permalink / raw)
  To: xinhui pan, amd-gfx; +Cc: alexander.deucher, christian.koenig

Please add a tag V2 in description explaining what was the delta from V1.
Other then that looks good to me.

Andrey

On 2021-09-12 7:48 p.m., xinhui pan wrote:
> Direct IB submission should be exclusive. So use write lock.
>
> Signed-off-by: xinhui pan <xinhui.pan@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> index 19323b4cce7b..be5d12ed3db1 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_debugfs.c
> @@ -1358,7 +1358,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   	}
>   
>   	/* Avoid accidently unparking the sched thread during GPU reset */
> -	r = down_read_killable(&adev->reset_sem);
> +	r = down_write_killable(&adev->reset_sem);
>   	if (r)
>   		return r;
>   
> @@ -1387,7 +1387,7 @@ static int amdgpu_debugfs_test_ib_show(struct seq_file *m, void *unused)
>   		kthread_unpark(ring->sched.thread);
>   	}
>   
> -	up_read(&adev->reset_sem);
> +	up_write(&adev->reset_sem);
>   
>   	pm_runtime_mark_last_busy(dev->dev);
>   	pm_runtime_put_autosuspend(dev->dev);

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-09-13 14:41 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-12 23:48 [PATCH v2] drm/amdgpu: Fix a race of IB test xinhui pan
2021-09-13  4:00 ` Lazar, Lijo
2021-09-13  4:42   ` 回复: " Pan, Xinhui
2021-09-13  6:22     ` Christian König
2021-09-13  6:25       ` Lazar, Lijo
2021-09-13  6:37         ` Christian König
2021-09-13  6:43           ` Lazar, Lijo
2021-09-13  6:51             ` Christian König
2021-09-13  7:15               ` Lazar, Lijo
2021-09-13  7:22                 ` Christian König
2021-09-13  7:23                 ` 回复: " Pan, Xinhui
2021-09-13 13:50                   ` Lazar, Lijo
2021-09-13 14:41 ` Andrey Grodzovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.