All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: workaround for TLB seq race
@ 2022-11-02 14:58 Christian König
  2022-11-02 16:08 ` Alex Deucher
                   ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Christian König @ 2022-11-02 14:58 UTC (permalink / raw)
  To: amd-gfx; +Cc: Christian König

It can happen that we query the sequence value before the callback
had a chance to run.

Work around that by grabbing the fence lock and releasing it again.
Should be replaced by hw handling soon.

Signed-off-by: Christian König <christian.koenig@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
index 9ecb7f663e19..e51a46c9582b 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
@@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m);
  */
 static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
 {
+	unsigned long flags;
+	spinlock_t *lock;
+
+	/*
+	 * Work around to stop racing between the fence signaling and handling
+	 * the cb. The lock is static after initially setting it up, just make
+	 * sure that the dma_fence structure isn't freed up.
+	 */
+	rcu_read_lock();
+	lock = vm->last_tlb_flush->lock;
+	rcu_read_unlock();
+
+	spin_lock_irqsave(lock, flags);
+	spin_unlock_irqrestore(lock, flags);
+
 	return atomic64_read(&vm->tlb_seq);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: workaround for TLB seq race
  2022-11-02 14:58 [PATCH] drm/amdgpu: workaround for TLB seq race Christian König
@ 2022-11-02 16:08 ` Alex Deucher
  2022-11-03 18:03 ` Limonciello, Mario
  2022-11-03 21:18 ` [PATCH] " Philip Yang
  2 siblings, 0 replies; 7+ messages in thread
From: Alex Deucher @ 2022-11-02 16:08 UTC (permalink / raw)
  To: Christian König; +Cc: Christian König, amd-gfx

On Wed, Nov 2, 2022 at 10:58 AM Christian König
<ckoenig.leichtzumerken@gmail.com> wrote:
>
> It can happen that we query the sequence value before the callback
> had a chance to run.
>
> Work around that by grabbing the fence lock and releasing it again.

workaround

> Should be replaced by hw handling soon.
>
> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 9ecb7f663e19..e51a46c9582b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m);
>   */
>  static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>  {
> +       unsigned long flags;
> +       spinlock_t *lock;
> +
> +       /*
> +        * Work around to stop racing between the fence signaling and handling

Workaround

WIth that fixed up, the patch is:
Acked-by: Alex Deucher <alexander.deucher@amd.com>


> +        * the cb. The lock is static after initially setting it up, just make
> +        * sure that the dma_fence structure isn't freed up.
> +        */
> +       rcu_read_lock();
> +       lock = vm->last_tlb_flush->lock;
> +       rcu_read_unlock();
> +
> +       spin_lock_irqsave(lock, flags);
> +       spin_unlock_irqrestore(lock, flags);
> +
>         return atomic64_read(&vm->tlb_seq);
>  }
>
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: drm/amdgpu: workaround for TLB seq race
  2022-11-02 14:58 [PATCH] drm/amdgpu: workaround for TLB seq race Christian König
  2022-11-02 16:08 ` Alex Deucher
@ 2022-11-03 18:03 ` Limonciello, Mario
  2022-11-03 21:18 ` [PATCH] " Philip Yang
  2 siblings, 0 replies; 7+ messages in thread
From: Limonciello, Mario @ 2022-11-03 18:03 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Christian König

On 11/2/2022 09:58, Christian König wrote:
> It can happen that we query the sequence value before the callback
> had a chance to run.
> 
> Work around that by grabbing the fence lock and releasing it again.
> Should be replaced by hw handling soon.
> 
> Signed-off-by: Christian König <christian.koenig@amd.com>
> Acked-by: Alex Deucher <alexander.deucher@amd.com>

You probably also want to add these tags:

Fixes: 5255e146c99a6 ("drm/amdgpu: rework TLB flushing")
Link: https://gitlab.freedesktop.org/drm/amd/-/issues/2113

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 9ecb7f663e19..e51a46c9582b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m);
>    */
>   static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>   {
> +	unsigned long flags;
> +	spinlock_t *lock;
> +
> +	/*
> +	 * Work around to stop racing between the fence signaling and handling
> +	 * the cb. The lock is static after initially setting it up, just make
> +	 * sure that the dma_fence structure isn't freed up.
> +	 */
> +	rcu_read_lock();
> +	lock = vm->last_tlb_flush->lock;
> +	rcu_read_unlock();
> +
> +	spin_lock_irqsave(lock, flags);
> +	spin_unlock_irqrestore(lock, flags);
> +
>   	return atomic64_read(&vm->tlb_seq);
>   }
>   

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: workaround for TLB seq race
  2022-11-02 14:58 [PATCH] drm/amdgpu: workaround for TLB seq race Christian König
  2022-11-02 16:08 ` Alex Deucher
  2022-11-03 18:03 ` Limonciello, Mario
@ 2022-11-03 21:18 ` Philip Yang
  2022-11-04  7:10   ` Christian König
  2 siblings, 1 reply; 7+ messages in thread
From: Philip Yang @ 2022-11-03 21:18 UTC (permalink / raw)
  To: Christian König, amd-gfx; +Cc: Christian König


On 2022-11-02 10:58, Christian König wrote:
> It can happen that we query the sequence value before the callback
> had a chance to run.
>
> Work around that by grabbing the fence lock and releasing it again.
> Should be replaced by hw handling soon.

kfd_flush_tlb is always called after waiting for map/unmap to GPU fence 
signalled, that means the callback is already executed and the sequence 
is increased if tlb flush is needed, so no such race from KFD.

I am not sure but seems the race does exist for amdgpu to grab vm and 
schedule job.

Acked-by: Philip Yang <Philip.Yang@amd.com>

> Signed-off-by: Christian König <christian.koenig@amd.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
>   1 file changed, 15 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> index 9ecb7f663e19..e51a46c9582b 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
> @@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm *vm, struct seq_file *m);
>    */
>   static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>   {
> +	unsigned long flags;
> +	spinlock_t *lock;
> +
> +	/*
> +	 * Work around to stop racing between the fence signaling and handling
> +	 * the cb. The lock is static after initially setting it up, just make
> +	 * sure that the dma_fence structure isn't freed up.
> +	 */
> +	rcu_read_lock();
> +	lock = vm->last_tlb_flush->lock;
> +	rcu_read_unlock();
> +
> +	spin_lock_irqsave(lock, flags);
> +	spin_unlock_irqrestore(lock, flags);
> +
>   	return atomic64_read(&vm->tlb_seq);
>   }
>   

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: workaround for TLB seq race
  2022-11-03 21:18 ` [PATCH] " Philip Yang
@ 2022-11-04  7:10   ` Christian König
  0 siblings, 0 replies; 7+ messages in thread
From: Christian König @ 2022-11-04  7:10 UTC (permalink / raw)
  To: Philip Yang, amd-gfx; +Cc: Christian König

Am 03.11.22 um 22:18 schrieb Philip Yang:
>
> On 2022-11-02 10:58, Christian König wrote:
>> It can happen that we query the sequence value before the callback
>> had a chance to run.
>>
>> Work around that by grabbing the fence lock and releasing it again.
>> Should be replaced by hw handling soon.
>
> kfd_flush_tlb is always called after waiting for map/unmap to GPU 
> fence signalled, that means the callback is already executed

And exactly that's incorrect.

Waiting for the fence to signal means that the callback has started 
executing, but it doesn't mean that it is finished.

This can then result in one CPU racing with the callback handler and 
because of this you see the wrong TLB seq.

Regards,
Christian.

> and the sequence is increased if tlb flush is needed, so no such race 
> from KFD.
>
> I am not sure but seems the race does exist for amdgpu to grab vm and 
> schedule job.
>
> Acked-by: Philip Yang <Philip.Yang@amd.com>
>
>> Signed-off-by: Christian König <christian.koenig@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> index 9ecb7f663e19..e51a46c9582b 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_vm.h
>> @@ -485,6 +485,21 @@ void amdgpu_debugfs_vm_bo_info(struct amdgpu_vm 
>> *vm, struct seq_file *m);
>>    */
>>   static inline uint64_t amdgpu_vm_tlb_seq(struct amdgpu_vm *vm)
>>   {
>> +    unsigned long flags;
>> +    spinlock_t *lock;
>> +
>> +    /*
>> +     * Work around to stop racing between the fence signaling and 
>> handling
>> +     * the cb. The lock is static after initially setting it up, 
>> just make
>> +     * sure that the dma_fence structure isn't freed up.
>> +     */
>> +    rcu_read_lock();
>> +    lock = vm->last_tlb_flush->lock;
>> +    rcu_read_unlock();
>> +
>> +    spin_lock_irqsave(lock, flags);
>> +    spin_unlock_irqrestore(lock, flags);
>> +
>>       return atomic64_read(&vm->tlb_seq);
>>   }


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH] drm/amdgpu: workaround for TLB seq race
@ 2022-11-03 18:19 Stefan Springer
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Springer @ 2022-11-03 18:19 UTC (permalink / raw)
  To: amd-gfx

[-- Attachment #1: Type: text/plain, Size: 315 bytes --]

Working fine after 26hrs+ of testing, plus another 24hrs of the previous
version of this patch.
Sorry if there are multiple replies, I had to figure out how to properly
reply to a mailing list, such that the reply is picked up by patchwork (1st
time doing this).

Tested-by: Stefan Springer <stefanspr94@gmail.com>

[-- Attachment #2: Type: text/html, Size: 420 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] drm/amdgpu: workaround for TLB seq race
@ 2022-11-03 16:40 Stefan Springer
  0 siblings, 0 replies; 7+ messages in thread
From: Stefan Springer @ 2022-11-03 16:40 UTC (permalink / raw)
  To: amd-gfx

[-- Attachment #1: Type: text/plain, Size: 84 bytes --]

Looking good after 24+hrs of use
Tested-by: Stefan Springer <stefanspr94@gmail.com>

[-- Attachment #2: Type: text/html, Size: 179 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2022-11-04  8:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-02 14:58 [PATCH] drm/amdgpu: workaround for TLB seq race Christian König
2022-11-02 16:08 ` Alex Deucher
2022-11-03 18:03 ` Limonciello, Mario
2022-11-03 21:18 ` [PATCH] " Philip Yang
2022-11-04  7:10   ` Christian König
2022-11-03 16:40 Stefan Springer
2022-11-03 18:19 Stefan Springer

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.