All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
@ 2023-09-11 19:00 Harish Kasiviswanathan
  2023-09-12  0:55 ` Felix Kuehling
                   ` (2 more replies)
  0 siblings, 3 replies; 8+ messages in thread
From: Harish Kasiviswanathan @ 2023-09-11 19:00 UTC (permalink / raw)
  To: amd-gfx; +Cc: Harish Kasiviswanathan

Heavy-weight TLB flush is required after unmap on all GPUs for
correctness and security.

Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
index b315311dfe2a..b9950074aee0 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
@@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
 
 static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
 {
-	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
-	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
+	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
 	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
 	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-11 19:00 [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later Harish Kasiviswanathan
@ 2023-09-12  0:55 ` Felix Kuehling
  2023-09-12  2:52 ` Lang Yu
  2023-09-13  8:25 ` Christian König
  2 siblings, 0 replies; 8+ messages in thread
From: Felix Kuehling @ 2023-09-12  0:55 UTC (permalink / raw)
  To: Harish Kasiviswanathan, amd-gfx

On 2023-09-11 15:00, Harish Kasiviswanathan wrote:
> Heavy-weight TLB flush is required after unmap on all GPUs for
> correctness and security.
>
> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index b315311dfe2a..b9950074aee0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
>   
>   static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
>   {
> -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
>   	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
>   	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
>   }

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-11 19:00 [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later Harish Kasiviswanathan
  2023-09-12  0:55 ` Felix Kuehling
@ 2023-09-12  2:52 ` Lang Yu
  2023-09-13  0:48   ` Felix Kuehling
  2023-09-13  8:25 ` Christian König
  2 siblings, 1 reply; 8+ messages in thread
From: Lang Yu @ 2023-09-12  2:52 UTC (permalink / raw)
  To: Harish Kasiviswanathan; +Cc: amd-gfx

On 09/11/ , Harish Kasiviswanathan wrote:
> Heavy-weight TLB flush is required after unmap on all GPUs for
> correctness and security.
> 
> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index b315311dfe2a..b9950074aee0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
>  
>  static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
>  {
> -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
>  	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
>  	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
>  }

1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
as described in commmit message, why we have this whitelist
instead of a blacklist?

2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
in svm_range_unmap_from_gpus(). Why not add this whitelist there?

Regards,
Lang

> -- 
> 2.34.1
> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-12  2:52 ` Lang Yu
@ 2023-09-13  0:48   ` Felix Kuehling
  2023-09-13 10:23     ` Lang Yu
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Kuehling @ 2023-09-13  0:48 UTC (permalink / raw)
  To: Lang Yu, Harish Kasiviswanathan; +Cc: amd-gfx

[-- Attachment #1: Type: text/plain, Size: 2713 bytes --]

On 2023-09-11 22:52, Lang Yu wrote:
> On 09/11/ , Harish Kasiviswanathan wrote:
>> Heavy-weight TLB flush is required after unmap on all GPUs for
>> correctness and security.
>>
>> Signed-off-by: Harish Kasiviswanathan<Harish.Kasiviswanathan@amd.com>
>> ---
>>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
>>   1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> index b315311dfe2a..b9950074aee0 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>> @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
>>   
>>   static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
>>   {
>> -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
>> -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
>> +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
>>   	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
>>   	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
>>   }
> 1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
> as described in commmit message, why we have this whitelist
> instead of a blacklist?

That was a bug that this patch is fixing. There were specific GPUs and 
firmware versions where the TLB flush after unmap was causing 
intermittent problems in specific tests. This should have always been a 
blacklist.


>
> 2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
> in svm_range_unmap_from_gpus(). Why not add this whitelist there?

There was a patch that used kfd_flush_tlb_after_unmap in the SVM code. 
But you reverted that patch, probably because it caused more problems 
than it solved. SVM really must flush TLBs the way it does because it is 
so tightly integrated with Linux's virtual memory management and because 
with XNACK, memory can be unmapped while GPU work is in progress without 
preemting queues (implicitly flushing TLBs and caches):

commit 515d7cebc2e2d2b4f0a276d26f3b790a83cdfe06
Author: Lang Yu<Lang.Yu@amd.com>
Date:   Wed Apr 20 10:24:31 2022 +0800

     Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too"
     
     This reverts commit 36bf93216ecbe399c40c5e0486f0f0e3a4afa69e.
     
     It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
     at the moment.
     
     ./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
     
     Signed-off-by: Lang Yu<Lang.Yu@amd.com>
     Reviewed-by: Philip Yang<Philip.Yang@amd.com>
     Signed-off-by: Alex Deucher<alexander.deucher@amd.com>

Regards,
   Felix


>
> Regards,
> Lang
>
>> -- 
>> 2.34.1
>>

[-- Attachment #2: Type: text/html, Size: 4025 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-11 19:00 [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later Harish Kasiviswanathan
  2023-09-12  0:55 ` Felix Kuehling
  2023-09-12  2:52 ` Lang Yu
@ 2023-09-13  8:25 ` Christian König
  2 siblings, 0 replies; 8+ messages in thread
From: Christian König @ 2023-09-13  8:25 UTC (permalink / raw)
  To: Harish Kasiviswanathan, amd-gfx

Am 11.09.23 um 21:00 schrieb Harish Kasiviswanathan:
> Heavy-weight TLB flush is required after unmap on all GPUs for
> correctness and security.
>
> Signed-off-by: Harish Kasiviswanathan <Harish.Kasiviswanathan@amd.com>

Acked-by: Christian König <christian.koenig@amd.com>

> ---
>   drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> index b315311dfe2a..b9950074aee0 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
>   
>   static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
>   {
> -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
>   	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
>   	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
>   }


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-13  0:48   ` Felix Kuehling
@ 2023-09-13 10:23     ` Lang Yu
  2023-09-13 13:46       ` Felix Kuehling
  0 siblings, 1 reply; 8+ messages in thread
From: Lang Yu @ 2023-09-13 10:23 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Harish Kasiviswanathan, amd-gfx

On 09/12/ , Felix Kuehling wrote:
> On 2023-09-11 22:52, Lang Yu wrote:
> > On 09/11/ , Harish Kasiviswanathan wrote:
> > > Heavy-weight TLB flush is required after unmap on all GPUs for
> > > correctness and security.
> > > 
> > > Signed-off-by: Harish Kasiviswanathan<Harish.Kasiviswanathan@amd.com>
> > > ---
> > >   drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
> > >   1 file changed, 1 insertion(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > index b315311dfe2a..b9950074aee0 100644
> > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
> > >   static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > >   {
> > > -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> > > -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > > +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
> > >   	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
> > >   	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > >   }
> > 1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
> > as described in commmit message, why we have this whitelist
> > instead of a blacklist?
> 
> That was a bug that this patch is fixing. There were specific GPUs and
> firmware versions where the TLB flush after unmap was causing intermittent
> problems in specific tests. This should have always been a blacklist.
> 
> 
> > 
> > 2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
> > in svm_range_unmap_from_gpus(). Why not add this whitelist there?
> 
> There was a patch that used kfd_flush_tlb_after_unmap in the SVM code. But
> you reverted that patch, probably because it caused more problems than it
> solved. SVM really must flush TLBs the way it does because it is so tightly
> integrated with Linux's virtual memory management and because with XNACK,
> memory can be unmapped while GPU work is in progress without preemting
> queues (implicitly flushing TLBs and caches):
> 
> commit 515d7cebc2e2d2b4f0a276d26f3b790a83cdfe06
> Author: Lang Yu<Lang.Yu@amd.com>
> Date:   Wed Apr 20 10:24:31 2022 +0800
> 
>     Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too"
>     This reverts commit 36bf93216ecbe399c40c5e0486f0f0e3a4afa69e.
>     It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
>     at the moment.
>     ./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
>     Signed-off-by: Lang Yu<Lang.Yu@amd.com>
>     Reviewed-by: Philip Yang<Philip.Yang@amd.com>
>     Signed-off-by: Alex Deucher<alexander.deucher@amd.com>
> 
> Regards,
>   Felix

Yes, that's because kfd_flush_tlb_after_unmap() return false for Vega10(gfx901).
kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is called unconditionally in SVM
for ASICs > IP_VERSION(9, 0, 0) and works well.

So why not relax the condition to KFD_GC_VERSION(dev) > IP_VERSION(9, 0, 0) ?                                                                           

Regards,
Lang

> 
> > 
> > Regards,
> > Lang
> > 
> > > -- 
> > > 2.34.1
> > > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-13 10:23     ` Lang Yu
@ 2023-09-13 13:46       ` Felix Kuehling
  2023-09-14  5:58         ` Lang Yu
  0 siblings, 1 reply; 8+ messages in thread
From: Felix Kuehling @ 2023-09-13 13:46 UTC (permalink / raw)
  To: Lang Yu; +Cc: Harish Kasiviswanathan, amd-gfx

On 2023-09-13 6:23, Lang Yu wrote:
> On 09/12/ , Felix Kuehling wrote:
>> On 2023-09-11 22:52, Lang Yu wrote:
>>> On 09/11/ , Harish Kasiviswanathan wrote:
>>>> Heavy-weight TLB flush is required after unmap on all GPUs for
>>>> correctness and security.
>>>>
>>>> Signed-off-by: Harish Kasiviswanathan<Harish.Kasiviswanathan@amd.com>
>>>> ---
>>>>    drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
>>>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> index b315311dfe2a..b9950074aee0 100644
>>>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
>>>> @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
>>>>    static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
>>>>    {
>>>> -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
>>>> -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
>>>> +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
>>>>    	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
>>>>    	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
>>>>    }
>>> 1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
>>> as described in commmit message, why we have this whitelist
>>> instead of a blacklist?
>> That was a bug that this patch is fixing. There were specific GPUs and
>> firmware versions where the TLB flush after unmap was causing intermittent
>> problems in specific tests. This should have always been a blacklist.
>>
>>
>>> 2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
>>> in svm_range_unmap_from_gpus(). Why not add this whitelist there?
>> There was a patch that used kfd_flush_tlb_after_unmap in the SVM code. But
>> you reverted that patch, probably because it caused more problems than it
>> solved. SVM really must flush TLBs the way it does because it is so tightly
>> integrated with Linux's virtual memory management and because with XNACK,
>> memory can be unmapped while GPU work is in progress without preemting
>> queues (implicitly flushing TLBs and caches):
>>
>> commit 515d7cebc2e2d2b4f0a276d26f3b790a83cdfe06
>> Author: Lang Yu<Lang.Yu@amd.com>
>> Date:   Wed Apr 20 10:24:31 2022 +0800
>>
>>      Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too"
>>      This reverts commit 36bf93216ecbe399c40c5e0486f0f0e3a4afa69e.
>>      It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
>>      at the moment.
>>      ./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
>>      Signed-off-by: Lang Yu<Lang.Yu@amd.com>
>>      Reviewed-by: Philip Yang<Philip.Yang@amd.com>
>>      Signed-off-by: Alex Deucher<alexander.deucher@amd.com>
>>
>> Regards,
>>    Felix
> Yes, that's because kfd_flush_tlb_after_unmap() return false for Vega10(gfx901).
> kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is called unconditionally in SVM
> for ASICs > IP_VERSION(9, 0, 0) and works well.
>
> So why not relax the condition to KFD_GC_VERSION(dev) > IP_VERSION(9, 0, 0) ?

That would reintroduce the same problem that this workaround was meant 
to fix. I don't remember all the details of this, as it was years ago. I 
believe it was an intermittent hang or VM fault that was somewhat 
difficult to reproduce and investigate. Maybe Eric remembers the details 
as he was working on this bug back then. However, it was a real issue, 
and we got an SDMA firmware fix for it on GFX IP 9.4.1 as you can see 
from the FW version check in kfd_flush_tlb_after_unmap.

I would not recommend reverting this workaround at the risk of 
reintroducing a known intermittent bug that affects stability.

Regards,
   Felix



>
> Regards,
> Lang
>
>>> Regards,
>>> Lang
>>>
>>>> -- 
>>>> 2.34.1
>>>>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later
  2023-09-13 13:46       ` Felix Kuehling
@ 2023-09-14  5:58         ` Lang Yu
  0 siblings, 0 replies; 8+ messages in thread
From: Lang Yu @ 2023-09-14  5:58 UTC (permalink / raw)
  To: Felix Kuehling; +Cc: Harish Kasiviswanathan, amd-gfx

On 09/13/ , Felix Kuehling wrote:
> On 2023-09-13 6:23, Lang Yu wrote:
> > On 09/12/ , Felix Kuehling wrote:
> > > On 2023-09-11 22:52, Lang Yu wrote:
> > > > On 09/11/ , Harish Kasiviswanathan wrote:
> > > > > Heavy-weight TLB flush is required after unmap on all GPUs for
> > > > > correctness and security.
> > > > > 
> > > > > Signed-off-by: Harish Kasiviswanathan<Harish.Kasiviswanathan@amd.com>
> > > > > ---
> > > > >    drivers/gpu/drm/amd/amdkfd/kfd_priv.h | 3 +--
> > > > >    1 file changed, 1 insertion(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > index b315311dfe2a..b9950074aee0 100644
> > > > > --- a/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > +++ b/drivers/gpu/drm/amd/amdkfd/kfd_priv.h
> > > > > @@ -1466,8 +1466,7 @@ void kfd_flush_tlb(struct kfd_process_device *pdd, enum TLB_FLUSH_TYPE type);
> > > > >    static inline bool kfd_flush_tlb_after_unmap(struct kfd_dev *dev)
> > > > >    {
> > > > > -	return KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 3) ||
> > > > > -	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 2) ||
> > > > > +	return KFD_GC_VERSION(dev) > IP_VERSION(9, 4, 2) ||
> > > > >    	       (KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 1) && dev->sdma_fw_version >= 18) ||
> > > > >    	       KFD_GC_VERSION(dev) == IP_VERSION(9, 4, 0);
> > > > >    }
> > > > 1, If TLB_FLUSH_HEAVYWEIGHT is required after unmap on all GPUs
> > > > as described in commmit message, why we have this whitelist
> > > > instead of a blacklist?
> > > That was a bug that this patch is fixing. There were specific GPUs and
> > > firmware versions where the TLB flush after unmap was causing intermittent
> > > problems in specific tests. This should have always been a blacklist.
> > > 
> > > 
> > > > 2, kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is also called
> > > > in svm_range_unmap_from_gpus(). Why not add this whitelist there?
> > > There was a patch that used kfd_flush_tlb_after_unmap in the SVM code. But
> > > you reverted that patch, probably because it caused more problems than it
> > > solved. SVM really must flush TLBs the way it does because it is so tightly
> > > integrated with Linux's virtual memory management and because with XNACK,
> > > memory can be unmapped while GPU work is in progress without preemting
> > > queues (implicitly flushing TLBs and caches):
> > > 
> > > commit 515d7cebc2e2d2b4f0a276d26f3b790a83cdfe06
> > > Author: Lang Yu<Lang.Yu@amd.com>
> > > Date:   Wed Apr 20 10:24:31 2022 +0800
> > > 
> > >      Revert "drm/amdkfd: only allow heavy-weight TLB flush on some ASICs for SVM too"
> > >      This reverts commit 36bf93216ecbe399c40c5e0486f0f0e3a4afa69e.
> > >      It causes SVM regressions on Vega10 with XNACK-ON. Just revert it
> > >      at the moment.
> > >      ./kfdtest --gtest_filter=KFDSVMRangeTest.MigratePolicyTest
> > >      Signed-off-by: Lang Yu<Lang.Yu@amd.com>
> > >      Reviewed-by: Philip Yang<Philip.Yang@amd.com>
> > >      Signed-off-by: Alex Deucher<alexander.deucher@amd.com>
> > > 
> > > Regards,
> > >    Felix
> > Yes, that's because kfd_flush_tlb_after_unmap() return false for Vega10(gfx901).
> > kfd_flush_tlb(pdd, TLB_FLUSH_HEAVYWEIGHT) is called unconditionally in SVM
> > for ASICs > IP_VERSION(9, 0, 0) and works well.
> > 
> > So why not relax the condition to KFD_GC_VERSION(dev) > IP_VERSION(9, 0, 0) ?
> 
> That would reintroduce the same problem that this workaround was meant to
> fix. I don't remember all the details of this, as it was years ago. I
> believe it was an intermittent hang or VM fault that was somewhat difficult
> to reproduce and investigate. Maybe Eric remembers the details as he was
> working on this bug back then. However, it was a real issue, and we got an
> SDMA firmware fix for it on GFX IP 9.4.1 as you can see from the FW version
> check in kfd_flush_tlb_after_unmap.
> 
> I would not recommend reverting this workaround at the risk of reintroducing
> a known intermittent bug that affects stability.

Got it. Thank you!

Regards,
Lang

> Regards,
>   Felix
> 
> 
> 
> > 
> > Regards,
> > Lang
> > 
> > > > Regards,
> > > > Lang
> > > > 
> > > > > -- 
> > > > > 2.34.1
> > > > > 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-09-14  5:59 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-11 19:00 [PATCH] drm/amdkfd: Insert missing TLB flush on GFX10 and later Harish Kasiviswanathan
2023-09-12  0:55 ` Felix Kuehling
2023-09-12  2:52 ` Lang Yu
2023-09-13  0:48   ` Felix Kuehling
2023-09-13 10:23     ` Lang Yu
2023-09-13 13:46       ` Felix Kuehling
2023-09-14  5:58         ` Lang Yu
2023-09-13  8:25 ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.