All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
@ 2021-08-10  4:43 Yifan Zhang
  2021-08-10 15:56 ` Felix Kuehling
  0 siblings, 1 reply; 5+ messages in thread
From: Yifan Zhang @ 2021-08-10  4:43 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.Kuehling, Yifan Zhang

KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.

Note: Google Test filter = KFDSVMRangeTest.*
[==========] Running 18 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 18 tests from KFDSVMRangeTest
[ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
[       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
[ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
[          ] Get default atrributes
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
  Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
Value of: expectedDefaultResults[i]
  Actual: 4294967295
Expected: outputAttributes[i].value
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
Value of: expectedDefaultResults[i]
  Actual: 4
Expected: outputAttributes[i].type
Which is: 2
[          ] Setting/Getting atrributes
[  FAILED  ]

the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
iterate svm interval tree.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
index f811a3a24cd2..192e9401bed5 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
@@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
 	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
 		 start + size - 1, nattr);
 
+	/* flush pending deferred work */
+	flush_work(&p->svms.deferred_list_work);
+
 	mmap_read_lock(mm);
 	r = svm_range_is_valid(p, start, size);
 	mmap_read_unlock(mm);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
  2021-08-10  4:43 [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure Yifan Zhang
@ 2021-08-10 15:56 ` Felix Kuehling
  2021-08-12 15:49   ` Zhang, Yifan
  0 siblings, 1 reply; 5+ messages in thread
From: Felix Kuehling @ 2021-08-10 15:56 UTC (permalink / raw)
  To: Yifan Zhang, amd-gfx

Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>
> Note: Google Test filter = KFDSVMRangeTest.*
> [==========] Running 18 tests from 1 test case.
> [----------] Global test environment set-up.
> [----------] 18 tests from KFDSVMRangeTest
> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
> [          ] Get default atrributes
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152: Failure
> Value of: expectedDefaultResults[i]
>   Actual: 4
> Expected: outputAttributes[i].type
> Which is: 2
> [          ] Setting/Getting atrributes
> [  FAILED  ]
>
> the root cause is that svm work queue has not finished when svm_range_get_attr is called, thus
> some garbage svm interval tree data make svm_range_get_attr get wrong result. Flush work queue before
> iterate svm interval tree.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index f811a3a24cd2..192e9401bed5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
>  	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>  		 start + size - 1, nattr);
>  
> +	/* flush pending deferred work */
> +	flush_work(&p->svms.deferred_list_work);
> +

There is still a race condition here. More work can be added to the
deferred_list_work after the flush call.

Work gets added to the deferred_list asynchronously, for example in MMU
notifiers. Trying to synchronize with asynchronous events is inherently
problematic. It appears that the test is making some assumptions about
things happening asynchronously (page faults or MMU notifiers) and
that's probably a problem with the test, not with the driver.

Alternatively, there may be a problem with a set-attribute call that
leaves some operations on the deferred list and results in unexpected
get-attribute results. If that's the problem, we may need to add a
flush-call to the end of the set-attributes function.

Can you provide more details about the exact sequence of set-attribute
and get-attribute calls that is causing the problem?

Regards,
  Felix


>  	mmap_read_lock(mm);
>  	r = svm_range_is_valid(p, start, size);
>  	mmap_read_unlock(mm);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
  2021-08-10 15:56 ` Felix Kuehling
@ 2021-08-12 15:49   ` Zhang, Yifan
  2021-08-12 20:22     ` Felix Kuehling
  0 siblings, 1 reply; 5+ messages in thread
From: Zhang, Yifan @ 2021-08-12 15:49 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[AMD Official Use Only]

Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute call. There is time slot when Process vma has been unmapped from CPU side, but prange is not removed in deferred list. If get-attribute is called  in this time slot, then there will be a problem.

This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after KFDSVMRangeTest.BasicSystemMemTest test.

TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
     TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
     TEST_START(TESTPROFILE_RUNALL)
......
     HSAint32 enable = -1;
     EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
     expectedDefaultResults[4] = (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
     sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the last test case since the addr is already munmap for CPU.
     char *pBuf = sysBuffer->As<char *>();
 
     LOG() << "Get default atrributes" << std::endl;
     memcpy(outputAttributes, inputAttributes, nAttributes * sizeof(HSA_SVM_ATTRIBUTE));
     EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the problematic get-attribute. It returns the the prange attributes in the last test case since prange is not removed.
                                     nAttributes, outputAttributes));

sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should return the default attributes. But there are some corner cases, deferred list from last test case is not finished when get attribute is called, so svm range list remains the contents of the last test. In such cases, hsaKmtSVMGetAttr returns the attributes in the last test instead of default values. (The two test cases are in the same process, mmap in HsaSVMRange constructor return the same addr) and makes test case fail.


-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Tuesday, August 10, 2021 11:57 PM
To: Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure

Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>
> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 
> tests from 1 test case.
> [----------] Global test environment set-up.
> [----------] 18 tests from KFDSVMRangeTest
> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
> [          ] Get default atrributes
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
> : Failure Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
> : Failure Value of: expectedDefaultResults[i]
>   Actual: 4294967295
> Expected: outputAttributes[i].value
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152
> : Failure Value of: expectedDefaultResults[i]
>   Actual: 4
> Expected: outputAttributes[i].type
> Which is: 2
> [          ] Setting/Getting atrributes
> [  FAILED  ]
>
> the root cause is that svm work queue has not finished when 
> svm_range_get_attr is called, thus some garbage svm interval tree data 
> make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> index f811a3a24cd2..192e9401bed5 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
>  	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>  		 start + size - 1, nattr);
>  
> +	/* flush pending deferred work */
> +	flush_work(&p->svms.deferred_list_work);
> +

There is still a race condition here. More work can be added to the deferred_list_work after the flush call.

Work gets added to the deferred_list asynchronously, for example in MMU notifiers. Trying to synchronize with asynchronous events is inherently problematic. It appears that the test is making some assumptions about things happening asynchronously (page faults or MMU notifiers) and that's probably a problem with the test, not with the driver.

Alternatively, there may be a problem with a set-attribute call that leaves some operations on the deferred list and results in unexpected get-attribute results. If that's the problem, we may need to add a flush-call to the end of the set-attributes function.

Can you provide more details about the exact sequence of set-attribute and get-attribute calls that is causing the problem?

Regards,
  Felix


>  	mmap_read_lock(mm);
>  	r = svm_range_is_valid(p, start, size);
>  	mmap_read_unlock(mm);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
  2021-08-12 15:49   ` Zhang, Yifan
@ 2021-08-12 20:22     ` Felix Kuehling
  2021-08-14 10:17       ` Zhang, Yifan
  0 siblings, 1 reply; 5+ messages in thread
From: Felix Kuehling @ 2021-08-12 20:22 UTC (permalink / raw)
  To: Zhang, Yifan, amd-gfx

Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan:
> [AMD Official Use Only]
>
> Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute call. There is time slot when Process vma has been unmapped from CPU side, but prange is not removed in deferred list. If get-attribute is called  in this time slot, then there will be a problem.
>
> This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after KFDSVMRangeTest.BasicSystemMemTest test.
>
> TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
>      TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
>      TEST_START(TESTPROFILE_RUNALL)
> ......
>      HSAint32 enable = -1;
>      EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
>      expectedDefaultResults[4] = (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
>      sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the last test case since the addr is already munmap for CPU.
>      char *pBuf = sysBuffer->As<char *>();
>  
>      LOG() << "Get default atrributes" << std::endl;
>      memcpy(outputAttributes, inputAttributes, nAttributes * sizeof(HSA_SVM_ATTRIBUTE));
>      EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the problematic get-attribute. It returns the the prange attributes in the last test case since prange is not removed.
>                                      nAttributes, outputAttributes));
>
> sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should return the default attributes. But there are some corner cases, deferred list from last test case is not finished when get attribute is called, so svm range list remains the contents of the last test. In such cases, hsaKmtSVMGetAttr returns the attributes in the last test instead of default values. (The two test cases are in the same process, mmap in HsaSVMRange constructor return the same addr) and makes test case fail.

OK. Thanks for investigating further. I think your patch is the the
correct solution after all. I'd just add a carefully worded comment:

/* Flush pending deferred work to avoid racing with deferred actions from
 * previous memory map changes (e.g. munmap). Concurrent memory map changes
 * can still race with get_attr because we don't hold the mmap lock. But that
 * would be a race condition in the application anyway, and undefined
 * behaviour is acceptable in that case.
 */

With that, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

Regards,
  Felix


>
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com> 
> Sent: Tuesday, August 10, 2021 11:57 PM
> To: Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
>
> Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
>> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>>
>> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 
>> tests from 1 test case.
>> [----------] Global test environment set-up.
>> [----------] 18 tests from KFDSVMRangeTest
>> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
>> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
>> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
>> [          ] Get default atrributes
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:154
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:152
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4
>> Expected: outputAttributes[i].type
>> Which is: 2
>> [          ] Setting/Getting atrributes
>> [  FAILED  ]
>>
>> the root cause is that svm work queue has not finished when 
>> svm_range_get_attr is called, thus some garbage svm interval tree data 
>> make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree.
>>
>> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c 
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index f811a3a24cd2..192e9401bed5 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
>>  	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>>  		 start + size - 1, nattr);
>>  
>> +	/* flush pending deferred work */
>> +	flush_work(&p->svms.deferred_list_work);
>> +
> There is still a race condition here. More work can be added to the deferred_list_work after the flush call.
>
> Work gets added to the deferred_list asynchronously, for example in MMU notifiers. Trying to synchronize with asynchronous events is inherently problematic. It appears that the test is making some assumptions about things happening asynchronously (page faults or MMU notifiers) and that's probably a problem with the test, not with the driver.
>
> Alternatively, there may be a problem with a set-attribute call that leaves some operations on the deferred list and results in unexpected get-attribute results. If that's the problem, we may need to add a flush-call to the end of the set-attributes function.
>
> Can you provide more details about the exact sequence of set-attribute and get-attribute calls that is causing the problem?
>
> Regards,
>   Felix
>
>
>>  	mmap_read_lock(mm);
>>  	r = svm_range_is_valid(p, start, size);
>>  	mmap_read_unlock(mm);

^ permalink raw reply	[flat|nested] 5+ messages in thread

* RE: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure
  2021-08-12 20:22     ` Felix Kuehling
@ 2021-08-14 10:17       ` Zhang, Yifan
  0 siblings, 0 replies; 5+ messages in thread
From: Zhang, Yifan @ 2021-08-14 10:17 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[Public]

Hi Felix,

Thanks for comments. I just found a similar race condition in KFD test. Sent it in another thread "[PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails"

BRs,
Yifan

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Friday, August 13, 2021 4:23 AM
To: Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure

Am 2021-08-12 um 11:49 a.m. schrieb Zhang, Yifan:
> [AMD Official Use Only]
>
> Yes, it is a sync issue b/w SVMRange unmap deferred list and get-attribute call. There is time slot when Process vma has been unmapped from CPU side, but prange is not removed in deferred list. If get-attribute is called  in this time slot, then there will be a problem.
>
> This issue happens when KFDSVMRangeTest,SetGetAttributesTest runs after KFDSVMRangeTest.BasicSystemMemTest test.
>
> TEST_F(KFDSVMRangeTest, SetGetAttributesTest) {
>      TEST_REQUIRE_ENV_CAPABILITIES(ENVCAPS_64BITLINUX);
>      TEST_START(TESTPROFILE_RUNALL)
> ......
>      HSAint32 enable = -1;
>      EXPECT_SUCCESS(hsaKmtGetXNACKMode(&enable));
>      expectedDefaultResults[4] = (enable)?HSA_SVM_ATTR_ACCESS:HSA_SVM_ATTR_NO_ACCESS;
>      sysBuffer = new HsaSVMRange(BufSize); // It returns the same addr as the last test case since the addr is already munmap for CPU.
>      char *pBuf = sysBuffer->As<char *>();
>  
>      LOG() << "Get default atrributes" << std::endl;
>      memcpy(outputAttributes, inputAttributes, nAttributes * sizeof(HSA_SVM_ATTRIBUTE));
>      EXPECT_SUCCESS(hsaKmtSVMGetAttr(pBuf, BufSize, // This is the problematic get-attribute. It returns the the prange attributes in the last test case since prange is not removed.
>                                      nAttributes, outputAttributes));
>
> sysBufer address is an unregistered SVM range. Thus hsaKmtSVMGetAttr should return the default attributes. But there are some corner cases, deferred list from last test case is not finished when get attribute is called, so svm range list remains the contents of the last test. In such cases, hsaKmtSVMGetAttr returns the attributes in the last test instead of default values. (The two test cases are in the same process, mmap in HsaSVMRange constructor return the same addr) and makes test case fail.

OK. Thanks for investigating further. I think your patch is the the correct solution after all. I'd just add a carefully worded comment:

/* Flush pending deferred work to avoid racing with deferred actions from
 * previous memory map changes (e.g. munmap). Concurrent memory map changes
 * can still race with get_attr because we don't hold the mmap lock. But that
 * would be a race condition in the application anyway, and undefined
 * behaviour is acceptable in that case.
 */

With that, the patch is

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>

Regards,
  Felix


>
>
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Tuesday, August 10, 2021 11:57 PM
> To: Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
> Subject: Re: [PATCH] drm/amdkfd: fix random 
> KFDSVMRangeTest.SetGetAttributesTest test failure
>
> Am 2021-08-10 um 12:43 a.m. schrieb Yifan Zhang:
>> KFDSVMRangeTest.SetGetAttributesTest randomly fails in stress test.
>>
>> Note: Google Test filter = KFDSVMRangeTest.* [==========] Running 18 
>> tests from 1 test case.
>> [----------] Global test environment set-up.
>> [----------] 18 tests from KFDSVMRangeTest
>> [ RUN      ] KFDSVMRangeTest.BasicSystemMemTest
>> [       OK ] KFDSVMRangeTest.BasicSystemMemTest (30 ms)
>> [ RUN      ] KFDSVMRangeTest.SetGetAttributesTest
>> [          ] Get default atrributes
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 4
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 4
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4294967295
>> Expected: outputAttributes[i].value
>> Which is: 0
>> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDSVMRangeTest.cpp:15
>> 2
>> : Failure Value of: expectedDefaultResults[i]
>>   Actual: 4
>> Expected: outputAttributes[i].type
>> Which is: 2
>> [          ] Setting/Getting atrributes
>> [  FAILED  ]
>>
>> the root cause is that svm work queue has not finished when 
>> svm_range_get_attr is called, thus some garbage svm interval tree 
>> data make svm_range_get_attr get wrong result. Flush work queue before iterate svm interval tree.
>>
>> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
>> ---
>>  drivers/gpu/drm/amd/amdkfd/kfd_svm.c | 3 +++
>>  1 file changed, 3 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> index f811a3a24cd2..192e9401bed5 100644
>> --- a/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_svm.c
>> @@ -3072,6 +3072,9 @@ svm_range_get_attr(struct kfd_process *p, uint64_t start, uint64_t size,
>>  	pr_debug("svms 0x%p [0x%llx 0x%llx] nattr 0x%x\n", &p->svms, start,
>>  		 start + size - 1, nattr);
>>  
>> +	/* flush pending deferred work */
>> +	flush_work(&p->svms.deferred_list_work);
>> +
> There is still a race condition here. More work can be added to the deferred_list_work after the flush call.
>
> Work gets added to the deferred_list asynchronously, for example in MMU notifiers. Trying to synchronize with asynchronous events is inherently problematic. It appears that the test is making some assumptions about things happening asynchronously (page faults or MMU notifiers) and that's probably a problem with the test, not with the driver.
>
> Alternatively, there may be a problem with a set-attribute call that leaves some operations on the deferred list and results in unexpected get-attribute results. If that's the problem, we may need to add a flush-call to the end of the set-attributes function.
>
> Can you provide more details about the exact sequence of set-attribute and get-attribute calls that is causing the problem?
>
> Regards,
>   Felix
>
>
>>  	mmap_read_lock(mm);
>>  	r = svm_range_is_valid(p, start, size);
>>  	mmap_read_unlock(mm);

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-08-14 10:18 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-10  4:43 [PATCH] drm/amdkfd: fix random KFDSVMRangeTest.SetGetAttributesTest test failure Yifan Zhang
2021-08-10 15:56 ` Felix Kuehling
2021-08-12 15:49   ` Zhang, Yifan
2021-08-12 20:22     ` Felix Kuehling
2021-08-14 10:17       ` Zhang, Yifan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.