amd-gfx.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails
@ 2021-08-14 10:12 Yifan Zhang
  2021-08-17  1:15 ` Felix Kuehling
  0 siblings, 1 reply; 3+ messages in thread
From: Yifan Zhang @ 2021-08-14 10:12 UTC (permalink / raw)
  To: amd-gfx; +Cc: Felix.Kuehling, Yifan Zhang

[ RUN      ] KFDSVMRangeTest.PartialUnmapSysMemTest
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: Failure
Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
  Actual: 1
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: Failure
Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
  Actual: 1
Expected: HSAKMT_STATUS_SUCCESS
Which is: 0
/home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: Failure
Expected: ((void *)__null) != (ptr), actual: NULL vs NULL
Segmentation fault (core dumped)
[          ] Profile: Full Test
[          ] HW capabilities: 0x9

kernel log:

[  102.029150]  ret_from_fork+0x22/0x30
[  102.029158] ---[ end trace 15c34e782714f9a3 ]---
[ 3613.603598] amdgpu: Address: 0x7f7149ccc000 already allocated by SVM
[ 3613.610620] show_signal_msg: 27 callbacks suppressed

These is race with deferred actions from previous memory map
changes (e.g. munmap).Flush pending deffered work to avoid such case.

Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
---
 drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
index 3177c4a0e753..982bf538dc3d 100644
--- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
+++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
@@ -1261,6 +1261,13 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
 		return -EINVAL;
 
 #if IS_ENABLED(CONFIG_HSA_AMD_SVM)
+	/* Flush pending deferred work to avoid racing with deferred actions from
+	 * previous memory map changes (e.g. munmap). Concurrent memory map changes
+	 * can still race with get_attr because we don't hold the mmap lock. But that
+	 * would be a race condition in the application anyway, and undefined
+	 * behaviour is acceptable in that case.
+	 */
+	flush_work(&svms->deferred_list_work);
 	mutex_lock(&svms->lock);
 	if (interval_tree_iter_first(&svms->objects,
 				     args->va_addr >> PAGE_SHIFT,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails
  2021-08-14 10:12 [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails Yifan Zhang
@ 2021-08-17  1:15 ` Felix Kuehling
  2021-08-17  9:13   ` Zhang, Yifan
  0 siblings, 1 reply; 3+ messages in thread
From: Felix Kuehling @ 2021-08-17  1:15 UTC (permalink / raw)
  To: Yifan Zhang, amd-gfx


Am 2021-08-14 um 6:12 a.m. schrieb Yifan Zhang:
> [ RUN      ] KFDSVMRangeTest.PartialUnmapSysMemTest
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: Failure
> Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: Failure
> Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: Failure
> Expected: ((void *)__null) != (ptr), actual: NULL vs NULL
> Segmentation fault (core dumped)
> [          ] Profile: Full Test
> [          ] HW capabilities: 0x9
>
> kernel log:
>
> [  102.029150]  ret_from_fork+0x22/0x30
> [  102.029158] ---[ end trace 15c34e782714f9a3 ]---
> [ 3613.603598] amdgpu: Address: 0x7f7149ccc000 already allocated by SVM
> [ 3613.610620] show_signal_msg: 27 callbacks suppressed
>
> These is race with deferred actions from previous memory map
> changes (e.g. munmap).Flush pending deffered work to avoid such case.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 3177c4a0e753..982bf538dc3d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -1261,6 +1261,13 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
>  		return -EINVAL;
>  
>  #if IS_ENABLED(CONFIG_HSA_AMD_SVM)
> +	/* Flush pending deferred work to avoid racing with deferred actions from
> +	 * previous memory map changes (e.g. munmap). Concurrent memory map changes
> +	 * can still race with get_attr because we don't hold the mmap lock. But that

This comment would need to be updated. This is not get_attr. Whether or
not undefined behaviour is acceptable in this case is a different
question from get_attr. In the get_attr case, a race is caused by a
broken application and causes potentially incorrect results being
reported to user mode. Garbage-in ==> garbage-out. No problem.

In the case of this race here, the cause is still a broken application.
But this check is here to catch broken applications and prevent them
from mapping the same virtual address range with two different
allocators. So I'd argue that the race condition is not acceptable here
because it has consequences for VM mappings managed by the kernel mode
driver.


> +	 * would be a race condition in the application anyway, and undefined
> +	 * behaviour is acceptable in that case.
> +	 */
> +	flush_work(&svms->deferred_list_work);
>  	mutex_lock(&svms->lock);

This still leaves a brief race. There is an easy way to fix that: Use
svm_range_list_lock_and_flush_work like this:

	svm_range_list_lock_and_flush_work(svms, current->mm);
	mutex_lock(&svms->lock);
	mmap_write_unlock(current->mm);
	...

Regards,
  Felix


>  	if (interval_tree_iter_first(&svms->objects,
>  				     args->va_addr >> PAGE_SHIFT,

^ permalink raw reply	[flat|nested] 3+ messages in thread

* RE: [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails
  2021-08-17  1:15 ` Felix Kuehling
@ 2021-08-17  9:13   ` Zhang, Yifan
  0 siblings, 0 replies; 3+ messages in thread
From: Zhang, Yifan @ 2021-08-17  9:13 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx

[AMD Official Use Only]

Hi Felix, 

Thanks for comments. I will send Patch V2.

BRs,
Yifan

-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Tuesday, August 17, 2021 9:16 AM
To: Zhang, Yifan <Yifan1.Zhang@amd.com>; amd-gfx@lists.freedesktop.org
Subject: Re: [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails


Am 2021-08-14 um 6:12 a.m. schrieb Yifan Zhang:
> [ RUN      ] KFDSVMRangeTest.PartialUnmapSysMemTest
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:245: 
> Failure Value of: (hsaKmtAllocMemory(m_Node, m_Size, m_Flags, &m_pBuf))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:248: 
> Failure Value of: (hsaKmtMapMemoryToGPUNodes(m_pBuf, m_Size, __null, mapFlags, 1, &m_Node))
>   Actual: 1
> Expected: HSAKMT_STATUS_SUCCESS
> Which is: 0
> /home/yifan/brahma/libhsakmt/tests/kfdtest/src/KFDTestUtil.cpp:306: 
> Failure
> Expected: ((void *)__null) != (ptr), actual: NULL vs NULL Segmentation 
> fault (core dumped)
> [          ] Profile: Full Test
> [          ] HW capabilities: 0x9
>
> kernel log:
>
> [  102.029150]  ret_from_fork+0x22/0x30 [  102.029158] ---[ end trace 
> 15c34e782714f9a3 ]--- [ 3613.603598] amdgpu: Address: 0x7f7149ccc000 
> already allocated by SVM [ 3613.610620] show_signal_msg: 27 callbacks 
> suppressed
>
> These is race with deferred actions from previous memory map changes 
> (e.g. munmap).Flush pending deffered work to avoid such case.
>
> Signed-off-by: Yifan Zhang <yifan1.zhang@amd.com>
> ---
>  drivers/gpu/drm/amd/amdkfd/kfd_chardev.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c 
> b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> index 3177c4a0e753..982bf538dc3d 100644
> --- a/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> +++ b/drivers/gpu/drm/amd/amdkfd/kfd_chardev.c
> @@ -1261,6 +1261,13 @@ static int kfd_ioctl_alloc_memory_of_gpu(struct file *filep,
>  		return -EINVAL;
>  
>  #if IS_ENABLED(CONFIG_HSA_AMD_SVM)
> +	/* Flush pending deferred work to avoid racing with deferred actions from
> +	 * previous memory map changes (e.g. munmap). Concurrent memory map changes
> +	 * can still race with get_attr because we don't hold the mmap lock. 
> +But that

This comment would need to be updated. This is not get_attr. Whether or not undefined behaviour is acceptable in this case is a different question from get_attr. In the get_attr case, a race is caused by a broken application and causes potentially incorrect results being reported to user mode. Garbage-in ==> garbage-out. No problem.

In the case of this race here, the cause is still a broken application.
But this check is here to catch broken applications and prevent them from mapping the same virtual address range with two different allocators. So I'd argue that the race condition is not acceptable here because it has consequences for VM mappings managed by the kernel mode driver.


> +	 * would be a race condition in the application anyway, and undefined
> +	 * behaviour is acceptable in that case.
> +	 */
> +	flush_work(&svms->deferred_list_work);
>  	mutex_lock(&svms->lock);

This still leaves a brief race. There is an easy way to fix that: Use svm_range_list_lock_and_flush_work like this:

	svm_range_list_lock_and_flush_work(svms, current->mm);
	mutex_lock(&svms->lock);
	mmap_write_unlock(current->mm);
	...

Regards,
  Felix


>  	if (interval_tree_iter_first(&svms->objects,
>  				     args->va_addr >> PAGE_SHIFT,

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2021-08-17  9:13 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-08-14 10:12 [PATCH] drm/amdkfd: fix KFDSVMRangeTest.PartialUnmapSysMemTest fails Yifan Zhang
2021-08-17  1:15 ` Felix Kuehling
2021-08-17  9:13   ` Zhang, Yifan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).