dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: Fix recursive locking warning
@ 2022-02-04  3:11 Rajneesh Bhardwaj
  2022-02-04  7:13 ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Rajneesh Bhardwaj @ 2022-02-04  3:11 UTC (permalink / raw)
  To: amd-gfx
  Cc: Alex Deucher, Felix Kuehling, Rajneesh Bhardwaj, dri-devel,
	Christian König

Noticed the below warning while running a pytorch workload on vega10
GPUs. Change to trylock to avoid conflicts with already held reservation
locks.

[  +0.000003] WARNING: possible recursive locking detected
[  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
[  +0.000004] --------------------------------------------
[  +0.000002] python/4822 is trying to acquire lock:
[  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000203]
              but task is already holding lock:
[  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000017]
              other info that might help us debug this:
[  +0.000002]  Possible unsafe locking scenario:

[  +0.000003]        CPU0
[  +0.000002]        ----
[  +0.000002]   lock(reservation_ww_class_mutex);
[  +0.000004]   lock(reservation_ww_class_mutex);
[  +0.000003]
               *** DEADLOCK ***

[  +0.000002]  May be due to missing lock nesting notation

[  +0.000003] 7 locks held by python/4822:
[  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
[  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
[  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
[  +0.000236]  #3: ffffb2b35606fd28
(reservation_ww_class_acquire){+.+.}-{0:0}, at:
amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
[  +0.000235]  #4: ffff932cbb7181f8
(reservation_ww_class_mutex){+.+.}-{3:3}, at:
ttm_eu_reserve_buffers+0x270/0x470 [ttm]
[  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
drm_dev_enter+0x5/0xa0 [drm]
[  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
[  +0.000195]
              stack backtrace:
[  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
5.13.0-kfd-rajneesh #1030
[  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
08/29/2018
[  +0.000003] Call Trace:
[  +0.000003]  dump_stack+0x6d/0x89
[  +0.000010]  __lock_acquire+0xb93/0x1a90
[  +0.000009]  lock_acquire+0x25d/0x2d0
[  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
[  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  ? lock_release+0x13f/0x270
[  +0.000005]  ? lock_is_held_type+0xa2/0x110
[  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
[  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
[  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
[  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
[  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
[  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
[  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
[  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
[  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
[  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
[amdgpu]
[  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
[  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
[  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
[  +0.000216]  ? lock_release+0x13f/0x270
[  +0.000006]  ? __fget_files+0x107/0x1e0
[  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
[  +0.000007]  do_syscall_64+0x36/0x70
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  +0.000007] RIP: 0033:0x7fbff90a7317
[  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
[  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
00007fbff90a7317
[  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
0000000000000004
[  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
00007fbcc402d880
[  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
00000000c0184b18
[  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
00007fbcc402d820

Cc: Christian König <christian.koenig@amd.com>
Cc: Felix Kuehling <Felix.Kuehling@amd.com>
Cc: Alex Deucher <Alexander.Deucher@amd.com>

Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
enabled")
Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
index 36bb41b027ec..6ccd2be685f5 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
@@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
 	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
 		return;
 
-	dma_resv_lock(bo->base.resv, NULL);
+	if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
+		return;
 
 	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
 	if (!WARN_ON(r)) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix recursive locking warning
  2022-02-04  3:11 [PATCH] drm/amdgpu: Fix recursive locking warning Rajneesh Bhardwaj
@ 2022-02-04  7:13 ` Christian König
  2022-02-04 16:23   ` Felix Kuehling
  0 siblings, 1 reply; 4+ messages in thread
From: Christian König @ 2022-02-04  7:13 UTC (permalink / raw)
  To: Rajneesh Bhardwaj, amd-gfx; +Cc: Alex Deucher, Felix Kuehling, dri-devel

Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
> Noticed the below warning while running a pytorch workload on vega10
> GPUs. Change to trylock to avoid conflicts with already held reservation
> locks.
>
> [  +0.000003] WARNING: possible recursive locking detected
> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
> [  +0.000004] --------------------------------------------
> [  +0.000002] python/4822 is trying to acquire lock:
> [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000203]
>                but task is already holding lock:
> [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
> [  +0.000017]
>                other info that might help us debug this:
> [  +0.000002]  Possible unsafe locking scenario:
>
> [  +0.000003]        CPU0
> [  +0.000002]        ----
> [  +0.000002]   lock(reservation_ww_class_mutex);
> [  +0.000004]   lock(reservation_ww_class_mutex);
> [  +0.000003]
>                 *** DEADLOCK ***
>
> [  +0.000002]  May be due to missing lock nesting notation
>
> [  +0.000003] 7 locks held by python/4822:
> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
> [  +0.000236]  #3: ffffb2b35606fd28
> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
> [  +0.000235]  #4: ffff932cbb7181f8
> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
> drm_dev_enter+0x5/0xa0 [drm]
> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
> [  +0.000195]
>                stack backtrace:
> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
> 5.13.0-kfd-rajneesh #1030
> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
> 08/29/2018
> [  +0.000003] Call Trace:
> [  +0.000003]  dump_stack+0x6d/0x89
> [  +0.000010]  __lock_acquire+0xb93/0x1a90
> [  +0.000009]  lock_acquire+0x25d/0x2d0
> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000183]  ? lock_release+0x13f/0x270
> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
> [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
> [amdgpu]
> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
> [  +0.000216]  ? lock_release+0x13f/0x270
> [  +0.000006]  ? __fget_files+0x107/0x1e0
> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
> [  +0.000007]  do_syscall_64+0x36/0x70
> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  +0.000007] RIP: 0033:0x7fbff90a7317
> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000010
> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
> 00007fbff90a7317
> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
> 0000000000000004
> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
> 00007fbcc402d880
> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
> 00000000c0184b18
> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
> 00007fbcc402d820
>
> Cc: Christian König <christian.koenig@amd.com>
> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>
> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
> enabled")
> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>

The fixes tag is not necessarily correct, I would remove that.

But apart from that the patch is Reviewed-by: Christian König 
<christian.koenig@amd.com>.

Thanks,
Christian.

> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> index 36bb41b027ec..6ccd2be685f5 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct ttm_buffer_object *bo)
>   	    !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>   		return;
>   
> -	dma_resv_lock(bo->base.resv, NULL);
> +	if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
> +		return;
>   
>   	r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, &fence);
>   	if (!WARN_ON(r)) {


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix recursive locking warning
  2022-02-04  7:13 ` Christian König
@ 2022-02-04 16:23   ` Felix Kuehling
  2022-02-04 16:25     ` Christian König
  0 siblings, 1 reply; 4+ messages in thread
From: Felix Kuehling @ 2022-02-04 16:23 UTC (permalink / raw)
  To: Christian König, Rajneesh Bhardwaj, amd-gfx; +Cc: Alex Deucher, dri-devel


Am 2022-02-04 um 02:13 schrieb Christian König:
> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>> Noticed the below warning while running a pytorch workload on vega10
>> GPUs. Change to trylock to avoid conflicts with already held reservation
>> locks.
>>
>> [  +0.000003] WARNING: possible recursive locking detected
>> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>> [  +0.000004] --------------------------------------------
>> [  +0.000002] python/4822 is trying to acquire lock:
>> [  +0.000004] ffff932cd9a259f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000203]
>>                but task is already holding lock:
>> [  +0.000003] ffff932cbb7181f8 (reservation_ww_class_mutex){+.+.}-{3:3},
>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [  +0.000017]
>>                other info that might help us debug this:
>> [  +0.000002]  Possible unsafe locking scenario:
>>
>> [  +0.000003]        CPU0
>> [  +0.000002]        ----
>> [  +0.000002]   lock(reservation_ww_class_mutex);
>> [  +0.000004]   lock(reservation_ww_class_mutex);
>> [  +0.000003]
>>                 *** DEADLOCK ***
>>
>> [  +0.000002]  May be due to missing lock nesting notation
>>
>> [  +0.000003] 7 locks held by python/4822:
>> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>> [  +0.000236]  #3: ffffb2b35606fd28
>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>> [  +0.000235]  #4: ffff932cbb7181f8
>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>> drm_dev_enter+0x5/0xa0 [drm]
>> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>> [  +0.000195]
>>                stack backtrace:
>> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>> 5.13.0-kfd-rajneesh #1030
>> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>> 08/29/2018
>> [  +0.000003] Call Trace:
>> [  +0.000003]  dump_stack+0x6d/0x89
>> [  +0.000010]  __lock_acquire+0xb93/0x1a90
>> [  +0.000009]  lock_acquire+0x25d/0x2d0
>> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
>> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000183]  ? lock_release+0x13f/0x270
>> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
>> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
>> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>> [  +0.000190]  amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>> [amdgpu]
>> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
>> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>> [  +0.000216]  ? lock_release+0x13f/0x270
>> [  +0.000006]  ? __fget_files+0x107/0x1e0
>> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
>> [  +0.000007]  do_syscall_64+0x36/0x70
>> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>> [  +0.000007] RIP: 0033:0x7fbff90a7317
>> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>> 0000000000000010
>> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>> 00007fbff90a7317
>> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>> 0000000000000004
>> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>> 00007fbcc402d880
>> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>> 00000000c0184b18
>> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>> 00007fbcc402d820
>>
>> Cc: Christian König <christian.koenig@amd.com>
>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>>
>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>> enabled")
>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>
> The fixes tag is not necessarily correct, I would remove that.
>
> But apart from that the patch is Reviewed-by: Christian König 
> <christian.koenig@amd.com>.

I suggested the Fixes tag since it was my patch that introduced the 
problem. Without my patch, page table BOs wouldn't be cleared here, and 
it wouldn't get that recursive lock warning.

Either way, the patch is also

Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>


>
> Thanks,
> Christian.
>
>> ---
>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> index 36bb41b027ec..6ccd2be685f5 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct 
>> ttm_buffer_object *bo)
>>           !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>           return;
>>   -    dma_resv_lock(bo->base.resv, NULL);
>> +    if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>> +        return;
>>         r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>> &fence);
>>       if (!WARN_ON(r)) {
>

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH] drm/amdgpu: Fix recursive locking warning
  2022-02-04 16:23   ` Felix Kuehling
@ 2022-02-04 16:25     ` Christian König
  0 siblings, 0 replies; 4+ messages in thread
From: Christian König @ 2022-02-04 16:25 UTC (permalink / raw)
  To: Felix Kuehling, Rajneesh Bhardwaj, amd-gfx; +Cc: Alex Deucher, dri-devel

Am 04.02.22 um 17:23 schrieb Felix Kuehling:
>
> Am 2022-02-04 um 02:13 schrieb Christian König:
>> Am 04.02.22 um 04:11 schrieb Rajneesh Bhardwaj:
>>> Noticed the below warning while running a pytorch workload on vega10
>>> GPUs. Change to trylock to avoid conflicts with already held 
>>> reservation
>>> locks.
>>>
>>> [  +0.000003] WARNING: possible recursive locking detected
>>> [  +0.000003] 5.13.0-kfd-rajneesh #1030 Not tainted
>>> [  +0.000004] --------------------------------------------
>>> [  +0.000002] python/4822 is trying to acquire lock:
>>> [  +0.000004] ffff932cd9a259f8 
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000203]
>>>                but task is already holding lock:
>>> [  +0.000003] ffff932cbb7181f8 
>>> (reservation_ww_class_mutex){+.+.}-{3:3},
>>> at: ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [  +0.000017]
>>>                other info that might help us debug this:
>>> [  +0.000002]  Possible unsafe locking scenario:
>>>
>>> [  +0.000003]        CPU0
>>> [  +0.000002]        ----
>>> [  +0.000002]   lock(reservation_ww_class_mutex);
>>> [  +0.000004]   lock(reservation_ww_class_mutex);
>>> [  +0.000003]
>>>                 *** DEADLOCK ***
>>>
>>> [  +0.000002]  May be due to missing lock nesting notation
>>>
>>> [  +0.000003] 7 locks held by python/4822:
>>> [  +0.000003]  #0: ffff932c4ac028d0 (&process->mutex){+.+.}-{3:3}, at:
>>> kfd_ioctl_map_memory_to_gpu+0x10b/0x320 [amdgpu]
>>> [  +0.000232]  #1: ffff932c55e830a8 (&info->lock#2){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x64/0xf60 [amdgpu]
>>> [  +0.000241]  #2: ffff932cc45b5e68 (&(*mem)->lock){+.+.}-{3:3}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0xdf/0xf60 [amdgpu]
>>> [  +0.000236]  #3: ffffb2b35606fd28
>>> (reservation_ww_class_acquire){+.+.}-{0:0}, at:
>>> amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x232/0xf60 [amdgpu]
>>> [  +0.000235]  #4: ffff932cbb7181f8
>>> (reservation_ww_class_mutex){+.+.}-{3:3}, at:
>>> ttm_eu_reserve_buffers+0x270/0x470 [ttm]
>>> [  +0.000015]  #5: ffffffffc045f700 (*(sspp++)){....}-{0:0}, at:
>>> drm_dev_enter+0x5/0xa0 [drm]
>>> [  +0.000038]  #6: ffff932c52da7078 (&vm->eviction_lock){+.+.}-{3:3},
>>> at: amdgpu_vm_bo_update_mapping+0xd5/0x4f0 [amdgpu]
>>> [  +0.000195]
>>>                stack backtrace:
>>> [  +0.000003] CPU: 11 PID: 4822 Comm: python Not tainted
>>> 5.13.0-kfd-rajneesh #1030
>>> [  +0.000005] Hardware name: GIGABYTE MZ01-CE0-00/MZ01-CE0-00, BIOS F02
>>> 08/29/2018
>>> [  +0.000003] Call Trace:
>>> [  +0.000003]  dump_stack+0x6d/0x89
>>> [  +0.000010]  __lock_acquire+0xb93/0x1a90
>>> [  +0.000009]  lock_acquire+0x25d/0x2d0
>>> [  +0.000005]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000184]  ? lock_is_held_type+0xa2/0x110
>>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000184]  __ww_mutex_lock.constprop.17+0xca/0x1060
>>> [  +0.000007]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000183]  ? lock_release+0x13f/0x270
>>> [  +0.000005]  ? lock_is_held_type+0xa2/0x110
>>> [  +0.000006]  ? amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000183]  amdgpu_bo_release_notify+0xc4/0x160 [amdgpu]
>>> [  +0.000185]  ttm_bo_release+0x4c6/0x580 [ttm]
>>> [  +0.000010]  amdgpu_bo_unref+0x1a/0x30 [amdgpu]
>>> [  +0.000183]  amdgpu_vm_free_table+0x76/0xa0 [amdgpu]
>>> [  +0.000189]  amdgpu_vm_free_pts+0xb8/0xf0 [amdgpu]
>>> [  +0.000189]  amdgpu_vm_update_ptes+0x411/0x770 [amdgpu]
>>> [  +0.000191]  amdgpu_vm_bo_update_mapping+0x324/0x4f0 [amdgpu]
>>> [  +0.000191]  amdgpu_vm_bo_update+0x251/0x610 [amdgpu]
>>> [  +0.000191]  update_gpuvm_pte+0xcc/0x290 [amdgpu]
>>> [  +0.000229]  ? amdgpu_vm_bo_map+0xd7/0x130 [amdgpu]
>>> [  +0.000190] amdgpu_amdkfd_gpuvm_map_memory_to_gpu+0x912/0xf60
>>> [amdgpu]
>>> [  +0.000234]  kfd_ioctl_map_memory_to_gpu+0x182/0x320 [amdgpu]
>>> [  +0.000218]  kfd_ioctl+0x2b9/0x600 [amdgpu]
>>> [  +0.000216]  ? kfd_ioctl_unmap_memory_from_gpu+0x270/0x270 [amdgpu]
>>> [  +0.000216]  ? lock_release+0x13f/0x270
>>> [  +0.000006]  ? __fget_files+0x107/0x1e0
>>> [  +0.000007]  __x64_sys_ioctl+0x8b/0xd0
>>> [  +0.000007]  do_syscall_64+0x36/0x70
>>> [  +0.000004]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [  +0.000007] RIP: 0033:0x7fbff90a7317
>>> [  +0.000004] Code: b3 66 90 48 8b 05 71 4b 2d 00 64 c7 00 26 00 00 00
>>> 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f
>>> 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 41 4b 2d 00 f7 d8 64 89 01 48
>>> [  +0.000005] RSP: 002b:00007fbe301fe648 EFLAGS: 00000246 ORIG_RAX:
>>> 0000000000000010
>>> [  +0.000006] RAX: ffffffffffffffda RBX: 00007fbcc402d820 RCX:
>>> 00007fbff90a7317
>>> [  +0.000003] RDX: 00007fbe301fe690 RSI: 00000000c0184b18 RDI:
>>> 0000000000000004
>>> [  +0.000003] RBP: 00007fbe301fe690 R08: 0000000000000000 R09:
>>> 00007fbcc402d880
>>> [  +0.000003] R10: 0000000002001000 R11: 0000000000000246 R12:
>>> 00000000c0184b18
>>> [  +0.000003] R13: 0000000000000004 R14: 00007fbf689593a0 R15:
>>> 00007fbcc402d820
>>>
>>> Cc: Christian König <christian.koenig@amd.com>
>>> Cc: Felix Kuehling <Felix.Kuehling@amd.com>
>>> Cc: Alex Deucher <Alexander.Deucher@amd.com>
>>>
>>> Fixes: 627b92ef9d7c ("drm/amdgpu: Wipe all VRAM on free when RAS is
>>> enabled")
>>> Signed-off-by: Rajneesh Bhardwaj <rajneesh.bhardwaj@amd.com>
>>
>> The fixes tag is not necessarily correct, I would remove that.
>>
>> But apart from that the patch is Reviewed-by: Christian König 
>> <christian.koenig@amd.com>.
>
> I suggested the Fixes tag since it was my patch that introduced the 
> problem. Without my patch, page table BOs wouldn't be cleared here, 
> and it wouldn't get that recursive lock warning.

Yeah, but the problem existed before that. E.g. it can happen that we 
drop the last reference during validation as well.

So this is valuable to backport even without your patch.

Regards,
Christian.

>
> Either way, the patch is also
>
> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com>
>
>
>>
>> Thanks,
>> Christian.
>>
>>> ---
>>>   drivers/gpu/drm/amd/amdgpu/amdgpu_object.c | 3 ++-
>>>   1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c 
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> index 36bb41b027ec..6ccd2be685f5 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c
>>> @@ -1306,7 +1306,8 @@ void amdgpu_bo_release_notify(struct 
>>> ttm_buffer_object *bo)
>>>           !(abo->flags & AMDGPU_GEM_CREATE_VRAM_WIPE_ON_RELEASE))
>>>           return;
>>>   -    dma_resv_lock(bo->base.resv, NULL);
>>> +    if (WARN_ON_ONCE(!dma_resv_trylock(bo->base.resv)))
>>> +        return;
>>>         r = amdgpu_fill_buffer(abo, AMDGPU_POISON, bo->base.resv, 
>>> &fence);
>>>       if (!WARN_ON(r)) {
>>


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-02-04 16:25 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-04  3:11 [PATCH] drm/amdgpu: Fix recursive locking warning Rajneesh Bhardwaj
2022-02-04  7:13 ` Christian König
2022-02-04 16:23   ` Felix Kuehling
2022-02-04 16:25     ` Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).