All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] drm/amdgpu: fix a potential circular locking dependency
@ 2020-08-11  9:32 Dennis Li
  2020-08-11 13:57 ` Felix Kuehling
  0 siblings, 1 reply; 15+ messages in thread
From: Dennis Li @ 2020-08-11  9:32 UTC (permalink / raw)
  To: amd-gfx, Alexander.Deucher, felix.kuehling, Hawking.Zhang,
	christian.koenig
  Cc: Dennis Li

[  653.902305] ======================================================
[  653.902928] WARNING: possible circular locking dependency detected
[  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
[  653.904098] ------------------------------------------------------
[  653.904675] amdgpu_test/3975 is trying to acquire lock:
[  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.905953]
               but task is already holding lock:
[  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
[  653.907694]
               which lock already depends on the new lock.

[  653.909423]
               the existing dependency chain (in reverse order) is:
[  653.910594]
               -> #1 (reservation_ww_class_mutex){+.+.}:
[  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
[  653.912350]        ww_mutex_lock+0x73/0x80
[  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
[  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
[  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
[  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
[  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
[  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
[  653.916959]        local_pci_probe+0x47/0xa0
[  653.917570]        work_for_cpu_fn+0x1a/0x30
[  653.918184]        process_one_work+0x29e/0x630
[  653.918803]        worker_thread+0x22b/0x3f0
[  653.919427]        kthread+0x12f/0x150
[  653.920047]        ret_from_fork+0x3a/0x50
[  653.920661]
               -> #0 (&adev->reset_sem){.+.+}:
[  653.921893]        __lock_acquire+0x13ec/0x16e0
[  653.922531]        lock_acquire+0xb8/0x1c0
[  653.923174]        down_read+0x48/0x230
[  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
[  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
[  653.925283]        drm_ioctl+0x389/0x450 [drm]
[  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
[  653.926686]        ksys_ioctl+0x98/0xb0
[  653.927357]        __x64_sys_ioctl+0x1a/0x20
[  653.928030]        do_syscall_64+0x5f/0x250
[  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  653.929373]
               other info that might help us debug this:

[  653.931356]  Possible unsafe locking scenario:

[  653.932647]        CPU0                    CPU1
[  653.933287]        ----                    ----
[  653.933911]   lock(reservation_ww_class_mutex);
[  653.934530]                                lock(&adev->reset_sem);
[  653.935154]                                lock(reservation_ww_class_mutex);
[  653.935766]   lock(&adev->reset_sem);
[  653.936360]
                *** DEADLOCK ***

[  653.938028] 2 locks held by amdgpu_test/3975:
[  653.938574]  #0: ffffb2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu]
[  653.939233]  #1: ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]

change the order of reservation_ww_class_mutex and adev->reset_sem in
amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to
avoid potential dead lock.

Signed-off-by: Dennis Li <Dennis.Li@amd.com>

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
index ee1e8fff83b2..fc889c477696 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
@@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 		abo = NULL;
 	}
 
+	down_read(&adev->reset_sem);
+
 	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
 
 	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
@@ -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 		bo_va = NULL;
 	}
 
-	down_read(&adev->reset_sem);
-
 	switch (args->operation) {
 	case AMDGPU_VA_OP_MAP:
 		va_flags = amdgpu_gem_va_map_flags(adev, args->flags);
@@ -701,12 +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
 		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
 					args->operation);
 
-	up_read(&adev->reset_sem);
-
 error_backoff:
 	ttm_eu_backoff_reservation(&ticket, &list);
 
 error_unref:
+	up_read(&adev->reset_sem);
 	drm_gem_object_put_unlocked(gobj);
 	return r;
 }
-- 
2.17.1

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-11  9:32 [PATCH] drm/amdgpu: fix a potential circular locking dependency Dennis Li
@ 2020-08-11 13:57 ` Felix Kuehling
  2020-08-11 16:14   ` Christian König
  2020-08-12  1:19   ` Li, Dennis
  0 siblings, 2 replies; 15+ messages in thread
From: Felix Kuehling @ 2020-08-11 13:57 UTC (permalink / raw)
  To: Dennis Li, amd-gfx, Alexander.Deucher, Hawking.Zhang, christian.koenig

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
> [  653.902305] ======================================================
> [  653.902928] WARNING: possible circular locking dependency detected
> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> [  653.904098] ------------------------------------------------------
> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
> [  653.905953]
>                but task is already holding lock:
> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
> [  653.907694]
>                which lock already depends on the new lock.
>
> [  653.909423]
>                the existing dependency chain (in reverse order) is:
> [  653.910594]
>                -> #1 (reservation_ww_class_mutex){+.+.}:
> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
> [  653.912350]        ww_mutex_lock+0x73/0x80
> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
> [  653.916959]        local_pci_probe+0x47/0xa0
> [  653.917570]        work_for_cpu_fn+0x1a/0x30
> [  653.918184]        process_one_work+0x29e/0x630
> [  653.918803]        worker_thread+0x22b/0x3f0
> [  653.919427]        kthread+0x12f/0x150
> [  653.920047]        ret_from_fork+0x3a/0x50
> [  653.920661]
>                -> #0 (&adev->reset_sem){.+.+}:
> [  653.921893]        __lock_acquire+0x13ec/0x16e0
> [  653.922531]        lock_acquire+0xb8/0x1c0
> [  653.923174]        down_read+0x48/0x230
> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
> [  653.926686]        ksys_ioctl+0x98/0xb0
> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
> [  653.928030]        do_syscall_64+0x5f/0x250
> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  653.929373]
>                other info that might help us debug this:
>
> [  653.931356]  Possible unsafe locking scenario:
>
> [  653.932647]        CPU0                    CPU1
> [  653.933287]        ----                    ----
> [  653.933911]   lock(reservation_ww_class_mutex);
> [  653.934530]                                lock(&adev->reset_sem);
> [  653.935154]                                lock(reservation_ww_class_mutex);
> [  653.935766]   lock(&adev->reset_sem);
> [  653.936360]
>                 *** DEADLOCK ***
>
> [  653.938028] 2 locks held by amdgpu_test/3975:
> [  653.938574]  #0: ffffb2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu]
> [  653.939233]  #1: ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>
> change the order of reservation_ww_class_mutex and adev->reset_sem in
> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to
> avoid potential dead lock.

It may be better to fix it the other way around in
amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
reservation. Otherwise you will never be able to take the reset_sem
while any BOs are reserved. That's probably going to cause you other
problems later.

That makes me wonder, why do you need the reset_sem in
amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
hardware access in that function. Is it for amdgpu_ttm_alloc_gart
updating the GART table through HDP?

Regards,
  Felix


>
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index ee1e8fff83b2..fc889c477696 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		abo = NULL;
>  	}
>  
> +	down_read(&adev->reset_sem);
> +
>  	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>  
>  	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
> @@ -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		bo_va = NULL;
>  	}
>  
> -	down_read(&adev->reset_sem);
> -
>  	switch (args->operation) {
>  	case AMDGPU_VA_OP_MAP:
>  		va_flags = amdgpu_gem_va_map_flags(adev, args->flags);
> @@ -701,12 +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>  					args->operation);
>  
> -	up_read(&adev->reset_sem);
> -
>  error_backoff:
>  	ttm_eu_backoff_reservation(&ticket, &list);
>  
>  error_unref:
> +	up_read(&adev->reset_sem);
>  	drm_gem_object_put_unlocked(gobj);
>  	return r;
>  }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-11 13:57 ` Felix Kuehling
@ 2020-08-11 16:14   ` Christian König
  2020-08-12  1:33     ` Li, Dennis
  2020-08-12  1:19   ` Li, Dennis
  1 sibling, 1 reply; 15+ messages in thread
From: Christian König @ 2020-08-11 16:14 UTC (permalink / raw)
  To: Felix Kuehling, Dennis Li, amd-gfx, Alexander.Deucher, Hawking.Zhang

Am 11.08.20 um 15:57 schrieb Felix Kuehling:
> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>> [  653.902305] ======================================================
>> [  653.902928] WARNING: possible circular locking dependency detected
>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>> [  653.904098] ------------------------------------------------------
>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.905953]
>>                 but task is already holding lock:
>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>> [  653.907694]
>>                 which lock already depends on the new lock.
>>
>> [  653.909423]
>>                 the existing dependency chain (in reverse order) is:
>> [  653.910594]
>>                 -> #1 (reservation_ww_class_mutex){+.+.}:
>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>> [  653.912350]        ww_mutex_lock+0x73/0x80
>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>> [  653.916959]        local_pci_probe+0x47/0xa0
>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>> [  653.918184]        process_one_work+0x29e/0x630
>> [  653.918803]        worker_thread+0x22b/0x3f0
>> [  653.919427]        kthread+0x12f/0x150
>> [  653.920047]        ret_from_fork+0x3a/0x50
>> [  653.920661]
>>                 -> #0 (&adev->reset_sem){.+.+}:
>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>> [  653.922531]        lock_acquire+0xb8/0x1c0
>> [  653.923174]        down_read+0x48/0x230
>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> [  653.926686]        ksys_ioctl+0x98/0xb0
>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>> [  653.928030]        do_syscall_64+0x5f/0x250
>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  653.929373]
>>                 other info that might help us debug this:
>>
>> [  653.931356]  Possible unsafe locking scenario:
>>
>> [  653.932647]        CPU0                    CPU1
>> [  653.933287]        ----                    ----
>> [  653.933911]   lock(reservation_ww_class_mutex);
>> [  653.934530]                                lock(&adev->reset_sem);
>> [  653.935154]                                lock(reservation_ww_class_mutex);
>> [  653.935766]   lock(&adev->reset_sem);
>> [  653.936360]
>>                  *** DEADLOCK ***
>>
>> [  653.938028] 2 locks held by amdgpu_test/3975:
>> [  653.938574]  #0: ffffb2a862d6bcd0 (reservation_ww_class_acquire){+.+.}, at: amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu]
>> [  653.939233]  #1: ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>
>> change the order of reservation_ww_class_mutex and adev->reset_sem in
>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, to
>> avoid potential dead lock.
> It may be better to fix it the other way around in
> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
> reservation. Otherwise you will never be able to take the reset_sem
> while any BOs are reserved. That's probably going to cause you other
> problems later.
>
> That makes me wonder, why do you need the reset_sem in
> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
> updating the GART table through HDP?

I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.

We shouldn't have any hardware access here, so taking the reset_sem 
looks like overkill to me.

Christian.

>
> Regards,
>    Felix
>
>
>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index ee1e8fff83b2..fc889c477696 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		abo = NULL;
>>   	}
>>   
>> +	down_read(&adev->reset_sem);
>> +
>>   	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>   
>>   	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
>> @@ -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		bo_va = NULL;
>>   	}
>>   
>> -	down_read(&adev->reset_sem);
>> -
>>   	switch (args->operation) {
>>   	case AMDGPU_VA_OP_MAP:
>>   		va_flags = amdgpu_gem_va_map_flags(adev, args->flags);
>> @@ -701,12 +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>   					args->operation);
>>   
>> -	up_read(&adev->reset_sem);
>> -
>>   error_backoff:
>>   	ttm_eu_backoff_reservation(&ticket, &list);
>>   
>>   error_unref:
>> +	up_read(&adev->reset_sem);
>>   	drm_gem_object_put_unlocked(gobj);
>>   	return r;
>>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-11 13:57 ` Felix Kuehling
  2020-08-11 16:14   ` Christian König
@ 2020-08-12  1:19   ` Li, Dennis
  2020-08-12  8:53     ` Christian König
  1 sibling, 1 reply; 15+ messages in thread
From: Li, Dennis @ 2020-08-12  1:19 UTC (permalink / raw)
  To: Kuehling, Felix, amd-gfx, Deucher, Alexander, Zhang, Hawking,
	Koenig, Christian

[AMD Official Use Only - Internal Distribution Only]

Hi, Felix,

Re: It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.
[Dennis Li] Thanks that you find the potential issue, I will change it in version 2.

Re: That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?
[Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to only protect amdgpu_ttm_alloc_gart before. But I worry other functions will access hardware in the future. Therefore I select an aggressive approach which lock reset_sem at the beginning of entry functions of amdgpu driver.

Best Regards
Dennis Li
-----Original Message-----
From: Kuehling, Felix <Felix.Kuehling@amd.com> 
Sent: Tuesday, August 11, 2020 9:57 PM
To: Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
> [  653.902305] ======================================================
> [  653.902928] WARNING: possible circular locking dependency detected
> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
> [  653.904098] ------------------------------------------------------
> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: 
> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>                but task is already holding lock:
> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, 
> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>                which lock already depends on the new lock.
>
> [  653.909423]
>                the existing dependency chain (in reverse order) is:
> [  653.910594]
>                -> #1 (reservation_ww_class_mutex){+.+.}:
> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
> [  653.912350]        ww_mutex_lock+0x73/0x80
> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
> [  653.916959]        local_pci_probe+0x47/0xa0
> [  653.917570]        work_for_cpu_fn+0x1a/0x30
> [  653.918184]        process_one_work+0x29e/0x630
> [  653.918803]        worker_thread+0x22b/0x3f0
> [  653.919427]        kthread+0x12f/0x150
> [  653.920047]        ret_from_fork+0x3a/0x50
> [  653.920661]
>                -> #0 (&adev->reset_sem){.+.+}:
> [  653.921893]        __lock_acquire+0x13ec/0x16e0
> [  653.922531]        lock_acquire+0xb8/0x1c0
> [  653.923174]        down_read+0x48/0x230
> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
> [  653.926686]        ksys_ioctl+0x98/0xb0
> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
> [  653.928030]        do_syscall_64+0x5f/0x250
> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
> [  653.929373]
>                other info that might help us debug this:
>
> [  653.931356]  Possible unsafe locking scenario:
>
> [  653.932647]        CPU0                    CPU1
> [  653.933287]        ----                    ----
> [  653.933911]   lock(reservation_ww_class_mutex);
> [  653.934530]                                lock(&adev->reset_sem);
> [  653.935154]                                lock(reservation_ww_class_mutex);
> [  653.935766]   lock(&adev->reset_sem);
> [  653.936360]
>                 *** DEADLOCK ***
>
> [  653.938028] 2 locks held by amdgpu_test/3975:
> [  653.938574]  #0: ffffb2a862d6bcd0 
> (reservation_ww_class_acquire){+.+.}, at: 
> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1: 
> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: 
> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>
> change the order of reservation_ww_class_mutex and adev->reset_sem in 
> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, 
> to avoid potential dead lock.

It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.

That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?

Regards,
  Felix


>
> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> index ee1e8fff83b2..fc889c477696 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		abo = NULL;
>  	}
>  
> +	down_read(&adev->reset_sem);
> +
>  	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>  
>  	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); @@ 
> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		bo_va = NULL;
>  	}
>  
> -	down_read(&adev->reset_sem);
> -
>  	switch (args->operation) {
>  	case AMDGPU_VA_OP_MAP:
>  		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ -701,12 
> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>  		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>  					args->operation);
>  
> -	up_read(&adev->reset_sem);
> -
>  error_backoff:
>  	ttm_eu_backoff_reservation(&ticket, &list);
>  
>  error_unref:
> +	up_read(&adev->reset_sem);
>  	drm_gem_object_put_unlocked(gobj);
>  	return r;
>  }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-11 16:14   ` Christian König
@ 2020-08-12  1:33     ` Li, Dennis
  2020-08-12  8:56       ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Dennis @ 2020-08-12  1:33 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Hi, Christian,

Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.

[Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware. 

Best Regards
Dennis Li
-----Original Message-----
From: Koenig, Christian <Christian.Koenig@amd.com> 
Sent: Wednesday, August 12, 2020 12:15 AM
To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency

Am 11.08.20 um 15:57 schrieb Felix Kuehling:
> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>> [  653.902305] ======================================================
>> [  653.902928] WARNING: possible circular locking dependency detected
>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>> [  653.904098] ------------------------------------------------------
>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at: 
>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>                 but task is already holding lock:
>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, 
>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>                 which lock already depends on the new lock.
>>
>> [  653.909423]
>>                 the existing dependency chain (in reverse order) is:
>> [  653.910594]
>>                 -> #1 (reservation_ww_class_mutex){+.+.}:
>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>> [  653.912350]        ww_mutex_lock+0x73/0x80
>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>> [  653.916959]        local_pci_probe+0x47/0xa0
>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>> [  653.918184]        process_one_work+0x29e/0x630
>> [  653.918803]        worker_thread+0x22b/0x3f0
>> [  653.919427]        kthread+0x12f/0x150
>> [  653.920047]        ret_from_fork+0x3a/0x50
>> [  653.920661]
>>                 -> #0 (&adev->reset_sem){.+.+}:
>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>> [  653.922531]        lock_acquire+0xb8/0x1c0
>> [  653.923174]        down_read+0x48/0x230
>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> [  653.926686]        ksys_ioctl+0x98/0xb0
>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>> [  653.928030]        do_syscall_64+0x5f/0x250
>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  653.929373]
>>                 other info that might help us debug this:
>>
>> [  653.931356]  Possible unsafe locking scenario:
>>
>> [  653.932647]        CPU0                    CPU1
>> [  653.933287]        ----                    ----
>> [  653.933911]   lock(reservation_ww_class_mutex);
>> [  653.934530]                                lock(&adev->reset_sem);
>> [  653.935154]                                lock(reservation_ww_class_mutex);
>> [  653.935766]   lock(&adev->reset_sem);
>> [  653.936360]
>>                  *** DEADLOCK ***
>>
>> [  653.938028] 2 locks held by amdgpu_test/3975:
>> [  653.938574]  #0: ffffb2a862d6bcd0 
>> (reservation_ww_class_acquire){+.+.}, at: 
>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1: 
>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at: 
>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>
>> change the order of reservation_ww_class_mutex and adev->reset_sem in 
>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, 
>> to avoid potential dead lock.
> It may be better to fix it the other way around in 
> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
> reservation. Otherwise you will never be able to take the reset_sem 
> while any BOs are reserved. That's probably going to cause you other 
> problems later.
>
> That makes me wonder, why do you need the reset_sem in 
> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious 
> hardware access in that function. Is it for amdgpu_ttm_alloc_gart 
> updating the GART table through HDP?

I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.

We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.

Christian.

>
> Regards,
>    Felix
>
>
>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c 
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index ee1e8fff83b2..fc889c477696 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		abo = NULL;
>>   	}
>>   
>> +	down_read(&adev->reset_sem);
>> +
>>   	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>   
>>   	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); @@ 
>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		bo_va = NULL;
>>   	}
>>   
>> -	down_read(&adev->reset_sem);
>> -
>>   	switch (args->operation) {
>>   	case AMDGPU_VA_OP_MAP:
>>   		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ -701,12 
>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>   					args->operation);
>>   
>> -	up_read(&adev->reset_sem);
>> -
>>   error_backoff:
>>   	ttm_eu_backoff_reservation(&ticket, &list);
>>   
>>   error_unref:
>> +	up_read(&adev->reset_sem);
>>   	drm_gem_object_put_unlocked(gobj);
>>   	return r;
>>   }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  1:19   ` Li, Dennis
@ 2020-08-12  8:53     ` Christian König
  2020-08-12  9:35       ` Li, Dennis
  2020-08-12 15:07       ` Felix Kuehling
  0 siblings, 2 replies; 15+ messages in thread
From: Christian König @ 2020-08-12  8:53 UTC (permalink / raw)
  To: Li, Dennis, Kuehling, Felix, amd-gfx, Deucher, Alexander, Zhang, Hawking

Am 12.08.20 um 03:19 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Felix,
>
> Re: It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.
> [Dennis Li] Thanks that you find the potential issue, I will change it in version 2.
>
> Re: That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?
> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to only protect amdgpu_ttm_alloc_gart before.

That access is irrelevant and the lock should be removed or changed into 
a trylock.

See we need the HDP flush only because the hardware could have accessed 
the data before.

But after a GPU reset the HDP is known to be clean, so this doesn't need 
any protection.

>   But I worry other functions will access hardware in the future. Therefore I select an aggressive approach which lock reset_sem at the beginning of entry functions of amdgpu driver.

This is not a good idea. We used to have such a global lock before and 
removed it because it caused all kind of problems.

When was this added? Looks like it slipped under my radar or I wasn't 
awake enough at that moment.

Christian.

>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Tuesday, August 11, 2020 9:57 PM
> To: Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
>
> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>> [  653.902305] ======================================================
>> [  653.902928] WARNING: possible circular locking dependency detected
>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>> [  653.904098] ------------------------------------------------------
>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>                 but task is already holding lock:
>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>                 which lock already depends on the new lock.
>>
>> [  653.909423]
>>                 the existing dependency chain (in reverse order) is:
>> [  653.910594]
>>                 -> #1 (reservation_ww_class_mutex){+.+.}:
>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>> [  653.912350]        ww_mutex_lock+0x73/0x80
>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>> [  653.916959]        local_pci_probe+0x47/0xa0
>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>> [  653.918184]        process_one_work+0x29e/0x630
>> [  653.918803]        worker_thread+0x22b/0x3f0
>> [  653.919427]        kthread+0x12f/0x150
>> [  653.920047]        ret_from_fork+0x3a/0x50
>> [  653.920661]
>>                 -> #0 (&adev->reset_sem){.+.+}:
>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>> [  653.922531]        lock_acquire+0xb8/0x1c0
>> [  653.923174]        down_read+0x48/0x230
>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> [  653.926686]        ksys_ioctl+0x98/0xb0
>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>> [  653.928030]        do_syscall_64+0x5f/0x250
>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  653.929373]
>>                 other info that might help us debug this:
>>
>> [  653.931356]  Possible unsafe locking scenario:
>>
>> [  653.932647]        CPU0                    CPU1
>> [  653.933287]        ----                    ----
>> [  653.933911]   lock(reservation_ww_class_mutex);
>> [  653.934530]                                lock(&adev->reset_sem);
>> [  653.935154]                                lock(reservation_ww_class_mutex);
>> [  653.935766]   lock(&adev->reset_sem);
>> [  653.936360]
>>                  *** DEADLOCK ***
>>
>> [  653.938028] 2 locks held by amdgpu_test/3975:
>> [  653.938574]  #0: ffffb2a862d6bcd0
>> (reservation_ww_class_acquire){+.+.}, at:
>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>
>> change the order of reservation_ww_class_mutex and adev->reset_sem in
>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
>> to avoid potential dead lock.
> It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.
>
> That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?
>
> Regards,
>    Felix
>
>
>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index ee1e8fff83b2..fc889c477696 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		abo = NULL;
>>   	}
>>   
>> +	down_read(&adev->reset_sem);
>> +
>>   	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>   
>>   	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); @@
>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		bo_va = NULL;
>>   	}
>>   
>> -	down_read(&adev->reset_sem);
>> -
>>   	switch (args->operation) {
>>   	case AMDGPU_VA_OP_MAP:
>>   		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ -701,12
>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>   					args->operation);
>>   
>> -	up_read(&adev->reset_sem);
>> -
>>   error_backoff:
>>   	ttm_eu_backoff_reservation(&ticket, &list);
>>   
>>   error_unref:
>> +	up_read(&adev->reset_sem);
>>   	drm_gem_object_put_unlocked(gobj);
>>   	return r;
>>   }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  1:33     ` Li, Dennis
@ 2020-08-12  8:56       ` Christian König
  2020-08-12  9:23         ` Li, Dennis
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2020-08-12  8:56 UTC (permalink / raw)
  To: Li, Dennis, Kuehling, Felix, amd-gfx, Deucher, Alexander, Zhang, Hawking

Am 12.08.20 um 03:33 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>
> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>
> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.

This is complete nonsense. The functions intentionally work through the 
scheduler to avoid accessing the hardware directly for exactly that reason.

The only hardware access we have here is the HDP flush and that can fail 
in the case of a GPU reset without causing problems.

Regards,
Christian.

>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Wednesday, August 12, 2020 12:15 AM
> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
>
> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>> [  653.902305] ======================================================
>>> [  653.902928] WARNING: possible circular locking dependency detected
>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>> [  653.904098] ------------------------------------------------------
>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>                  but task is already holding lock:
>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>                  which lock already depends on the new lock.
>>>
>>> [  653.909423]
>>>                  the existing dependency chain (in reverse order) is:
>>> [  653.910594]
>>>                  -> #1 (reservation_ww_class_mutex){+.+.}:
>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>> [  653.918184]        process_one_work+0x29e/0x630
>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>> [  653.919427]        kthread+0x12f/0x150
>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>> [  653.920661]
>>>                  -> #0 (&adev->reset_sem){.+.+}:
>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>> [  653.923174]        down_read+0x48/0x230
>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [  653.929373]
>>>                  other info that might help us debug this:
>>>
>>> [  653.931356]  Possible unsafe locking scenario:
>>>
>>> [  653.932647]        CPU0                    CPU1
>>> [  653.933287]        ----                    ----
>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>> [  653.934530]                                lock(&adev->reset_sem);
>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>> [  653.935766]   lock(&adev->reset_sem);
>>> [  653.936360]
>>>                   *** DEADLOCK ***
>>>
>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>> [  653.938574]  #0: ffffb2a862d6bcd0
>>> (reservation_ww_class_acquire){+.+.}, at:
>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>
>>> change the order of reservation_ww_class_mutex and adev->reset_sem in
>>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
>>> to avoid potential dead lock.
>> It may be better to fix it the other way around in
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>> reservation. Otherwise you will never be able to take the reset_sem
>> while any BOs are reserved. That's probably going to cause you other
>> problems later.
>>
>> That makes me wonder, why do you need the reset_sem in
>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>> updating the GART table through HDP?
> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>
> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>
> Christian.
>
>> Regards,
>>     Felix
>>
>>
>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index ee1e8fff83b2..fc889c477696 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>    		abo = NULL;
>>>    	}
>>>    
>>> +	down_read(&adev->reset_sem);
>>> +
>>>    	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>    
>>>    	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); @@
>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>    		bo_va = NULL;
>>>    	}
>>>    
>>> -	down_read(&adev->reset_sem);
>>> -
>>>    	switch (args->operation) {
>>>    	case AMDGPU_VA_OP_MAP:
>>>    		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ -701,12
>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>    		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>    					args->operation);
>>>    
>>> -	up_read(&adev->reset_sem);
>>> -
>>>    error_backoff:
>>>    	ttm_eu_backoff_reservation(&ticket, &list);
>>>    
>>>    error_unref:
>>> +	up_read(&adev->reset_sem);
>>>    	drm_gem_object_put_unlocked(gobj);
>>>    	return r;
>>>    }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  8:56       ` Christian König
@ 2020-08-12  9:23         ` Li, Dennis
  2020-08-12  9:43           ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Dennis @ 2020-08-12  9:23 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:33 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Christian,
>
> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>
> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.

This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason.

The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems.

[Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

Regards,
Christian.

>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Koenig, Christian <Christian.Koenig@amd.com>
> Sent: Wednesday, August 12, 2020 12:15 AM
> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis 
> <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander 
> <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
> dependency
>
> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>> [  653.902305] 
>>> ======================================================
>>> [  653.902928] WARNING: possible circular locking dependency detected
>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>> [  653.904098] 
>>> ------------------------------------------------------
>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>                  but task is already holding lock:
>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>                  which lock already depends on the new lock.
>>>
>>> [  653.909423]
>>>                  the existing dependency chain (in reverse order) is:
>>> [  653.910594]
>>>                  -> #1 (reservation_ww_class_mutex){+.+.}:
>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>> [  653.918184]        process_one_work+0x29e/0x630
>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>> [  653.919427]        kthread+0x12f/0x150
>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>> [  653.920661]
>>>                  -> #0 (&adev->reset_sem){.+.+}:
>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>> [  653.923174]        down_read+0x48/0x230
>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [  653.929373]
>>>                  other info that might help us debug this:
>>>
>>> [  653.931356]  Possible unsafe locking scenario:
>>>
>>> [  653.932647]        CPU0                    CPU1
>>> [  653.933287]        ----                    ----
>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>> [  653.934530]                                lock(&adev->reset_sem);
>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>> [  653.935766]   lock(&adev->reset_sem);
>>> [  653.936360]
>>>                   *** DEADLOCK ***
>>>
>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>> [  653.938574]  #0: ffffb2a862d6bcd0 
>>> (reservation_ww_class_acquire){+.+.}, at:
>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>
>>> change the order of reservation_ww_class_mutex and adev->reset_sem 
>>> in amdgpu_gem_va_ioctl the same as ones in 
>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>> It may be better to fix it the other way around in 
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
>> reservation. Otherwise you will never be able to take the reset_sem 
>> while any BOs are reserved. That's probably going to cause you other 
>> problems later.
>>
>> That makes me wonder, why do you need the reset_sem in 
>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious 
>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart 
>> updating the GART table through HDP?
> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>
> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>
> Christian.
>
>> Regards,
>>     Felix
>>
>>
>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index ee1e8fff83b2..fc889c477696 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>    		abo = NULL;
>>>    	}
>>>    
>>> +	down_read(&adev->reset_sem);
>>> +
>>>    	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>    
>>>    	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); 
>>> @@
>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>    		bo_va = NULL;
>>>    	}
>>>    
>>> -	down_read(&adev->reset_sem);
>>> -
>>>    	switch (args->operation) {
>>>    	case AMDGPU_VA_OP_MAP:
>>>    		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ 
>>> -701,12
>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
>>> +*data,
>>>    		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>    					args->operation);
>>>    
>>> -	up_read(&adev->reset_sem);
>>> -
>>>    error_backoff:
>>>    	ttm_eu_backoff_reservation(&ticket, &list);
>>>    
>>>    error_unref:
>>> +	up_read(&adev->reset_sem);
>>>    	drm_gem_object_put_unlocked(gobj);
>>>    	return r;
>>>    }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  8:53     ` Christian König
@ 2020-08-12  9:35       ` Li, Dennis
  2020-08-12 15:07       ` Felix Kuehling
  1 sibling, 0 replies; 15+ messages in thread
From: Li, Dennis @ 2020-08-12  9:35 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 03:19 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Hi, Felix,
>
> Re: It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.
> [Dennis Li] Thanks that you find the potential issue, I will change it in version 2.
>
> Re: That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?
> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have considered to only protect amdgpu_ttm_alloc_gart before.

That access is irrelevant and the lock should be removed or changed into a trylock.

See we need the HDP flush only because the hardware could have accessed the data before.

But after a GPU reset the HDP is known to be clean, so this doesn't need any protection.

>   But I worry other functions will access hardware in the future. Therefore I select an aggressive approach which lock reset_sem at the beginning of entry functions of amdgpu driver.

This is not a good idea. We used to have such a global lock before and removed it because it caused all kind of problems.

[Dennis Li] okay. If you don't agree this aggressive approach. I will change to protect amdgpu_ttm_alloc_gart only. 

When was this added? Looks like it slipped under my radar or I wasn't awake enough at that moment.

Christian.

>
> Best Regards
> Dennis Li
> -----Original Message-----
> From: Kuehling, Felix <Felix.Kuehling@amd.com>
> Sent: Tuesday, August 11, 2020 9:57 PM
> To: Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; 
> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking 
> <Hawking.Zhang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
> dependency
>
> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>> [  653.902305] ======================================================
>> [  653.902928] WARNING: possible circular locking dependency detected
>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>> [  653.904098] ------------------------------------------------------
>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>                 but task is already holding lock:
>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>                 which lock already depends on the new lock.
>>
>> [  653.909423]
>>                 the existing dependency chain (in reverse order) is:
>> [  653.910594]
>>                 -> #1 (reservation_ww_class_mutex){+.+.}:
>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>> [  653.912350]        ww_mutex_lock+0x73/0x80
>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>> [  653.916959]        local_pci_probe+0x47/0xa0
>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>> [  653.918184]        process_one_work+0x29e/0x630
>> [  653.918803]        worker_thread+0x22b/0x3f0
>> [  653.919427]        kthread+0x12f/0x150
>> [  653.920047]        ret_from_fork+0x3a/0x50
>> [  653.920661]
>>                 -> #0 (&adev->reset_sem){.+.+}:
>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>> [  653.922531]        lock_acquire+0xb8/0x1c0
>> [  653.923174]        down_read+0x48/0x230
>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>> [  653.926686]        ksys_ioctl+0x98/0xb0
>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>> [  653.928030]        do_syscall_64+0x5f/0x250
>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>> [  653.929373]
>>                 other info that might help us debug this:
>>
>> [  653.931356]  Possible unsafe locking scenario:
>>
>> [  653.932647]        CPU0                    CPU1
>> [  653.933287]        ----                    ----
>> [  653.933911]   lock(reservation_ww_class_mutex);
>> [  653.934530]                                lock(&adev->reset_sem);
>> [  653.935154]                                lock(reservation_ww_class_mutex);
>> [  653.935766]   lock(&adev->reset_sem);
>> [  653.936360]
>>                  *** DEADLOCK ***
>>
>> [  653.938028] 2 locks held by amdgpu_test/3975:
>> [  653.938574]  #0: ffffb2a862d6bcd0
>> (reservation_ww_class_acquire){+.+.}, at:
>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>
>> change the order of reservation_ww_class_mutex and adev->reset_sem in 
>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem, 
>> to avoid potential dead lock.
> It may be better to fix it the other way around in amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the reservation. Otherwise you will never be able to take the reset_sem while any BOs are reserved. That's probably going to cause you other problems later.
>
> That makes me wonder, why do you need the reset_sem in amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious hardware access in that function. Is it for amdgpu_ttm_alloc_gart updating the GART table through HDP?
>
> Regards,
>    Felix
>
>
>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>
>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> index ee1e8fff83b2..fc889c477696 100644
>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		abo = NULL;
>>   	}
>>   
>> +	down_read(&adev->reset_sem);
>> +
>>   	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>   
>>   	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); @@
>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>   		bo_va = NULL;
>>   	}
>>   
>> -	down_read(&adev->reset_sem);
>> -
>>   	switch (args->operation) {
>>   	case AMDGPU_VA_OP_MAP:
>>   		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@ -701,12
>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
>> +*data,
>>   		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>   					args->operation);
>>   
>> -	up_read(&adev->reset_sem);
>> -
>>   error_backoff:
>>   	ttm_eu_backoff_reservation(&ticket, &list);
>>   
>>   error_unref:
>> +	up_read(&adev->reset_sem);
>>   	drm_gem_object_put_unlocked(gobj);
>>   	return r;
>>   }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  9:23         ` Li, Dennis
@ 2020-08-12  9:43           ` Christian König
  2020-08-12 10:02             ` Li, Dennis
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2020-08-12  9:43 UTC (permalink / raw)
  To: Li, Dennis, Kuehling, Felix, amd-gfx, Deucher, Alexander, Zhang, Hawking

Am 12.08.20 um 11:23 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Christian,
>>
>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>
>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.
> This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason.
>
> The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems.
>
> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

That is for pre gfx9 hardware and only called once during initial 
enabling of the feature.

Please remove that locking again since it is clearly completely against 
the driver design.

Christian.

>
> Regards,
> Christian.
>
>> Best Regards
>> Dennis Li
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Wednesday, August 12, 2020 12:15 AM
>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis
>> <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, Alexander
>> <Alexander.Deucher@amd.com>; Zhang, Hawking <Hawking.Zhang@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
>> dependency
>>
>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>>> [  653.902305]
>>>> ======================================================
>>>> [  653.902928] WARNING: possible circular locking dependency detected
>>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>>> [  653.904098]
>>>> ------------------------------------------------------
>>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>>                   but task is already holding lock:
>>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>>                   which lock already depends on the new lock.
>>>>
>>>> [  653.909423]
>>>>                   the existing dependency chain (in reverse order) is:
>>>> [  653.910594]
>>>>                   -> #1 (reservation_ww_class_mutex){+.+.}:
>>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>>> [  653.918184]        process_one_work+0x29e/0x630
>>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>>> [  653.919427]        kthread+0x12f/0x150
>>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>>> [  653.920661]
>>>>                   -> #0 (&adev->reset_sem){.+.+}:
>>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>>> [  653.923174]        down_read+0x48/0x230
>>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> [  653.929373]
>>>>                   other info that might help us debug this:
>>>>
>>>> [  653.931356]  Possible unsafe locking scenario:
>>>>
>>>> [  653.932647]        CPU0                    CPU1
>>>> [  653.933287]        ----                    ----
>>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>>> [  653.934530]                                lock(&adev->reset_sem);
>>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>>> [  653.935766]   lock(&adev->reset_sem);
>>>> [  653.936360]
>>>>                    *** DEADLOCK ***
>>>>
>>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>>> [  653.938574]  #0: ffffb2a862d6bcd0
>>>> (reservation_ww_class_acquire){+.+.}, at:
>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>>
>>>> change the order of reservation_ww_class_mutex and adev->reset_sem
>>>> in amdgpu_gem_va_ioctl the same as ones in
>>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>>> It may be better to fix it the other way around in
>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>>> reservation. Otherwise you will never be able to take the reset_sem
>>> while any BOs are reserved. That's probably going to cause you other
>>> problems later.
>>>
>>> That makes me wonder, why do you need the reset_sem in
>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>>> updating the GART table through HDP?
>> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>>
>> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>
>> Christian.
>>
>>> Regards,
>>>      Felix
>>>
>>>
>>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index ee1e8fff83b2..fc889c477696 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>     		abo = NULL;
>>>>     	}
>>>>     
>>>> +	down_read(&adev->reset_sem);
>>>> +
>>>>     	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>>     
>>>>     	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
>>>> @@
>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>     		bo_va = NULL;
>>>>     	}
>>>>     
>>>> -	down_read(&adev->reset_sem);
>>>> -
>>>>     	switch (args->operation) {
>>>>     	case AMDGPU_VA_OP_MAP:
>>>>     		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>>> -701,12
>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void
>>>> +*data,
>>>>     		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>>     					args->operation);
>>>>     
>>>> -	up_read(&adev->reset_sem);
>>>> -
>>>>     error_backoff:
>>>>     	ttm_eu_backoff_reservation(&ticket, &list);
>>>>     
>>>>     error_unref:
>>>> +	up_read(&adev->reset_sem);
>>>>     	drm_gem_object_put_unlocked(gobj);
>>>>     	return r;
>>>>     }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  9:43           ` Christian König
@ 2020-08-12 10:02             ` Li, Dennis
  2020-08-12 10:04               ` Christian König
  0 siblings, 1 reply; 15+ messages in thread
From: Li, Dennis @ 2020-08-12 10:02 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 11:23 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Christian,
>>
>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>
>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.
> This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason.
>
> The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems.
>
> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt

That is for pre gfx9 hardware and only called once during initial enabling of the feature.

Please remove that locking again since it is clearly completely against the driver design.

[Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function. 

Christian.

>
> Regards,
> Christian.
>
>> Best Regards
>> Dennis Li
>> -----Original Message-----
>> From: Koenig, Christian <Christian.Koenig@amd.com>
>> Sent: Wednesday, August 12, 2020 12:15 AM
>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis 
>> <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, 
>> Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking 
>> <Hawking.Zhang@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
>> dependency
>>
>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>>> [  653.902305]
>>>> ======================================================
>>>> [  653.902928] WARNING: possible circular locking dependency detected
>>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>>> [  653.904098]
>>>> ------------------------------------------------------
>>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>>                   but task is already holding lock:
>>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>>                   which lock already depends on the new lock.
>>>>
>>>> [  653.909423]
>>>>                   the existing dependency chain (in reverse order) is:
>>>> [  653.910594]
>>>>                   -> #1 (reservation_ww_class_mutex){+.+.}:
>>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>>> [  653.918184]        process_one_work+0x29e/0x630
>>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>>> [  653.919427]        kthread+0x12f/0x150
>>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>>> [  653.920661]
>>>>                   -> #0 (&adev->reset_sem){.+.+}:
>>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>>> [  653.923174]        down_read+0x48/0x230
>>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> [  653.929373]
>>>>                   other info that might help us debug this:
>>>>
>>>> [  653.931356]  Possible unsafe locking scenario:
>>>>
>>>> [  653.932647]        CPU0                    CPU1
>>>> [  653.933287]        ----                    ----
>>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>>> [  653.934530]                                lock(&adev->reset_sem);
>>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>>> [  653.935766]   lock(&adev->reset_sem);
>>>> [  653.936360]
>>>>                    *** DEADLOCK ***
>>>>
>>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>>> [  653.938574]  #0: ffffb2a862d6bcd0 
>>>> (reservation_ww_class_acquire){+.+.}, at:
>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>>
>>>> change the order of reservation_ww_class_mutex and adev->reset_sem 
>>>> in amdgpu_gem_va_ioctl the same as ones in 
>>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>>> It may be better to fix it the other way around in 
>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
>>> reservation. Otherwise you will never be able to take the reset_sem 
>>> while any BOs are reserved. That's probably going to cause you other 
>>> problems later.
>>>
>>> That makes me wonder, why do you need the reset_sem in 
>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious 
>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart 
>>> updating the GART table through HDP?
>> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>>
>> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>
>> Christian.
>>
>>> Regards,
>>>      Felix
>>>
>>>
>>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index ee1e8fff83b2..fc889c477696 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>     		abo = NULL;
>>>>     	}
>>>>     
>>>> +	down_read(&adev->reset_sem);
>>>> +
>>>>     	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>>     
>>>>     	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates); 
>>>> @@
>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>     		bo_va = NULL;
>>>>     	}
>>>>     
>>>> -	down_read(&adev->reset_sem);
>>>> -
>>>>     	switch (args->operation) {
>>>>     	case AMDGPU_VA_OP_MAP:
>>>>     		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>>> -701,12
>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
>>>> +*data,
>>>>     		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>>     					args->operation);
>>>>     
>>>> -	up_read(&adev->reset_sem);
>>>> -
>>>>     error_backoff:
>>>>     	ttm_eu_backoff_reservation(&ticket, &list);
>>>>     
>>>>     error_unref:
>>>> +	up_read(&adev->reset_sem);
>>>>     	drm_gem_object_put_unlocked(gobj);
>>>>     	return r;
>>>>     }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12 10:02             ` Li, Dennis
@ 2020-08-12 10:04               ` Christian König
  2020-08-12 11:31                 ` Li, Dennis
  0 siblings, 1 reply; 15+ messages in thread
From: Christian König @ 2020-08-12 10:04 UTC (permalink / raw)
  To: Li, Dennis, Kuehling, Felix, amd-gfx, Deucher, Alexander, Zhang, Hawking

Am 12.08.20 um 12:02 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 11:23 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Christian,
>>>
>>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>>
>>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.
>> This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason.
>>
>> The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems.
>>
>> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get ->
>> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt
> That is for pre gfx9 hardware and only called once during initial enabling of the feature.
>
> Please remove that locking again since it is clearly completely against the driver design.
>
> [Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function.

Better even only protect the amdgpu_vm_update_prt_state() function.

Christian.

>
> Christian.
>
>> Regards,
>> Christian.
>>
>>> Best Regards
>>> Dennis Li
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Wednesday, August 12, 2020 12:15 AM
>>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis
>>> <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher,
>>> Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
>>> <Hawking.Zhang@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
>>> dependency
>>>
>>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>>>> [  653.902305]
>>>>> ======================================================
>>>>> [  653.902928] WARNING: possible circular locking dependency detected
>>>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>>>> [  653.904098]
>>>>> ------------------------------------------------------
>>>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>>>                    but task is already holding lock:
>>>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>>>                    which lock already depends on the new lock.
>>>>>
>>>>> [  653.909423]
>>>>>                    the existing dependency chain (in reverse order) is:
>>>>> [  653.910594]
>>>>>                    -> #1 (reservation_ww_class_mutex){+.+.}:
>>>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>>>> [  653.918184]        process_one_work+0x29e/0x630
>>>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>>>> [  653.919427]        kthread+0x12f/0x150
>>>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>>>> [  653.920661]
>>>>>                    -> #0 (&adev->reset_sem){.+.+}:
>>>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>>>> [  653.923174]        down_read+0x48/0x230
>>>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>> [  653.929373]
>>>>>                    other info that might help us debug this:
>>>>>
>>>>> [  653.931356]  Possible unsafe locking scenario:
>>>>>
>>>>> [  653.932647]        CPU0                    CPU1
>>>>> [  653.933287]        ----                    ----
>>>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>>>> [  653.934530]                                lock(&adev->reset_sem);
>>>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>>>> [  653.935766]   lock(&adev->reset_sem);
>>>>> [  653.936360]
>>>>>                     *** DEADLOCK ***
>>>>>
>>>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>>>> [  653.938574]  #0: ffffb2a862d6bcd0
>>>>> (reservation_ww_class_acquire){+.+.}, at:
>>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>>>
>>>>> change the order of reservation_ww_class_mutex and adev->reset_sem
>>>>> in amdgpu_gem_va_ioctl the same as ones in
>>>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>>>> It may be better to fix it the other way around in
>>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>>>> reservation. Otherwise you will never be able to take the reset_sem
>>>> while any BOs are reserved. That's probably going to cause you other
>>>> problems later.
>>>>
>>>> That makes me wonder, why do you need the reset_sem in
>>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>>>> updating the GART table through HDP?
>>> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>>>
>>> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>>
>>> Christian.
>>>
>>>> Regards,
>>>>       Felix
>>>>
>>>>
>>>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> index ee1e8fff83b2..fc889c477696 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>>      		abo = NULL;
>>>>>      	}
>>>>>      
>>>>> +	down_read(&adev->reset_sem);
>>>>> +
>>>>>      	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>>>      
>>>>>      	r = ttm_eu_reserve_buffers(&ticket, &list, true, &duplicates);
>>>>> @@
>>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>>      		bo_va = NULL;
>>>>>      	}
>>>>>      
>>>>> -	down_read(&adev->reset_sem);
>>>>> -
>>>>>      	switch (args->operation) {
>>>>>      	case AMDGPU_VA_OP_MAP:
>>>>>      		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>>>> -701,12
>>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void
>>>>> +*data,
>>>>>      		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>>>      					args->operation);
>>>>>      
>>>>> -	up_read(&adev->reset_sem);
>>>>> -
>>>>>      error_backoff:
>>>>>      	ttm_eu_backoff_reservation(&ticket, &list);
>>>>>      
>>>>>      error_unref:
>>>>> +	up_read(&adev->reset_sem);
>>>>>      	drm_gem_object_put_unlocked(gobj);
>>>>>      	return r;
>>>>>      }

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* RE: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12 10:04               ` Christian König
@ 2020-08-12 11:31                 ` Li, Dennis
  0 siblings, 0 replies; 15+ messages in thread
From: Li, Dennis @ 2020-08-12 11:31 UTC (permalink / raw)
  To: Koenig, Christian, Kuehling, Felix, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

[AMD Official Use Only - Internal Distribution Only]

Am 12.08.20 um 12:02 schrieb Li, Dennis:
> [AMD Official Use Only - Internal Distribution Only]
>
> Am 12.08.20 um 11:23 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Am 12.08.20 um 03:33 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Christian,
>>>
>>> Re: I was wondering the same thing for the amdgpu_gem_va_ioctl() as well. We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>>
>>> [Dennis Li] amdgpu_vm_bo_unmap, amdgpu_vm_bo_clear_mappings, amdgpu_vm_bo_replace_map  and amdgpu_gem_va_update_vm all a chance to access hardware.
>> This is complete nonsense. The functions intentionally work through the scheduler to avoid accessing the hardware directly for exactly that reason.
>>
>> The only hardware access we have here is the HDP flush and that can fail in the case of a GPU reset without causing problems.
>>
>> [Dennis Li]  amdgpu_vm_bo_clear_mappings -> amdgpu_vm_prt_get -> 
>> amdgpu_vm_update_prt_state -> gmc_v8_0_set_prt
> That is for pre gfx9 hardware and only called once during initial enabling of the feature.
>
> Please remove that locking again since it is clearly completely against the driver design.
>
> [Dennis Li] okay, if you agree, I will change to only protect amdgpu_gem_va_update_vm in this function.

Better even only protect the amdgpu_vm_update_prt_state() function.
[Dennis Li] Got it. According to your suggestion, I will also narrow down the scope of reset_sem in other functions. 

Christian.

>
> Christian.
>
>> Regards,
>> Christian.
>>
>>> Best Regards
>>> Dennis Li
>>> -----Original Message-----
>>> From: Koenig, Christian <Christian.Koenig@amd.com>
>>> Sent: Wednesday, August 12, 2020 12:15 AM
>>> To: Kuehling, Felix <Felix.Kuehling@amd.com>; Li, Dennis 
>>> <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org; Deucher, 
>>> Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking 
>>> <Hawking.Zhang@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking 
>>> dependency
>>>
>>> Am 11.08.20 um 15:57 schrieb Felix Kuehling:
>>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>>>> [  653.902305]
>>>>> ======================================================
>>>>> [  653.902928] WARNING: possible circular locking dependency detected
>>>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted: G           OE
>>>>> [  653.904098]
>>>>> ------------------------------------------------------
>>>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>>>                    but task is already holding lock:
>>>>> [  653.907087] ffff9744adbee1f8 
>>>>> (reservation_ww_class_mutex){+.+.},
>>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>>>                    which lock already depends on the new lock.
>>>>>
>>>>> [  653.909423]
>>>>>                    the existing dependency chain (in reverse order) is:
>>>>> [  653.910594]
>>>>>                    -> #1 (reservation_ww_class_mutex){+.+.}:
>>>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>>>> [  653.918184]        process_one_work+0x29e/0x630
>>>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>>>> [  653.919427]        kthread+0x12f/0x150
>>>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>>>> [  653.920661]
>>>>>                    -> #0 (&adev->reset_sem){.+.+}:
>>>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>>>> [  653.923174]        down_read+0x48/0x230
>>>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>>> [  653.929373]
>>>>>                    other info that might help us debug this:
>>>>>
>>>>> [  653.931356]  Possible unsafe locking scenario:
>>>>>
>>>>> [  653.932647]        CPU0                    CPU1
>>>>> [  653.933287]        ----                    ----
>>>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>>>> [  653.934530]                                lock(&adev->reset_sem);
>>>>> [  653.935154]                                lock(reservation_ww_class_mutex);
>>>>> [  653.935766]   lock(&adev->reset_sem);
>>>>> [  653.936360]
>>>>>                     *** DEADLOCK ***
>>>>>
>>>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>>>> [  653.938574]  #0: ffffb2a862d6bcd0 
>>>>> (reservation_ww_class_acquire){+.+.}, at:
>>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>>>
>>>>> change the order of reservation_ww_class_mutex and adev->reset_sem 
>>>>> in amdgpu_gem_va_ioctl the same as ones in 
>>>>> amdgpu_amdkfd_alloc_gtt_mem, to avoid potential dead lock.
>>>> It may be better to fix it the other way around in 
>>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the 
>>>> reservation. Otherwise you will never be able to take the reset_sem 
>>>> while any BOs are reserved. That's probably going to cause you 
>>>> other problems later.
>>>>
>>>> That makes me wonder, why do you need the reset_sem in 
>>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious 
>>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart 
>>>> updating the GART table through HDP?
>>> I was wondering the same thing for the amdgpu_gem_va_ioctl() as well.
>>>
>>> We shouldn't have any hardware access here, so taking the reset_sem looks like overkill to me.
>>>
>>> Christian.
>>>
>>>> Regards,
>>>>       Felix
>>>>
>>>>
>>>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>>>
>>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> index ee1e8fff83b2..fc889c477696 100644
>>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>>      		abo = NULL;
>>>>>      	}
>>>>>      
>>>>> +	down_read(&adev->reset_sem);
>>>>> +
>>>>>      	amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>>>      
>>>>>      	r = ttm_eu_reserve_buffers(&ticket, &list, true, 
>>>>> &duplicates); @@
>>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>>      		bo_va = NULL;
>>>>>      	}
>>>>>      
>>>>> -	down_read(&adev->reset_sem);
>>>>> -
>>>>>      	switch (args->operation) {
>>>>>      	case AMDGPU_VA_OP_MAP:
>>>>>      		va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>>>> -701,12
>>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void 
>>>>> +*data,
>>>>>      		amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>>>      					args->operation);
>>>>>      
>>>>> -	up_read(&adev->reset_sem);
>>>>> -
>>>>>      error_backoff:
>>>>>      	ttm_eu_backoff_reservation(&ticket, &list);
>>>>>      
>>>>>      error_unref:
>>>>> +	up_read(&adev->reset_sem);
>>>>>      	drm_gem_object_put_unlocked(gobj);
>>>>>      	return r;
>>>>>      }
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12  8:53     ` Christian König
  2020-08-12  9:35       ` Li, Dennis
@ 2020-08-12 15:07       ` Felix Kuehling
  2020-08-12 15:29         ` Christian König
  1 sibling, 1 reply; 15+ messages in thread
From: Felix Kuehling @ 2020-08-12 15:07 UTC (permalink / raw)
  To: Christian König, Li, Dennis, amd-gfx, Deucher, Alexander,
	Zhang, Hawking

Am 2020-08-12 um 4:53 a.m. schrieb Christian König:
> Am 12.08.20 um 03:19 schrieb Li, Dennis:
>> [AMD Official Use Only - Internal Distribution Only]
>>
>> Hi, Felix,
>>
>> Re: It may be better to fix it the other way around in
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>> reservation. Otherwise you will never be able to take the reset_sem
>> while any BOs are reserved. That's probably going to cause you other
>> problems later.
>> [Dennis Li] Thanks that you find the potential issue, I will change
>> it in version 2.
>>
>> Re: That makes me wonder, why do you need the reset_sem in
>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>> updating the GART table through HDP?
>> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have
>> considered to only protect amdgpu_ttm_alloc_gart before.
>
> That access is irrelevant and the lock should be removed or changed
> into a trylock.
>
> See we need the HDP flush only because the hardware could have
> accessed the data before.
>
> But after a GPU reset the HDP is known to be clean, so this doesn't
> need any protection.
>
>>   But I worry other functions will access hardware in the future.
>> Therefore I select an aggressive approach which lock reset_sem at the
>> beginning of entry functions of amdgpu driver.
>
> This is not a good idea. We used to have such a global lock before and
> removed it because it caused all kind of problems.

In most cases it's a global read-lock, so most of the time it should not
impact concurrency. The only "writer" that blocks all readers is the GPU
reset.


>
> When was this added? Looks like it slipped under my radar or I wasn't
> awake enough at that moment.

The change that added the reset_sem added read-locks in about 70 places.
I'm still concerned that this will be hard to maintain, to make sure
future HW access will do the necessary locking. I guess that's the
motivation for doing the locking more coarse-grained. Taking the lock
higher up the call chains means fewer places that take the lock and new
low-level code may not need its own locking in the future because it's
already covered by higher-level callers.

On the other hand, it needs to be low enough in the call chains to avoid
recursive locking or circular dependencies with other locks.

Regards,
  Felix


>
> Christian.
>
>>
>> Best Regards
>> Dennis Li
>> -----Original Message-----
>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>> Sent: Tuesday, August 11, 2020 9:57 PM
>> To: Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org;
>> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
>> <Hawking.Zhang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
>> dependency
>>
>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>> [  653.902305] ======================================================
>>> [  653.902928] WARNING: possible circular locking dependency detected
>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
>>> G           OE
>>> [  653.904098] ------------------------------------------------------
>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>                 but task is already holding lock:
>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>                 which lock already depends on the new lock.
>>>
>>> [  653.909423]
>>>                 the existing dependency chain (in reverse order) is:
>>> [  653.910594]
>>>                 -> #1 (reservation_ww_class_mutex){+.+.}:
>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>> [  653.918184]        process_one_work+0x29e/0x630
>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>> [  653.919427]        kthread+0x12f/0x150
>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>> [  653.920661]
>>>                 -> #0 (&adev->reset_sem){.+.+}:
>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>> [  653.923174]        down_read+0x48/0x230
>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>> [  653.929373]
>>>                 other info that might help us debug this:
>>>
>>> [  653.931356]  Possible unsafe locking scenario:
>>>
>>> [  653.932647]        CPU0                    CPU1
>>> [  653.933287]        ----                    ----
>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>> [  653.934530]                                lock(&adev->reset_sem);
>>> [  653.935154]                               
>>> lock(reservation_ww_class_mutex);
>>> [  653.935766]   lock(&adev->reset_sem);
>>> [  653.936360]
>>>                  *** DEADLOCK ***
>>>
>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>> [  653.938574]  #0: ffffb2a862d6bcd0
>>> (reservation_ww_class_acquire){+.+.}, at:
>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>
>>> change the order of reservation_ww_class_mutex and adev->reset_sem in
>>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
>>> to avoid potential dead lock.
>> It may be better to fix it the other way around in
>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>> reservation. Otherwise you will never be able to take the reset_sem
>> while any BOs are reserved. That's probably going to cause you other
>> problems later.
>>
>> That makes me wonder, why do you need the reset_sem in
>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>> updating the GART table through HDP?
>>
>> Regards,
>>    Felix
>>
>>
>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>
>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> index ee1e8fff83b2..fc889c477696 100644
>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
>>> void *data,
>>>           abo = NULL;
>>>       }
>>>   +    down_read(&adev->reset_sem);
>>> +
>>>       amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>         r = ttm_eu_reserve_buffers(&ticket, &list, true,
>>> &duplicates); @@
>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
>>> void *data,
>>>           bo_va = NULL;
>>>       }
>>>   -    down_read(&adev->reset_sem);
>>> -
>>>       switch (args->operation) {
>>>       case AMDGPU_VA_OP_MAP:
>>>           va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>> -701,12
>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>           amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>                       args->operation);
>>>   -    up_read(&adev->reset_sem);
>>> -
>>>   error_backoff:
>>>       ttm_eu_backoff_reservation(&ticket, &list);
>>>     error_unref:
>>> +    up_read(&adev->reset_sem);
>>>       drm_gem_object_put_unlocked(gobj);
>>>       return r;
>>>   }
>
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] drm/amdgpu: fix a potential circular locking dependency
  2020-08-12 15:07       ` Felix Kuehling
@ 2020-08-12 15:29         ` Christian König
  0 siblings, 0 replies; 15+ messages in thread
From: Christian König @ 2020-08-12 15:29 UTC (permalink / raw)
  To: Felix Kuehling, Christian König, Li, Dennis, amd-gfx,
	Deucher, Alexander, Zhang, Hawking

Am 12.08.20 um 17:07 schrieb Felix Kuehling:
> Am 2020-08-12 um 4:53 a.m. schrieb Christian König:
>> Am 12.08.20 um 03:19 schrieb Li, Dennis:
>>> [AMD Official Use Only - Internal Distribution Only]
>>>
>>> Hi, Felix,
>>>
>>> Re: It may be better to fix it the other way around in
>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>>> reservation. Otherwise you will never be able to take the reset_sem
>>> while any BOs are reserved. That's probably going to cause you other
>>> problems later.
>>> [Dennis Li] Thanks that you find the potential issue, I will change
>>> it in version 2.
>>>
>>> Re: That makes me wonder, why do you need the reset_sem in
>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>>> updating the GART table through HDP?
>>> [Dennis Li] Yes, amdgpu_gart_bind will flush HDP and TLB. I have
>>> considered to only protect amdgpu_ttm_alloc_gart before.
>> That access is irrelevant and the lock should be removed or changed
>> into a trylock.
>>
>> See we need the HDP flush only because the hardware could have
>> accessed the data before.
>>
>> But after a GPU reset the HDP is known to be clean, so this doesn't
>> need any protection.
>>
>>>    But I worry other functions will access hardware in the future.
>>> Therefore I select an aggressive approach which lock reset_sem at the
>>> beginning of entry functions of amdgpu driver.
>> This is not a good idea. We used to have such a global lock before and
>> removed it because it caused all kind of problems.
> In most cases it's a global read-lock, so most of the time it should not
> impact concurrency. The only "writer" that blocks all readers is the GPU
> reset.
>
>
>> When was this added? Looks like it slipped under my radar or I wasn't
>> awake enough at that moment.
> The change that added the reset_sem added read-locks in about 70 places.
> I'm still concerned that this will be hard to maintain, to make sure
> future HW access will do the necessary locking. I guess that's the
> motivation for doing the locking more coarse-grained. Taking the lock
> higher up the call chains means fewer places that take the lock and new
> low-level code may not need its own locking in the future because it's
> already covered by higher-level callers.
>
> On the other hand, it needs to be low enough in the call chains to avoid
> recursive locking or circular dependencies with other locks.

Well the whole approach is a NAK. We already tried this and it didn't 
worked at all.

See how much effort it was to remove the global reset rwsem again in the 
past.

We have only minimal functions accessing the hardware directly which can 
run concurrently with a GPU reset.

Everything else works through ring buffers where the processing is 
stopped before we reset the GPU.

So the whole thing is just a bad idea as far as I can see.

Regards,
Christian.

>
> Regards,
>    Felix
>
>
>> Christian.
>>
>>> Best Regards
>>> Dennis Li
>>> -----Original Message-----
>>> From: Kuehling, Felix <Felix.Kuehling@amd.com>
>>> Sent: Tuesday, August 11, 2020 9:57 PM
>>> To: Li, Dennis <Dennis.Li@amd.com>; amd-gfx@lists.freedesktop.org;
>>> Deucher, Alexander <Alexander.Deucher@amd.com>; Zhang, Hawking
>>> <Hawking.Zhang@amd.com>; Koenig, Christian <Christian.Koenig@amd.com>
>>> Subject: Re: [PATCH] drm/amdgpu: fix a potential circular locking
>>> dependency
>>>
>>> Am 2020-08-11 um 5:32 a.m. schrieb Dennis Li:
>>>> [  653.902305] ======================================================
>>>> [  653.902928] WARNING: possible circular locking dependency detected
>>>> [  653.903517] 5.6.0-deli-v5.6-2848-g3f3109b0e75f #1 Tainted:
>>>> G           OE
>>>> [  653.904098] ------------------------------------------------------
>>>> [  653.904675] amdgpu_test/3975 is trying to acquire lock:
>>>> [  653.905241] ffff97848f8647a0 (&adev->reset_sem){.+.+}, at:
>>>> amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu] [  653.905953]
>>>>                  but task is already holding lock:
>>>> [  653.907087] ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.},
>>>> at: ttm_eu_reserve_buffers+0x1ae/0x520 [ttm] [  653.907694]
>>>>                  which lock already depends on the new lock.
>>>>
>>>> [  653.909423]
>>>>                  the existing dependency chain (in reverse order) is:
>>>> [  653.910594]
>>>>                  -> #1 (reservation_ww_class_mutex){+.+.}:
>>>> [  653.911759]        __ww_mutex_lock.constprop.15+0xca/0x1120
>>>> [  653.912350]        ww_mutex_lock+0x73/0x80
>>>> [  653.913044]        amdgpu_amdkfd_alloc_gtt_mem+0xde/0x380 [amdgpu]
>>>> [  653.913724]        kgd2kfd_device_init+0x13f/0x5e0 [amdgpu]
>>>> [  653.914388]        amdgpu_amdkfd_device_init+0x155/0x190 [amdgpu]
>>>> [  653.915033]        amdgpu_device_init+0x1303/0x1e10 [amdgpu]
>>>> [  653.915685]        amdgpu_driver_load_kms+0x5c/0x2c0 [amdgpu]
>>>> [  653.916349]        amdgpu_pci_probe+0x11d/0x200 [amdgpu]
>>>> [  653.916959]        local_pci_probe+0x47/0xa0
>>>> [  653.917570]        work_for_cpu_fn+0x1a/0x30
>>>> [  653.918184]        process_one_work+0x29e/0x630
>>>> [  653.918803]        worker_thread+0x22b/0x3f0
>>>> [  653.919427]        kthread+0x12f/0x150
>>>> [  653.920047]        ret_from_fork+0x3a/0x50
>>>> [  653.920661]
>>>>                  -> #0 (&adev->reset_sem){.+.+}:
>>>> [  653.921893]        __lock_acquire+0x13ec/0x16e0
>>>> [  653.922531]        lock_acquire+0xb8/0x1c0
>>>> [  653.923174]        down_read+0x48/0x230
>>>> [  653.923886]        amdgpu_gem_va_ioctl+0x286/0x4f0 [amdgpu]
>>>> [  653.924588]        drm_ioctl_kernel+0xb6/0x100 [drm]
>>>> [  653.925283]        drm_ioctl+0x389/0x450 [drm]
>>>> [  653.926013]        amdgpu_drm_ioctl+0x4f/0x80 [amdgpu]
>>>> [  653.926686]        ksys_ioctl+0x98/0xb0
>>>> [  653.927357]        __x64_sys_ioctl+0x1a/0x20
>>>> [  653.928030]        do_syscall_64+0x5f/0x250
>>>> [  653.928697]        entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>> [  653.929373]
>>>>                  other info that might help us debug this:
>>>>
>>>> [  653.931356]  Possible unsafe locking scenario:
>>>>
>>>> [  653.932647]        CPU0                    CPU1
>>>> [  653.933287]        ----                    ----
>>>> [  653.933911]   lock(reservation_ww_class_mutex);
>>>> [  653.934530]                                lock(&adev->reset_sem);
>>>> [  653.935154]
>>>> lock(reservation_ww_class_mutex);
>>>> [  653.935766]   lock(&adev->reset_sem);
>>>> [  653.936360]
>>>>                   *** DEADLOCK ***
>>>>
>>>> [  653.938028] 2 locks held by amdgpu_test/3975:
>>>> [  653.938574]  #0: ffffb2a862d6bcd0
>>>> (reservation_ww_class_acquire){+.+.}, at:
>>>> amdgpu_gem_va_ioctl+0x39b/0x4f0 [amdgpu] [  653.939233]  #1:
>>>> ffff9744adbee1f8 (reservation_ww_class_mutex){+.+.}, at:
>>>> ttm_eu_reserve_buffers+0x1ae/0x520 [ttm]
>>>>
>>>> change the order of reservation_ww_class_mutex and adev->reset_sem in
>>>> amdgpu_gem_va_ioctl the same as ones in amdgpu_amdkfd_alloc_gtt_mem,
>>>> to avoid potential dead lock.
>>> It may be better to fix it the other way around in
>>> amdgpu_amdkfd_alloc_gtt_mem. Always take the reset_sem inside the
>>> reservation. Otherwise you will never be able to take the reset_sem
>>> while any BOs are reserved. That's probably going to cause you other
>>> problems later.
>>>
>>> That makes me wonder, why do you need the reset_sem in
>>> amdgpu_amdkfd_alloc_gtt_mem in the first place? There is no obvious
>>> hardware access in that function. Is it for amdgpu_ttm_alloc_gart
>>> updating the GART table through HDP?
>>>
>>> Regards,
>>>     Felix
>>>
>>>
>>>> Signed-off-by: Dennis Li <Dennis.Li@amd.com>
>>>>
>>>> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> index ee1e8fff83b2..fc889c477696 100644
>>>> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_gem.c
>>>> @@ -652,6 +652,8 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
>>>> void *data,
>>>>            abo = NULL;
>>>>        }
>>>>    +    down_read(&adev->reset_sem);
>>>> +
>>>>        amdgpu_vm_get_pd_bo(&fpriv->vm, &list, &vm_pd);
>>>>          r = ttm_eu_reserve_buffers(&ticket, &list, true,
>>>> &duplicates); @@
>>>> -670,8 +672,6 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev,
>>>> void *data,
>>>>            bo_va = NULL;
>>>>        }
>>>>    -    down_read(&adev->reset_sem);
>>>> -
>>>>        switch (args->operation) {
>>>>        case AMDGPU_VA_OP_MAP:
>>>>            va_flags = amdgpu_gem_va_map_flags(adev, args->flags); @@
>>>> -701,12
>>>> +701,11 @@ int amdgpu_gem_va_ioctl(struct drm_device *dev, void *data,
>>>>            amdgpu_gem_va_update_vm(adev, &fpriv->vm, bo_va,
>>>>                        args->operation);
>>>>    -    up_read(&adev->reset_sem);
>>>> -
>>>>    error_backoff:
>>>>        ttm_eu_backoff_reservation(&ticket, &list);
>>>>      error_unref:
>>>> +    up_read(&adev->reset_sem);
>>>>        drm_gem_object_put_unlocked(gobj);
>>>>        return r;
>>>>    }
> _______________________________________________
> amd-gfx mailing list
> amd-gfx@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2020-08-12 15:30 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-11  9:32 [PATCH] drm/amdgpu: fix a potential circular locking dependency Dennis Li
2020-08-11 13:57 ` Felix Kuehling
2020-08-11 16:14   ` Christian König
2020-08-12  1:33     ` Li, Dennis
2020-08-12  8:56       ` Christian König
2020-08-12  9:23         ` Li, Dennis
2020-08-12  9:43           ` Christian König
2020-08-12 10:02             ` Li, Dennis
2020-08-12 10:04               ` Christian König
2020-08-12 11:31                 ` Li, Dennis
2020-08-12  1:19   ` Li, Dennis
2020-08-12  8:53     ` Christian König
2020-08-12  9:35       ` Li, Dennis
2020-08-12 15:07       ` Felix Kuehling
2020-08-12 15:29         ` Christian König

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.