All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw()
@ 2022-09-30 21:41 ` Zhang Boyang
  0 siblings, 0 replies; 6+ messages in thread
From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx
  Cc: David C . Rankin, Steven J Abner

Hi,

There are several reports of "Fatal error during GPU init" will cause
NULL-deref in amdgpu_device_fini_sw(). Although the NULL-deref is result
instead of reason, this NULL-deref will confuse user.

https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/

This is probably because "adev" is not fully initialized when
amdgpu_device_init() failed. Thus subsequent amdgpu_device_fini_sw()
will try to release "adev->reset_domain" and cause NULL-deref.

This patch fixes this problem by guarding the code with an "if".
However, I'm new to this module and I didn't fully understand the code,
so please review my code carefully.

Best Regards,
Zhang Boyang



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw()
@ 2022-09-30 21:41 ` Zhang Boyang
  0 siblings, 0 replies; 6+ messages in thread
From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx
  Cc: Steven J Abner, David C . Rankin

Hi,

There are several reports of "Fatal error during GPU init" will cause
NULL-deref in amdgpu_device_fini_sw(). Although the NULL-deref is result
instead of reason, this NULL-deref will confuse user.

https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/

This is probably because "adev" is not fully initialized when
amdgpu_device_init() failed. Thus subsequent amdgpu_device_fini_sw()
will try to release "adev->reset_domain" and cause NULL-deref.

This patch fixes this problem by guarding the code with an "if".
However, I'm new to this module and I didn't fully understand the code,
so please review my code carefully.

Best Regards,
Zhang Boyang



^ permalink raw reply	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
  2022-09-30 21:41 ` Zhang Boyang
@ 2022-09-30 21:41   ` Zhang Boyang
  -1 siblings, 0 replies; 6+ messages in thread
From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx
  Cc: David C . Rankin, Steven J Abner, Zhang Boyang

After amdgpu_device_init() failed, adev->reset_domain may be NULL. Thus
subsequent call to amdgpu_device_fini_sw() may result in null-ptr-deref.

This patch fixes the problem by adding a NULL pointer check around the
code of releasing adev->reset_domain in amdgpu_device_fini_sw().

Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.")

Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
Link: https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index be7aff2d4a57..204daad06b32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
 	if (adev->mman.discovery_bin)
 		amdgpu_discovery_fini(adev);
 
-	amdgpu_reset_put_reset_domain(adev->reset_domain);
-	adev->reset_domain = NULL;
+	if (adev->reset_domain) {
+		amdgpu_reset_put_reset_domain(adev->reset_domain);
+		adev->reset_domain = NULL;
+	}
 
 	kfree(adev->pci_state);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
@ 2022-09-30 21:41   ` Zhang Boyang
  0 siblings, 0 replies; 6+ messages in thread
From: Zhang Boyang @ 2022-09-30 21:41 UTC (permalink / raw)
  To: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx
  Cc: Steven J Abner, David C . Rankin, Zhang Boyang

After amdgpu_device_init() failed, adev->reset_domain may be NULL. Thus
subsequent call to amdgpu_device_fini_sw() may result in null-ptr-deref.

This patch fixes the problem by adding a NULL pointer check around the
code of releasing adev->reset_domain in amdgpu_device_fini_sw().

Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be refcounted.")

Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
Link: https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
index be7aff2d4a57..204daad06b32 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
@@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct amdgpu_device *adev)
 	if (adev->mman.discovery_bin)
 		amdgpu_discovery_fini(adev);
 
-	amdgpu_reset_put_reset_domain(adev->reset_domain);
-	adev->reset_domain = NULL;
+	if (adev->reset_domain) {
+		amdgpu_reset_put_reset_domain(adev->reset_domain);
+		adev->reset_domain = NULL;
+	}
 
 	kfree(adev->pci_state);
 
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
  2022-09-30 21:41   ` Zhang Boyang
@ 2022-09-30 22:38     ` Steven J Abner
  -1 siblings, 0 replies; 6+ messages in thread
From: Steven J Abner @ 2022-09-30 22:38 UTC (permalink / raw)
  To: Zhang Boyang
  Cc: Andrey Grodzovsky, Christian König, linux-kernel, amd-gfx,
	David C . Rankin

I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function 
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with 
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop 
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference 
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting 
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal 
'index',
breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in 
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of 
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week 
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on 
libata procedures, one
of which causes a lock up some of the time.
Steve

On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> 
wrote:
> After amdgpu_device_init() failed, adev->reset_domain may be NULL. 
> Thus
> subsequent call to amdgpu_device_fini_sw() may result in 
> null-ptr-deref.
> 
> This patch fixes the problem by adding a NULL pointer check around the
> code of releasing adev->reset_domain in amdgpu_device_fini_sw().
> 
> Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be 
> refcounted.")
> 
> Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
> Link: 
> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
> Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index be7aff2d4a57..204daad06b32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct 
> amdgpu_device *adev)
>  	if (adev->mman.discovery_bin)
>  		amdgpu_discovery_fini(adev);
> 
> -	amdgpu_reset_put_reset_domain(adev->reset_domain);
> -	adev->reset_domain = NULL;
> +	if (adev->reset_domain) {
> +		amdgpu_reset_put_reset_domain(adev->reset_domain);
> +		adev->reset_domain = NULL;
> +	}
> 
>  	kfree(adev->pci_state);
> 
> --
> 2.30.2
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
@ 2022-09-30 22:38     ` Steven J Abner
  0 siblings, 0 replies; 6+ messages in thread
From: Steven J Abner @ 2022-09-30 22:38 UTC (permalink / raw)
  To: Zhang Boyang
  Cc: Andrey Grodzovsky, David C . Rankin, Christian König,
	amd-gfx, linux-kernel

I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function 
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with 
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop 
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference 
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting 
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal 
'index',
breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in 
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of 
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week 
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on 
libata procedures, one
of which causes a lock up some of the time.
Steve

On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> 
wrote:
> After amdgpu_device_init() failed, adev->reset_domain may be NULL. 
> Thus
> subsequent call to amdgpu_device_fini_sw() may result in 
> null-ptr-deref.
> 
> This patch fixes the problem by adding a NULL pointer check around the
> code of releasing adev->reset_domain in amdgpu_device_fini_sw().
> 
> Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be 
> refcounted.")
> 
> Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
> Link: 
> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
> Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index be7aff2d4a57..204daad06b32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct 
> amdgpu_device *adev)
>  	if (adev->mman.discovery_bin)
>  		amdgpu_discovery_fini(adev);
> 
> -	amdgpu_reset_put_reset_domain(adev->reset_domain);
> -	adev->reset_domain = NULL;
> +	if (adev->reset_domain) {
> +		amdgpu_reset_put_reset_domain(adev->reset_domain);
> +		adev->reset_domain = NULL;
> +	}
> 
>  	kfree(adev->pci_state);
> 
> --
> 2.30.2
> 



^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2022-10-03 13:08 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-30 21:41 [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw() Zhang Boyang
2022-09-30 21:41 ` Zhang Boyang
2022-09-30 21:41 ` [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref " Zhang Boyang
2022-09-30 21:41   ` Zhang Boyang
2022-09-30 22:38   ` Steven J Abner
2022-09-30 22:38     ` Steven J Abner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.