All of lore.kernel.org
 help / color / mirror / Atom feed
From: Steven J Abner <pheonix.sja@att.net>
To: Zhang Boyang <zhangboyang.id@gmail.com>
Cc: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
	"Christian König" <christian.koenig@amd.com>,
	linux-kernel@vger.kernel.org, amd-gfx@lists.freedesktop.org,
	"David C . Rankin" <drankinatty@suddenlinkmail.com>
Subject: Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
Date: Fri, 30 Sep 2022 18:38:41 -0400	[thread overview]
Message-ID: <H8O1JR.4SM4JR87O02A2@att.net> (raw)
In-Reply-To: <20220930214110.1074226-2-zhangboyang.id@gmail.com>

I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function 
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with 
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop 
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference 
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting 
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal 
'index',
breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in 
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of 
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week 
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on 
libata procedures, one
of which causes a lock up some of the time.
Steve

On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> 
wrote:
> After amdgpu_device_init() failed, adev->reset_domain may be NULL. 
> Thus
> subsequent call to amdgpu_device_fini_sw() may result in 
> null-ptr-deref.
> 
> This patch fixes the problem by adding a NULL pointer check around the
> code of releasing adev->reset_domain in amdgpu_device_fini_sw().
> 
> Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be 
> refcounted.")
> 
> Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
> Link: 
> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
> Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index be7aff2d4a57..204daad06b32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct 
> amdgpu_device *adev)
>  	if (adev->mman.discovery_bin)
>  		amdgpu_discovery_fini(adev);
> 
> -	amdgpu_reset_put_reset_domain(adev->reset_domain);
> -	adev->reset_domain = NULL;
> +	if (adev->reset_domain) {
> +		amdgpu_reset_put_reset_domain(adev->reset_domain);
> +		adev->reset_domain = NULL;
> +	}
> 
>  	kfree(adev->pci_state);
> 
> --
> 2.30.2
> 



WARNING: multiple messages have this Message-ID (diff)
From: Steven J Abner <pheonix.sja@att.net>
To: Zhang Boyang <zhangboyang.id@gmail.com>
Cc: "Andrey Grodzovsky" <andrey.grodzovsky@amd.com>,
	"David C . Rankin" <drankinatty@suddenlinkmail.com>,
	"Christian König" <christian.koenig@amd.com>,
	amd-gfx@lists.freedesktop.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref in amdgpu_device_fini_sw()
Date: Fri, 30 Sep 2022 18:38:41 -0400	[thread overview]
Message-ID: <H8O1JR.4SM4JR87O02A2@att.net> (raw)
In-Reply-To: <20220930214110.1074226-2-zhangboyang.id@gmail.com>

I had done more delving into this also, thankfully was not forgotten.
Additional info to solve blackout, was going to contact AMD:
The problem as far as I could trace occurs in amdgpu_psp.c in function 
psp_cmd_submit_buf().
Normal counter 'timeout' seems to use a max of about 400 with 
usleep_range(5, 80).
Under normal operation 'psp->fence_buf' will equal 'index' and drop 
from loop. I could not track what
fills the kernel buffer objects 'virtual' pointer's reference 
(psp->fence_buf). I assume that
it's to a firmware file that on read is returning an error, and getting 
stuck in a loop lock.
The error condition I found occurs when 'psp->fence_buf' does not equal 
'index',
breaking from loop with 'timeout' == 0.
Blackout seemed to be about 80% of reboots, but found that in 
reconfiguration of kernel,
on 5.18.19, with CONFIG_ATA_ACPI=y drops blackouts to about 30% of 
reboots. I have now avoided
all blackouts with addition of CONFIG_SATA_PMP=y (at least a week 
free?).
This is pure guess but, maybe the ARM PSP processor is dependent on 
libata procedures, one
of which causes a lock up some of the time.
Steve

On Fri, Sep 30, 2022 at 21:41, Zhang Boyang <zhangboyang.id@gmail.com> 
wrote:
> After amdgpu_device_init() failed, adev->reset_domain may be NULL. 
> Thus
> subsequent call to amdgpu_device_fini_sw() may result in 
> null-ptr-deref.
> 
> This patch fixes the problem by adding a NULL pointer check around the
> code of releasing adev->reset_domain in amdgpu_device_fini_sw().
> 
> Fixes: cfbb6b004744 ("drm/amdgpu: Rework reset domain to be 
> refcounted.")
> 
> Signed-off-by: Zhang Boyang <zhangboyang.id@gmail.com>
> Link: 
> https://lore.kernel.org/lkml/a8bce489-8ccc-aa95-3de6-f854e03ad557@suddenlinkmail.com/
> Link: https://lore.kernel.org/lkml/AT9WHR.3Z1T3VI9A2AQ3@att.net/
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_device.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c 
> b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> index be7aff2d4a57..204daad06b32 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
> @@ -4021,8 +4021,10 @@ void amdgpu_device_fini_sw(struct 
> amdgpu_device *adev)
>  	if (adev->mman.discovery_bin)
>  		amdgpu_discovery_fini(adev);
> 
> -	amdgpu_reset_put_reset_domain(adev->reset_domain);
> -	adev->reset_domain = NULL;
> +	if (adev->reset_domain) {
> +		amdgpu_reset_put_reset_domain(adev->reset_domain);
> +		adev->reset_domain = NULL;
> +	}
> 
>  	kfree(adev->pci_state);
> 
> --
> 2.30.2
> 



  reply	other threads:[~2022-09-30 22:38 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-30 21:41 [RFC PATCH 0/1] drm/amdgpu: Fix NULL-deref in amdgpu_device_fini_sw() Zhang Boyang
2022-09-30 21:41 ` Zhang Boyang
2022-09-30 21:41 ` [RFC PATCH 1/1] drm/amdgpu: Fix null-ptr-deref " Zhang Boyang
2022-09-30 21:41   ` Zhang Boyang
2022-09-30 22:38   ` Steven J Abner [this message]
2022-09-30 22:38     ` Steven J Abner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=H8O1JR.4SM4JR87O02A2@att.net \
    --to=pheonix.sja@att.net \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=andrey.grodzovsky@amd.com \
    --cc=christian.koenig@amd.com \
    --cc=drankinatty@suddenlinkmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=zhangboyang.id@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.