All of lore.kernel.org
 help / color / mirror / Atom feed
From: Luben Tuikov <luben.tuikov@amd.com>
To: "Guilherme G. Piccoli" <gpiccoli@igalia.com>,
	amd-gfx@lists.freedesktop.org
Cc: kernel@gpiccoli.net, Guchun Chen <guchun.chen@amd.com>,
	Xinhui.Pan@amd.com, dri-devel@lists.freedesktop.org,
	Mario Limonciello <mario.limonciello@amd.com>,
	kernel-dev@igalia.com, alexander.deucher@amd.com,
	christian.koenig@amd.com
Subject: Re: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Date: Thu, 2 Feb 2023 11:12:38 -0500	[thread overview]
Message-ID: <f7805d0a-6fbd-7389-a2fd-d6343ef4598b@amd.com> (raw)
In-Reply-To: <20230202134856.1382169-1-gpiccoli@igalia.com>

Hi Guilherme,

Thanks for redoing to a v3. This patch is:

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

On 2023-02-02 08:48, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of 0000:04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0000000000000090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  <TASK>
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0], and also removed the no_scheduler check [1].
> 
> [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
> [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> Suggested-by: Christian König <christian.koenig@amd.com>
> Cc: Guchun Chen <guchun.chen@amd.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..faff4a3f96e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>  		if (!ring || !ring->fence_drv.initialized)
>  			continue;
>  
> -		if (!ring->no_scheduler)
> +		/*
> +		 * Notice we check for sched.ops since there's some
> +		 * override on the meaning of sched.ready by amdgpu.
> +		 * The natural check would be sched.ready, which is
> +		 * set as drm_sched_init() finishes...
> +		 */
> +		if (ring->sched.ops)
>  			drm_sched_fini(&ring->sched);
>  
>  		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)

-- 
Regards,
Luben


  reply	other threads:[~2023-02-02 16:12 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-02 13:48 [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Guilherme G. Piccoli
2023-02-02 16:12 ` Luben Tuikov [this message]
2023-02-02 16:53   ` Guilherme G. Piccoli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=f7805d0a-6fbd-7389-a2fd-d6343ef4598b@amd.com \
    --to=luben.tuikov@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=christian.koenig@amd.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gpiccoli@igalia.com \
    --cc=guchun.chen@amd.com \
    --cc=kernel-dev@igalia.com \
    --cc=kernel@gpiccoli.net \
    --cc=mario.limonciello@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.