All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Christian König" <christian.koenig@amd.com>
To: "Guilherme G. Piccoli" <gpiccoli@igalia.com>,
	amd-gfx@lists.freedesktop.org
Cc: kernel@gpiccoli.net, Guchun Chen <guchun.chen@amd.com>,
	Xinhui.Pan@amd.com, dri-devel@lists.freedesktop.org,
	Luben Tuikov <luben.tuikov@amd.com>,
	Mario Limonciello <mario.limonciello@amd.com>,
	kernel-dev@igalia.com, alexander.deucher@amd.com
Subject: Re: [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Date: Thu, 2 Feb 2023 12:58:15 +0100	[thread overview]
Message-ID: <cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com> (raw)
In-Reply-To: <20230201164814.1353383-1-gpiccoli@igalia.com>

Am 01.02.23 um 17:48 schrieb Guilherme G. Piccoli:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
>
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
>
> amdgpu: probe of 0000:04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0000000000000090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>   <TASK>
>   amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>   amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>   amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>   devm_drm_dev_init_release+0x49/0x70
>   [...]
>
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
>
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0].
>
> [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
>
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> Suggested-by: Christian König <christian.koenig@amd.com>
> Cc: Guchun Chen <guchun.chen@amd.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>   drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>   1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..3b962cb680a6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>   		if (!ring || !ring->fence_drv.initialized)
>   			continue;
>   
> -		if (!ring->no_scheduler)
> +		/*
> +		 * Notice we check for sched.ops since there's some
> +		 * override on the meaning of sched.ready by amdgpu.
> +		 * The natural check would be sched.ready, which is
> +		 * set as drm_sched_init() finishes...
> +		 */
> +		if (!ring->no_scheduler && ring->sched.ops)
>   			drm_sched_fini(&ring->sched);

I think we should drop the check for no_scheduler here and just call 
drm_sched_fini() when the scheduler instance was initialized before.

Background is that I've found a couple of places where we potentially 
set no_scheduler to false after the scheduler was already initialized.

This is then probably leading to a memory leak or worth.

Regards,
Christian.

>   
>   		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)


  parent reply	other threads:[~2023-02-02 11:58 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-01 16:48 [PATCH v2] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Guilherme G. Piccoli
2023-02-01 17:02 ` Luben Tuikov
2023-02-02  1:34 ` Chen, Guchun
2023-02-02 11:58 ` Christian König [this message]
2023-02-02 13:52   ` Guilherme G. Piccoli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com \
    --to=christian.koenig@amd.com \
    --cc=Xinhui.Pan@amd.com \
    --cc=alexander.deucher@amd.com \
    --cc=amd-gfx@lists.freedesktop.org \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=gpiccoli@igalia.com \
    --cc=guchun.chen@amd.com \
    --cc=kernel-dev@igalia.com \
    --cc=kernel@gpiccoli.net \
    --cc=luben.tuikov@amd.com \
    --cc=mario.limonciello@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.