From: "Guilherme G. Piccoli" <gpiccoli@igalia.com>
To: amd-gfx@lists.freedesktop.org
Cc: kernel@gpiccoli.net, Guchun Chen <guchun.chen@amd.com>,
Xinhui.Pan@amd.com, dri-devel@lists.freedesktop.org,
"Guilherme G. Piccoli" <gpiccoli@igalia.com>,
Luben Tuikov <luben.tuikov@amd.com>,
Mario Limonciello <mario.limonciello@amd.com>,
kernel-dev@igalia.com, alexander.deucher@amd.com,
christian.koenig@amd.com
Subject: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
Date: Thu, 2 Feb 2023 10:48:56 -0300 [thread overview]
Message-ID: <20230202134856.1382169-1-gpiccoli@igalia.com> (raw)
Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.
Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:
amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
<TASK>
amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
devm_drm_dev_init_release+0x49/0x70
[...]
To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.
Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0], and also removed the no_scheduler check [1].
[0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
[1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
Suggested-by: Christian König <christian.koenig@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---
drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..faff4a3f96e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
if (!ring || !ring->fence_drv.initialized)
continue;
- if (!ring->no_scheduler)
+ /*
+ * Notice we check for sched.ops since there's some
+ * override on the meaning of sched.ready by amdgpu.
+ * The natural check would be sched.ready, which is
+ * set as drm_sched_init() finishes...
+ */
+ if (ring->sched.ops)
drm_sched_fini(&ring->sched);
for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
--
2.39.0
next reply other threads:[~2023-02-02 13:49 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-02 13:48 Guilherme G. Piccoli [this message]
2023-02-02 16:12 ` [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Luben Tuikov
2023-02-02 16:53 ` Guilherme G. Piccoli
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230202134856.1382169-1-gpiccoli@igalia.com \
--to=gpiccoli@igalia.com \
--cc=Xinhui.Pan@amd.com \
--cc=alexander.deucher@amd.com \
--cc=amd-gfx@lists.freedesktop.org \
--cc=christian.koenig@amd.com \
--cc=dri-devel@lists.freedesktop.org \
--cc=guchun.chen@amd.com \
--cc=kernel-dev@igalia.com \
--cc=kernel@gpiccoli.net \
--cc=luben.tuikov@amd.com \
--cc=mario.limonciello@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).