All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
@ 2023-02-02 13:48 Guilherme G. Piccoli
  2023-02-02 16:12 ` Luben Tuikov
  0 siblings, 1 reply; 3+ messages in thread
From: Guilherme G. Piccoli @ 2023-02-02 13:48 UTC (permalink / raw)
  To: amd-gfx
  Cc: kernel, Guchun Chen, Xinhui.Pan, dri-devel, Guilherme G. Piccoli,
	Luben Tuikov, Mario Limonciello, kernel-dev, alexander.deucher,
	christian.koenig

Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
routine - such function is expected to be called only after the
respective init function - drm_sched_init() - was executed successfully.

Happens that we faced a driver probe failure in the Steam Deck
recently, and the function drm_sched_fini() was called even without
its counter-part had been previously called, causing the following oops:

amdgpu: probe of 0000:04:00.0 failed with error -110
BUG: kernel NULL pointer dereference, address: 0000000000000090
PGD 0 P4D 0
Oops: 0002 [#1] PREEMPT SMP NOPTI
CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
[...]
Call Trace:
 <TASK>
 amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
 amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
 amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
 devm_drm_dev_init_release+0x49/0x70
 [...]

To prevent that, check if the drm_sched was properly initialized for a
given ring before calling its fini counter-part.

Notice ideally we'd use sched.ready for that; such field is set as the latest
thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
field - in the above oops for example, it was a GFX ring causing the crash, and
the sched.ready field was set to true in the ring init routine, regardless of
the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
Christian's suggestion [0], and also removed the no_scheduler check [1].

[0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
[1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/

Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
Suggested-by: Christian König <christian.koenig@amd.com>
Cc: Guchun Chen <guchun.chen@amd.com>
Cc: Luben Tuikov <luben.tuikov@amd.com>
Cc: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 00444203220d..faff4a3f96e6 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
 		if (!ring || !ring->fence_drv.initialized)
 			continue;
 
-		if (!ring->no_scheduler)
+		/*
+		 * Notice we check for sched.ops since there's some
+		 * override on the meaning of sched.ready by amdgpu.
+		 * The natural check would be sched.ready, which is
+		 * set as drm_sched_init() finishes...
+		 */
+		if (ring->sched.ops)
 			drm_sched_fini(&ring->sched);
 
 		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)
-- 
2.39.0


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
  2023-02-02 13:48 [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Guilherme G. Piccoli
@ 2023-02-02 16:12 ` Luben Tuikov
  2023-02-02 16:53   ` Guilherme G. Piccoli
  0 siblings, 1 reply; 3+ messages in thread
From: Luben Tuikov @ 2023-02-02 16:12 UTC (permalink / raw)
  To: Guilherme G. Piccoli, amd-gfx
  Cc: kernel, Guchun Chen, Xinhui.Pan, dri-devel, Mario Limonciello,
	kernel-dev, alexander.deucher, christian.koenig

Hi Guilherme,

Thanks for redoing to a v3. This patch is:

Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>

Regards,
Luben

On 2023-02-02 08:48, Guilherme G. Piccoli wrote:
> Currently amdgpu calls drm_sched_fini() from the fence driver sw fini
> routine - such function is expected to be called only after the
> respective init function - drm_sched_init() - was executed successfully.
> 
> Happens that we faced a driver probe failure in the Steam Deck
> recently, and the function drm_sched_fini() was called even without
> its counter-part had been previously called, causing the following oops:
> 
> amdgpu: probe of 0000:04:00.0 failed with error -110
> BUG: kernel NULL pointer dereference, address: 0000000000000090
> PGD 0 P4D 0
> Oops: 0002 [#1] PREEMPT SMP NOPTI
> CPU: 0 PID: 609 Comm: systemd-udevd Not tainted 6.2.0-rc3-gpiccoli #338
> Hardware name: Valve Jupiter/Jupiter, BIOS F7A0113 11/04/2022
> RIP: 0010:drm_sched_fini+0x84/0xa0 [gpu_sched]
> [...]
> Call Trace:
>  <TASK>
>  amdgpu_fence_driver_sw_fini+0xc8/0xd0 [amdgpu]
>  amdgpu_device_fini_sw+0x2b/0x3b0 [amdgpu]
>  amdgpu_driver_release_kms+0x16/0x30 [amdgpu]
>  devm_drm_dev_init_release+0x49/0x70
>  [...]
> 
> To prevent that, check if the drm_sched was properly initialized for a
> given ring before calling its fini counter-part.
> 
> Notice ideally we'd use sched.ready for that; such field is set as the latest
> thing on drm_sched_init(). But amdgpu seems to "override" the meaning of such
> field - in the above oops for example, it was a GFX ring causing the crash, and
> the sched.ready field was set to true in the ring init routine, regardless of
> the state of the DRM scheduler. Hence, we ended-up using sched.ops as per
> Christian's suggestion [0], and also removed the no_scheduler check [1].
> 
> [0] https://lore.kernel.org/amd-gfx/984ee981-2906-0eaf-ccec-9f80975cb136@amd.com/
> [1] https://lore.kernel.org/amd-gfx/cd0e2994-f85f-d837-609f-7056d5fb7231@amd.com/
> 
> Fixes: 067f44c8b459 ("drm/amdgpu: avoid over-handle of fence driver fini in s3 test (v2)")
> Suggested-by: Christian König <christian.koenig@amd.com>
> Cc: Guchun Chen <guchun.chen@amd.com>
> Cc: Luben Tuikov <luben.tuikov@amd.com>
> Cc: Mario Limonciello <mario.limonciello@amd.com>
> Signed-off-by: Guilherme G. Piccoli <gpiccoli@igalia.com>
> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 00444203220d..faff4a3f96e6 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -618,7 +618,13 @@ void amdgpu_fence_driver_sw_fini(struct amdgpu_device *adev)
>  		if (!ring || !ring->fence_drv.initialized)
>  			continue;
>  
> -		if (!ring->no_scheduler)
> +		/*
> +		 * Notice we check for sched.ops since there's some
> +		 * override on the meaning of sched.ready by amdgpu.
> +		 * The natural check would be sched.ready, which is
> +		 * set as drm_sched_init() finishes...
> +		 */
> +		if (ring->sched.ops)
>  			drm_sched_fini(&ring->sched);
>  
>  		for (j = 0; j <= ring->fence_drv.num_fences_mask; ++j)

-- 
Regards,
Luben


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini
  2023-02-02 16:12 ` Luben Tuikov
@ 2023-02-02 16:53   ` Guilherme G. Piccoli
  0 siblings, 0 replies; 3+ messages in thread
From: Guilherme G. Piccoli @ 2023-02-02 16:53 UTC (permalink / raw)
  To: Luben Tuikov, amd-gfx
  Cc: kernel, Guchun Chen, Xinhui.Pan, dri-devel, Mario Limonciello,
	kernel-dev, alexander.deucher, christian.koenig

On 02/02/2023 13:12, Luben Tuikov wrote:
> Hi Guilherme,
> 
> Thanks for redoing to a v3. This patch is:
> 
> Reviewed-by: Luben Tuikov <luben.tuikov@amd.com>
> 
> Regards,
> Luben
> 

Thank you for the reviews Luben, much appreciated!

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-02-02 16:53 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-02 13:48 [PATCH v3] drm/amdgpu/fence: Fix oops due to non-matching drm_sched init/fini Guilherme G. Piccoli
2023-02-02 16:12 ` Luben Tuikov
2023-02-02 16:53   ` Guilherme G. Piccoli

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.