dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery
@ 2024-01-17  3:12 Erico Nunes
  2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
                   ` (6 more replies)
  0 siblings, 7 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

There have been reports from users for a long time about applications
hitting random "*ERROR* lima job timeout" and the application or GPU
becoming unresponsive.

This series addresses a number of related bugs, especially in the pp
timeout recovery path.

Patch 1 fixes a stack trace initially featured in a user report here:
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415

Patch 2 fixes a "pp reset time out" message which was fairly confusing
at first. Debugging with an application which just submits one single
job at a time made it clear that the timeout actually was reported in
the application that runs next, even if that is a good application.

Patch 3 is one of the most important fixes and stops random "mmu command
2 timeout" ppmmu timeouts that I hit while running my tests. Initially I
thought it came from some race condition on running/resetting jobs, but
it was actually due to hard resetting live tasks.
After setting this there was never a mmu timeout anymore (even after
dropping the guilty flag in the upcoming patch, which by itself was
actually the easiest reproducer of the ppmmu timeouts).

Patch 4 might be debatable, but hopefully we can agree to go with it
since it was already discussed in Panfrost here:
https://patchwork.freedesktop.org/series/120820/
It is actually not that hard to reproduce both "spurious timeout" and
"unexpectedly high interrupt latency" by creating an irq which takes a
longer than usual to execute, and paired with the issues in timeout
recovery this could cause an unrelated application to hang.

Patch 5 removes the guilty drm_sched from context handling in lima.
Often users report issues with a user-visible effect of "application
flipping the last 2 rendered frames":
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8410
This was ultimately caused because lima sets the guilty flag to a
context which causes all further rendering jobs from that context to be
dropped.
Without the fixes from patches 1-4 it was not possible to just drop the
guilty flag as that could trigger the mentioned issues and still hang
the GPU by attempting a recovery.
After the fixes above it seems to be reliable and survives some fairly
torturing tests mentioned below.
Other similar GPU drivers like panfrost, v3d don't make use of the
guilty context flag. So I think we can drop it for lima too.

Patch 6 is just some debug message cleanup.


Some of the tests which I ran with this patchset:

- Application which renders normal frames, then frames which /barely/
timeout, then frames which legitimely timeout, then frames which timeout
enough to trigger the hardware watchdog irq, then goes back to render
normal frames. This used to hang the application at first but now
survives the entire process repeated indefinitely.

- High irq latency from an unrelated source while rendering. I put a
mdelay() in the touchscreen driver irq and could cause all "spurious
timeout", "unexpectedly high interrupt latency" and actual timeouts.
Those are now all reported to dmesg for information and I am no longer
able to cause an unrelated application to hang.

- Game running with lower configured drm_sched timeout (locally set to
200ms) with other applications triggering timeouts in the background.
Different applications trigger resets independently but none of them
cause the GPU to hang or the game to stop working.


Erico Nunes (6):
  drm/lima: fix devfreq refcount imbalance for job timeouts
  drm/lima: reset async_reset on pp hard reset
  drm/lima: set bus_stop bit before hard reset
  drm/lima: handle spurious timeouts due to high irq latency
  drm/lima: remove guilty drm_sched context handling
  drm/lima: improve some pp debug messages

 drivers/gpu/drm/lima/lima_ctx.c   |  2 +-
 drivers/gpu/drm/lima/lima_ctx.h   |  1 -
 drivers/gpu/drm/lima/lima_pp.c    | 26 +++++++++++++++++---
 drivers/gpu/drm/lima/lima_sched.c | 40 ++++++++++++++++++++++++-------
 drivers/gpu/drm/lima/lima_sched.h |  5 ++--
 5 files changed, 58 insertions(+), 16 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:13   ` Vasily Khoruzhick
  2024-01-18  1:36   ` Qiang Yu
  2024-01-17  3:12 ` [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset Erico Nunes
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

In case a task manages to complete but it took just long enough to also
trigger the timeout handler, the current code results in a refcount
imbalance on lima_pm_idle.

While this can be a rare occurrence, when it happens it may fill user
logs with stack traces such as:

[10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
...
[10136.669628] Call trace:
[10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
[10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
[10136.669656]  lima_gp_irq_handler+0xa8/0x120
[10136.669666]  __handle_irq_event_percpu+0x48/0x160
[10136.669679]  handle_irq_event+0x4c/0xc0

The imbalance happens because lima_sched_pipe_task_done() already calls
lima_pm_idle for this case if there was no error.
Check the error flag in the timeout handler to ensure we can never run
into this case.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_sched.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index c3bf8cda8498..66317296d831 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -427,7 +427,8 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	pipe->current_vm = NULL;
 	pipe->current_task = NULL;
 
-	lima_pm_idle(ldev);
+	if (pipe->error)
+		lima_pm_idle(ldev);
 
 	drm_sched_resubmit_jobs(&pipe->base);
 	drm_sched_start(&pipe->base, true);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
  2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:17   ` Vasily Khoruzhick
  2024-01-18  1:56   ` Qiang Yu
  2024-01-17  3:12 ` [PATCH v1 3/6] drm/lima: set bus_stop bit before " Erico Nunes
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

Lima pp jobs use an async reset to avoid having to wait for the soft
reset right after a job. The soft reset is done at the end of a job and
a reset_complete flag is expected to be set at the next job.
However, in case the user runs into a job timeout from any application,
a hard reset is issued to the hardware. This hard reset clears the
reset_complete flag, which causes an error message to show up before the
next job.
This is probably harmless for the execution but can be very confusing to
debug, as it blames a reset timeout on the next application to submit a
job.
Reset the async_reset flag when doing the hard reset so that we don't
get that message.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_pp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index a5c95bed08c0..a8f8f63b8295 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -191,6 +191,13 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0);
 	pp_write(LIMA_PP_INT_CLEAR, LIMA_PP_IRQ_MASK_ALL);
 	pp_write(LIMA_PP_INT_MASK, LIMA_PP_IRQ_MASK_USED);
+
+	/*
+	 * if there was an async soft reset queued,
+	 * don't wait for it in the next job
+	 */
+	ip->data.async_reset = false;
+
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 3/6] drm/lima: set bus_stop bit before hard reset
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
  2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
  2024-01-17  3:12 ` [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:23   ` Vasily Khoruzhick
  2024-01-18  2:01   ` Qiang Yu
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_pp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index a8f8f63b8295..ac097dd75072 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -168,6 +168,11 @@ static void lima_pp_write_frame(struct lima_ip *ip, u32 *frame, u32 *wb)
 	}
 }
 
+static int lima_pp_bus_stop_poll(struct lima_ip *ip)
+{
+	return !!(pp_read(LIMA_PP_STATUS) & LIMA_PP_STATUS_BUS_STOPPED);
+}
+
 static int lima_pp_hard_reset_poll(struct lima_ip *ip)
 {
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC01A0000);
@@ -181,6 +186,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC0FFE000);
 	pp_write(LIMA_PP_INT_MASK, 0);
+
+	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
+	ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
+	if (ret) {
+		dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
+		return ret;
+	}
+
 	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
 	ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
 	if (ret) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (2 preceding siblings ...)
  2024-01-17  3:12 ` [PATCH v1 3/6] drm/lima: set bus_stop bit before " Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:26   ` Vasily Khoruzhick
                     ` (3 more replies)
  2024-01-17  3:12 ` [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling Erico Nunes
                   ` (2 subsequent siblings)
  6 siblings, 4 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

There are several unexplained and unreproduced cases of rendering
timeouts with lima, for which one theory is high IRQ latency coming from
somewhere else in the system.
This kind of occurrence may cause applications to trigger unnecessary
resets of the GPU or even applications to hang if it hits an issue in
the recovery path.
Panfrost already does some special handling to account for such
"spurious timeouts", it makes sense to have this in lima too to reduce
the chance that it hit users.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_sched.c | 32 ++++++++++++++++++++++++++-----
 drivers/gpu/drm/lima/lima_sched.h |  2 ++
 2 files changed, 29 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 66317296d831..9449b81bcd5b 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR MIT
 /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
 
+#include <linux/hardirq.h>
 #include <linux/iosys-map.h>
 #include <linux/kthread.h>
 #include <linux/slab.h>
@@ -223,10 +224,7 @@ static struct dma_fence *lima_sched_run_job(struct drm_sched_job *job)
 
 	task->fence = &fence->base;
 
-	/* for caller usage of the fence, otherwise irq handler
-	 * may consume the fence before caller use it
-	 */
-	dma_fence_get(task->fence);
+	task->done_fence = dma_fence_get(task->fence);
 
 	pipe->current_task = task;
 
@@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
 	struct lima_sched_task *task = to_lima_task(job);
 	struct lima_device *ldev = pipe->ldev;
+	struct lima_ip *ip = pipe->processor[0];
+
+	/*
+	 * If the GPU managed to complete this jobs fence, the timeout is
+	 * spurious. Bail out.
+	 */
+	if (dma_fence_is_signaled(task->done_fence)) {
+		DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
+		return DRM_GPU_SCHED_STAT_NOMINAL;
+	}
+
+	/*
+	 * Lima IRQ handler may take a long time to process an interrupt
+	 * if there is another IRQ handler hogging the processing.
+	 * In order to catch such cases and not report spurious Lima job
+	 * timeouts, synchronize the IRQ handler and re-check the fence
+	 * status.
+	 */
+	synchronize_irq(ip->irq);
+
+	if (dma_fence_is_signaled(task->done_fence)) {
+		DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
+		return DRM_GPU_SCHED_STAT_NOMINAL;
+	}
 
 	if (!pipe->error)
-		DRM_ERROR("lima job timeout\n");
+		DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
 
 	drm_sched_stop(&pipe->base, &task->base);
 
diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
index 6a11764d87b3..34050facb110 100644
--- a/drivers/gpu/drm/lima/lima_sched.h
+++ b/drivers/gpu/drm/lima/lima_sched.h
@@ -29,6 +29,8 @@ struct lima_sched_task {
 	bool recoverable;
 	struct lima_bo *heap;
 
+	struct dma_fence *done_fence;
+
 	/* pipe fence */
 	struct dma_fence *fence;
 };
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (3 preceding siblings ...)
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:28   ` Vasily Khoruzhick
  2024-01-17  3:12 ` [PATCH v1 6/6] drm/lima: improve some pp debug messages Erico Nunes
  2024-01-17  7:22 ` [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Christian König
  6 siblings, 1 reply; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

Marking the context as guilty currently only makes the application which
hits a single timeout problem to stop its rendering context entirely.
All jobs submitted later are dropped from the guilty context.

Lima runs on fairly underpowered hardware for modern standards and it is
not entirely unreasonable that a rendering job may time out occasionally
due to high system load or too demanding application stack. In this case
it would be generally preferred to report the error but try to keep the
application going.

Other similar embedded GPU drivers don't make use of the guilty context
flag. Now that there are reliability improvements to the lima timeout
recovery handling, drop the guilty contexts to let the application keep
running in this case.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_ctx.c   | 2 +-
 drivers/gpu/drm/lima/lima_ctx.h   | 1 -
 drivers/gpu/drm/lima/lima_sched.c | 5 ++---
 drivers/gpu/drm/lima/lima_sched.h | 3 +--
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
index 8389f2d7d021..0e668fc1e0f9 100644
--- a/drivers/gpu/drm/lima/lima_ctx.c
+++ b/drivers/gpu/drm/lima/lima_ctx.c
@@ -19,7 +19,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
 	kref_init(&ctx->refcnt);
 
 	for (i = 0; i < lima_pipe_num; i++) {
-		err = lima_sched_context_init(dev->pipe + i, ctx->context + i, &ctx->guilty);
+		err = lima_sched_context_init(dev->pipe + i, ctx->context + i);
 		if (err)
 			goto err_out0;
 	}
diff --git a/drivers/gpu/drm/lima/lima_ctx.h b/drivers/gpu/drm/lima/lima_ctx.h
index 74e2be09090f..5b1063ce968b 100644
--- a/drivers/gpu/drm/lima/lima_ctx.h
+++ b/drivers/gpu/drm/lima/lima_ctx.h
@@ -13,7 +13,6 @@ struct lima_ctx {
 	struct kref refcnt;
 	struct lima_device *dev;
 	struct lima_sched_context context[lima_pipe_num];
-	atomic_t guilty;
 
 	/* debug info */
 	char pname[TASK_COMM_LEN];
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 9449b81bcd5b..496c79713fe8 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -154,13 +154,12 @@ void lima_sched_task_fini(struct lima_sched_task *task)
 }
 
 int lima_sched_context_init(struct lima_sched_pipe *pipe,
-			    struct lima_sched_context *context,
-			    atomic_t *guilty)
+			    struct lima_sched_context *context)
 {
 	struct drm_gpu_scheduler *sched = &pipe->base;
 
 	return drm_sched_entity_init(&context->base, DRM_SCHED_PRIORITY_NORMAL,
-				     &sched, 1, guilty);
+				     &sched, 1, NULL);
 }
 
 void lima_sched_context_fini(struct lima_sched_pipe *pipe,
diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
index 34050facb110..677e908b53f8 100644
--- a/drivers/gpu/drm/lima/lima_sched.h
+++ b/drivers/gpu/drm/lima/lima_sched.h
@@ -93,8 +93,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
 void lima_sched_task_fini(struct lima_sched_task *task);
 
 int lima_sched_context_init(struct lima_sched_pipe *pipe,
-			    struct lima_sched_context *context,
-			    atomic_t *guilty);
+			    struct lima_sched_context *context);
 void lima_sched_context_fini(struct lima_sched_pipe *pipe,
 			     struct lima_sched_context *context);
 struct dma_fence *lima_sched_context_queue_task(struct lima_sched_task *task);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH v1 6/6] drm/lima: improve some pp debug messages
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (4 preceding siblings ...)
  2024-01-17  3:12 ` [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling Erico Nunes
@ 2024-01-17  3:12 ` Erico Nunes
  2024-01-17 18:29   ` Vasily Khoruzhick
  2024-01-17  7:22 ` [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Christian König
  6 siblings, 1 reply; 30+ messages in thread
From: Erico Nunes @ 2024-01-17  3:12 UTC (permalink / raw)
  To: Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, christian.koenig,
	anarsoul, Daniel Vetter, David Airlie, Sumit Semwal, Erico Nunes

Make the messages more consistent by showing the pp name.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_pp.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index ac097dd75072..d9e178d6645d 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -197,7 +197,7 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
 	ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
 	if (ret) {
-		dev_err(dev->dev, "pp hard reset timeout\n");
+		dev_err(dev->dev, "pp %s hard reset timeout\n", lima_ip_name(ip));
 		return ret;
 	}
 
@@ -423,8 +423,8 @@ static void lima_pp_task_error(struct lima_sched_pipe *pipe)
 	for (i = 0; i < pipe->num_processor; i++) {
 		struct lima_ip *ip = pipe->processor[i];
 
-		dev_err(ip->dev->dev, "pp task error %d int_state=%x status=%x\n",
-			i, pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
+		dev_err(ip->dev->dev, "pp %s task error int_state=%x status=%x\n",
+			lima_ip_name(ip), pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
 
 		lima_pp_hard_reset(ip);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery
  2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (5 preceding siblings ...)
  2024-01-17  3:12 ` [PATCH v1 6/6] drm/lima: improve some pp debug messages Erico Nunes
@ 2024-01-17  7:22 ` Christian König
  6 siblings, 0 replies; 30+ messages in thread
From: Christian König @ 2024-01-17  7:22 UTC (permalink / raw)
  To: Erico Nunes, Qiang Yu, dri-devel, lima
  Cc: Thomas Zimmermann, linux-kernel, Maxime Ripard, anarsoul,
	Daniel Vetter, David Airlie, Sumit Semwal

Am 17.01.24 um 04:12 schrieb Erico Nunes:
> There have been reports from users for a long time about applications
> hitting random "*ERROR* lima job timeout" and the application or GPU
> becoming unresponsive.
>
> This series addresses a number of related bugs, especially in the pp
> timeout recovery path.
>
> Patch 1 fixes a stack trace initially featured in a user report here:
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415
>
> Patch 2 fixes a "pp reset time out" message which was fairly confusing
> at first. Debugging with an application which just submits one single
> job at a time made it clear that the timeout actually was reported in
> the application that runs next, even if that is a good application.
>
> Patch 3 is one of the most important fixes and stops random "mmu command
> 2 timeout" ppmmu timeouts that I hit while running my tests. Initially I
> thought it came from some race condition on running/resetting jobs, but
> it was actually due to hard resetting live tasks.
> After setting this there was never a mmu timeout anymore (even after
> dropping the guilty flag in the upcoming patch, which by itself was
> actually the easiest reproducer of the ppmmu timeouts).
>
> Patch 4 might be debatable, but hopefully we can agree to go with it
> since it was already discussed in Panfrost here:
> https://patchwork.freedesktop.org/series/120820/
> It is actually not that hard to reproduce both "spurious timeout" and
> "unexpectedly high interrupt latency" by creating an irq which takes a
> longer than usual to execute, and paired with the issues in timeout
> recovery this could cause an unrelated application to hang.
>
> Patch 5 removes the guilty drm_sched from context handling in lima.
> Often users report issues with a user-visible effect of "application
> flipping the last 2 rendered frames":
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8410
> This was ultimately caused because lima sets the guilty flag to a
> context which causes all further rendering jobs from that context to be
> dropped.
> Without the fixes from patches 1-4 it was not possible to just drop the
> guilty flag as that could trigger the mentioned issues and still hang
> the GPU by attempting a recovery.
> After the fixes above it seems to be reliable and survives some fairly
> torturing tests mentioned below.
> Other similar GPU drivers like panfrost, v3d don't make use of the
> guilty context flag. So I think we can drop it for lima too.

Oh, yes please! Thanks so much for that.

I'm working on removing the guilty handling in amdgpu as well and 
together this will most likely allow us to completely remove the guilty 
handling from the scheduler.

In general driver should use the dma_fence error handling instead. That 
one is well defined and way more reliable.

Feel free to add my acked-by to patch 5.

Thanks,
Christian.

>
> Patch 6 is just some debug message cleanup.
>
>
> Some of the tests which I ran with this patchset:
>
> - Application which renders normal frames, then frames which /barely/
> timeout, then frames which legitimely timeout, then frames which timeout
> enough to trigger the hardware watchdog irq, then goes back to render
> normal frames. This used to hang the application at first but now
> survives the entire process repeated indefinitely.
>
> - High irq latency from an unrelated source while rendering. I put a
> mdelay() in the touchscreen driver irq and could cause all "spurious
> timeout", "unexpectedly high interrupt latency" and actual timeouts.
> Those are now all reported to dmesg for information and I am no longer
> able to cause an unrelated application to hang.
>
> - Game running with lower configured drm_sched timeout (locally set to
> 200ms) with other applications triggering timeouts in the background.
> Different applications trigger resets independently but none of them
> cause the GPU to hang or the game to stop working.
>
>
> Erico Nunes (6):
>    drm/lima: fix devfreq refcount imbalance for job timeouts
>    drm/lima: reset async_reset on pp hard reset
>    drm/lima: set bus_stop bit before hard reset
>    drm/lima: handle spurious timeouts due to high irq latency
>    drm/lima: remove guilty drm_sched context handling
>    drm/lima: improve some pp debug messages
>
>   drivers/gpu/drm/lima/lima_ctx.c   |  2 +-
>   drivers/gpu/drm/lima/lima_ctx.h   |  1 -
>   drivers/gpu/drm/lima/lima_pp.c    | 26 +++++++++++++++++---
>   drivers/gpu/drm/lima/lima_sched.c | 40 ++++++++++++++++++++++++-------
>   drivers/gpu/drm/lima/lima_sched.h |  5 ++--
>   5 files changed, 58 insertions(+), 16 deletions(-)
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
@ 2024-01-17 18:13   ` Vasily Khoruzhick
  2024-01-18  1:36   ` Qiang Yu
  1 sibling, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:13 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> In case a task manages to complete but it took just long enough to also
> trigger the timeout handler, the current code results in a refcount
> imbalance on lima_pm_idle.
>
> While this can be a rare occurrence, when it happens it may fill user
> logs with stack traces such as:
>
> [10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669628] Call trace:
> [10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
> [10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
> [10136.669656]  lima_gp_irq_handler+0xa8/0x120
> [10136.669666]  __handle_irq_event_percpu+0x48/0x160
> [10136.669679]  handle_irq_event+0x4c/0xc0
>
> The imbalance happens because lima_sched_pipe_task_done() already calls
> lima_pm_idle for this case if there was no error.
> Check the error flag in the timeout handler to ensure we can never run
> into this case.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_sched.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index c3bf8cda8498..66317296d831 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -427,7 +427,8 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         pipe->current_vm = NULL;
>         pipe->current_task = NULL;
>
> -       lima_pm_idle(ldev);
> +       if (pipe->error)
> +               lima_pm_idle(ldev);
>
>         drm_sched_resubmit_jobs(&pipe->base);
>         drm_sched_start(&pipe->base, true);
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset
  2024-01-17  3:12 ` [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset Erico Nunes
@ 2024-01-17 18:17   ` Vasily Khoruzhick
  2024-01-18  1:56   ` Qiang Yu
  1 sibling, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:17 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> Lima pp jobs use an async reset to avoid having to wait for the soft
> reset right after a job. The soft reset is done at the end of a job and
> a reset_complete flag is expected to be set at the next job.
> However, in case the user runs into a job timeout from any application,
> a hard reset is issued to the hardware. This hard reset clears the
> reset_complete flag, which causes an error message to show up before the
> next job.
> This is probably harmless for the execution but can be very confusing to
> debug, as it blames a reset timeout on the next application to submit a
> job.
> Reset the async_reset flag when doing the hard reset so that we don't
> get that message.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_pp.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
> index a5c95bed08c0..a8f8f63b8295 100644
> --- a/drivers/gpu/drm/lima/lima_pp.c
> +++ b/drivers/gpu/drm/lima/lima_pp.c
> @@ -191,6 +191,13 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0);
>         pp_write(LIMA_PP_INT_CLEAR, LIMA_PP_IRQ_MASK_ALL);
>         pp_write(LIMA_PP_INT_MASK, LIMA_PP_IRQ_MASK_USED);
> +
> +       /*
> +        * if there was an async soft reset queued,
> +        * don't wait for it in the next job
> +        */
> +       ip->data.async_reset = false;
> +
>         return 0;
>  }
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/6] drm/lima: set bus_stop bit before hard reset
  2024-01-17  3:12 ` [PATCH v1 3/6] drm/lima: set bus_stop bit before " Erico Nunes
@ 2024-01-17 18:23   ` Vasily Khoruzhick
  2024-01-18  2:01   ` Qiang Yu
  1 sibling, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:23 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> This is required for reliable hard resets. Otherwise, doing a hard reset
> while a task is still running (such as a task which is being stopped by
> the drm_sched timeout handler) may result in random mmu write timeouts
> or lockups which cause the entire gpu to hang.

It looks like Mali driver is doing the same, so it totally makes sense.

> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_pp.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
> index a8f8f63b8295..ac097dd75072 100644
> --- a/drivers/gpu/drm/lima/lima_pp.c
> +++ b/drivers/gpu/drm/lima/lima_pp.c
> @@ -168,6 +168,11 @@ static void lima_pp_write_frame(struct lima_ip *ip, u32 *frame, u32 *wb)
>         }
>  }
>
> +static int lima_pp_bus_stop_poll(struct lima_ip *ip)
> +{
> +       return !!(pp_read(LIMA_PP_STATUS) & LIMA_PP_STATUS_BUS_STOPPED);
> +}
> +
>  static int lima_pp_hard_reset_poll(struct lima_ip *ip)
>  {
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC01A0000);
> @@ -181,6 +186,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
>
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC0FFE000);
>         pp_write(LIMA_PP_INT_MASK, 0);
> +
> +       pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
> +       ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
> +       if (ret) {
> +               dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
> +               return ret;
> +       }
> +
>         pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
>         ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
>         if (ret) {
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
@ 2024-01-17 18:26   ` Vasily Khoruzhick
  2024-01-18  2:46   ` Qiang Yu
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:26 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> There are several unexplained and unreproduced cases of rendering
> timeouts with lima, for which one theory is high IRQ latency coming from
> somewhere else in the system.
> This kind of occurrence may cause applications to trigger unnecessary
> resets of the GPU or even applications to hang if it hits an issue in
> the recovery path.
> Panfrost already does some special handling to account for such
> "spurious timeouts", it makes sense to have this in lima too to reduce
> the chance that it hit users.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_sched.c | 32 ++++++++++++++++++++++++++-----
>  drivers/gpu/drm/lima/lima_sched.h |  2 ++
>  2 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 66317296d831..9449b81bcd5b 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0 OR MIT
>  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
>
> +#include <linux/hardirq.h>
>  #include <linux/iosys-map.h>
>  #include <linux/kthread.h>
>  #include <linux/slab.h>
> @@ -223,10 +224,7 @@ static struct dma_fence *lima_sched_run_job(struct drm_sched_job *job)
>
>         task->fence = &fence->base;
>
> -       /* for caller usage of the fence, otherwise irq handler
> -        * may consume the fence before caller use it
> -        */
> -       dma_fence_get(task->fence);
> +       task->done_fence = dma_fence_get(task->fence);
>
>         pipe->current_task = task;
>
> @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
>         struct lima_sched_task *task = to_lima_task(job);
>         struct lima_device *ldev = pipe->ldev;
> +       struct lima_ip *ip = pipe->processor[0];
> +
> +       /*
> +        * If the GPU managed to complete this jobs fence, the timeout is
> +        * spurious. Bail out.
> +        */
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
> +
> +       /*
> +        * Lima IRQ handler may take a long time to process an interrupt
> +        * if there is another IRQ handler hogging the processing.
> +        * In order to catch such cases and not report spurious Lima job
> +        * timeouts, synchronize the IRQ handler and re-check the fence
> +        * status.
> +        */
> +       synchronize_irq(ip->irq);
> +
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
>
>         if (!pipe->error)
> -               DRM_ERROR("lima job timeout\n");
> +               DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
>
>         drm_sched_stop(&pipe->base, &task->base);
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> index 6a11764d87b3..34050facb110 100644
> --- a/drivers/gpu/drm/lima/lima_sched.h
> +++ b/drivers/gpu/drm/lima/lima_sched.h
> @@ -29,6 +29,8 @@ struct lima_sched_task {
>         bool recoverable;
>         struct lima_bo *heap;
>
> +       struct dma_fence *done_fence;
> +
>         /* pipe fence */
>         struct dma_fence *fence;
>  };
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling
  2024-01-17  3:12 ` [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling Erico Nunes
@ 2024-01-17 18:28   ` Vasily Khoruzhick
  0 siblings, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:28 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> Marking the context as guilty currently only makes the application which
> hits a single timeout problem to stop its rendering context entirely.
> All jobs submitted later are dropped from the guilty context.
>
> Lima runs on fairly underpowered hardware for modern standards and it is
> not entirely unreasonable that a rendering job may time out occasionally
> due to high system load or too demanding application stack. In this case
> it would be generally preferred to report the error but try to keep the
> application going.
>
> Other similar embedded GPU drivers don't make use of the guilty context
> flag. Now that there are reliability improvements to the lima timeout
> recovery handling, drop the guilty contexts to let the application keep
> running in this case.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_ctx.c   | 2 +-
>  drivers/gpu/drm/lima/lima_ctx.h   | 1 -
>  drivers/gpu/drm/lima/lima_sched.c | 5 ++---
>  drivers/gpu/drm/lima/lima_sched.h | 3 +--
>  4 files changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
> index 8389f2d7d021..0e668fc1e0f9 100644
> --- a/drivers/gpu/drm/lima/lima_ctx.c
> +++ b/drivers/gpu/drm/lima/lima_ctx.c
> @@ -19,7 +19,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
>         kref_init(&ctx->refcnt);
>
>         for (i = 0; i < lima_pipe_num; i++) {
> -               err = lima_sched_context_init(dev->pipe + i, ctx->context + i, &ctx->guilty);
> +               err = lima_sched_context_init(dev->pipe + i, ctx->context + i);
>                 if (err)
>                         goto err_out0;
>         }
> diff --git a/drivers/gpu/drm/lima/lima_ctx.h b/drivers/gpu/drm/lima/lima_ctx.h
> index 74e2be09090f..5b1063ce968b 100644
> --- a/drivers/gpu/drm/lima/lima_ctx.h
> +++ b/drivers/gpu/drm/lima/lima_ctx.h
> @@ -13,7 +13,6 @@ struct lima_ctx {
>         struct kref refcnt;
>         struct lima_device *dev;
>         struct lima_sched_context context[lima_pipe_num];
> -       atomic_t guilty;
>
>         /* debug info */
>         char pname[TASK_COMM_LEN];
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 9449b81bcd5b..496c79713fe8 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -154,13 +154,12 @@ void lima_sched_task_fini(struct lima_sched_task *task)
>  }
>
>  int lima_sched_context_init(struct lima_sched_pipe *pipe,
> -                           struct lima_sched_context *context,
> -                           atomic_t *guilty)
> +                           struct lima_sched_context *context)
>  {
>         struct drm_gpu_scheduler *sched = &pipe->base;
>
>         return drm_sched_entity_init(&context->base, DRM_SCHED_PRIORITY_NORMAL,
> -                                    &sched, 1, guilty);
> +                                    &sched, 1, NULL);
>  }
>
>  void lima_sched_context_fini(struct lima_sched_pipe *pipe,
> diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> index 34050facb110..677e908b53f8 100644
> --- a/drivers/gpu/drm/lima/lima_sched.h
> +++ b/drivers/gpu/drm/lima/lima_sched.h
> @@ -93,8 +93,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
>  void lima_sched_task_fini(struct lima_sched_task *task);
>
>  int lima_sched_context_init(struct lima_sched_pipe *pipe,
> -                           struct lima_sched_context *context,
> -                           atomic_t *guilty);
> +                           struct lima_sched_context *context);
>  void lima_sched_context_fini(struct lima_sched_pipe *pipe,
>                              struct lima_sched_context *context);
>  struct dma_fence *lima_sched_context_queue_task(struct lima_sched_task *task);
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 6/6] drm/lima: improve some pp debug messages
  2024-01-17  3:12 ` [PATCH v1 6/6] drm/lima: improve some pp debug messages Erico Nunes
@ 2024-01-17 18:29   ` Vasily Khoruzhick
  0 siblings, 0 replies; 30+ messages in thread
From: Vasily Khoruzhick @ 2024-01-17 18:29 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, dri-devel, linux-kernel, Maxime Ripard,
	christian.koenig, Qiang Yu, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Tue, Jan 16, 2024 at 7:12 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> Make the messages more consistent by showing the pp name.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>

> ---
>  drivers/gpu/drm/lima/lima_pp.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
> index ac097dd75072..d9e178d6645d 100644
> --- a/drivers/gpu/drm/lima/lima_pp.c
> +++ b/drivers/gpu/drm/lima/lima_pp.c
> @@ -197,7 +197,7 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
>         pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
>         ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
>         if (ret) {
> -               dev_err(dev->dev, "pp hard reset timeout\n");
> +               dev_err(dev->dev, "pp %s hard reset timeout\n", lima_ip_name(ip));
>                 return ret;
>         }
>
> @@ -423,8 +423,8 @@ static void lima_pp_task_error(struct lima_sched_pipe *pipe)
>         for (i = 0; i < pipe->num_processor; i++) {
>                 struct lima_ip *ip = pipe->processor[i];
>
> -               dev_err(ip->dev->dev, "pp task error %d int_state=%x status=%x\n",
> -                       i, pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
> +               dev_err(ip->dev->dev, "pp %s task error int_state=%x status=%x\n",
> +                       lima_ip_name(ip), pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
>
>                 lima_pp_hard_reset(ip);
>         }
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
  2024-01-17 18:13   ` Vasily Khoruzhick
@ 2024-01-18  1:36   ` Qiang Yu
  2024-01-18 11:14     ` Erico Nunes
  1 sibling, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-18  1:36 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

So this is caused by same job trigger both done and timeout handling?
I think a better way to solve this is to make sure only one handler
(done or timeout) process the job instead of just making lima_pm_idle()
unique.

Regards,
Qiang

On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> In case a task manages to complete but it took just long enough to also
> trigger the timeout handler, the current code results in a refcount
> imbalance on lima_pm_idle.
>
> While this can be a rare occurrence, when it happens it may fill user
> logs with stack traces such as:
>
> [10136.669170] WARNING: CPU: 0 PID: 0 at drivers/gpu/drm/lima/lima_devfreq.c:205 lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669459] pc : lima_devfreq_record_idle+0xa0/0xb0
> ...
> [10136.669628] Call trace:
> [10136.669634]  lima_devfreq_record_idle+0xa0/0xb0
> [10136.669646]  lima_sched_pipe_task_done+0x5c/0xb0
> [10136.669656]  lima_gp_irq_handler+0xa8/0x120
> [10136.669666]  __handle_irq_event_percpu+0x48/0x160
> [10136.669679]  handle_irq_event+0x4c/0xc0
>
> The imbalance happens because lima_sched_pipe_task_done() already calls
> lima_pm_idle for this case if there was no error.
> Check the error flag in the timeout handler to ensure we can never run
> into this case.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_sched.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index c3bf8cda8498..66317296d831 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -427,7 +427,8 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         pipe->current_vm = NULL;
>         pipe->current_task = NULL;
>
> -       lima_pm_idle(ldev);
> +       if (pipe->error)
> +               lima_pm_idle(ldev);
>
>         drm_sched_resubmit_jobs(&pipe->base);
>         drm_sched_start(&pipe->base, true);
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset
  2024-01-17  3:12 ` [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset Erico Nunes
  2024-01-17 18:17   ` Vasily Khoruzhick
@ 2024-01-18  1:56   ` Qiang Yu
  1 sibling, 0 replies; 30+ messages in thread
From: Qiang Yu @ 2024-01-18  1:56 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

GP should also need this.

Regards,
Qiang

On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> Lima pp jobs use an async reset to avoid having to wait for the soft
> reset right after a job. The soft reset is done at the end of a job and
> a reset_complete flag is expected to be set at the next job.
> However, in case the user runs into a job timeout from any application,
> a hard reset is issued to the hardware. This hard reset clears the
> reset_complete flag, which causes an error message to show up before the
> next job.
> This is probably harmless for the execution but can be very confusing to
> debug, as it blames a reset timeout on the next application to submit a
> job.
> Reset the async_reset flag when doing the hard reset so that we don't
> get that message.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_pp.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
> index a5c95bed08c0..a8f8f63b8295 100644
> --- a/drivers/gpu/drm/lima/lima_pp.c
> +++ b/drivers/gpu/drm/lima/lima_pp.c
> @@ -191,6 +191,13 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0);
>         pp_write(LIMA_PP_INT_CLEAR, LIMA_PP_IRQ_MASK_ALL);
>         pp_write(LIMA_PP_INT_MASK, LIMA_PP_IRQ_MASK_USED);
> +
> +       /*
> +        * if there was an async soft reset queued,
> +        * don't wait for it in the next job
> +        */
> +       ip->data.async_reset = false;
> +
>         return 0;
>  }
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/6] drm/lima: set bus_stop bit before hard reset
  2024-01-17  3:12 ` [PATCH v1 3/6] drm/lima: set bus_stop bit before " Erico Nunes
  2024-01-17 18:23   ` Vasily Khoruzhick
@ 2024-01-18  2:01   ` Qiang Yu
  2024-01-18 11:43     ` Erico Nunes
  1 sibling, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-18  2:01 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

Do we need same for GP?

Regards,
Qiang

On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> This is required for reliable hard resets. Otherwise, doing a hard reset
> while a task is still running (such as a task which is being stopped by
> the drm_sched timeout handler) may result in random mmu write timeouts
> or lockups which cause the entire gpu to hang.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_pp.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
>
> diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
> index a8f8f63b8295..ac097dd75072 100644
> --- a/drivers/gpu/drm/lima/lima_pp.c
> +++ b/drivers/gpu/drm/lima/lima_pp.c
> @@ -168,6 +168,11 @@ static void lima_pp_write_frame(struct lima_ip *ip, u32 *frame, u32 *wb)
>         }
>  }
>
> +static int lima_pp_bus_stop_poll(struct lima_ip *ip)
> +{
> +       return !!(pp_read(LIMA_PP_STATUS) & LIMA_PP_STATUS_BUS_STOPPED);
> +}
> +
>  static int lima_pp_hard_reset_poll(struct lima_ip *ip)
>  {
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC01A0000);
> @@ -181,6 +186,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
>
>         pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC0FFE000);
>         pp_write(LIMA_PP_INT_MASK, 0);
> +
> +       pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
> +       ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
> +       if (ret) {
> +               dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
> +               return ret;
> +       }
> +
>         pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
>         ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
>         if (ret) {
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
  2024-01-17 18:26   ` Vasily Khoruzhick
@ 2024-01-18  2:46   ` Qiang Yu
  2024-01-18 11:38     ` Erico Nunes
  2024-01-19  1:43   ` Qiang Yu
  2024-01-21  9:56   ` Hillf Danton
  3 siblings, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-18  2:46 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> There are several unexplained and unreproduced cases of rendering
> timeouts with lima, for which one theory is high IRQ latency coming from
> somewhere else in the system.
> This kind of occurrence may cause applications to trigger unnecessary
> resets of the GPU or even applications to hang if it hits an issue in
> the recovery path.
> Panfrost already does some special handling to account for such
> "spurious timeouts", it makes sense to have this in lima too to reduce
> the chance that it hit users.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_sched.c | 32 ++++++++++++++++++++++++++-----
>  drivers/gpu/drm/lima/lima_sched.h |  2 ++
>  2 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 66317296d831..9449b81bcd5b 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0 OR MIT
>  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
>
> +#include <linux/hardirq.h>
>  #include <linux/iosys-map.h>
>  #include <linux/kthread.h>
>  #include <linux/slab.h>
> @@ -223,10 +224,7 @@ static struct dma_fence *lima_sched_run_job(struct drm_sched_job *job)
>
>         task->fence = &fence->base;
>
> -       /* for caller usage of the fence, otherwise irq handler
> -        * may consume the fence before caller use it
> -        */
> -       dma_fence_get(task->fence);
> +       task->done_fence = dma_fence_get(task->fence);
>
>         pipe->current_task = task;
>
> @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
>         struct lima_sched_task *task = to_lima_task(job);
>         struct lima_device *ldev = pipe->ldev;
> +       struct lima_ip *ip = pipe->processor[0];
> +
> +       /*
> +        * If the GPU managed to complete this jobs fence, the timeout is
> +        * spurious. Bail out.
> +        */
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
> +
> +       /*
> +        * Lima IRQ handler may take a long time to process an interrupt
> +        * if there is another IRQ handler hogging the processing.
> +        * In order to catch such cases and not report spurious Lima job
> +        * timeouts, synchronize the IRQ handler and re-check the fence
> +        * status.
> +        */
> +       synchronize_irq(ip->irq);
> +
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
>
>         if (!pipe->error)
> -               DRM_ERROR("lima job timeout\n");
> +               DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
>
>         drm_sched_stop(&pipe->base, &task->base);
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> index 6a11764d87b3..34050facb110 100644
> --- a/drivers/gpu/drm/lima/lima_sched.h
> +++ b/drivers/gpu/drm/lima/lima_sched.h
> @@ -29,6 +29,8 @@ struct lima_sched_task {
>         bool recoverable;
>         struct lima_bo *heap;
>
> +       struct dma_fence *done_fence;
This is same as the following fence, do we really need a duplicated one?

> +
>         /* pipe fence */
>         struct dma_fence *fence;
>  };
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-18  1:36   ` Qiang Yu
@ 2024-01-18 11:14     ` Erico Nunes
  2024-01-19  1:50       ` Qiang Yu
  0 siblings, 1 reply; 30+ messages in thread
From: Erico Nunes @ 2024-01-18 11:14 UTC (permalink / raw)
  To: Qiang Yu
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Thu, Jan 18, 2024 at 2:36 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> So this is caused by same job trigger both done and timeout handling?
> I think a better way to solve this is to make sure only one handler
> (done or timeout) process the job instead of just making lima_pm_idle()
> unique.

It's not very clear to me how to best ensure that, with the drm_sched
software timeout and the irq happening potentially at the same time.
I think patch 4 in this series describes and covers the most common
case that this would be hit. So maybe now this patch could be dropped
in favour of just that one.
But since this was a bit hard to reproduce and I'm not sure the issue
is entirely covered by that, I just decided to keep this small change
as it prevented all the stack trace reproducers I was able to come up
with.

Erico

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-18  2:46   ` Qiang Yu
@ 2024-01-18 11:38     ` Erico Nunes
  0 siblings, 0 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-18 11:38 UTC (permalink / raw)
  To: Qiang Yu
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Thu, Jan 18, 2024 at 3:46 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
> > diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> > index 6a11764d87b3..34050facb110 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.h
> > +++ b/drivers/gpu/drm/lima/lima_sched.h
> > @@ -29,6 +29,8 @@ struct lima_sched_task {
> >         bool recoverable;
> >         struct lima_bo *heap;
> >
> > +       struct dma_fence *done_fence;
> This is same as the following fence, do we really need a duplicated one?

Checking again now, I think we can reuse the existing one.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 3/6] drm/lima: set bus_stop bit before hard reset
  2024-01-18  2:01   ` Qiang Yu
@ 2024-01-18 11:43     ` Erico Nunes
  0 siblings, 0 replies; 30+ messages in thread
From: Erico Nunes @ 2024-01-18 11:43 UTC (permalink / raw)
  To: Qiang Yu
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Thu, Jan 18, 2024 at 3:01 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> Do we need same for GP?

I don't have an issue reproducer for gp so far, but the hardware does
have the same bit and the mali driver does it for both gp and pp, so I
think we can also add it to gp.

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
  2024-01-17 18:26   ` Vasily Khoruzhick
  2024-01-18  2:46   ` Qiang Yu
@ 2024-01-19  1:43   ` Qiang Yu
  2024-01-21  3:04     ` Qiang Yu
  2024-01-21  9:56   ` Hillf Danton
  3 siblings, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-19  1:43 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> There are several unexplained and unreproduced cases of rendering
> timeouts with lima, for which one theory is high IRQ latency coming from
> somewhere else in the system.
> This kind of occurrence may cause applications to trigger unnecessary
> resets of the GPU or even applications to hang if it hits an issue in
> the recovery path.
> Panfrost already does some special handling to account for such
> "spurious timeouts", it makes sense to have this in lima too to reduce
> the chance that it hit users.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_sched.c | 32 ++++++++++++++++++++++++++-----
>  drivers/gpu/drm/lima/lima_sched.h |  2 ++
>  2 files changed, 29 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index 66317296d831..9449b81bcd5b 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0 OR MIT
>  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
>
> +#include <linux/hardirq.h>
>  #include <linux/iosys-map.h>
>  #include <linux/kthread.h>
>  #include <linux/slab.h>
> @@ -223,10 +224,7 @@ static struct dma_fence *lima_sched_run_job(struct drm_sched_job *job)
>
>         task->fence = &fence->base;
>
> -       /* for caller usage of the fence, otherwise irq handler
> -        * may consume the fence before caller use it
> -        */
> -       dma_fence_get(task->fence);
> +       task->done_fence = dma_fence_get(task->fence);
>
>         pipe->current_task = task;
>
> @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
>         struct lima_sched_task *task = to_lima_task(job);
>         struct lima_device *ldev = pipe->ldev;
> +       struct lima_ip *ip = pipe->processor[0];
> +
> +       /*
> +        * If the GPU managed to complete this jobs fence, the timeout is
> +        * spurious. Bail out.
> +        */
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
> +
You may just remove this check and left the check after sync irq.

> +       /*
> +        * Lima IRQ handler may take a long time to process an interrupt
> +        * if there is another IRQ handler hogging the processing.
> +        * In order to catch such cases and not report spurious Lima job
> +        * timeouts, synchronize the IRQ handler and re-check the fence
> +        * status.
> +        */
> +       synchronize_irq(ip->irq);
This should be done after drm_sched_stop() to prevent drm scheduler
run more jobs. And we need to disable interrupt of GP/PP for no more
running job triggered fresh INT.

PP may have more than one IRQ, so we need to wait on all of them.

> +
> +       if (dma_fence_is_signaled(task->done_fence)) {
> +               DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
>
>         if (!pipe->error)
> -               DRM_ERROR("lima job timeout\n");
> +               DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
>
>         drm_sched_stop(&pipe->base, &task->base);
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> index 6a11764d87b3..34050facb110 100644
> --- a/drivers/gpu/drm/lima/lima_sched.h
> +++ b/drivers/gpu/drm/lima/lima_sched.h
> @@ -29,6 +29,8 @@ struct lima_sched_task {
>         bool recoverable;
>         struct lima_bo *heap;
>
> +       struct dma_fence *done_fence;
> +
>         /* pipe fence */
>         struct dma_fence *fence;
>  };
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-18 11:14     ` Erico Nunes
@ 2024-01-19  1:50       ` Qiang Yu
  2024-01-23 23:19         ` Erico Nunes
  0 siblings, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-19  1:50 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Thu, Jan 18, 2024 at 7:14 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 2:36 AM Qiang Yu <yuq825@gmail.com> wrote:
> >
> > So this is caused by same job trigger both done and timeout handling?
> > I think a better way to solve this is to make sure only one handler
> > (done or timeout) process the job instead of just making lima_pm_idle()
> > unique.
>
> It's not very clear to me how to best ensure that, with the drm_sched
> software timeout and the irq happening potentially at the same time.
This could be done by stopping scheduler run more job and disable
GP/PP interrupt. Then after sync irq, there should be no more new
irq gets in when we handling timeout.

> I think patch 4 in this series describes and covers the most common
> case that this would be hit. So maybe now this patch could be dropped
> in favour of just that one.
Yes.

> But since this was a bit hard to reproduce and I'm not sure the issue
> is entirely covered by that, I just decided to keep this small change
> as it prevented all the stack trace reproducers I was able to come up
> with.
>
> Erico

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-19  1:43   ` Qiang Yu
@ 2024-01-21  3:04     ` Qiang Yu
  0 siblings, 0 replies; 30+ messages in thread
From: Qiang Yu @ 2024-01-21  3:04 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Fri, Jan 19, 2024 at 9:43 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> On Wed, Jan 17, 2024 at 11:12 AM Erico Nunes <nunes.erico@gmail.com> wrote:
> >
> > There are several unexplained and unreproduced cases of rendering
> > timeouts with lima, for which one theory is high IRQ latency coming from
> > somewhere else in the system.
> > This kind of occurrence may cause applications to trigger unnecessary
> > resets of the GPU or even applications to hang if it hits an issue in
> > the recovery path.
> > Panfrost already does some special handling to account for such
> > "spurious timeouts", it makes sense to have this in lima too to reduce
> > the chance that it hit users.
> >
> > Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> > ---
> >  drivers/gpu/drm/lima/lima_sched.c | 32 ++++++++++++++++++++++++++-----
> >  drivers/gpu/drm/lima/lima_sched.h |  2 ++
> >  2 files changed, 29 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > index 66317296d831..9449b81bcd5b 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.c
> > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > @@ -1,6 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0 OR MIT
> >  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
> >
> > +#include <linux/hardirq.h>
> >  #include <linux/iosys-map.h>
> >  #include <linux/kthread.h>
> >  #include <linux/slab.h>
> > @@ -223,10 +224,7 @@ static struct dma_fence *lima_sched_run_job(struct drm_sched_job *job)
> >
> >         task->fence = &fence->base;
> >
> > -       /* for caller usage of the fence, otherwise irq handler
> > -        * may consume the fence before caller use it
> > -        */
> > -       dma_fence_get(task->fence);
> > +       task->done_fence = dma_fence_get(task->fence);
> >
> >         pipe->current_task = task;
> >
> > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> >         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> >         struct lima_sched_task *task = to_lima_task(job);
> >         struct lima_device *ldev = pipe->ldev;
> > +       struct lima_ip *ip = pipe->processor[0];
> > +
> > +       /*
> > +        * If the GPU managed to complete this jobs fence, the timeout is
> > +        * spurious. Bail out.
> > +        */
> > +       if (dma_fence_is_signaled(task->done_fence)) {
> > +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > +               return DRM_GPU_SCHED_STAT_NOMINAL;
> > +       }
> > +
> You may just remove this check and left the check after sync irq.
>
After more thinking, this is only for handling spurious timeouts more
efficiently, not for totally reliable timeout handling. So this check should
be OK.

> > +       /*
> > +        * Lima IRQ handler may take a long time to process an interrupt
> > +        * if there is another IRQ handler hogging the processing.
> > +        * In order to catch such cases and not report spurious Lima job
> > +        * timeouts, synchronize the IRQ handler and re-check the fence
> > +        * status.
> > +        */
> > +       synchronize_irq(ip->irq);
> This should be done after drm_sched_stop() to prevent drm scheduler
> run more jobs. And we need to disable interrupt of GP/PP for no more
> running job triggered fresh INT.
This is OK too. We just need to solve reliable timeout handling after
drm_sched_stop() in another patch.

>
> PP may have more than one IRQ, so we need to wait on all of them.
>
> > +
> > +       if (dma_fence_is_signaled(task->done_fence)) {
> > +               DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> > +               return DRM_GPU_SCHED_STAT_NOMINAL;
> > +       }
> >
> >         if (!pipe->error)
> > -               DRM_ERROR("lima job timeout\n");
> > +               DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
> >
> >         drm_sched_stop(&pipe->base, &task->base);
> >
> > diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
> > index 6a11764d87b3..34050facb110 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.h
> > +++ b/drivers/gpu/drm/lima/lima_sched.h
> > @@ -29,6 +29,8 @@ struct lima_sched_task {
> >         bool recoverable;
> >         struct lima_bo *heap;
> >
> > +       struct dma_fence *done_fence;
> > +
> >         /* pipe fence */
> >         struct dma_fence *fence;
> >  };
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
                     ` (2 preceding siblings ...)
  2024-01-19  1:43   ` Qiang Yu
@ 2024-01-21  9:56   ` Hillf Danton
  2024-01-21 11:20     ` Qiang Yu
  3 siblings, 1 reply; 30+ messages in thread
From: Hillf Danton @ 2024-01-21  9:56 UTC (permalink / raw)
  To: Erico Nunes
  Cc: lima, linux-kernel, dri-devel, Qiang Yu, Daniel Vetter, David Airlie

On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@gmail.com>
>  
> @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>  	struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
>  	struct lima_sched_task *task = to_lima_task(job);
>  	struct lima_device *ldev = pipe->ldev;
> +	struct lima_ip *ip = pipe->processor[0];
> +
> +	/*
> +	 * If the GPU managed to complete this jobs fence, the timeout is
> +	 * spurious. Bail out.
> +	 */
> +	if (dma_fence_is_signaled(task->done_fence)) {
> +		DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> +		return DRM_GPU_SCHED_STAT_NOMINAL;
> +	}

Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
and stop selling bandaid like this because you have options like locating
the reasons behind timeout.
> +
> +	/*
> +	 * Lima IRQ handler may take a long time to process an interrupt
> +	 * if there is another IRQ handler hogging the processing.
> +	 * In order to catch such cases and not report spurious Lima job
> +	 * timeouts, synchronize the IRQ handler and re-check the fence
> +	 * status.
> +	 */
> +	synchronize_irq(ip->irq);
> +
> +	if (dma_fence_is_signaled(task->done_fence)) {
> +		DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> +		return DRM_GPU_SCHED_STAT_NOMINAL;
> +	}
>  
>  	if (!pipe->error)
> -		DRM_ERROR("lima job timeout\n");
> +		DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
>  
>  	drm_sched_stop(&pipe->base, &task->base);
>  

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-21  9:56   ` Hillf Danton
@ 2024-01-21 11:20     ` Qiang Yu
  2024-01-21 15:11       ` Erico Nunes
  0 siblings, 1 reply; 30+ messages in thread
From: Qiang Yu @ 2024-01-21 11:20 UTC (permalink / raw)
  To: Hillf Danton
  Cc: lima, linux-kernel, dri-devel, Daniel Vetter, David Airlie, Erico Nunes

On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@sina.com> wrote:
>
> On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@gmail.com>
> >
> > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> >       struct lima_sched_task *task = to_lima_task(job);
> >       struct lima_device *ldev = pipe->ldev;
> > +     struct lima_ip *ip = pipe->processor[0];
> > +
> > +     /*
> > +      * If the GPU managed to complete this jobs fence, the timeout is
> > +      * spurious. Bail out.
> > +      */
> > +     if (dma_fence_is_signaled(task->done_fence)) {
> > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > +     }
>
> Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> and stop selling bandaid like this because you have options like locating
> the reasons behind timeout.

This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
10s timeout. So increasing the timeout seems to be an equivalent and better
way?

> > +
> > +     /*
> > +      * Lima IRQ handler may take a long time to process an interrupt
> > +      * if there is another IRQ handler hogging the processing.
> > +      * In order to catch such cases and not report spurious Lima job
> > +      * timeouts, synchronize the IRQ handler and re-check the fence
> > +      * status.
> > +      */
> > +     synchronize_irq(ip->irq);
> > +
> > +     if (dma_fence_is_signaled(task->done_fence)) {
> > +             DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > +     }
> >
> >       if (!pipe->error)
> > -             DRM_ERROR("lima job timeout\n");
> > +             DRM_ERROR("%s lima job timeout\n", lima_ip_name(ip));
> >
> >       drm_sched_stop(&pipe->base, &task->base);
> >

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-21 11:20     ` Qiang Yu
@ 2024-01-21 15:11       ` Erico Nunes
  2024-01-23  1:18         ` Qiang Yu
  0 siblings, 1 reply; 30+ messages in thread
From: Erico Nunes @ 2024-01-21 15:11 UTC (permalink / raw)
  To: Qiang Yu
  Cc: Hillf Danton, lima, linux-kernel, dri-devel, Daniel Vetter, David Airlie

On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@gmail.com> wrote:
>
> On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@sina.com> wrote:
> >
> > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@gmail.com>
> > >
> > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > >       struct lima_sched_task *task = to_lima_task(job);
> > >       struct lima_device *ldev = pipe->ldev;
> > > +     struct lima_ip *ip = pipe->processor[0];
> > > +
> > > +     /*
> > > +      * If the GPU managed to complete this jobs fence, the timeout is
> > > +      * spurious. Bail out.
> > > +      */
> > > +     if (dma_fence_is_signaled(task->done_fence)) {
> > > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > > +     }
> >
> > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> > and stop selling bandaid like this because you have options like locating
> > the reasons behind timeout.
>
> This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
> for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
> 10s timeout. So increasing the timeout seems to be an equivalent and better
> way?

Indeed 500ms might be too optimistic for the sort of applications that
users expect to run on this hardware currently. For a more similar
reference though, other embedded drivers like v3d and panfrost do
still set it to 500ms. Note that this patch is just exactly the same
as exists in Panfrost today and was already discussed with some common
arguments in the patches of this series:
https://patchwork.freedesktop.org/series/120820/

But I would agree to bump the timeout to a higher value for lima. Some
distributions are already doing this with module parameters anyway to
even be able to run some more demanding application stacks on a Mali
400.

Another thing we might consider (probably in a followup patchset to
not delay these fixes forever for the people hitting this issue) is to
configure the Mali hardware watchdog to the value that we would like
as a timeout. That way we would get timeout jobs going through the
same error irq path as other hardware error jobs and might be able to
delete(?)/simplify this software timeout code.

In the meantime for v2 of this series I'll make the change to account
for the multiple pp irqs. So do you agree it is ok to leave
drm_sched_stop() as it is at least for this series?

Thanks all for the reviews

Erico

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-21 15:11       ` Erico Nunes
@ 2024-01-23  1:18         ` Qiang Yu
  0 siblings, 0 replies; 30+ messages in thread
From: Qiang Yu @ 2024-01-23  1:18 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Hillf Danton, lima, linux-kernel, dri-devel, Daniel Vetter, David Airlie

On Sun, Jan 21, 2024 at 11:11 PM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> On Sun, Jan 21, 2024 at 12:20 PM Qiang Yu <yuq825@gmail.com> wrote:
> >
> > On Sun, Jan 21, 2024 at 5:56 PM Hillf Danton <hdanton@sina.com> wrote:
> > >
> > > On Wed, 17 Jan 2024 04:12:10 +0100 Erico Nunes <nunes.erico@gmail.com>
> > > >
> > > > @@ -401,9 +399,33 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > > >       struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > > >       struct lima_sched_task *task = to_lima_task(job);
> > > >       struct lima_device *ldev = pipe->ldev;
> > > > +     struct lima_ip *ip = pipe->processor[0];
> > > > +
> > > > +     /*
> > > > +      * If the GPU managed to complete this jobs fence, the timeout is
> > > > +      * spurious. Bail out.
> > > > +      */
> > > > +     if (dma_fence_is_signaled(task->done_fence)) {
> > > > +             DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > > +             return DRM_GPU_SCHED_STAT_NOMINAL;
> > > > +     }
> > >
> > > Given 500ms in lima_sched_pipe_init(), no timeout is spurious by define,
> > > and stop selling bandaid like this because you have options like locating
> > > the reasons behind timeout.
> >
> > This chang do look like to aim for 2FPS apps. Maybe 500ms is too short
> > for week mali4x0 gpus (2FPS apps appear more likely). AMD/NV GPU uses
> > 10s timeout. So increasing the timeout seems to be an equivalent and better
> > way?
>
> Indeed 500ms might be too optimistic for the sort of applications that
> users expect to run on this hardware currently. For a more similar
> reference though, other embedded drivers like v3d and panfrost do
> still set it to 500ms. Note that this patch is just exactly the same
> as exists in Panfrost today and was already discussed with some common
> arguments in the patches of this series:
> https://patchwork.freedesktop.org/series/120820/
>
> But I would agree to bump the timeout to a higher value for lima. Some
> distributions are already doing this with module parameters anyway to
> even be able to run some more demanding application stacks on a Mali
> 400.
>
> Another thing we might consider (probably in a followup patchset to
> not delay these fixes forever for the people hitting this issue) is to
> configure the Mali hardware watchdog to the value that we would like
> as a timeout. That way we would get timeout jobs going through the
> same error irq path as other hardware error jobs and might be able to
> delete(?)/simplify this software timeout code.
>
This way should be much simpler and stable.

> In the meantime for v2 of this series I'll make the change to account
> for the multiple pp irqs. So do you agree it is ok to leave
> drm_sched_stop() as it is at least for this series?
>
I'm OK with this.

> Thanks all for the reviews
>
> Erico

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-19  1:50       ` Qiang Yu
@ 2024-01-23 23:19         ` Erico Nunes
  2024-01-24  1:03           ` Qiang Yu
  0 siblings, 1 reply; 30+ messages in thread
From: Erico Nunes @ 2024-01-23 23:19 UTC (permalink / raw)
  To: Qiang Yu
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Fri, Jan 19, 2024 at 2:50 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> On Thu, Jan 18, 2024 at 7:14 PM Erico Nunes <nunes.erico@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 2:36 AM Qiang Yu <yuq825@gmail.com> wrote:
> > >
> > > So this is caused by same job trigger both done and timeout handling?
> > > I think a better way to solve this is to make sure only one handler
> > > (done or timeout) process the job instead of just making lima_pm_idle()
> > > unique.
> >
> > It's not very clear to me how to best ensure that, with the drm_sched
> > software timeout and the irq happening potentially at the same time.
> This could be done by stopping scheduler run more job and disable
> GP/PP interrupt. Then after sync irq, there should be no more new
> irq gets in when we handling timeout.
>
> > I think patch 4 in this series describes and covers the most common
> > case that this would be hit. So maybe now this patch could be dropped
> > in favour of just that one.
> Yes.

After dropping the patch while testing to send v2, I managed to
reproduce this issue again.
Looking at a trace it seems that this actually happened with the test workload:
lima_sched_timedout_job -> (fence by is not signaled and new fence
check is passed) -> irq happens preempting lima_sched_timedout_job,
fence is signaled and lima_pm_idle called once at
lima_sched_pipe_task_done -> lima_sched_timedout_job continues and
calls lima_pm_idle again

So I think we still need this patch to at least prevent the bug with
the current software timeout. If we move to the hardware watchdog
timeout later we might be able to remove it anyway, but that will
still require separate work and testing.

Erico

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts
  2024-01-23 23:19         ` Erico Nunes
@ 2024-01-24  1:03           ` Qiang Yu
  0 siblings, 0 replies; 30+ messages in thread
From: Qiang Yu @ 2024-01-24  1:03 UTC (permalink / raw)
  To: Erico Nunes
  Cc: Daniel Vetter, lima, linux-kernel, dri-devel, christian.koenig,
	anarsoul, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Sumit Semwal

On Wed, Jan 24, 2024 at 7:19 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> On Fri, Jan 19, 2024 at 2:50 AM Qiang Yu <yuq825@gmail.com> wrote:
> >
> > On Thu, Jan 18, 2024 at 7:14 PM Erico Nunes <nunes.erico@gmail.com> wrote:
> > >
> > > On Thu, Jan 18, 2024 at 2:36 AM Qiang Yu <yuq825@gmail.com> wrote:
> > > >
> > > > So this is caused by same job trigger both done and timeout handling?
> > > > I think a better way to solve this is to make sure only one handler
> > > > (done or timeout) process the job instead of just making lima_pm_idle()
> > > > unique.
> > >
> > > It's not very clear to me how to best ensure that, with the drm_sched
> > > software timeout and the irq happening potentially at the same time.
> > This could be done by stopping scheduler run more job and disable
> > GP/PP interrupt. Then after sync irq, there should be no more new
> > irq gets in when we handling timeout.
> >
> > > I think patch 4 in this series describes and covers the most common
> > > case that this would be hit. So maybe now this patch could be dropped
> > > in favour of just that one.
> > Yes.
>
> After dropping the patch while testing to send v2, I managed to
> reproduce this issue again.
> Looking at a trace it seems that this actually happened with the test workload:
> lima_sched_timedout_job -> (fence by is not signaled and new fence
> check is passed) -> irq happens preempting lima_sched_timedout_job,
> fence is signaled and lima_pm_idle called once at
> lima_sched_pipe_task_done -> lima_sched_timedout_job continues and
> calls lima_pm_idle again
>
> So I think we still need this patch to at least prevent the bug with
> the current software timeout. If we move to the hardware watchdog
> timeout later we might be able to remove it anyway, but that will
> still require separate work and testing.
>
Then you do need to solve the IRQ and reset race by disabling GP/PP
IRQ and sync irq after drm_sched_stop(). I mean you may keep the
patch 4 for spurious timeout, and add a new patch for solving the IRQ
and reset race.

Disable pm idle does not solve the race completely, other reset logic
may also conflict if not sequence the IRQ and reset .

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-01-24  1:05 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-17  3:12 [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Erico Nunes
2024-01-17  3:12 ` [PATCH v1 1/6] drm/lima: fix devfreq refcount imbalance for job timeouts Erico Nunes
2024-01-17 18:13   ` Vasily Khoruzhick
2024-01-18  1:36   ` Qiang Yu
2024-01-18 11:14     ` Erico Nunes
2024-01-19  1:50       ` Qiang Yu
2024-01-23 23:19         ` Erico Nunes
2024-01-24  1:03           ` Qiang Yu
2024-01-17  3:12 ` [PATCH v1 2/6] drm/lima: reset async_reset on pp hard reset Erico Nunes
2024-01-17 18:17   ` Vasily Khoruzhick
2024-01-18  1:56   ` Qiang Yu
2024-01-17  3:12 ` [PATCH v1 3/6] drm/lima: set bus_stop bit before " Erico Nunes
2024-01-17 18:23   ` Vasily Khoruzhick
2024-01-18  2:01   ` Qiang Yu
2024-01-18 11:43     ` Erico Nunes
2024-01-17  3:12 ` [PATCH v1 4/6] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
2024-01-17 18:26   ` Vasily Khoruzhick
2024-01-18  2:46   ` Qiang Yu
2024-01-18 11:38     ` Erico Nunes
2024-01-19  1:43   ` Qiang Yu
2024-01-21  3:04     ` Qiang Yu
2024-01-21  9:56   ` Hillf Danton
2024-01-21 11:20     ` Qiang Yu
2024-01-21 15:11       ` Erico Nunes
2024-01-23  1:18         ` Qiang Yu
2024-01-17  3:12 ` [PATCH v1 5/6] drm/lima: remove guilty drm_sched context handling Erico Nunes
2024-01-17 18:28   ` Vasily Khoruzhick
2024-01-17  3:12 ` [PATCH v1 6/6] drm/lima: improve some pp debug messages Erico Nunes
2024-01-17 18:29   ` Vasily Khoruzhick
2024-01-17  7:22 ` [PATCH v1 0/6] drm/lima: fixes and improvements to error recovery Christian König

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).