[PATCH v2 0/8] drm/lima: fixes and improvements to error recovery

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery
@ 2024-01-24  2:59 Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 1/8] drm/lima: reset async_reset on pp hard reset Erico Nunes
                   ` (8 more replies)
  0 siblings, 9 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

v1 reference:
https://patchwork.kernel.org/project/dri-devel/cover/20240117031212.1104034-1-nunes.erico@gmail.com/

Changes v1 -> v2:
- Dropped patch 1 which aimed to fix
https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
That will require more testing and an actual fix to the irq/timeout
handler race. It can be solved separately so I am deferring it to a
followup patch and keeping that issue open.

- Added patches 2 and 4 to cover "reset time out" and bus stop bit to
hard reset in gp as well.

- Added handling of all processors in synchronize_irq in patch 5 to
cover multiple pp. Dropped unnecessary duplicate fence in patch 5.

- Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
to be reasonable to bump our timeout value so that we further decrease
the chance of users actually hitting any of these timeouts by default.

- Reworked patch 8 in v2. Since I broadened the work to not only focus
in pp anymore, I also included the change to the other blocks as well.

- Collected some reviews and acks in unmodified patches.


Erico Nunes (8):
  drm/lima: reset async_reset on pp hard reset
  drm/lima: reset async_reset on gp hard reset
  drm/lima: set pp bus_stop bit before hard reset
  drm/lima: set gp bus_stop bit before hard reset
  drm/lima: handle spurious timeouts due to high irq latency
  drm/lima: remove guilty drm_sched context handling
  drm/lima: increase default job timeout to 10s
  drm/lima: standardize debug messages by ip name

 drivers/gpu/drm/lima/lima_ctx.c      |  2 +-
 drivers/gpu/drm/lima/lima_ctx.h      |  1 -
 drivers/gpu/drm/lima/lima_gp.c       | 39 +++++++++++++++++++++-------
 drivers/gpu/drm/lima/lima_l2_cache.c |  6 +++--
 drivers/gpu/drm/lima/lima_mmu.c      | 18 ++++++-------
 drivers/gpu/drm/lima/lima_pmu.c      |  3 ++-
 drivers/gpu/drm/lima/lima_pp.c       | 37 ++++++++++++++++++++------
 drivers/gpu/drm/lima/lima_sched.c    | 38 ++++++++++++++++++++++-----
 drivers/gpu/drm/lima/lima_sched.h    |  3 +--
 9 files changed, 107 insertions(+), 40 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v2 1/8] drm/lima: reset async_reset on pp hard reset
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 2/8] drm/lima: reset async_reset on gp " Erico Nunes
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

Lima pp jobs use an async reset to avoid having to wait for the soft
reset right after a job. The soft reset is done at the end of a job and
a reset_complete flag is expected to be set at the next job.
However, in case the user runs into a job timeout from any application,
a hard reset is issued to the hardware. This hard reset clears the
reset_complete flag, which causes an error message to show up before the
next job.
This is probably harmless for the execution but can be very confusing to
debug, as it blames a reset timeout on the next application to submit a
job.
Reset the async_reset flag when doing the hard reset so that we don't
get that message.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
---
 drivers/gpu/drm/lima/lima_pp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index a5c95bed08c0..a8f8f63b8295 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -191,6 +191,13 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0);
 	pp_write(LIMA_PP_INT_CLEAR, LIMA_PP_IRQ_MASK_ALL);
 	pp_write(LIMA_PP_INT_MASK, LIMA_PP_IRQ_MASK_USED);
+
+	/*
+	 * if there was an async soft reset queued,
+	 * don't wait for it in the next job
+	 */
+	ip->data.async_reset = false;
+
 	return 0;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 2/8] drm/lima: reset async_reset on gp hard reset
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 1/8] drm/lima: reset async_reset on pp hard reset Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 3/8] drm/lima: set pp bus_stop bit before " Erico Nunes
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

Lima gp jobs use an async reset to avoid having to wait for the soft
reset right after a job. The soft reset is done at the end of a job and
a reset_complete flag is expected to be set at the next job.
However, in case the user runs into a job timeout from any application,
a hard reset is issued to the hardware. This hard reset clears the
reset_complete flag, which causes an error message to show up before the
next job.
This is probably harmless for the execution but can be very confusing to
debug, as it blames a reset timeout on the next application to submit a
job.
Reset the async_reset flag when doing the hard reset so that we don't
get that message.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_gp.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_gp.c b/drivers/gpu/drm/lima/lima_gp.c
index 8dd501b7a3d0..b9a06e701a33 100644
--- a/drivers/gpu/drm/lima/lima_gp.c
+++ b/drivers/gpu/drm/lima/lima_gp.c
@@ -189,6 +189,13 @@ static int lima_gp_hard_reset(struct lima_ip *ip)
 	gp_write(LIMA_GP_PERF_CNT_0_LIMIT, 0);
 	gp_write(LIMA_GP_INT_CLEAR, LIMA_GP_IRQ_MASK_ALL);
 	gp_write(LIMA_GP_INT_MASK, LIMA_GP_IRQ_MASK_USED);
+
+	/*
+	 * if there was an async soft reset queued,
+	 * don't wait for it in the next job
+	 */
+	ip->data.async_reset = false;
+
 	return 0;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 3/8] drm/lima: set pp bus_stop bit before hard reset
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 1/8] drm/lima: reset async_reset on pp hard reset Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 2/8] drm/lima: reset async_reset on gp " Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 4/8] drm/lima: set gp " Erico Nunes
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
---
 drivers/gpu/drm/lima/lima_pp.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index a8f8f63b8295..ac097dd75072 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -168,6 +168,11 @@ static void lima_pp_write_frame(struct lima_ip *ip, u32 *frame, u32 *wb)
 	}
 }
 
+static int lima_pp_bus_stop_poll(struct lima_ip *ip)
+{
+	return !!(pp_read(LIMA_PP_STATUS) & LIMA_PP_STATUS_BUS_STOPPED);
+}
+
 static int lima_pp_hard_reset_poll(struct lima_ip *ip)
 {
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC01A0000);
@@ -181,6 +186,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 
 	pp_write(LIMA_PP_PERF_CNT_0_LIMIT, 0xC0FFE000);
 	pp_write(LIMA_PP_INT_MASK, 0);
+
+	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
+	ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
+	if (ret) {
+		dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
+		return ret;
+	}
+
 	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
 	ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
 	if (ret) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 4/8] drm/lima: set gp bus_stop bit before hard reset
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (2 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 3/8] drm/lima: set pp bus_stop bit before " Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

This is required for reliable hard resets. Otherwise, doing a hard reset
while a task is still running (such as a task which is being stopped by
the drm_sched timeout handler) may result in random mmu write timeouts
or lockups which cause the entire gpu to hang.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_gp.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/lima/lima_gp.c b/drivers/gpu/drm/lima/lima_gp.c
index b9a06e701a33..4355fa7b17f4 100644
--- a/drivers/gpu/drm/lima/lima_gp.c
+++ b/drivers/gpu/drm/lima/lima_gp.c
@@ -166,6 +166,11 @@ static void lima_gp_task_run(struct lima_sched_pipe *pipe,
 	gp_write(LIMA_GP_CMD, cmd);
 }
 
+static int lima_gp_bus_stop_poll(struct lima_ip *ip)
+{
+	return !!(gp_read(LIMA_GP_STATUS) & LIMA_GP_STATUS_BUS_STOPPED);
+}
+
 static int lima_gp_hard_reset_poll(struct lima_ip *ip)
 {
 	gp_write(LIMA_GP_PERF_CNT_0_LIMIT, 0xC01A0000);
@@ -179,6 +184,13 @@ static int lima_gp_hard_reset(struct lima_ip *ip)
 
 	gp_write(LIMA_GP_PERF_CNT_0_LIMIT, 0xC0FFE000);
 	gp_write(LIMA_GP_INT_MASK, 0);
+
+	gp_write(LIMA_GP_CMD, LIMA_GP_CMD_STOP_BUS);
+	ret = lima_poll_timeout(ip, lima_gp_bus_stop_poll, 10, 100);
+	if (ret) {
+		dev_err(dev->dev, "%s bus stop timeout\n", lima_ip_name(ip));
+		return ret;
+	}
 	gp_write(LIMA_GP_CMD, LIMA_GP_CMD_RESET);
 	ret = lima_poll_timeout(ip, lima_gp_hard_reset_poll, 10, 100);
 	if (ret) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (3 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 4/8] drm/lima: set gp " Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24 12:38   ` Qiang Yu
  2024-01-24  2:59 ` [PATCH v2 6/8] drm/lima: remove guilty drm_sched context handling Erico Nunes
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

There are several unexplained and unreproduced cases of rendering
timeouts with lima, for which one theory is high IRQ latency coming from
somewhere else in the system.
This kind of occurrence may cause applications to trigger unnecessary
resets of the GPU or even applications to hang if it hits an issue in
the recovery path.
Panfrost already does some special handling to account for such
"spurious timeouts", it makes sense to have this in lima too to reduce
the chance that it hit users.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_sched.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index c3bf8cda8498..814428564637 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0 OR MIT
 /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
 
+#include <linux/hardirq.h>
 #include <linux/iosys-map.h>
 #include <linux/kthread.h>
 #include <linux/slab.h>
@@ -401,9 +402,35 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
 	struct lima_sched_task *task = to_lima_task(job);
 	struct lima_device *ldev = pipe->ldev;
+	struct lima_ip *ip = pipe->processor[0];
+	int i;
+
+	/*
+	 * If the GPU managed to complete this jobs fence, the timeout is
+	 * spurious. Bail out.
+	 */
+	if (dma_fence_is_signaled(task->fence)) {
+		DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
+		return DRM_GPU_SCHED_STAT_NOMINAL;
+	}
+
+	/*
+	 * Lima IRQ handler may take a long time to process an interrupt
+	 * if there is another IRQ handler hogging the processing.
+	 * In order to catch such cases and not report spurious Lima job
+	 * timeouts, synchronize the IRQ handler and re-check the fence
+	 * status.
+	 */
+	for (i = 0; i < pipe->num_processor; i++)
+		synchronize_irq(pipe->processor[i]->irq);
+
+	if (dma_fence_is_signaled(task->fence)) {
+		DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
+		return DRM_GPU_SCHED_STAT_NOMINAL;
+	}
 
 	if (!pipe->error)
-		DRM_ERROR("lima job timeout\n");
+		DRM_ERROR("%s job timeout\n", lima_ip_name(ip));
 
 	drm_sched_stop(&pipe->base, &task->base);
 
@@ -417,8 +444,6 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
 	if (pipe->bcast_mmu)
 		lima_mmu_page_fault_resume(pipe->bcast_mmu);
 	else {
-		int i;
-
 		for (i = 0; i < pipe->num_mmu; i++)
 			lima_mmu_page_fault_resume(pipe->mmu[i]);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 6/8] drm/lima: remove guilty drm_sched context handling
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (4 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 7/8] drm/lima: increase default job timeout to 10s Erico Nunes
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

Marking the context as guilty currently only makes the application which
hits a single timeout problem to stop its rendering context entirely.
All jobs submitted later are dropped from the guilty context.

Lima runs on fairly underpowered hardware for modern standards and it is
not entirely unreasonable that a rendering job may time out occasionally
due to high system load or too demanding application stack. In this case
it would be generally preferred to report the error but try to keep the
application going.

Other similar embedded GPU drivers don't make use of the guilty context
flag. Now that there are reliability improvements to the lima timeout
recovery handling, drop the guilty contexts to let the application keep
running in this case.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
Acked-by: Christian König <christian.koenig@amd.com>
Reviewed-by: Vasily Khoruzhick <anarsoul@gmail.com>
---
 drivers/gpu/drm/lima/lima_ctx.c   | 2 +-
 drivers/gpu/drm/lima/lima_ctx.h   | 1 -
 drivers/gpu/drm/lima/lima_sched.c | 5 ++---
 drivers/gpu/drm/lima/lima_sched.h | 3 +--
 4 files changed, 4 insertions(+), 7 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_ctx.c b/drivers/gpu/drm/lima/lima_ctx.c
index 8389f2d7d021..0e668fc1e0f9 100644
--- a/drivers/gpu/drm/lima/lima_ctx.c
+++ b/drivers/gpu/drm/lima/lima_ctx.c
@@ -19,7 +19,7 @@ int lima_ctx_create(struct lima_device *dev, struct lima_ctx_mgr *mgr, u32 *id)
 	kref_init(&ctx->refcnt);
 
 	for (i = 0; i < lima_pipe_num; i++) {
-		err = lima_sched_context_init(dev->pipe + i, ctx->context + i, &ctx->guilty);
+		err = lima_sched_context_init(dev->pipe + i, ctx->context + i);
 		if (err)
 			goto err_out0;
 	}
diff --git a/drivers/gpu/drm/lima/lima_ctx.h b/drivers/gpu/drm/lima/lima_ctx.h
index 74e2be09090f..5b1063ce968b 100644
--- a/drivers/gpu/drm/lima/lima_ctx.h
+++ b/drivers/gpu/drm/lima/lima_ctx.h
@@ -13,7 +13,6 @@ struct lima_ctx {
 	struct kref refcnt;
 	struct lima_device *dev;
 	struct lima_sched_context context[lima_pipe_num];
-	atomic_t guilty;
 
 	/* debug info */
 	char pname[TASK_COMM_LEN];
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index 814428564637..c2e78605e43e 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -154,13 +154,12 @@ void lima_sched_task_fini(struct lima_sched_task *task)
 }
 
 int lima_sched_context_init(struct lima_sched_pipe *pipe,
-			    struct lima_sched_context *context,
-			    atomic_t *guilty)
+			    struct lima_sched_context *context)
 {
 	struct drm_gpu_scheduler *sched = &pipe->base;
 
 	return drm_sched_entity_init(&context->base, DRM_SCHED_PRIORITY_NORMAL,
-				     &sched, 1, guilty);
+				     &sched, 1, NULL);
 }
 
 void lima_sched_context_fini(struct lima_sched_pipe *pipe,
diff --git a/drivers/gpu/drm/lima/lima_sched.h b/drivers/gpu/drm/lima/lima_sched.h
index 6a11764d87b3..6bd4f3b70109 100644
--- a/drivers/gpu/drm/lima/lima_sched.h
+++ b/drivers/gpu/drm/lima/lima_sched.h
@@ -91,8 +91,7 @@ int lima_sched_task_init(struct lima_sched_task *task,
 void lima_sched_task_fini(struct lima_sched_task *task);
 
 int lima_sched_context_init(struct lima_sched_pipe *pipe,
-			    struct lima_sched_context *context,
-			    atomic_t *guilty);
+			    struct lima_sched_context *context);
 void lima_sched_context_fini(struct lima_sched_pipe *pipe,
 			     struct lima_sched_context *context);
 struct dma_fence *lima_sched_context_queue_task(struct lima_sched_task *task);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 7/8] drm/lima: increase default job timeout to 10s
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (5 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 6/8] drm/lima: remove guilty drm_sched context handling Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-24  2:59 ` [PATCH v2 8/8] drm/lima: standardize debug messages by ip name Erico Nunes
  2024-01-30  1:07 ` [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Qiang Yu
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

The previous 500ms default timeout was fairly optimistic and could be
hit by real world applications. Many distributions targeting devices
with a Mali-4xx already bumped this timeout to a higher limit.
We can be generous here with a high value as 10s since this should
mostly catch buggy jobs like infinite loop shaders, and these don't
seem to happen very often in real applications.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_sched.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index c2e78605e43e..00b19adfc888 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -505,7 +505,7 @@ static void lima_sched_recover_work(struct work_struct *work)
 int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 {
 	unsigned int timeout = lima_sched_timeout_ms > 0 ?
-			       lima_sched_timeout_ms : 500;
+			       lima_sched_timeout_ms : 10000;
 
 	pipe->fence_context = dma_fence_context_alloc(1);
 	spin_lock_init(&pipe->fence_lock);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v2 8/8] drm/lima: standardize debug messages by ip name
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (6 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 7/8] drm/lima: increase default job timeout to 10s Erico Nunes
@ 2024-01-24  2:59 ` Erico Nunes
  2024-01-30  1:07 ` [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Qiang Yu
  8 siblings, 0 replies; 14+ messages in thread
From: Erico Nunes @ 2024-01-24  2:59 UTC (permalink / raw)
  To: Qiang Yu, anarsoul, christian.koenig, dri-devel, lima
  Cc: Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	David Airlie, Daniel Vetter, Sumit Semwal, linux-kernel,
	Erico Nunes

Some debug messages carried the ip name, or included "lima", or
included both the ip name and then the numbered ip name again.
Make the messages more consistent by always looking up and showing
the ip name first.

Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
---
 drivers/gpu/drm/lima/lima_gp.c       | 20 +++++++++++---------
 drivers/gpu/drm/lima/lima_l2_cache.c |  6 ++++--
 drivers/gpu/drm/lima/lima_mmu.c      | 18 +++++++++---------
 drivers/gpu/drm/lima/lima_pmu.c      |  3 ++-
 drivers/gpu/drm/lima/lima_pp.c       | 19 ++++++++++---------
 5 files changed, 36 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/lima/lima_gp.c b/drivers/gpu/drm/lima/lima_gp.c
index 4355fa7b17f4..6b354e2fb61d 100644
--- a/drivers/gpu/drm/lima/lima_gp.c
+++ b/drivers/gpu/drm/lima/lima_gp.c
@@ -34,11 +34,11 @@ static irqreturn_t lima_gp_irq_handler(int irq, void *data)
 	if (state & LIMA_GP_IRQ_MASK_ERROR) {
 		if ((state & LIMA_GP_IRQ_MASK_ERROR) ==
 		    LIMA_GP_IRQ_PLBU_OUT_OF_MEM) {
-			dev_dbg(dev->dev, "gp out of heap irq status=%x\n",
-				status);
+			dev_dbg(dev->dev, "%s out of heap irq status=%x\n",
+				lima_ip_name(ip), status);
 		} else {
-			dev_err(dev->dev, "gp error irq state=%x status=%x\n",
-				state, status);
+			dev_err(dev->dev, "%s error irq state=%x status=%x\n",
+				lima_ip_name(ip), state, status);
 			if (task)
 				task->recoverable = false;
 		}
@@ -89,7 +89,8 @@ static int lima_gp_soft_reset_async_wait(struct lima_ip *ip)
 				 v & LIMA_GP_IRQ_RESET_COMPLETED,
 				 0, 100);
 	if (err) {
-		dev_err(dev->dev, "gp soft reset time out\n");
+		dev_err(dev->dev, "%s soft reset time out\n",
+			lima_ip_name(ip));
 		return err;
 	}
 
@@ -194,7 +195,7 @@ static int lima_gp_hard_reset(struct lima_ip *ip)
 	gp_write(LIMA_GP_CMD, LIMA_GP_CMD_RESET);
 	ret = lima_poll_timeout(ip, lima_gp_hard_reset_poll, 10, 100);
 	if (ret) {
-		dev_err(dev->dev, "gp hard reset timeout\n");
+		dev_err(dev->dev, "%s hard reset timeout\n", lima_ip_name(ip));
 		return ret;
 	}
 
@@ -220,8 +221,9 @@ static void lima_gp_task_error(struct lima_sched_pipe *pipe)
 {
 	struct lima_ip *ip = pipe->processor[0];
 
-	dev_err(ip->dev->dev, "gp task error int_state=%x status=%x\n",
-		gp_read(LIMA_GP_INT_STAT), gp_read(LIMA_GP_STATUS));
+	dev_err(ip->dev->dev, "%s task error int_state=%x status=%x\n",
+		lima_ip_name(ip), gp_read(LIMA_GP_INT_STAT),
+		gp_read(LIMA_GP_STATUS));
 
 	lima_gp_hard_reset(ip);
 }
@@ -324,7 +326,7 @@ int lima_gp_init(struct lima_ip *ip)
 	err = devm_request_irq(dev->dev, ip->irq, lima_gp_irq_handler,
 			       IRQF_SHARED, lima_ip_name(ip), ip);
 	if (err) {
-		dev_err(dev->dev, "gp %s fail to request irq\n",
+		dev_err(dev->dev, "%s fail to request irq\n",
 			lima_ip_name(ip));
 		return err;
 	}
diff --git a/drivers/gpu/drm/lima/lima_l2_cache.c b/drivers/gpu/drm/lima/lima_l2_cache.c
index c4080a02957b..184106ce55f8 100644
--- a/drivers/gpu/drm/lima/lima_l2_cache.c
+++ b/drivers/gpu/drm/lima/lima_l2_cache.c
@@ -21,7 +21,8 @@ static int lima_l2_cache_wait_idle(struct lima_ip *ip)
 				 !(v & LIMA_L2_CACHE_STATUS_COMMAND_BUSY),
 				 0, 1000);
 	if (err) {
-		dev_err(dev->dev, "l2 cache wait command timeout\n");
+		dev_err(dev->dev, "%s wait command timeout\n",
+			lima_ip_name(ip));
 		return err;
 	}
 	return 0;
@@ -83,7 +84,8 @@ int lima_l2_cache_init(struct lima_ip *ip)
 	spin_lock_init(&ip->data.lock);
 
 	size = l2_cache_read(LIMA_L2_CACHE_SIZE);
-	dev_info(dev->dev, "l2 cache %uK, %u-way, %ubyte cache line, %ubit external bus\n",
+	dev_info(dev->dev, "%s %uK, %u-way, %ubyte cache line, %ubit external bus\n",
+		 lima_ip_name(ip),
 		 1 << (((size >> 16) & 0xff) - 10),
 		 1 << ((size >> 8) & 0xff),
 		 1 << (size & 0xff),
diff --git a/drivers/gpu/drm/lima/lima_mmu.c b/drivers/gpu/drm/lima/lima_mmu.c
index a1ae6c252dc2..e18317c5ca8c 100644
--- a/drivers/gpu/drm/lima/lima_mmu.c
+++ b/drivers/gpu/drm/lima/lima_mmu.c
@@ -22,7 +22,8 @@
 				  cond, 0, 100);	     \
 	if (__ret)					     \
 		dev_err(dev->dev,			     \
-			"mmu command %x timeout\n", cmd);    \
+			"%s command %x timeout\n",           \
+			lima_ip_name(ip), cmd);              \
 	__ret;						     \
 })
 
@@ -40,14 +41,13 @@ static irqreturn_t lima_mmu_irq_handler(int irq, void *data)
 	if (status & LIMA_MMU_INT_PAGE_FAULT) {
 		u32 fault = mmu_read(LIMA_MMU_PAGE_FAULT_ADDR);
 
-		dev_err(dev->dev, "mmu page fault at 0x%x from bus id %d of type %s on %s\n",
-			fault, LIMA_MMU_STATUS_BUS_ID(status),
-			status & LIMA_MMU_STATUS_PAGE_FAULT_IS_WRITE ? "write" : "read",
-			lima_ip_name(ip));
+		dev_err(dev->dev, "%s page fault at 0x%x from bus id %d of type %s\n",
+			lima_ip_name(ip), fault, LIMA_MMU_STATUS_BUS_ID(status),
+			status & LIMA_MMU_STATUS_PAGE_FAULT_IS_WRITE ? "write" : "read");
 	}
 
 	if (status & LIMA_MMU_INT_READ_BUS_ERROR)
-		dev_err(dev->dev, "mmu %s irq bus error\n", lima_ip_name(ip));
+		dev_err(dev->dev, "%s irq bus error\n", lima_ip_name(ip));
 
 	/* mask all interrupts before resume */
 	mmu_write(LIMA_MMU_INT_MASK, 0);
@@ -102,14 +102,14 @@ int lima_mmu_init(struct lima_ip *ip)
 
 	mmu_write(LIMA_MMU_DTE_ADDR, 0xCAFEBABE);
 	if (mmu_read(LIMA_MMU_DTE_ADDR) != 0xCAFEB000) {
-		dev_err(dev->dev, "mmu %s dte write test fail\n", lima_ip_name(ip));
+		dev_err(dev->dev, "%s dte write test fail\n", lima_ip_name(ip));
 		return -EIO;
 	}
 
 	err = devm_request_irq(dev->dev, ip->irq, lima_mmu_irq_handler,
 			       IRQF_SHARED, lima_ip_name(ip), ip);
 	if (err) {
-		dev_err(dev->dev, "mmu %s fail to request irq\n", lima_ip_name(ip));
+		dev_err(dev->dev, "%s fail to request irq\n", lima_ip_name(ip));
 		return err;
 	}
 
@@ -152,7 +152,7 @@ void lima_mmu_page_fault_resume(struct lima_ip *ip)
 	u32 v;
 
 	if (status & LIMA_MMU_STATUS_PAGE_FAULT_ACTIVE) {
-		dev_info(dev->dev, "mmu resume\n");
+		dev_info(dev->dev, "%s resume\n", lima_ip_name(ip));
 
 		mmu_write(LIMA_MMU_INT_MASK, 0);
 		mmu_write(LIMA_MMU_DTE_ADDR, 0xCAFEBABE);
diff --git a/drivers/gpu/drm/lima/lima_pmu.c b/drivers/gpu/drm/lima/lima_pmu.c
index e397e1146e96..113cb9b215cd 100644
--- a/drivers/gpu/drm/lima/lima_pmu.c
+++ b/drivers/gpu/drm/lima/lima_pmu.c
@@ -21,7 +21,8 @@ static int lima_pmu_wait_cmd(struct lima_ip *ip)
 				 v, v & LIMA_PMU_INT_CMD_MASK,
 				 100, 100000);
 	if (err) {
-		dev_err(dev->dev, "timeout wait pmu cmd\n");
+		dev_err(dev->dev, "%s timeout wait pmu cmd\n",
+			lima_ip_name(ip));
 		return err;
 	}
 
diff --git a/drivers/gpu/drm/lima/lima_pp.c b/drivers/gpu/drm/lima/lima_pp.c
index ac097dd75072..d0d2db0ef1ce 100644
--- a/drivers/gpu/drm/lima/lima_pp.c
+++ b/drivers/gpu/drm/lima/lima_pp.c
@@ -26,8 +26,8 @@ static void lima_pp_handle_irq(struct lima_ip *ip, u32 state)
 	if (state & LIMA_PP_IRQ_MASK_ERROR) {
 		u32 status = pp_read(LIMA_PP_STATUS);
 
-		dev_err(dev->dev, "pp error irq state=%x status=%x\n",
-			state, status);
+		dev_err(dev->dev, "%s error irq state=%x status=%x\n",
+			lima_ip_name(ip), state, status);
 
 		pipe->error = true;
 
@@ -125,7 +125,7 @@ static int lima_pp_soft_reset_async_wait_one(struct lima_ip *ip)
 
 	ret = lima_poll_timeout(ip, lima_pp_soft_reset_poll, 0, 100);
 	if (ret) {
-		dev_err(dev->dev, "pp %s reset time out\n", lima_ip_name(ip));
+		dev_err(dev->dev, "%s reset time out\n", lima_ip_name(ip));
 		return ret;
 	}
 
@@ -190,14 +190,14 @@ static int lima_pp_hard_reset(struct lima_ip *ip)
 	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_STOP_BUS);
 	ret = lima_poll_timeout(ip, lima_pp_bus_stop_poll, 10, 100);
 	if (ret) {
-		dev_err(dev->dev, "pp %s bus stop timeout\n", lima_ip_name(ip));
+		dev_err(dev->dev, "%s bus stop timeout\n", lima_ip_name(ip));
 		return ret;
 	}
 
 	pp_write(LIMA_PP_CTRL, LIMA_PP_CTRL_FORCE_RESET);
 	ret = lima_poll_timeout(ip, lima_pp_hard_reset_poll, 10, 100);
 	if (ret) {
-		dev_err(dev->dev, "pp hard reset timeout\n");
+		dev_err(dev->dev, "%s hard reset timeout\n", lima_ip_name(ip));
 		return ret;
 	}
 
@@ -274,7 +274,7 @@ int lima_pp_init(struct lima_ip *ip)
 	err = devm_request_irq(dev->dev, ip->irq, lima_pp_irq_handler,
 			       IRQF_SHARED, lima_ip_name(ip), ip);
 	if (err) {
-		dev_err(dev->dev, "pp %s fail to request irq\n",
+		dev_err(dev->dev, "%s fail to request irq\n",
 			lima_ip_name(ip));
 		return err;
 	}
@@ -309,7 +309,7 @@ int lima_pp_bcast_init(struct lima_ip *ip)
 	err = devm_request_irq(dev->dev, ip->irq, lima_pp_bcast_irq_handler,
 			       IRQF_SHARED, lima_ip_name(ip), ip);
 	if (err) {
-		dev_err(dev->dev, "pp %s fail to request irq\n",
+		dev_err(dev->dev, "%s fail to request irq\n",
 			lima_ip_name(ip));
 		return err;
 	}
@@ -423,8 +423,9 @@ static void lima_pp_task_error(struct lima_sched_pipe *pipe)
 	for (i = 0; i < pipe->num_processor; i++) {
 		struct lima_ip *ip = pipe->processor[i];
 
-		dev_err(ip->dev->dev, "pp task error %d int_state=%x status=%x\n",
-			i, pp_read(LIMA_PP_INT_STATUS), pp_read(LIMA_PP_STATUS));
+		dev_err(ip->dev->dev, "%s task error %d int_state=%x status=%x\n",
+			lima_ip_name(ip), i, pp_read(LIMA_PP_INT_STATUS),
+			pp_read(LIMA_PP_STATUS));
 
 		lima_pp_hard_reset(ip);
 	}
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-24  2:59 ` [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
@ 2024-01-24 12:38   ` Qiang Yu
  2024-01-29 22:55     ` Erico Nunes
  0 siblings, 1 reply; 14+ messages in thread
From: Qiang Yu @ 2024-01-24 12:38 UTC (permalink / raw)
  To: Erico Nunes
  Cc: anarsoul, christian.koenig, dri-devel, lima, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, linux-kernel

On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> There are several unexplained and unreproduced cases of rendering
> timeouts with lima, for which one theory is high IRQ latency coming from
> somewhere else in the system.
> This kind of occurrence may cause applications to trigger unnecessary
> resets of the GPU or even applications to hang if it hits an issue in
> the recovery path.
> Panfrost already does some special handling to account for such
> "spurious timeouts", it makes sense to have this in lima too to reduce
> the chance that it hit users.
>
> Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> ---
>  drivers/gpu/drm/lima/lima_sched.c | 31 ++++++++++++++++++++++++++++---
>  1 file changed, 28 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index c3bf8cda8498..814428564637 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -1,6 +1,7 @@
>  // SPDX-License-Identifier: GPL-2.0 OR MIT
>  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
>
> +#include <linux/hardirq.h>
>  #include <linux/iosys-map.h>
>  #include <linux/kthread.h>
>  #include <linux/slab.h>
> @@ -401,9 +402,35 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
>         struct lima_sched_task *task = to_lima_task(job);
>         struct lima_device *ldev = pipe->ldev;
> +       struct lima_ip *ip = pipe->processor[0];
> +       int i;
> +
> +       /*
> +        * If the GPU managed to complete this jobs fence, the timeout is
> +        * spurious. Bail out.
> +        */
> +       if (dma_fence_is_signaled(task->fence)) {
> +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
> +
> +       /*
> +        * Lima IRQ handler may take a long time to process an interrupt
> +        * if there is another IRQ handler hogging the processing.
> +        * In order to catch such cases and not report spurious Lima job
> +        * timeouts, synchronize the IRQ handler and re-check the fence
> +        * status.
> +        */
> +       for (i = 0; i < pipe->num_processor; i++)
> +               synchronize_irq(pipe->processor[i]->irq);
> +
I have a question, this timeout handler will be called when GP/PP error IRQ.
If we call synchronize_irq() in the IRQ handler, will we block ourselves here?

> +       if (dma_fence_is_signaled(task->fence)) {
> +               DRM_WARN("%s unexpectedly high interrupt latency\n", lima_ip_name(ip));
> +               return DRM_GPU_SCHED_STAT_NOMINAL;
> +       }
>
>         if (!pipe->error)
> -               DRM_ERROR("lima job timeout\n");
> +               DRM_ERROR("%s job timeout\n", lima_ip_name(ip));
>
>         drm_sched_stop(&pipe->base, &task->base);
>
> @@ -417,8 +444,6 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
>         if (pipe->bcast_mmu)
>                 lima_mmu_page_fault_resume(pipe->bcast_mmu);
>         else {
> -               int i;
> -
>                 for (i = 0; i < pipe->num_mmu; i++)
>                         lima_mmu_page_fault_resume(pipe->mmu[i]);
>         }
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-24 12:38   ` Qiang Yu
@ 2024-01-29 22:55     ` Erico Nunes
  2024-01-30  1:05       ` Qiang Yu
  0 siblings, 1 reply; 14+ messages in thread
From: Erico Nunes @ 2024-01-29 22:55 UTC (permalink / raw)
  To: Qiang Yu
  Cc: anarsoul, christian.koenig, dri-devel, lima, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, linux-kernel

On Wed, Jan 24, 2024 at 1:38 PM Qiang Yu <yuq825@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <nunes.erico@gmail.com> wrote:
> >
> > There are several unexplained and unreproduced cases of rendering
> > timeouts with lima, for which one theory is high IRQ latency coming from
> > somewhere else in the system.
> > This kind of occurrence may cause applications to trigger unnecessary
> > resets of the GPU or even applications to hang if it hits an issue in
> > the recovery path.
> > Panfrost already does some special handling to account for such
> > "spurious timeouts", it makes sense to have this in lima too to reduce
> > the chance that it hit users.
> >
> > Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> > ---
> >  drivers/gpu/drm/lima/lima_sched.c | 31 ++++++++++++++++++++++++++++---
> >  1 file changed, 28 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > index c3bf8cda8498..814428564637 100644
> > --- a/drivers/gpu/drm/lima/lima_sched.c
> > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > @@ -1,6 +1,7 @@
> >  // SPDX-License-Identifier: GPL-2.0 OR MIT
> >  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
> >
> > +#include <linux/hardirq.h>
> >  #include <linux/iosys-map.h>
> >  #include <linux/kthread.h>
> >  #include <linux/slab.h>
> > @@ -401,9 +402,35 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> >         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> >         struct lima_sched_task *task = to_lima_task(job);
> >         struct lima_device *ldev = pipe->ldev;
> > +       struct lima_ip *ip = pipe->processor[0];
> > +       int i;
> > +
> > +       /*
> > +        * If the GPU managed to complete this jobs fence, the timeout is
> > +        * spurious. Bail out.
> > +        */
> > +       if (dma_fence_is_signaled(task->fence)) {
> > +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > +               return DRM_GPU_SCHED_STAT_NOMINAL;
> > +       }
> > +
> > +       /*
> > +        * Lima IRQ handler may take a long time to process an interrupt
> > +        * if there is another IRQ handler hogging the processing.
> > +        * In order to catch such cases and not report spurious Lima job
> > +        * timeouts, synchronize the IRQ handler and re-check the fence
> > +        * status.
> > +        */
> > +       for (i = 0; i < pipe->num_processor; i++)
> > +               synchronize_irq(pipe->processor[i]->irq);
> > +
> I have a question, this timeout handler will be called when GP/PP error IRQ.
> If we call synchronize_irq() in the IRQ handler, will we block ourselves here?

If I understand correctly, this handler is only called by drm_sched in
a workqueue, not by gp or pp IRQ and it also does not run in any IRQ
context.
So I think this sort of lockup can't happen here.

I ran some additional tests with both timeouts and actual error IRQs
(locally modified Mesa to produce some errored jobs) and was not able
to cause any lockup related to this.

Erico

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency
  2024-01-29 22:55     ` Erico Nunes
@ 2024-01-30  1:05       ` Qiang Yu
  0 siblings, 0 replies; 14+ messages in thread
From: Qiang Yu @ 2024-01-30  1:05 UTC (permalink / raw)
  To: Erico Nunes
  Cc: anarsoul, christian.koenig, dri-devel, lima, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, linux-kernel

On Tue, Jan 30, 2024 at 6:55 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> On Wed, Jan 24, 2024 at 1:38 PM Qiang Yu <yuq825@gmail.com> wrote:
> >
> > On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <nunes.erico@gmail.com> wrote:
> > >
> > > There are several unexplained and unreproduced cases of rendering
> > > timeouts with lima, for which one theory is high IRQ latency coming from
> > > somewhere else in the system.
> > > This kind of occurrence may cause applications to trigger unnecessary
> > > resets of the GPU or even applications to hang if it hits an issue in
> > > the recovery path.
> > > Panfrost already does some special handling to account for such
> > > "spurious timeouts", it makes sense to have this in lima too to reduce
> > > the chance that it hit users.
> > >
> > > Signed-off-by: Erico Nunes <nunes.erico@gmail.com>
> > > ---
> > >  drivers/gpu/drm/lima/lima_sched.c | 31 ++++++++++++++++++++++++++++---
> > >  1 file changed, 28 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> > > index c3bf8cda8498..814428564637 100644
> > > --- a/drivers/gpu/drm/lima/lima_sched.c
> > > +++ b/drivers/gpu/drm/lima/lima_sched.c
> > > @@ -1,6 +1,7 @@
> > >  // SPDX-License-Identifier: GPL-2.0 OR MIT
> > >  /* Copyright 2017-2019 Qiang Yu <yuq825@gmail.com> */
> > >
> > > +#include <linux/hardirq.h>
> > >  #include <linux/iosys-map.h>
> > >  #include <linux/kthread.h>
> > >  #include <linux/slab.h>
> > > @@ -401,9 +402,35 @@ static enum drm_gpu_sched_stat lima_sched_timedout_job(struct drm_sched_job *job
> > >         struct lima_sched_pipe *pipe = to_lima_pipe(job->sched);
> > >         struct lima_sched_task *task = to_lima_task(job);
> > >         struct lima_device *ldev = pipe->ldev;
> > > +       struct lima_ip *ip = pipe->processor[0];
> > > +       int i;
> > > +
> > > +       /*
> > > +        * If the GPU managed to complete this jobs fence, the timeout is
> > > +        * spurious. Bail out.
> > > +        */
> > > +       if (dma_fence_is_signaled(task->fence)) {
> > > +               DRM_WARN("%s spurious timeout\n", lima_ip_name(ip));
> > > +               return DRM_GPU_SCHED_STAT_NOMINAL;
> > > +       }
> > > +
> > > +       /*
> > > +        * Lima IRQ handler may take a long time to process an interrupt
> > > +        * if there is another IRQ handler hogging the processing.
> > > +        * In order to catch such cases and not report spurious Lima job
> > > +        * timeouts, synchronize the IRQ handler and re-check the fence
> > > +        * status.
> > > +        */
> > > +       for (i = 0; i < pipe->num_processor; i++)
> > > +               synchronize_irq(pipe->processor[i]->irq);
> > > +
> > I have a question, this timeout handler will be called when GP/PP error IRQ.
> > If we call synchronize_irq() in the IRQ handler, will we block ourselves here?
>
> If I understand correctly, this handler is only called by drm_sched in
> a workqueue, not by gp or pp IRQ and it also does not run in any IRQ
> context.
> So I think this sort of lockup can't happen here.
>
Oh, right. I miss understand the drm_sched_fault() which still call the timeout
handler in work queue instead of caller thread.

> I ran some additional tests with both timeouts and actual error IRQs
> (locally modified Mesa to produce some errored jobs) and was not able
> to cause any lockup related to this.
>
> Erico

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery
  2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
                   ` (7 preceding siblings ...)
  2024-01-24  2:59 ` [PATCH v2 8/8] drm/lima: standardize debug messages by ip name Erico Nunes
@ 2024-01-30  1:07 ` Qiang Yu
  2024-02-12  8:40   ` Qiang Yu
  8 siblings, 1 reply; 14+ messages in thread
From: Qiang Yu @ 2024-01-30  1:07 UTC (permalink / raw)
  To: Erico Nunes
  Cc: anarsoul, christian.koenig, dri-devel, lima, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, linux-kernel

Serial is Reviewed-by: QIang Yu <yuq825@gmail.com>

On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <nunes.erico@gmail.com> wrote:
>
> v1 reference:
> https://patchwork.kernel.org/project/dri-devel/cover/20240117031212.1104034-1-nunes.erico@gmail.com/
>
> Changes v1 -> v2:
> - Dropped patch 1 which aimed to fix
> https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
> That will require more testing and an actual fix to the irq/timeout
> handler race. It can be solved separately so I am deferring it to a
> followup patch and keeping that issue open.
>
> - Added patches 2 and 4 to cover "reset time out" and bus stop bit to
> hard reset in gp as well.
>
> - Added handling of all processors in synchronize_irq in patch 5 to
> cover multiple pp. Dropped unnecessary duplicate fence in patch 5.
>
> - Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
> to be reasonable to bump our timeout value so that we further decrease
> the chance of users actually hitting any of these timeouts by default.
>
> - Reworked patch 8 in v2. Since I broadened the work to not only focus
> in pp anymore, I also included the change to the other blocks as well.
>
> - Collected some reviews and acks in unmodified patches.
>
>
> Erico Nunes (8):
>   drm/lima: reset async_reset on pp hard reset
>   drm/lima: reset async_reset on gp hard reset
>   drm/lima: set pp bus_stop bit before hard reset
>   drm/lima: set gp bus_stop bit before hard reset
>   drm/lima: handle spurious timeouts due to high irq latency
>   drm/lima: remove guilty drm_sched context handling
>   drm/lima: increase default job timeout to 10s
>   drm/lima: standardize debug messages by ip name
>
>  drivers/gpu/drm/lima/lima_ctx.c      |  2 +-
>  drivers/gpu/drm/lima/lima_ctx.h      |  1 -
>  drivers/gpu/drm/lima/lima_gp.c       | 39 +++++++++++++++++++++-------
>  drivers/gpu/drm/lima/lima_l2_cache.c |  6 +++--
>  drivers/gpu/drm/lima/lima_mmu.c      | 18 ++++++-------
>  drivers/gpu/drm/lima/lima_pmu.c      |  3 ++-
>  drivers/gpu/drm/lima/lima_pp.c       | 37 ++++++++++++++++++++------
>  drivers/gpu/drm/lima/lima_sched.c    | 38 ++++++++++++++++++++++-----
>  drivers/gpu/drm/lima/lima_sched.h    |  3 +--
>  9 files changed, 107 insertions(+), 40 deletions(-)
>
> --
> 2.43.0
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery
  2024-01-30  1:07 ` [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Qiang Yu
@ 2024-02-12  8:40   ` Qiang Yu
  0 siblings, 0 replies; 14+ messages in thread
From: Qiang Yu @ 2024-02-12  8:40 UTC (permalink / raw)
  To: Erico Nunes
  Cc: anarsoul, christian.koenig, dri-devel, lima, Maarten Lankhorst,
	Maxime Ripard, Thomas Zimmermann, David Airlie, Daniel Vetter,
	Sumit Semwal, linux-kernel

applied to drm-misc-next

On Tue, Jan 30, 2024 at 9:07 AM Qiang Yu <yuq825@gmail.com> wrote:
>
> Serial is Reviewed-by: QIang Yu <yuq825@gmail.com>
>
> On Wed, Jan 24, 2024 at 11:00 AM Erico Nunes <nunes.erico@gmail.com> wrote:
> >
> > v1 reference:
> > https://patchwork.kernel.org/project/dri-devel/cover/20240117031212.1104034-1-nunes.erico@gmail.com/
> >
> > Changes v1 -> v2:
> > - Dropped patch 1 which aimed to fix
> > https://gitlab.freedesktop.org/mesa/mesa/-/issues/8415 .
> > That will require more testing and an actual fix to the irq/timeout
> > handler race. It can be solved separately so I am deferring it to a
> > followup patch and keeping that issue open.
> >
> > - Added patches 2 and 4 to cover "reset time out" and bus stop bit to
> > hard reset in gp as well.
> >
> > - Added handling of all processors in synchronize_irq in patch 5 to
> > cover multiple pp. Dropped unnecessary duplicate fence in patch 5.
> >
> > - Added patch 7 in v2. After some discussion in patch 4 (v1), it seems
> > to be reasonable to bump our timeout value so that we further decrease
> > the chance of users actually hitting any of these timeouts by default.
> >
> > - Reworked patch 8 in v2. Since I broadened the work to not only focus
> > in pp anymore, I also included the change to the other blocks as well.
> >
> > - Collected some reviews and acks in unmodified patches.
> >
> >
> > Erico Nunes (8):
> >   drm/lima: reset async_reset on pp hard reset
> >   drm/lima: reset async_reset on gp hard reset
> >   drm/lima: set pp bus_stop bit before hard reset
> >   drm/lima: set gp bus_stop bit before hard reset
> >   drm/lima: handle spurious timeouts due to high irq latency
> >   drm/lima: remove guilty drm_sched context handling
> >   drm/lima: increase default job timeout to 10s
> >   drm/lima: standardize debug messages by ip name
> >
> >  drivers/gpu/drm/lima/lima_ctx.c      |  2 +-
> >  drivers/gpu/drm/lima/lima_ctx.h      |  1 -
> >  drivers/gpu/drm/lima/lima_gp.c       | 39 +++++++++++++++++++++-------
> >  drivers/gpu/drm/lima/lima_l2_cache.c |  6 +++--
> >  drivers/gpu/drm/lima/lima_mmu.c      | 18 ++++++-------
> >  drivers/gpu/drm/lima/lima_pmu.c      |  3 ++-
> >  drivers/gpu/drm/lima/lima_pp.c       | 37 ++++++++++++++++++++------
> >  drivers/gpu/drm/lima/lima_sched.c    | 38 ++++++++++++++++++++++-----
> >  drivers/gpu/drm/lima/lima_sched.h    |  3 +--
> >  9 files changed, 107 insertions(+), 40 deletions(-)
> >
> > --
> > 2.43.0
> >

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2024-02-12  8:40 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-24  2:59 [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Erico Nunes
2024-01-24  2:59 ` [PATCH v2 1/8] drm/lima: reset async_reset on pp hard reset Erico Nunes
2024-01-24  2:59 ` [PATCH v2 2/8] drm/lima: reset async_reset on gp " Erico Nunes
2024-01-24  2:59 ` [PATCH v2 3/8] drm/lima: set pp bus_stop bit before " Erico Nunes
2024-01-24  2:59 ` [PATCH v2 4/8] drm/lima: set gp " Erico Nunes
2024-01-24  2:59 ` [PATCH v2 5/8] drm/lima: handle spurious timeouts due to high irq latency Erico Nunes
2024-01-24 12:38   ` Qiang Yu
2024-01-29 22:55     ` Erico Nunes
2024-01-30  1:05       ` Qiang Yu
2024-01-24  2:59 ` [PATCH v2 6/8] drm/lima: remove guilty drm_sched context handling Erico Nunes
2024-01-24  2:59 ` [PATCH v2 7/8] drm/lima: increase default job timeout to 10s Erico Nunes
2024-01-24  2:59 ` [PATCH v2 8/8] drm/lima: standardize debug messages by ip name Erico Nunes
2024-01-30  1:07 ` [PATCH v2 0/8] drm/lima: fixes and improvements to error recovery Qiang Yu
2024-02-12  8:40   ` Qiang Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).