dri-devel.lists.freedesktop.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes
@ 2020-09-03 16:48 Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Yukimasa Sugizaki @ 2020-09-03 16:48 UTC (permalink / raw)
  To: dri-devel; +Cc: Yukimasa Sugizaki

From: Yukimasa Sugizaki <ysugi@idein.jp>

Hi,

The current V3D scheduler has two issues where CSD jobs are resubmitted
regardless of the previous timed-out flag, and where the timer is not
restarted for timed-out CL/CSD jobs (which we wish to continue running).
The second one is due to the DRM scheduler API change and fixed in a
similar way to [1].  A kernel command-line option to set the default
timeout value is also added.

I tested this patchset with Piglit and our CSD programs in [2].  Because
it is hard to get the current upstream kernel to work on BCM2711, I used
the kernel from rpi-5.8.y tree [3].  There still are problems where some
Piglit tests get longer time to finish running (3610 minutes to 3650
minutes in total), and some ones result in the invalid memory read
errors with unknown reasons:

[17086.230959] v3d fec00000.v3d: MMU error from client CLE (4) at 0xac1000, pte invalid
[17086.238722] v3d fec00000.v3d: MMU error from client CLE (4) at 0x1b61000, pte invalid
[18643.303188] v3d fec00000.v3d: MMU error from client L2T (0) at 0x15bff00, pte invalid
[18655.933748] v3d fec00000.v3d: MMU error from client L2T (0) at 0x15bff00, pte invalid

However, most of the CL/CSD programs are now working happily without
kernel warnings and errors.

Regards,
Sugizaki

[1] https://patchwork.kernel.org/patch/11732895/
[2] https://github.com/Idein/py-videocore6
[3] https://github.com/raspberrypi/linux/tree/rpi-5.8.y

Yukimasa Sugizaki (3):
  drm/v3d: Don't resubmit guilty CSD jobs
  drm/v3d: Correctly restart the timer when progress is made
  drm/v3d: Add job timeout module param

 drivers/gpu/drm/v3d/v3d_sched.c | 62 +++++++++++++++++++++++++++++++++--------
 1 file changed, 51 insertions(+), 11 deletions(-)

--
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs
  2020-09-03 16:48 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
@ 2020-09-03 16:48 ` Yukimasa Sugizaki
  2021-02-04 13:54   ` Chema Casanova
  2020-09-03 16:48 ` [PATCH 2/3] drm/v3d: Correctly restart the timer when progress is made Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 3/3] drm/v3d: Add job timeout module param Yukimasa Sugizaki
  2 siblings, 1 reply; 11+ messages in thread
From: Yukimasa Sugizaki @ 2020-09-03 16:48 UTC (permalink / raw)
  To: dri-devel; +Cc: David Airlie, Yukimasa Sugizaki

From: Yukimasa Sugizaki <ysugi@idein.jp>

The previous code misses a check for the timeout error set by
drm_sched_resubmit_jobs(), which results in an infinite GPU reset loop
if once a timeout occurs:

[  178.799106] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  178.807836] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  179.839132] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  179.847865] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  180.879146] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  180.887925] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  181.919188] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  181.928002] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
...

This commit adds the check for timeout as in v3d_{bin,render}_job_run():

[   66.408962] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[   66.417734] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[   66.428296] [drm] Skipping CSD job resubmission due to previous error (-125)

, where -125 is -ECANCELED, though users currently have no way other
than inspecting the dmesg to check if the timeout has occurred.

Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 0747614a78f0..001216f22017 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -226,6 +226,17 @@ v3d_csd_job_run(struct drm_sched_job *sched_job)
 	struct dma_fence *fence;
 	int i;

+	/* This error is set to -ECANCELED by drm_sched_resubmit_jobs() if this
+	 * job timed out more than sched_job->sched->hang_limit times.
+	 */
+	int error = sched_job->s_fence->finished.error;
+
+	if (unlikely(error < 0)) {
+		DRM_WARN("Skipping CSD job resubmission due to previous error (%d)\n",
+			 error);
+		return ERR_PTR(error);
+	}
+
 	v3d->csd_job = job;

 	v3d_invalidate_caches(v3d);
--
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] drm/v3d: Correctly restart the timer when progress is made
  2020-09-03 16:48 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki
@ 2020-09-03 16:48 ` Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 3/3] drm/v3d: Add job timeout module param Yukimasa Sugizaki
  2 siblings, 0 replies; 11+ messages in thread
From: Yukimasa Sugizaki @ 2020-09-03 16:48 UTC (permalink / raw)
  To: dri-devel
  Cc: Emily Deng, David Airlie, Christian König, Yukimasa Sugizaki

From: Yukimasa Sugizaki <ysugi@idein.jp>

The V3D scheduler wants a timed-out job to continue running if it made
progress.  However, the current DRM scheduler removes the timed-out job
from ring_mirror_list and thus the timer is not restarted automatically,
resulting in an infinite timeout.  We need stop and restart the DRM
scheduler to rearm the timer.

Fixes: 135517d3565b ("drm/scheduler: Avoid accessing freed bad job.")
Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 001216f22017..feef0c749e68 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -312,9 +312,24 @@ v3d_cl_job_timedout(struct drm_sched_job *sched_job, enum v3d_queue q,
 	u32 ctca = V3D_CORE_READ(0, V3D_CLE_CTNCA(q));
 	u32 ctra = V3D_CORE_READ(0, V3D_CLE_CTNRA(q));

+	/* If we've made progress, skip reset and let the timer get
+	 * rearmed.
+	 */
 	if (*timedout_ctca != ctca || *timedout_ctra != ctra) {
 		*timedout_ctca = ctca;
 		*timedout_ctra = ctra;
+
+		/* Because the timed-out job has been removed from
+		 * ring_mirror_list in drm_sched_job_timedout(), we need to
+		 * stop and restart the scheduler to rearm the timer.
+		 * Holding the reset_lock is necessary for concurrent
+		 * v3d_gpu_reset_for_timeout().
+		 */
+		mutex_lock(&v3d->reset_lock);
+		drm_sched_stop(sched_job->sched, sched_job);
+		drm_sched_start(sched_job->sched, sched_job);
+		mutex_unlock(&v3d->reset_lock);
+
 		return;
 	}

@@ -359,6 +374,18 @@ v3d_csd_job_timedout(struct drm_sched_job *sched_job)
 	 */
 	if (job->timedout_batches != batches) {
 		job->timedout_batches = batches;
+
+		/* Because the timed-out job has been removed from
+		 * ring_mirror_list in drm_sched_job_timedout(), we need to
+		 * stop and restart the scheduler to rearm the timer.
+		 * Holding the reset_lock is necessary for concurrent
+		 * v3d_gpu_reset_for_timeout().
+		 */
+		mutex_lock(&v3d->reset_lock);
+		drm_sched_stop(sched_job->sched, sched_job);
+		drm_sched_start(sched_job->sched, sched_job);
+		mutex_unlock(&v3d->reset_lock);
+
 		return;
 	}

--
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] drm/v3d: Add job timeout module param
  2020-09-03 16:48 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki
  2020-09-03 16:48 ` [PATCH 2/3] drm/v3d: Correctly restart the timer when progress is made Yukimasa Sugizaki
@ 2020-09-03 16:48 ` Yukimasa Sugizaki
  2021-02-04 18:09   ` Chema Casanova
  2 siblings, 1 reply; 11+ messages in thread
From: Yukimasa Sugizaki @ 2020-09-03 16:48 UTC (permalink / raw)
  To: dri-devel; +Cc: David Airlie, Yukimasa Sugizaki

From: Yukimasa Sugizaki <ysugi@idein.jp>

The default timeout is 500 ms which is too short for some workloads
including Piglit.  Adding this parameter will help users to run heavier
tasks.

Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index feef0c749e68..983efb018560 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -19,11 +19,17 @@
  */

 #include <linux/kthread.h>
+#include <linux/moduleparam.h>

 #include "v3d_drv.h"
 #include "v3d_regs.h"
 #include "v3d_trace.h"

+static uint timeout = 500;
+module_param(timeout, uint, 0444);
+MODULE_PARM_DESC(timeout,
+	"Timeout for a job in ms (0 means infinity and default is 500 ms)");
+
 static struct v3d_job *
 to_v3d_job(struct drm_sched_job *sched_job)
 {
@@ -432,13 +438,13 @@ v3d_sched_init(struct v3d_dev *v3d)
 {
 	int hw_jobs_limit = 1;
 	int job_hang_limit = 0;
-	int hang_limit_ms = 500;
+	long timeout_jiffies = timeout ?
+			msecs_to_jiffies(timeout) : MAX_SCHEDULE_TIMEOUT;
 	int ret;

 	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
 			     &v3d_bin_sched_ops,
-			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
 			     "v3d_bin");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create bin scheduler: %d.", ret);
@@ -447,8 +453,7 @@ v3d_sched_init(struct v3d_dev *v3d)

 	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
 			     &v3d_render_sched_ops,
-			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
 			     "v3d_render");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create render scheduler: %d.",
@@ -459,8 +464,7 @@ v3d_sched_init(struct v3d_dev *v3d)

 	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
 			     &v3d_tfu_sched_ops,
-			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
 			     "v3d_tfu");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
@@ -472,8 +476,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	if (v3d_has_csd(v3d)) {
 		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
 				     &v3d_csd_sched_ops,
-				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     hw_jobs_limit, job_hang_limit, timeout_jiffies,
 				     "v3d_csd");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CSD scheduler: %d.",
@@ -484,8 +487,7 @@ v3d_sched_init(struct v3d_dev *v3d)

 		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
 				     &v3d_cache_clean_sched_ops,
-				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     hw_jobs_limit, job_hang_limit, timeout_jiffies,
 				     "v3d_cache_clean");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN scheduler: %d.",
--
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs
  2020-09-03 16:48 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki
@ 2021-02-04 13:54   ` Chema Casanova
  0 siblings, 0 replies; 11+ messages in thread
From: Chema Casanova @ 2021-02-04 13:54 UTC (permalink / raw)
  To: Yukimasa Sugizaki, dri-devel; +Cc: David Airlie

I've tested the patch and confirmed that applies correctly over drm-next.

I've also confirmed that the timeout happens with the described test 
case by the developer.

https://github.com/raspberrypi/linux/pull/3816#issuecomment-682251862

Considering this is my first review of a patch in v3d kernel side I 
think this patch is fine.

Reviewed-by: Jose Maria Casanova Crespo <jmcasanova@igalia.com>

On 3/9/20 18:48, Yukimasa Sugizaki wrote:
> From: Yukimasa Sugizaki <ysugi@idein.jp>
>
> The previous code misses a check for the timeout error set by
> drm_sched_resubmit_jobs(), which results in an infinite GPU reset loop
> if once a timeout occurs:
>
> [  178.799106] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
> [  178.807836] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
> [  179.839132] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
> [  179.847865] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
> [  180.879146] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
> [  180.887925] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
> [  181.919188] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
> [  181.928002] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
> ...
>
> This commit adds the check for timeout as in v3d_{bin,render}_job_run():
>
> [   66.408962] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
> [   66.417734] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
> [   66.428296] [drm] Skipping CSD job resubmission due to previous error (-125)
>
> , where -125 is -ECANCELED, though users currently have no way other
> than inspecting the dmesg to check if the timeout has occurred.
>
> Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
> ---
>   drivers/gpu/drm/v3d/v3d_sched.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
>
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 0747614a78f0..001216f22017 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -226,6 +226,17 @@ v3d_csd_job_run(struct drm_sched_job *sched_job)
>   	struct dma_fence *fence;
>   	int i;
>
> +	/* This error is set to -ECANCELED by drm_sched_resubmit_jobs() if this
> +	 * job timed out more than sched_job->sched->hang_limit times.
> +	 */
> +	int error = sched_job->s_fence->finished.error;
> +
> +	if (unlikely(error < 0)) {
> +		DRM_WARN("Skipping CSD job resubmission due to previous error (%d)\n",
> +			 error);
> +		return ERR_PTR(error);
> +	}
> +
>   	v3d->csd_job = job;
>
>   	v3d_invalidate_caches(v3d);
> --
> 2.7.4
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/v3d: Add job timeout module param
  2020-09-03 16:48 ` [PATCH 3/3] drm/v3d: Add job timeout module param Yukimasa Sugizaki
@ 2021-02-04 18:09   ` Chema Casanova
  2021-02-04 19:34     ` Eric Anholt
  0 siblings, 1 reply; 11+ messages in thread
From: Chema Casanova @ 2021-02-04 18:09 UTC (permalink / raw)
  To: Yukimasa Sugizaki, dri-devel; +Cc: David Airlie

On 3/9/20 18:48, Yukimasa Sugizaki wrote:
> From: Yukimasa Sugizaki <ysugi@idein.jp>
>
> The default timeout is 500 ms which is too short for some workloads
> including Piglit.  Adding this parameter will help users to run heavier
> tasks.
>
> Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
> ---
>   drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
>   1 file changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index feef0c749e68..983efb018560 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -19,11 +19,17 @@
>    */
>
>   #include <linux/kthread.h>
> +#include <linux/moduleparam.h>
>
>   #include "v3d_drv.h"
>   #include "v3d_regs.h"
>   #include "v3d_trace.h"
>
> +static uint timeout = 500;
> +module_param(timeout, uint, 0444);
> +MODULE_PARM_DESC(timeout,
> +	"Timeout for a job in ms (0 means infinity and default is 500 ms)");
> +

I think that  parameter definition could be included at v3d_drv.c as 
other drivers do. Then we would have all future parameters together. In 
that case we would need also to include the extern definition at 
v3d_drv.h for the driver variable.

The param name could be more descriptive like "sched_timeout_ms" in the 
lima driver.

I wonder if it would make sense to have different timeouts parameters 
for different type of jobs we have. At least this series come from the 
need additional time to complete compute jobs. This is what amdgpu does, 
but we probably don't need something such complex.

I think it would also make sense to increase our default value for the 
compute jobs.

What do you think?

In any case I think that having this general scheduling timeout 
parameter as other drivers do is a good idea.

Regards,

Chema Casanova

>   static struct v3d_job *
>   to_v3d_job(struct drm_sched_job *sched_job)
>   {
> @@ -432,13 +438,13 @@ v3d_sched_init(struct v3d_dev *v3d)
>   {
>   	int hw_jobs_limit = 1;
>   	int job_hang_limit = 0;
> -	int hang_limit_ms = 500;
> +	long timeout_jiffies = timeout ?
> +			msecs_to_jiffies(timeout) : MAX_SCHEDULE_TIMEOUT;
>   	int ret;
>
>   	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
>   			     &v3d_bin_sched_ops,
> -			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
>   			     "v3d_bin");
>   	if (ret) {
>   		dev_err(v3d->drm.dev, "Failed to create bin scheduler: %d.", ret);
> @@ -447,8 +453,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>
>   	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
>   			     &v3d_render_sched_ops,
> -			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
>   			     "v3d_render");
>   	if (ret) {
>   		dev_err(v3d->drm.dev, "Failed to create render scheduler: %d.",
> @@ -459,8 +464,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>
>   	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
>   			     &v3d_tfu_sched_ops,
> -			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     hw_jobs_limit, job_hang_limit, timeout_jiffies,
>   			     "v3d_tfu");
>   	if (ret) {
>   		dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
> @@ -472,8 +476,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>   	if (v3d_has_csd(v3d)) {
>   		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
>   				     &v3d_csd_sched_ops,
> -				     hw_jobs_limit, job_hang_limit,
> -				     msecs_to_jiffies(hang_limit_ms),
> +				     hw_jobs_limit, job_hang_limit, timeout_jiffies,
>   				     "v3d_csd");
>   		if (ret) {
>   			dev_err(v3d->drm.dev, "Failed to create CSD scheduler: %d.",
> @@ -484,8 +487,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>
>   		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
>   				     &v3d_cache_clean_sched_ops,
> -				     hw_jobs_limit, job_hang_limit,
> -				     msecs_to_jiffies(hang_limit_ms),
> +				     hw_jobs_limit, job_hang_limit, timeout_jiffies,
>   				     "v3d_cache_clean");
>   		if (ret) {
>   			dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN scheduler: %d.",
> --
> 2.7.4
>
> _______________________________________________
> dri-devel mailing list
> dri-devel@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/dri-devel
>
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/v3d: Add job timeout module param
  2021-02-04 18:09   ` Chema Casanova
@ 2021-02-04 19:34     ` Eric Anholt
  2021-02-05 12:28       ` Yukimasa Sugizaki
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Anholt @ 2021-02-04 19:34 UTC (permalink / raw)
  To: Chema Casanova; +Cc: David Airlie, DRI Development, Yukimasa Sugizaki

On Thu, Feb 4, 2021 at 10:09 AM Chema Casanova <jmcasanova@igalia.com> wrote:
>
> On 3/9/20 18:48, Yukimasa Sugizaki wrote:
> > From: Yukimasa Sugizaki <ysugi@idein.jp>
> >
> > The default timeout is 500 ms which is too short for some workloads
> > including Piglit.  Adding this parameter will help users to run heavier
> > tasks.
> >
> > Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
> > ---
> >   drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
> >   1 file changed, 13 insertions(+), 11 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> > index feef0c749e68..983efb018560 100644
> > --- a/drivers/gpu/drm/v3d/v3d_sched.c
> > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> > @@ -19,11 +19,17 @@
> >    */
> >
> >   #include <linux/kthread.h>
> > +#include <linux/moduleparam.h>
> >
> >   #include "v3d_drv.h"
> >   #include "v3d_regs.h"
> >   #include "v3d_trace.h"
> >
> > +static uint timeout = 500;
> > +module_param(timeout, uint, 0444);
> > +MODULE_PARM_DESC(timeout,
> > +     "Timeout for a job in ms (0 means infinity and default is 500 ms)");
> > +
>
> I think that  parameter definition could be included at v3d_drv.c as
> other drivers do. Then we would have all future parameters together. In
> that case we would need also to include the extern definition at
> v3d_drv.h for the driver variable.
>
> The param name could be more descriptive like "sched_timeout_ms" in the
> lima driver.
>
> I wonder if it would make sense to have different timeouts parameters
> for different type of jobs we have. At least this series come from the
> need additional time to complete compute jobs. This is what amdgpu does,
> but we probably don't need something such complex.
>
> I think it would also make sense to increase our default value for the
> compute jobs.
>
> What do you think?
>
> In any case I think that having this general scheduling timeout
> parameter as other drivers do is a good idea.

Having a param for being able to test if the job completes eventually
is a good idea, but if tests are triggering timeouts, then you
probably need to investigate what's going on in the driver -- it's not
like you want .5second unpreemptible jobs to be easy to produce.

Also, for CS, I wonder if we have another reg besides CSD_CURRENT_CFG4
we could be looking at for "we're making progress on the job".  Half a
second with no discernible progress sounds like a bug.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/v3d: Add job timeout module param
  2021-02-04 19:34     ` Eric Anholt
@ 2021-02-05 12:28       ` Yukimasa Sugizaki
  2021-02-10 17:59         ` Chema Casanova
  0 siblings, 1 reply; 11+ messages in thread
From: Yukimasa Sugizaki @ 2021-02-05 12:28 UTC (permalink / raw)
  To: Chema Casanova, Eric Anholt; +Cc: David Airlie, DRI Development

On 05/02/2021, Eric Anholt <eric@anholt.net> wrote:
> On Thu, Feb 4, 2021 at 10:09 AM Chema Casanova <jmcasanova@igalia.com>
> wrote:
>>
>> On 3/9/20 18:48, Yukimasa Sugizaki wrote:
>> > From: Yukimasa Sugizaki <ysugi@idein.jp>
>> >
>> > The default timeout is 500 ms which is too short for some workloads
>> > including Piglit.  Adding this parameter will help users to run heavier
>> > tasks.
>> >
>> > Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
>> > ---
>> >   drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
>> >   1 file changed, 13 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>> > b/drivers/gpu/drm/v3d/v3d_sched.c
>> > index feef0c749e68..983efb018560 100644
>> > --- a/drivers/gpu/drm/v3d/v3d_sched.c
>> > +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>> > @@ -19,11 +19,17 @@
>> >    */
>> >
>> >   #include <linux/kthread.h>
>> > +#include <linux/moduleparam.h>
>> >
>> >   #include "v3d_drv.h"
>> >   #include "v3d_regs.h"
>> >   #include "v3d_trace.h"
>> >
>> > +static uint timeout = 500;
>> > +module_param(timeout, uint, 0444);
>> > +MODULE_PARM_DESC(timeout,
>> > +     "Timeout for a job in ms (0 means infinity and default is 500
>> > ms)");
>> > +
>>
>> I think that  parameter definition could be included at v3d_drv.c as
>> other drivers do. Then we would have all future parameters together. In
>> that case we would need also to include the extern definition at
>> v3d_drv.h for the driver variable.
>>
>> The param name could be more descriptive like "sched_timeout_ms" in the
>> lima driver.
>>
>> I wonder if it would make sense to have different timeouts parameters
>> for different type of jobs we have. At least this series come from the
>> need additional time to complete compute jobs. This is what amdgpu does,
>> but we probably don't need something such complex.
>>
>> I think it would also make sense to increase our default value for the
>> compute jobs.
>>
>> What do you think?
>>
>> In any case I think that having this general scheduling timeout
>> parameter as other drivers do is a good idea.

I agree with your observations.
I'm going to add bin_timeout_ms, render_timeout_ms, tfu_timeout_ms,
v3d_timeout_ms, and cache_clean_timeout_ms parameters to v3d_drv.c
instead of the sole timeout parameter.
Though not sophisticated, this should be enough.

> Having a param for being able to test if the job completes eventually
> is a good idea, but if tests are triggering timeouts, then you
> probably need to investigate what's going on in the driver -- it's not
> like you want .5second unpreemptible jobs to be easy to produce.
>
> Also, for CS, I wonder if we have another reg besides CSD_CURRENT_CFG4
> we could be looking at for "we're making progress on the job".  Half a
> second with no discernible progress sounds like a bug.

If binning/render/TFU/cache_clean jobs keep running over 0.5 seconds,
then it should be a bug because they normally complete processing
within the period.
However, a CSD job can do anything.
For example, SaschaWillems's ray tracing example requires about 0.7
seconds to compose a frame with compute shader with Igalia's Vulkan
driver.
Another example is our matrix computation library for QPU:
https://github.com/Idein/qmkl6
It requires about 1.1 seconds to multiply two 2048x2048 matrices.
Because in general we do not know what is the expected behavior of a
QPU program and what is not, we also cannot detect a progress the
program is making.
This is why we want to have the timeout parameters to be able to be modified.

Regards,
Y. Sugizaki.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/v3d: Add job timeout module param
  2021-02-05 12:28       ` Yukimasa Sugizaki
@ 2021-02-10 17:59         ` Chema Casanova
  2021-02-11  6:31           ` Yukimasa Sugizaki
  0 siblings, 1 reply; 11+ messages in thread
From: Chema Casanova @ 2021-02-10 17:59 UTC (permalink / raw)
  To: Yukimasa Sugizaki, Eric Anholt; +Cc: David Airlie, DRI Development


On 5/2/21 13:28, Yukimasa Sugizaki wrote:
> On 05/02/2021, Eric Anholt <eric@anholt.net> wrote:
>> On Thu, Feb 4, 2021 at 10:09 AM Chema Casanova <jmcasanova@igalia.com>
>> wrote:
>>> On 3/9/20 18:48, Yukimasa Sugizaki wrote:
>>>> From: Yukimasa Sugizaki <ysugi@idein.jp>
>>>>
>>>> The default timeout is 500 ms which is too short for some workloads
>>>> including Piglit.  Adding this parameter will help users to run heavier
>>>> tasks.
>>>>
>>>> Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
>>>> ---
>>>>    drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
>>>>    1 file changed, 13 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> index feef0c749e68..983efb018560 100644
>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>> @@ -19,11 +19,17 @@
>>>>     */
>>>>
>>>>    #include <linux/kthread.h>
>>>> +#include <linux/moduleparam.h>
>>>>
>>>>    #include "v3d_drv.h"
>>>>    #include "v3d_regs.h"
>>>>    #include "v3d_trace.h"
>>>>
>>>> +static uint timeout = 500;
>>>> +module_param(timeout, uint, 0444);
>>>> +MODULE_PARM_DESC(timeout,
>>>> +     "Timeout for a job in ms (0 means infinity and default is 500
>>>> ms)");
>>>> +
>>> I think that  parameter definition could be included at v3d_drv.c as
>>> other drivers do. Then we would have all future parameters together. In
>>> that case we would need also to include the extern definition at
>>> v3d_drv.h for the driver variable.
>>>
>>> The param name could be more descriptive like "sched_timeout_ms" in the
>>> lima driver.
>>>
>>> I wonder if it would make sense to have different timeouts parameters
>>> for different type of jobs we have. At least this series come from the
>>> need additional time to complete compute jobs. This is what amdgpu does,
>>> but we probably don't need something such complex.
>>>
>>> I think it would also make sense to increase our default value for the
>>> compute jobs.
>>>
>>> What do you think?
>>>
>>> In any case I think that having this general scheduling timeout
>>> parameter as other drivers do is a good idea.
> I agree with your observations.
> I'm going to add bin_timeout_ms, render_timeout_ms, tfu_timeout_ms,
> v3d_timeout_ms, and cache_clean_timeout_ms parameters to v3d_drv.c
> instead of the sole timeout parameter.
> Though not sophisticated, this should be enough.

>> Having a param for being able to test if the job completes eventually
>> is a good idea, but if tests are triggering timeouts, then you
>> probably need to investigate what's going on in the driver -- it's not
>> like you want .5second unpreemptible jobs to be easy to produce.
>>
>> Also, for CS, I wonder if we have another reg besides CSD_CURRENT_CFG4
>> we could be looking at for "we're making progress on the job".  Half a
>> second with no discernible progress sounds like a bug.
> If binning/render/TFU/cache_clean jobs keep running over 0.5 seconds,
> then it should be a bug because they normally complete processing
> within the period.
> However, a CSD job can do anything.
> For example, SaschaWillems's ray tracing example requires about 0.7
> seconds to compose a frame with compute shader with Igalia's Vulkan
> driver.

I've been checking the values of the registers with SaschaWillems 
computeraytracing on job submission, timeout evaluation and job destroy. 
The driver detects progress. CSD_CURRENT_CFG4 values are decreased as 
expected based in the number of pending batches. Next debug info starts 
with 262144 - 1, and we can see it ends with 0xfffffffff. So, I think 
that using CSD_CURRENT_CFG4 is working as expected. We could see also 
that CSD_CURRENT_ID0 and CSD_CURRENT_ID1 could also be used to track 
progress as we see how values are changing for X, Y and Z component of 
current workgroup ID. But comparisons on CSD_CURRENT_CFG4 are simpler.

10661.766782] [drm] CSD launched: timeout_batches (0x00000000) | cfg4 
(0x0003fff9) | id0 (0x00000080) | id1 (0x00000000) | status (0x00000056)
[10662.272345] [drm] CSD timeout check: timeout_batches (0x00000000) -> 
cfg4 (0x0002dfaf) | id0 (0x00050000) | id1 (0x00000024) | status 
(0x00000056)
[10662.792324] [drm] CSD timeout check: timeout_batches (0x0002dfaf) -> 
cfg4 (0x00018c1f) | id0 (0x003e0000) | id1 (0x0000004e) | status 
(0x00000056)
[10663.302350] [drm] CSD timeout check: timeout_batches (0x00018c1f) -> 
cfg4 (0x0000724f) | id0 (0x005b0000) | id1 (0x00000071) | status 
(0x00000056)
[10663.487139] [drm] CSD free: timeout_batches (0x0000724f) -> cfg4 
(0xffffffff) | id0 (0x00000000) | id1 (0x00010000) | status (0x00000060)

> Another example is our matrix computation library for QPU:
> https://github.com/Idein/qmkl6
> It requires about 1.1 seconds to multiply two 2048x2048 matrices.
> Because in general we do not know what is the expected behavior of a
> QPU program and what is not, we also cannot detect a progress the
> program is making.
> This is why we want to have the timeout parameters to be able to be modified.

In case of qmkl6, all jobs are launched with CSD_CURRENT_CFG4 is 
initialized as 7 (Number of batches minus 1) so when the job starts, all 
the batches are on execution. So, the HW cannot report any progress, and 
CFG4 reports that there are no pending batches queued with ~0.

This is the trace of v3d_submit_csd_ioctl calls showing CFG4.

      sscal-qmkl6-26222   [002] .... 11978.399780: v3d_submit_csd_ioctl: 
dev=1, CFG4 0x00000007, CFG5 0x00560000, CFG6 0x00280000
      sscal-qmkl6-26222   [002] .... 11978.518960: v3d_submit_csd_ioctl: 
dev=1, CFG4 0x00000007, CFG5 0x00560000, CFG6 0x00280000

Here is a example of the process finishing before reaching the timeout:

[10339.374376] [drm] CSD launched: timeout_batches (0x00000000) | cfg4 
(0x00000001) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b34)
[10339.470097] [drm] CSD free: timeout_batches (0x00000000) -> cfg4 
(0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b40)

Here is the example of a job that reaches the timeout and RESET:

[10339.980849] [drm] CSD launched: timeout_batches (0x00000000) | cfg4 
(0x00000000) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b44)
[10340.497731] [drm] CSD timeout: timeout_batches (0x00000000) -> cfg4 
(0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b44)
[10341.007815] [drm] CSD timeout RESET: timeout_batches (0xffffffff) | 
cfg4 (0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status 
(0x00000b44)
[10341.008143] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting 
GPU for hang.
[10341.008241] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* 
V3D_ERR_STAT: 0x00001000
[10341.008311] [drm] Skipping CSD job resubmission due to previous error 
(-125)
[10341.008375] [drm] CSD free: timeout_batches (0xffffffff) -> cfg4 
(0x00000000) | id0 (0x00000000) | id1 (0x00000000) | status (0x00000000)

In the general case we need at least to reach the timeout twice to reset 
the GPU.

Jobs start with 0 on timeout_batches, then the value is updated to 
0xffffffff and at next timeout evaluation there is no progress as all 
batches have already been submitted. Then we reset the GPU.

I don't know if the algorithms in qmkl6 or py-videocore6 cannot be 
launched using more batches. But if this is not the case adding the 
timeout module parameter seems to be our first solution. It would 
probably be enough adding a general timeout "timeout_ms" for debugging 
any kind of job and a specific "csd_timeout_ms" that only applies to CSD 
jobs and could be used by systems that need to execute this kind of 
jobs. When the CSD timeout is reached we could probably notify with a 
DRM_WARN suggesting increasing the csd_timeout.

Kind regards,

Chema Casanova


_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 3/3] drm/v3d: Add job timeout module param
  2021-02-10 17:59         ` Chema Casanova
@ 2021-02-11  6:31           ` Yukimasa Sugizaki
  0 siblings, 0 replies; 11+ messages in thread
From: Yukimasa Sugizaki @ 2021-02-11  6:31 UTC (permalink / raw)
  To: Chema Casanova; +Cc: David Airlie, DRI Development

On 11/02/2021, Chema Casanova <jmcasanova@igalia.com> wrote:
>
> On 5/2/21 13:28, Yukimasa Sugizaki wrote:
>> On 05/02/2021, Eric Anholt <eric@anholt.net> wrote:
>>> On Thu, Feb 4, 2021 at 10:09 AM Chema Casanova <jmcasanova@igalia.com>
>>> wrote:
>>>> On 3/9/20 18:48, Yukimasa Sugizaki wrote:
>>>>> From: Yukimasa Sugizaki <ysugi@idein.jp>
>>>>>
>>>>> The default timeout is 500 ms which is too short for some workloads
>>>>> including Piglit.  Adding this parameter will help users to run
>>>>> heavier
>>>>> tasks.
>>>>>
>>>>> Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
>>>>> ---
>>>>>    drivers/gpu/drm/v3d/v3d_sched.c | 24 +++++++++++++-----------
>>>>>    1 file changed, 13 insertions(+), 11 deletions(-)
>>>>>
>>>>> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> index feef0c749e68..983efb018560 100644
>>>>> --- a/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
>>>>> @@ -19,11 +19,17 @@
>>>>>     */
>>>>>
>>>>>    #include <linux/kthread.h>
>>>>> +#include <linux/moduleparam.h>
>>>>>
>>>>>    #include "v3d_drv.h"
>>>>>    #include "v3d_regs.h"
>>>>>    #include "v3d_trace.h"
>>>>>
>>>>> +static uint timeout = 500;
>>>>> +module_param(timeout, uint, 0444);
>>>>> +MODULE_PARM_DESC(timeout,
>>>>> +     "Timeout for a job in ms (0 means infinity and default is 500
>>>>> ms)");
>>>>> +
>>>> I think that  parameter definition could be included at v3d_drv.c as
>>>> other drivers do. Then we would have all future parameters together. In
>>>> that case we would need also to include the extern definition at
>>>> v3d_drv.h for the driver variable.
>>>>
>>>> The param name could be more descriptive like "sched_timeout_ms" in the
>>>> lima driver.
>>>>
>>>> I wonder if it would make sense to have different timeouts parameters
>>>> for different type of jobs we have. At least this series come from the
>>>> need additional time to complete compute jobs. This is what amdgpu
>>>> does,
>>>> but we probably don't need something such complex.
>>>>
>>>> I think it would also make sense to increase our default value for the
>>>> compute jobs.
>>>>
>>>> What do you think?
>>>>
>>>> In any case I think that having this general scheduling timeout
>>>> parameter as other drivers do is a good idea.
>> I agree with your observations.
>> I'm going to add bin_timeout_ms, render_timeout_ms, tfu_timeout_ms,
>> v3d_timeout_ms, and cache_clean_timeout_ms parameters to v3d_drv.c
>> instead of the sole timeout parameter.
>> Though not sophisticated, this should be enough.
>
>>> Having a param for being able to test if the job completes eventually
>>> is a good idea, but if tests are triggering timeouts, then you
>>> probably need to investigate what's going on in the driver -- it's not
>>> like you want .5second unpreemptible jobs to be easy to produce.
>>>
>>> Also, for CS, I wonder if we have another reg besides CSD_CURRENT_CFG4
>>> we could be looking at for "we're making progress on the job".  Half a
>>> second with no discernible progress sounds like a bug.
>> If binning/render/TFU/cache_clean jobs keep running over 0.5 seconds,
>> then it should be a bug because they normally complete processing
>> within the period.
>> However, a CSD job can do anything.
>> For example, SaschaWillems's ray tracing example requires about 0.7
>> seconds to compose a frame with compute shader with Igalia's Vulkan
>> driver.
>
> I've been checking the values of the registers with SaschaWillems
> computeraytracing on job submission, timeout evaluation and job destroy.
> The driver detects progress. CSD_CURRENT_CFG4 values are decreased as
> expected based in the number of pending batches. Next debug info starts
> with 262144 - 1, and we can see it ends with 0xfffffffff. So, I think
> that using CSD_CURRENT_CFG4 is working as expected. We could see also
> that CSD_CURRENT_ID0 and CSD_CURRENT_ID1 could also be used to track
> progress as we see how values are changing for X, Y and Z component of
> current workgroup ID. But comparisons on CSD_CURRENT_CFG4 are simpler.
>
> 10661.766782] [drm] CSD launched: timeout_batches (0x00000000) | cfg4
> (0x0003fff9) | id0 (0x00000080) | id1 (0x00000000) | status (0x00000056)
> [10662.272345] [drm] CSD timeout check: timeout_batches (0x00000000) ->
> cfg4 (0x0002dfaf) | id0 (0x00050000) | id1 (0x00000024) | status
> (0x00000056)
> [10662.792324] [drm] CSD timeout check: timeout_batches (0x0002dfaf) ->
> cfg4 (0x00018c1f) | id0 (0x003e0000) | id1 (0x0000004e) | status
> (0x00000056)
> [10663.302350] [drm] CSD timeout check: timeout_batches (0x00018c1f) ->
> cfg4 (0x0000724f) | id0 (0x005b0000) | id1 (0x00000071) | status
> (0x00000056)
> [10663.487139] [drm] CSD free: timeout_batches (0x0000724f) -> cfg4
> (0xffffffff) | id0 (0x00000000) | id1 (0x00010000) | status (0x00000060)
>
>> Another example is our matrix computation library for QPU:
>> https://github.com/Idein/qmkl6
>> It requires about 1.1 seconds to multiply two 2048x2048 matrices.
>> Because in general we do not know what is the expected behavior of a
>> QPU program and what is not, we also cannot detect a progress the
>> program is making.
>> This is why we want to have the timeout parameters to be able to be
>> modified.
>
> In case of qmkl6, all jobs are launched with CSD_CURRENT_CFG4 is
> initialized as 7 (Number of batches minus 1) so when the job starts, all
> the batches are on execution. So, the HW cannot report any progress, and
> CFG4 reports that there are no pending batches queued with ~0.
>
> This is the trace of v3d_submit_csd_ioctl calls showing CFG4.
>
>       sscal-qmkl6-26222   [002] .... 11978.399780: v3d_submit_csd_ioctl:
> dev=1, CFG4 0x00000007, CFG5 0x00560000, CFG6 0x00280000
>       sscal-qmkl6-26222   [002] .... 11978.518960: v3d_submit_csd_ioctl:
> dev=1, CFG4 0x00000007, CFG5 0x00560000, CFG6 0x00280000
>
> Here is a example of the process finishing before reaching the timeout:
>
> [10339.374376] [drm] CSD launched: timeout_batches (0x00000000) | cfg4
> (0x00000001) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b34)
> [10339.470097] [drm] CSD free: timeout_batches (0x00000000) -> cfg4
> (0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b40)
>
> Here is the example of a job that reaches the timeout and RESET:
>
> [10339.980849] [drm] CSD launched: timeout_batches (0x00000000) | cfg4
> (0x00000000) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b44)
> [10340.497731] [drm] CSD timeout: timeout_batches (0x00000000) -> cfg4
> (0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status (0x00000b44)
> [10341.007815] [drm] CSD timeout RESET: timeout_batches (0xffffffff) |
> cfg4 (0xffffffff) | id0 (0x00080800) | id1 (0x00000000) | status
> (0x00000b44)
> [10341.008143] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting
> GPU for hang.
> [10341.008241] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR*
> V3D_ERR_STAT: 0x00001000
> [10341.008311] [drm] Skipping CSD job resubmission due to previous error
> (-125)
> [10341.008375] [drm] CSD free: timeout_batches (0xffffffff) -> cfg4
> (0x00000000) | id0 (0x00000000) | id1 (0x00000000) | status (0x00000000)
>
> In the general case we need at least to reach the timeout twice to reset
> the GPU.
>
> Jobs start with 0 on timeout_batches, then the value is updated to
> 0xffffffff and at next timeout evaluation there is no progress as all
> batches have already been submitted. Then we reset the GPU.
>
> I don't know if the algorithms in qmkl6 or py-videocore6 cannot be
> launched using more batches. But if this is not the case adding the
> timeout module parameter seems to be our first solution. It would
> probably be enough adding a general timeout "timeout_ms" for debugging
> any kind of job and a specific "csd_timeout_ms" that only applies to CSD
> jobs and could be used by systems that need to execute this kind of
> jobs. When the CSD timeout is reached we could probably notify with a
> DRM_WARN suggesting increasing the csd_timeout.
>
> Kind regards,
>
> Chema Casanova

Yes, our CSD codes perform computation with just eight batches (one
batch per QPU) for two reasons.
One is for efficiency and the other is that we cannot always divide
the problem to run within 500 milliseconds.
So we too think it makes sense to add the timeout parameters.

Thank you for the suggestion on the parameters.
Though I've completed implementing them, we are facing an issue where
very occasionally v3d_irq() is called by a finished CSD job during the
reset by v3d_gpu_reset_for_timeout().
In this situation, v3d_irq() reads V3D_CTL_INT_STS as 0xdeadbeef (no
response), and the remaining operations cause a kernel panic.
I suspect this can be fixed by holding v3d->reset_lock in v3d_irq(),
but we've not completed testing it yet.
I appreciate your patience.

Regards,
Y. Sugizaki.
_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs
  2020-09-04  8:15 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
@ 2020-09-04  8:15 ` Yukimasa Sugizaki
  0 siblings, 0 replies; 11+ messages in thread
From: Yukimasa Sugizaki @ 2020-09-04  8:15 UTC (permalink / raw)
  To: dri-devel; +Cc: David Airlie, Yukimasa Sugizaki

The previous code misses a check for the timeout error set by
drm_sched_resubmit_jobs(), which results in an infinite GPU reset loop
if once a timeout occurs:

[  178.799106] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  178.807836] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  179.839132] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  179.847865] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  180.879146] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  180.887925] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[  181.919188] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[  181.928002] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
...

This commit adds the check for timeout as in v3d_{bin,render}_job_run():

[   66.408962] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* Resetting GPU for hang.
[   66.417734] v3d fec00000.v3d: [drm:v3d_reset [v3d]] *ERROR* V3D_ERR_STAT: 0x00001000
[   66.428296] [drm] Skipping CSD job resubmission due to previous error (-125)

, where -125 is -ECANCELED, though users currently have no way other
than inspecting the dmesg to check if the timeout has occurred.

Signed-off-by: Yukimasa Sugizaki <ysugi@idein.jp>
---
 drivers/gpu/drm/v3d/v3d_sched.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 0747614a78f0..001216f22017 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -226,6 +226,17 @@ v3d_csd_job_run(struct drm_sched_job *sched_job)
 	struct dma_fence *fence;
 	int i;

+	/* This error is set to -ECANCELED by drm_sched_resubmit_jobs() if this
+	 * job timed out more than sched_job->sched->hang_limit times.
+	 */
+	int error = sched_job->s_fence->finished.error;
+
+	if (unlikely(error < 0)) {
+		DRM_WARN("Skipping CSD job resubmission due to previous error (%d)\n",
+			 error);
+		return ERR_PTR(error);
+	}
+
 	v3d->csd_job = job;

 	v3d_invalidate_caches(v3d);
--
2.7.4

_______________________________________________
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

^ permalink raw reply related	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2021-02-11  6:31 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-03 16:48 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
2020-09-03 16:48 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki
2021-02-04 13:54   ` Chema Casanova
2020-09-03 16:48 ` [PATCH 2/3] drm/v3d: Correctly restart the timer when progress is made Yukimasa Sugizaki
2020-09-03 16:48 ` [PATCH 3/3] drm/v3d: Add job timeout module param Yukimasa Sugizaki
2021-02-04 18:09   ` Chema Casanova
2021-02-04 19:34     ` Eric Anholt
2021-02-05 12:28       ` Yukimasa Sugizaki
2021-02-10 17:59         ` Chema Casanova
2021-02-11  6:31           ` Yukimasa Sugizaki
2020-09-04  8:15 [PATCH 0/3] drm/v3d: CL/CSD job timeout fixes Yukimasa Sugizaki
2020-09-04  8:15 ` [PATCH 1/3] drm/v3d: Don't resubmit guilty CSD jobs Yukimasa Sugizaki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).