All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/15] drm/panfrost: Misc improvements
@ 2021-06-25 13:33 Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
                   ` (14 more replies)
  0 siblings, 15 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Hello,

This is a merge of [1] and [2] since the second series depends on
patches in the preparatory series.

The main change in this v3 is the addition of patch 1 and 9 simplifying
the reset synchronisation as suggested by Daniel.

Also addressed Steve's comments, and IGT tests are now passing reliably
(which doesn't guarantee much, but that's still an improvement since
pan-reset was unreliable with v2).

Regards,

Boris

Boris Brezillon (14):
  drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  drm/panfrost: Drop the pfdev argument passed to
    panfrost_exception_name()
  drm/panfrost: Expose exception types to userspace
  drm/panfrost: Do the exception -> string translation using a table
  drm/panfrost: Expose a helper to trigger a GPU reset
  drm/panfrost: Use a threaded IRQ for job interrupts
  drm/panfrost: Simplify the reset serialization logic
  drm/panfrost: Make sure job interrupts are masked before resetting
  drm/panfrost: Disable the AS on unhandled page faults
  drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  drm/panfrost: Don't reset the GPU on job faults unless we really have
    to
  drm/panfrost: Kill in-flight jobs on FD close

Steven Price (1):
  drm/panfrost: Queue jobs on the hardware

 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c  |   2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c    |   3 +-
 drivers/gpu/drm/lima/lima_sched.c          |   3 +-
 drivers/gpu/drm/panfrost/panfrost_device.c | 139 +++--
 drivers/gpu/drm/panfrost/panfrost_device.h |  15 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c    |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    | 630 +++++++++++++++------
 drivers/gpu/drm/panfrost/panfrost_mmu.c    |  41 +-
 drivers/gpu/drm/panfrost/panfrost_regs.h   |   3 -
 drivers/gpu/drm/scheduler/sched_main.c     |   6 +-
 drivers/gpu/drm/v3d/v3d_sched.c            |  10 +-
 include/drm/gpu_scheduler.h                |   5 +-
 include/uapi/drm/panfrost_drm.h            |  71 +++
 13 files changed, 679 insertions(+), 251 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 15:07   ` Steven Price
  2021-06-25 13:33 ` [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate Boris Brezillon
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c         |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c    |  6 +++++-
 drivers/gpu/drm/v3d/v3d_sched.c           | 10 +++++-----
 include/drm/gpu_scheduler.h               |  5 ++++-
 7 files changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
 	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
 			   num_hw_submission, amdgpu_job_hang_limit,
-			   timeout, sched_score, ring->name);
+			   timeout, NULL, sched_score, ring->name);
 	if (r) {
 		DRM_ERROR("Failed to create scheduler on ring %s.\n",
 			  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
 	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-			     msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+			     msecs_to_jiffies(500), NULL, NULL,
+			     dev_name(gpu->dev));
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
 	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
-			      lima_job_hang_limit, msecs_to_jiffies(timeout),
+			      lima_job_hang_limit,
+			      msecs_to_jiffies(timeout), NULL,
 			      NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
 		ret = drm_sched_init(&js->queue[j].sched,
 				     &panfrost_sched_ops,
-				     1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+				     1, 0,
+				     msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
 				     NULL, "pan_js");
 		if (ret) {
 			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c0a2f8f8d472..a937d0529944 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -837,6 +837,8 @@ static int drm_sched_main(void *param)
  * @hw_submission: number of hw submissions that can be in flight
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
+ * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
+ *		used
  * @score: optional score atomic shared with other schedulers
  * @name: name used for debugging
  *
@@ -844,7 +846,8 @@ static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   unsigned hw_submission, unsigned hang_limit, long timeout,
+		   unsigned hw_submission, unsigned hang_limit,
+		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name)
 {
 	int i, ret;
@@ -852,6 +855,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->timeout = timeout;
+	sched->timeout_wq = timeout_wq ? : system_wq;
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 8992480c88fa..a39bdd5cfc4f 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -402,7 +402,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
 			     &v3d_bin_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_bin");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create bin scheduler: %d.", ret);
@@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
 			     &v3d_render_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_render");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create render scheduler: %d.",
@@ -424,7 +424,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
 			     &v3d_tfu_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_tfu");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
@@ -437,7 +437,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
 				     &v3d_csd_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_csd");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CSD scheduler: %d.",
@@ -449,7 +449,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
 				     &v3d_cache_clean_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_cache_clean");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN scheduler: %d.",
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 10225a0a35d0..d4cdc906709e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -269,6 +269,7 @@ struct drm_sched_backend_ops {
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @timeout_wq: workqueue used to queue @work_tdr
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @thread: the kthread on which the scheduler which run.
@@ -293,6 +294,7 @@ struct drm_gpu_scheduler {
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*timeout_wq;
 	struct delayed_work		work_tdr;
 	struct task_struct		*thread;
 	struct list_head		pending_list;
@@ -306,7 +308,8 @@ struct drm_gpu_scheduler {
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   uint32_t hw_submission, unsigned hang_limit, long timeout,
+		   uint32_t hw_submission, unsigned hang_limit,
+		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:45   ` Alyssa Rosenzweig
  2021-06-25 13:33 ` [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition Boris Brezillon
                   ` (12 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

If the fence creation fail, we can return the error pointer directly.
The core will update the fence error accordingly.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 8ff79fd49577..d6c9698bca3b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct drm_sched_job *sched_job)
 
 	fence = panfrost_fence_create(pfdev, slot);
 	if (IS_ERR(fence))
-		return NULL;
+		return fence;
 
 	if (job->done_fence)
 		dma_fence_put(job->done_fence);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:39   ` Alyssa Rosenzweig
  2021-06-25 13:33 ` [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name() Boris Brezillon
                   ` (11 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Exception types will be defined as an enum in panfrost_drm.h so userspace
and use the same definitions if needed.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_regs.h | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_regs.h b/drivers/gpu/drm/panfrost/panfrost_regs.h
index eddaa62ad8b0..151cfebd80a0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_regs.h
+++ b/drivers/gpu/drm/panfrost/panfrost_regs.h
@@ -261,9 +261,6 @@
 #define JS_COMMAND_SOFT_STOP_1		0x06	/* Execute SOFT_STOP if JOB_CHAIN_FLAG is 1 */
 #define JS_COMMAND_HARD_STOP_1		0x07	/* Execute HARD_STOP if JOB_CHAIN_FLAG is 1 */
 
-#define JS_STATUS_EVENT_ACTIVE		0x08
-
-
 /* MMU regs */
 #define MMU_INT_RAWSTAT			0x2000
 #define MMU_INT_CLEAR			0x2004
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (2 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:45   ` Alyssa Rosenzweig
  2021-06-25 13:33 ` [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace Boris Brezillon
                   ` (10 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Currently unused. We'll add it back if we need per-GPU definitions.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 2 +-
 drivers/gpu/drm/panfrost/panfrost_device.h | 2 +-
 drivers/gpu/drm/panfrost/panfrost_gpu.c    | 2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    | 2 +-
 drivers/gpu/drm/panfrost/panfrost_mmu.c    | 2 +-
 5 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
index fbcf5edbe367..bce6b0aff05e 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,7 +292,7 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
 	panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 exception_code)
+const char *panfrost_exception_name(u32 exception_code)
 {
 	switch (exception_code) {
 		/* Non-Fault Status code */
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index 4c6bdea5537b..ade8a1974ee9 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -172,6 +172,6 @@ void panfrost_device_reset(struct panfrost_device *pfdev);
 int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
-const char *panfrost_exception_name(struct panfrost_device *pfdev, u32 exception_code);
+const char *panfrost_exception_name(u32 exception_code);
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_gpu.c b/drivers/gpu/drm/panfrost/panfrost_gpu.c
index 2aae636f1cf5..ec59f15940fb 100644
--- a/drivers/gpu/drm/panfrost/panfrost_gpu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_gpu.c
@@ -33,7 +33,7 @@ static irqreturn_t panfrost_gpu_irq_handler(int irq, void *data)
 		address |= gpu_read(pfdev, GPU_FAULT_ADDRESS_LO);
 
 		dev_warn(pfdev->dev, "GPU Fault 0x%08x (%s) at 0x%016llx\n",
-			 fault_status & 0xFF, panfrost_exception_name(pfdev, fault_status),
+			 fault_status & 0xFF, panfrost_exception_name(fault_status),
 			 address);
 
 		if (state & GPU_IRQ_MULTIPLE_FAULT)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index d6c9698bca3b..3cd1aec6c261 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -500,7 +500,7 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
 
 			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
 				j,
-				panfrost_exception_name(pfdev, job_read(pfdev, JS_STATUS(j))),
+				panfrost_exception_name(job_read(pfdev, JS_STATUS(j))),
 				job_read(pfdev, JS_HEAD_LO(j)),
 				job_read(pfdev, JS_TAIL_LO(j)));
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index d76dff201ea6..b4f0c673cd7f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -676,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, void *data)
 				"TODO",
 				fault_status,
 				(fault_status & (1 << 10) ? "DECODER FAULT" : "SLAVE FAULT"),
-				exception_type, panfrost_exception_name(pfdev, exception_type),
+				exception_type, panfrost_exception_name(exception_type),
 				access_type, access_type_name(pfdev, fault_status),
 				source_id);
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (3 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name() Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:42   ` Alyssa Rosenzweig
  2021-06-25 13:33 ` [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
                   ` (9 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Job headers contain an exception type field which might be read and
converted to a human readable string by tracing tools. Let's expose
the exception type as an enum so we share the same definition.

v3:
* Add missing values

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 include/uapi/drm/panfrost_drm.h | 71 +++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
index ec19db1eead8..899cd6d952d4 100644
--- a/include/uapi/drm/panfrost_drm.h
+++ b/include/uapi/drm/panfrost_drm.h
@@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
 	__u32 retained;       /* out, whether backing store still exists */
 };
 
+/* The exception types */
+
+enum drm_panfrost_exception_type {
+	DRM_PANFROST_EXCEPTION_OK = 0x00,
+	DRM_PANFROST_EXCEPTION_DONE = 0x01,
+	DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
+	DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
+	DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
+	DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
+	DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
+	DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
+	DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
+	DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
+	DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
+	DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
+	DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
+	DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
+	DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
+	DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
+	DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
+	DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
+	DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
+	DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
+	DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
+	DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
+	DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
+	DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
+	DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
+	DRM_PANFROST_EXCEPTION_OOM = 0x60,
+	DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
+	DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
+	DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
+	DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
+	DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
+	DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
+	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
+	DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
+	DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
+	DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
+	DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
+	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
+	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
+	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
+	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
+	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
+	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
+	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
+	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
+	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
+	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
+};
+
 #if defined(__cplusplus)
 }
 #endif
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (4 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:41   ` Alyssa Rosenzweig
  2021-06-25 15:35   ` Steven Price
  2021-06-25 13:33 ` [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset Boris Brezillon
                   ` (8 subsequent siblings)
  14 siblings, 2 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Do the exception -> string translation using a table. This way we get
rid of those magic numbers and can easily add new fields if we need
to attach extra information to exception types.

v3:
* Drop the error field

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.c | 130 +++++++++++++--------
 1 file changed, 83 insertions(+), 47 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
index bce6b0aff05e..736854542b05 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -292,55 +292,91 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
 	panfrost_clk_fini(pfdev);
 }
 
-const char *panfrost_exception_name(u32 exception_code)
-{
-	switch (exception_code) {
-		/* Non-Fault Status code */
-	case 0x00: return "NOT_STARTED/IDLE/OK";
-	case 0x01: return "DONE";
-	case 0x02: return "INTERRUPTED";
-	case 0x03: return "STOPPED";
-	case 0x04: return "TERMINATED";
-	case 0x08: return "ACTIVE";
-		/* Job exceptions */
-	case 0x40: return "JOB_CONFIG_FAULT";
-	case 0x41: return "JOB_POWER_FAULT";
-	case 0x42: return "JOB_READ_FAULT";
-	case 0x43: return "JOB_WRITE_FAULT";
-	case 0x44: return "JOB_AFFINITY_FAULT";
-	case 0x48: return "JOB_BUS_FAULT";
-	case 0x50: return "INSTR_INVALID_PC";
-	case 0x51: return "INSTR_INVALID_ENC";
-	case 0x52: return "INSTR_TYPE_MISMATCH";
-	case 0x53: return "INSTR_OPERAND_FAULT";
-	case 0x54: return "INSTR_TLS_FAULT";
-	case 0x55: return "INSTR_BARRIER_FAULT";
-	case 0x56: return "INSTR_ALIGN_FAULT";
-	case 0x58: return "DATA_INVALID_FAULT";
-	case 0x59: return "TILE_RANGE_FAULT";
-	case 0x5A: return "ADDR_RANGE_FAULT";
-	case 0x60: return "OUT_OF_MEMORY";
-		/* GPU exceptions */
-	case 0x80: return "DELAYED_BUS_FAULT";
-	case 0x88: return "SHAREABILITY_FAULT";
-		/* MMU exceptions */
-	case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
-	case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
-	case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
-	case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
-	case 0xC8: return "PERMISSION_FAULT";
-	case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
-	case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
-	case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
-	case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
-	case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
-	case 0xD8: return "ACCESS_FLAG";
-	case 0xD9 ... 0xDF: return "ACCESS_FLAG";
-	case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
-	case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
+#define PANFROST_EXCEPTION(id) \
+	[DRM_PANFROST_EXCEPTION_ ## id] = { \
+		.name = #id, \
 	}
 
-	return "UNKNOWN";
+struct panfrost_exception_info {
+	const char *name;
+};
+
+static const struct panfrost_exception_info panfrost_exception_infos[] = {
+	PANFROST_EXCEPTION(OK),
+	PANFROST_EXCEPTION(DONE),
+	PANFROST_EXCEPTION(INTERRUPTED),
+	PANFROST_EXCEPTION(STOPPED),
+	PANFROST_EXCEPTION(TERMINATED),
+	PANFROST_EXCEPTION(KABOOM),
+	PANFROST_EXCEPTION(EUREKA),
+	PANFROST_EXCEPTION(ACTIVE),
+	PANFROST_EXCEPTION(JOB_CONFIG_FAULT),
+	PANFROST_EXCEPTION(JOB_POWER_FAULT),
+	PANFROST_EXCEPTION(JOB_READ_FAULT),
+	PANFROST_EXCEPTION(JOB_WRITE_FAULT),
+	PANFROST_EXCEPTION(JOB_AFFINITY_FAULT),
+	PANFROST_EXCEPTION(JOB_BUS_FAULT),
+	PANFROST_EXCEPTION(INSTR_INVALID_PC),
+	PANFROST_EXCEPTION(INSTR_INVALID_ENC),
+	PANFROST_EXCEPTION(INSTR_TYPE_MISMATCH),
+	PANFROST_EXCEPTION(INSTR_OPERAND_FAULT),
+	PANFROST_EXCEPTION(INSTR_TLS_FAULT),
+	PANFROST_EXCEPTION(INSTR_BARRIER_FAULT),
+	PANFROST_EXCEPTION(INSTR_ALIGN_FAULT),
+	PANFROST_EXCEPTION(DATA_INVALID_FAULT),
+	PANFROST_EXCEPTION(TILE_RANGE_FAULT),
+	PANFROST_EXCEPTION(ADDR_RANGE_FAULT),
+	PANFROST_EXCEPTION(IMPRECISE_FAULT),
+	PANFROST_EXCEPTION(OOM),
+	PANFROST_EXCEPTION(OOM_AFBC),
+	PANFROST_EXCEPTION(UNKNOWN),
+	PANFROST_EXCEPTION(DELAYED_BUS_FAULT),
+	PANFROST_EXCEPTION(GPU_SHAREABILITY_FAULT),
+	PANFROST_EXCEPTION(SYS_SHAREABILITY_FAULT),
+	PANFROST_EXCEPTION(GPU_CACHEABILITY_FAULT),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_0),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_1),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_2),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_3),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_4),
+	PANFROST_EXCEPTION(TRANSLATION_FAULT_IDENTITY),
+	PANFROST_EXCEPTION(PERM_FAULT_0),
+	PANFROST_EXCEPTION(PERM_FAULT_1),
+	PANFROST_EXCEPTION(PERM_FAULT_2),
+	PANFROST_EXCEPTION(PERM_FAULT_3),
+	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_0),
+	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_1),
+	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_2),
+	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_3),
+	PANFROST_EXCEPTION(ACCESS_FLAG_0),
+	PANFROST_EXCEPTION(ACCESS_FLAG_1),
+	PANFROST_EXCEPTION(ACCESS_FLAG_2),
+	PANFROST_EXCEPTION(ACCESS_FLAG_3),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN0),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN1),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN2),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN3),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
+	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
+	PANFROST_EXCEPTION(MEM_ATTR_FAULT_0),
+	PANFROST_EXCEPTION(MEM_ATTR_FAULT_1),
+	PANFROST_EXCEPTION(MEM_ATTR_FAULT_2),
+	PANFROST_EXCEPTION(MEM_ATTR_FAULT_3),
+	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_0),
+	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_1),
+	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_2),
+	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_3),
+};
+
+const char *panfrost_exception_name(u32 exception_code)
+{
+	if (WARN_ON(exception_code >= ARRAY_SIZE(panfrost_exception_infos) ||
+		    !panfrost_exception_infos[exception_code].name))
+		return "Unknown exception type";
+
+	return panfrost_exception_infos[exception_code].name;
 }
 
 void panfrost_device_reset(struct panfrost_device *pfdev)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (5 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:46   ` Alyssa Rosenzweig
  2021-06-25 13:33 ` [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
                   ` (7 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Expose a helper to trigger a GPU reset so we can easily trigger reset
operations outside the job timeout handler.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.h | 8 ++++++++
 drivers/gpu/drm/panfrost/panfrost_job.c    | 4 +---
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index ade8a1974ee9..6024eaf34ba0 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -174,4 +174,12 @@ int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
 
+static inline void
+panfrost_device_schedule_reset(struct panfrost_device *pfdev)
+{
+	/* Schedule a reset if there's no reset in progress. */
+	if (!atomic_xchg(&pfdev->reset.pending, 1))
+		schedule_work(&pfdev->reset.work);
+}
+
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 3cd1aec6c261..be8f68f63974 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -458,9 +458,7 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
 		return DRM_GPU_SCHED_STAT_NOMINAL;
 
-	/* Schedule a reset if there's no reset in progress. */
-	if (!atomic_xchg(&pfdev->reset.pending, 1))
-		schedule_work(&pfdev->reset.work);
+	panfrost_device_schedule_reset(pfdev);
 
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (6 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:47   ` Alyssa Rosenzweig
  2021-06-25 15:40   ` Steven Price
  2021-06-25 13:33 ` [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
                   ` (6 subsequent siblings)
  14 siblings, 2 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

This should avoid switching to interrupt context when the GPU is under
heavy use.

v3:
* Don't take the job_lock in panfrost_job_handle_irq()

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++++++++++++++++++-------
 1 file changed, 38 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index be8f68f63974..e0c479e67304 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops panfrost_sched_ops = {
 	.free_job = panfrost_job_free
 };
 
-static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 {
-	struct panfrost_device *pfdev = data;
-	u32 status = job_read(pfdev, JOB_INT_STAT);
 	int j;
 
 	dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
 
-	if (!status)
-		return IRQ_NONE;
-
-	pm_runtime_mark_last_busy(pfdev->dev);
-
 	for (j = 0; status; j++) {
 		u32 mask = MK_JS_MASK(j);
 
@@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
 		if (status & JOB_INT_MASK_DONE(j)) {
 			struct panfrost_job *job;
 
-			spin_lock(&pfdev->js->job_lock);
 			job = pfdev->jobs[j];
 			/* Only NULL if job timeout occurred */
 			if (job) {
@@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
 				dma_fence_signal_locked(job->done_fence);
 				pm_runtime_put_autosuspend(pfdev->dev);
 			}
-			spin_unlock(&pfdev->js->job_lock);
 		}
 
 		status &= ~mask;
 	}
+}
 
+static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
+{
+	struct panfrost_device *pfdev = data;
+	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+	while (status) {
+		pm_runtime_mark_last_busy(pfdev->dev);
+
+		spin_lock(&pfdev->js->job_lock);
+		panfrost_job_handle_irq(pfdev, status);
+		spin_unlock(&pfdev->js->job_lock);
+		status = job_read(pfdev, JOB_INT_RAWSTAT);
+	}
+
+	job_write(pfdev, JOB_INT_MASK,
+		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
+		  GENMASK(NUM_JOB_SLOTS - 1, 0));
 	return IRQ_HANDLED;
 }
 
+static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
+{
+	struct panfrost_device *pfdev = data;
+	u32 status = job_read(pfdev, JOB_INT_STAT);
+
+	if (!status)
+		return IRQ_NONE;
+
+	job_write(pfdev, JOB_INT_MASK, 0);
+	return IRQ_WAKE_THREAD;
+}
+
 static void panfrost_reset(struct work_struct *work)
 {
 	struct panfrost_device *pfdev = container_of(work,
 						     struct panfrost_device,
 						     reset.work);
-	unsigned long flags;
 	unsigned int i;
 	bool cookie;
 
@@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
 	/* All timers have been stopped, we can safely reset the pending state. */
 	atomic_set(&pfdev->reset.pending, 0);
 
-	spin_lock_irqsave(&pfdev->js->job_lock, flags);
+	spin_lock(&pfdev->js->job_lock);
 	for (i = 0; i < NUM_JOB_SLOTS; i++) {
 		if (pfdev->jobs[i]) {
 			pm_runtime_put_noidle(pfdev->dev);
@@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
 			pfdev->jobs[i] = NULL;
 		}
 	}
-	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
+	spin_unlock(&pfdev->js->job_lock);
 
 	panfrost_device_reset(pfdev);
 
@@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 	if (irq <= 0)
 		return -ENODEV;
 
-	ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
-			       IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
+	ret = devm_request_threaded_irq(pfdev->dev, irq,
+					panfrost_job_irq_handler,
+					panfrost_job_irq_handler_thread,
+					IRQF_SHARED, KBUILD_MODNAME "-job",
+					pfdev);
 	if (ret) {
 		dev_err(pfdev->dev, "failed to request job irq");
 		return ret;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (7 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 15:42   ` Steven Price
  2021-06-25 16:22   ` Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting Boris Brezillon
                   ` (5 subsequent siblings)
  14 siblings, 2 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Daniel Vetter, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Boris Brezillon, Robin Murphy

Now that we can pass our own workqueue to drm_sched_init(), we can use
an ordered workqueue on for both the scheduler timeout tdr and our own
reset work (which we use when the reset is not caused by a fault/timeout
on a specific job, like when we have AS_ACTIVE bit stuck). This
guarantees that the timeout handlers and reset handler can't run
concurrently which drastically simplifies the locking.

Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   6 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    | 185 ++++++++-------------
 2 files changed, 71 insertions(+), 120 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index 6024eaf34ba0..bfe32907ba6b 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -108,6 +108,7 @@ struct panfrost_device {
 	struct mutex sched_lock;
 
 	struct {
+		struct workqueue_struct *wq;
 		struct work_struct work;
 		atomic_t pending;
 	} reset;
@@ -177,9 +178,8 @@ const char *panfrost_exception_name(u32 exception_code);
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
 {
-	/* Schedule a reset if there's no reset in progress. */
-	if (!atomic_xchg(&pfdev->reset.pending, 1))
-		schedule_work(&pfdev->reset.work);
+	atomic_set(&pfdev->reset.pending, 1);
+	queue_work(pfdev->reset.wq, &pfdev->reset.work);
 }
 
 #endif
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index e0c479e67304..88d34fd781e8 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -25,17 +25,8 @@
 #define job_write(dev, reg, data) writel(data, dev->iomem + (reg))
 #define job_read(dev, reg) readl(dev->iomem + (reg))
 
-enum panfrost_queue_status {
-	PANFROST_QUEUE_STATUS_ACTIVE,
-	PANFROST_QUEUE_STATUS_STOPPED,
-	PANFROST_QUEUE_STATUS_STARTING,
-	PANFROST_QUEUE_STATUS_FAULT_PENDING,
-};
-
 struct panfrost_queue_state {
 	struct drm_gpu_scheduler sched;
-	atomic_t status;
-	struct mutex lock;
 	u64 fence_context;
 	u64 emit_seqno;
 };
@@ -379,57 +370,73 @@ void panfrost_job_enable_interrupts(struct panfrost_device *pfdev)
 	job_write(pfdev, JOB_INT_MASK, irq_mask);
 }
 
-static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
-				    struct drm_sched_job *bad)
+static void panfrost_reset(struct panfrost_device *pfdev,
+			   struct drm_sched_job *bad)
 {
-	enum panfrost_queue_status old_status;
-	bool stopped = false;
+	unsigned int i;
+	bool cookie;
 
-	mutex_lock(&queue->lock);
-	old_status = atomic_xchg(&queue->status,
-				 PANFROST_QUEUE_STATUS_STOPPED);
-	if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
-		goto out;
+	if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
+		return;
+
+	/* Stop the schedulers.
+	 *
+	 * FIXME: We temporarily get out of the dma_fence_signalling section
+	 * because the cleanup path generate lockdep splats when taking locks
+	 * to release job resources. We should rework the code to follow this
+	 * pattern:
+	 *
+	 *	try_lock
+	 *	if (locked)
+	 *		release
+	 *	else
+	 *		schedule_work_to_release_later
+	 */
+	for (i = 0; i < NUM_JOB_SLOTS; i++)
+		drm_sched_stop(&pfdev->js->queue[i].sched, bad);
+
+	cookie = dma_fence_begin_signalling();
 
-	WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
-	drm_sched_stop(&queue->sched, bad);
 	if (bad)
 		drm_sched_increase_karma(bad);
 
-	stopped = true;
+	spin_lock(&pfdev->js->job_lock);
+	for (i = 0; i < NUM_JOB_SLOTS; i++) {
+		if (pfdev->jobs[i]) {
+			pm_runtime_put_noidle(pfdev->dev);
+			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+			pfdev->jobs[i] = NULL;
+		}
+	}
+	spin_unlock(&pfdev->js->job_lock);
 
-	/*
-	 * Set the timeout to max so the timer doesn't get started
-	 * when we return from the timeout handler (restored in
-	 * panfrost_scheduler_start()).
+	panfrost_device_reset(pfdev);
+
+	/* GPU has been reset, we can cancel timeout/fault work that may have
+	 * been queued in the meantime and clear the reset pending bit.
 	 */
-	queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
+	atomic_set(&pfdev->reset.pending, 0);
+	cancel_work_sync(&pfdev->reset.work);
+	for (i = 0; i < NUM_JOB_SLOTS; i++)
+		cancel_delayed_work(&pfdev->js->queue[i].sched.work_tdr);
 
-out:
-	mutex_unlock(&queue->lock);
 
-	return stopped;
-}
+	/* Now resubmit jobs that were previously queued but didn't have a
+	 * chance to finish.
+	 * FIXME: We temporarily get out of the DMA fence signalling section
+	 * while resubmitting jobs because the job submission logic will
+	 * allocate memory with the GFP_KERNEL flag which can trigger memory
+	 * reclaim and exposes a lock ordering issue.
+	 */
+	dma_fence_end_signalling(cookie);
+	for (i = 0; i < NUM_JOB_SLOTS; i++)
+		drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
+	cookie = dma_fence_begin_signalling();
 
-static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
-{
-	enum panfrost_queue_status old_status;
+	for (i = 0; i < NUM_JOB_SLOTS; i++)
+		drm_sched_start(&pfdev->js->queue[i].sched, true);
 
-	mutex_lock(&queue->lock);
-	old_status = atomic_xchg(&queue->status,
-				 PANFROST_QUEUE_STATUS_STARTING);
-	WARN_ON(old_status != PANFROST_QUEUE_STATUS_STOPPED);
-
-	/* Restore the original timeout before starting the scheduler. */
-	queue->sched.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS);
-	drm_sched_resubmit_jobs(&queue->sched);
-	drm_sched_start(&queue->sched, true);
-	old_status = atomic_xchg(&queue->status,
-				 PANFROST_QUEUE_STATUS_ACTIVE);
-	if (old_status == PANFROST_QUEUE_STATUS_FAULT_PENDING)
-		drm_sched_fault(&queue->sched);
-
-	mutex_unlock(&queue->lock);
+	dma_fence_end_signalling(cookie);
 }
 
 static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
@@ -454,11 +461,8 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 		job_read(pfdev, JS_TAIL_LO(js)),
 		sched_job);
 
-	/* Scheduler is already stopped, nothing to do. */
-	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
-		return DRM_GPU_SCHED_STAT_NOMINAL;
-
-	panfrost_device_schedule_reset(pfdev);
+	atomic_set(&pfdev->reset.pending, 1);
+	panfrost_reset(pfdev, sched_job);
 
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
@@ -485,8 +489,6 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 		job_write(pfdev, JOB_INT_CLEAR, mask);
 
 		if (status & JOB_INT_MASK_ERR(j)) {
-			enum panfrost_queue_status old_status;
-
 			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
 			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
@@ -494,19 +496,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 				panfrost_exception_name(job_read(pfdev, JS_STATUS(j))),
 				job_read(pfdev, JS_HEAD_LO(j)),
 				job_read(pfdev, JS_TAIL_LO(j)));
-
-			/*
-			 * When the queue is being restarted we don't report
-			 * faults directly to avoid races between the timeout
-			 * and reset handlers. panfrost_scheduler_start() will
-			 * call drm_sched_fault() after the queue has been
-			 * started if status == FAULT_PENDING.
-			 */
-			old_status = atomic_cmpxchg(&pfdev->js->queue[j].status,
-						    PANFROST_QUEUE_STATUS_STARTING,
-						    PANFROST_QUEUE_STATUS_FAULT_PENDING);
-			if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
-				drm_sched_fault(&pfdev->js->queue[j].sched);
+			drm_sched_fault(&pfdev->js->queue[j].sched);
 		}
 
 		if (status & JOB_INT_MASK_DONE(j)) {
@@ -561,56 +551,13 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
 	return IRQ_WAKE_THREAD;
 }
 
-static void panfrost_reset(struct work_struct *work)
+static void panfrost_reset_work(struct work_struct *work)
 {
 	struct panfrost_device *pfdev = container_of(work,
 						     struct panfrost_device,
 						     reset.work);
-	unsigned int i;
-	bool cookie;
 
-	cookie = dma_fence_begin_signalling();
-	for (i = 0; i < NUM_JOB_SLOTS; i++) {
-		/*
-		 * We want pending timeouts to be handled before we attempt
-		 * to stop the scheduler. If we don't do that and the timeout
-		 * handler is in flight, it might have removed the bad job
-		 * from the list, and we'll lose this job if the reset handler
-		 * enters the critical section in panfrost_scheduler_stop()
-		 * before the timeout handler.
-		 *
-		 * Timeout is set to MAX_SCHEDULE_TIMEOUT - 1 because we need
-		 * something big enough to make sure the timer will not expire
-		 * before we manage to stop the scheduler, but we can't use
-		 * MAX_SCHEDULE_TIMEOUT because drm_sched_get_cleanup_job()
-		 * considers that as 'timer is not running' and will dequeue
-		 * the job without making sure the timeout handler is not
-		 * running.
-		 */
-		pfdev->js->queue[i].sched.timeout = MAX_SCHEDULE_TIMEOUT - 1;
-		cancel_delayed_work_sync(&pfdev->js->queue[i].sched.work_tdr);
-		panfrost_scheduler_stop(&pfdev->js->queue[i], NULL);
-	}
-
-	/* All timers have been stopped, we can safely reset the pending state. */
-	atomic_set(&pfdev->reset.pending, 0);
-
-	spin_lock(&pfdev->js->job_lock);
-	for (i = 0; i < NUM_JOB_SLOTS; i++) {
-		if (pfdev->jobs[i]) {
-			pm_runtime_put_noidle(pfdev->dev);
-			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
-			pfdev->jobs[i] = NULL;
-		}
-	}
-	spin_unlock(&pfdev->js->job_lock);
-
-	panfrost_device_reset(pfdev);
-
-	for (i = 0; i < NUM_JOB_SLOTS; i++)
-		panfrost_scheduler_start(&pfdev->js->queue[i]);
-
-	dma_fence_end_signalling(cookie);
+	panfrost_reset(pfdev, NULL);
 }
 
 int panfrost_job_init(struct panfrost_device *pfdev)
@@ -618,7 +565,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 	struct panfrost_job_slot *js;
 	int ret, j, irq;
 
-	INIT_WORK(&pfdev->reset.work, panfrost_reset);
+	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 
 	pfdev->js = js = devm_kzalloc(pfdev->dev, sizeof(*js), GFP_KERNEL);
 	if (!js)
@@ -640,9 +587,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 		return ret;
 	}
 
-	for (j = 0; j < NUM_JOB_SLOTS; j++) {
-		mutex_init(&js->queue[j].lock);
+	pfdev->reset.wq = alloc_ordered_workqueue("panfrost-reset", 0);
+	if (!pfdev->reset.wq)
+		return -ENOMEM;
 
+	for (j = 0; j < NUM_JOB_SLOTS; j++) {
 		js->queue[j].fence_context = dma_fence_context_alloc(1);
 
 		ret = drm_sched_init(&js->queue[j].sched,
@@ -664,6 +613,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 	for (j--; j >= 0; j--)
 		drm_sched_fini(&js->queue[j].sched);
 
+	destroy_workqueue(pfdev->reset.wq);
 	return ret;
 }
 
@@ -676,9 +626,10 @@ void panfrost_job_fini(struct panfrost_device *pfdev)
 
 	for (j = 0; j < NUM_JOB_SLOTS; j++) {
 		drm_sched_fini(&js->queue[j].sched);
-		mutex_destroy(&js->queue[j].lock);
 	}
 
+	cancel_work_sync(&pfdev->reset.work);
+	destroy_workqueue(pfdev->reset.wq);
 }
 
 int panfrost_job_open(struct panfrost_file_priv *panfrost_priv)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (8 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 15:55   ` Steven Price
  2021-06-25 13:33 ` [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults Boris Brezillon
                   ` (4 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

This is not yet needed because we let active jobs be killed during by
the reset and we don't really bother making sure they can be restarted.
But once we start adding soft-stop support, controlling when we deal
with the remaining interrrupts and making sure those are handled before
the reset is issued gets tricky if we keep job interrupts active.

Let's prepare for that and mask+flush job IRQs before issuing a reset.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++++++++++++++------
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 88d34fd781e8..0566e2f7e84a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -34,6 +34,7 @@ struct panfrost_queue_state {
 struct panfrost_job_slot {
 	struct panfrost_queue_state queue[NUM_JOB_SLOTS];
 	spinlock_t job_lock;
+	int irq;
 };
 
 static struct panfrost_job *
@@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
 	if (bad)
 		drm_sched_increase_karma(bad);
 
-	spin_lock(&pfdev->js->job_lock);
+	/* Mask job interrupts and synchronize to make sure we won't be
+	 * interrupted during our reset.
+	 */
+	job_write(pfdev, JOB_INT_MASK, 0);
+	synchronize_irq(pfdev->js->irq);
+
+	/* Schedulers are stopped and interrupts are masked+flushed, we don't
+	 * need to protect the 'evict unfinished jobs' lock with the job_lock.
+	 */
 	for (i = 0; i < NUM_JOB_SLOTS; i++) {
 		if (pfdev->jobs[i]) {
 			pm_runtime_put_noidle(pfdev->dev);
@@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
 			pfdev->jobs[i] = NULL;
 		}
 	}
-	spin_unlock(&pfdev->js->job_lock);
 
 	panfrost_device_reset(pfdev);
 
@@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 
 			job = pfdev->jobs[j];
 			/* Only NULL if job timeout occurred */
+			WARN_ON(!job);
 			if (job) {
 				pfdev->jobs[j] = NULL;
 
@@ -563,7 +572,7 @@ static void panfrost_reset_work(struct work_struct *work)
 int panfrost_job_init(struct panfrost_device *pfdev)
 {
 	struct panfrost_job_slot *js;
-	int ret, j, irq;
+	int ret, j;
 
 	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 
@@ -573,11 +582,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
 	spin_lock_init(&js->job_lock);
 
-	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
-	if (irq <= 0)
+	js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
+	if (js->irq <= 0)
 		return -ENODEV;
 
-	ret = devm_request_threaded_irq(pfdev->dev, irq,
+	ret = devm_request_threaded_irq(pfdev->dev, js->irq,
 					panfrost_job_irq_handler,
 					panfrost_job_irq_handler_thread,
 					IRQF_SHARED, KBUILD_MODNAME "-job",
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (9 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 16:10   ` Steven Price
  2021-06-25 13:33 ` [PATCH v3 12/15] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck Boris Brezillon
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

If we don't do that, we have to wait for the job timeout to expire
before the fault jobs gets killed.

v3:
* Make sure the AS is re-enabled when new jobs are submitted to the
  context

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
 drivers/gpu/drm/panfrost/panfrost_mmu.c    | 34 ++++++++++++++++++++--
 2 files changed, 32 insertions(+), 3 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index bfe32907ba6b..efe9a675b614 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -96,6 +96,7 @@ struct panfrost_device {
 	spinlock_t as_lock;
 	unsigned long as_in_use_mask;
 	unsigned long as_alloc_mask;
+	unsigned long as_faulty_mask;
 	struct list_head as_lru_list;
 
 	struct panfrost_job_slot *js;
diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index b4f0c673cd7f..65e98c51cb66 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -154,6 +154,7 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, struct panfrost_mmu *mmu)
 	as = mmu->as;
 	if (as >= 0) {
 		int en = atomic_inc_return(&mmu->as_count);
+		u32 mask = BIT(as) | BIT(16 + as);
 
 		/*
 		 * AS can be retained by active jobs or a perfcnt context,
@@ -162,6 +163,18 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, struct panfrost_mmu *mmu)
 		WARN_ON(en >= (NUM_JOB_SLOTS + 1));
 
 		list_move(&mmu->list, &pfdev->as_lru_list);
+
+		if (pfdev->as_faulty_mask & mask) {
+			/* Unhandled pagefault on this AS, the MMU was
+			 * disabled. We need to re-enable the MMU after
+			 * clearing+unmasking the AS interrupts.
+			 */
+			mmu_write(pfdev, MMU_INT_CLEAR, mask);
+			mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+			pfdev->as_faulty_mask &= ~mask;
+			panfrost_mmu_enable(pfdev, mmu);
+		}
+
 		goto out;
 	}
 
@@ -211,6 +224,7 @@ void panfrost_mmu_reset(struct panfrost_device *pfdev)
 	spin_lock(&pfdev->as_lock);
 
 	pfdev->as_alloc_mask = 0;
+	pfdev->as_faulty_mask = 0;
 
 	list_for_each_entry_safe(mmu, mmu_tmp, &pfdev->as_lru_list, list) {
 		mmu->as = -1;
@@ -662,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, void *data)
 		if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 0xC0)
 			ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
 
-		if (ret)
+		if (ret) {
 			/* terminal fault, print info about the fault */
 			dev_err(pfdev->dev,
 				"Unhandled Page fault in AS%d at VA 0x%016llX\n"
@@ -680,14 +694,28 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, void *data)
 				access_type, access_type_name(pfdev, fault_status),
 				source_id);
 
+			spin_lock(&pfdev->as_lock);
+			/* Ignore MMU interrupts on this AS until it's been
+			 * re-enabled.
+			 */
+			pfdev->as_faulty_mask |= mask;
+
+			/* Disable the MMU to kill jobs on this AS. */
+			panfrost_mmu_disable(pfdev, as);
+			spin_unlock(&pfdev->as_lock);
+		}
+
 		status &= ~mask;
 
 		/* If we received new MMU interrupts, process them before returning. */
 		if (!status)
-			status = mmu_read(pfdev, MMU_INT_RAWSTAT);
+			status = mmu_read(pfdev, MMU_INT_RAWSTAT) & ~pfdev->as_faulty_mask;
 	}
 
-	mmu_write(pfdev, MMU_INT_MASK, ~0);
+	spin_lock(&pfdev->as_lock);
+	mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
+	spin_unlock(&pfdev->as_lock);
+
 	return IRQ_HANDLED;
 };
 
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 12/15] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (10 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 13/15] drm/panfrost: Don't reset the GPU on job faults unless we really have to Boris Brezillon
                   ` (2 subsequent siblings)
  14 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

Things are unlikely to resolve until we reset the GPU. Let's not wait
for other faults/timeout to happen to trigger this reset.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
Reviewed-by: Steven Price <steven.price@arm.com>
---
 drivers/gpu/drm/panfrost/panfrost_mmu.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
index 65e98c51cb66..5267c3a1f02f 100644
--- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
+++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
@@ -36,8 +36,11 @@ static int wait_ready(struct panfrost_device *pfdev, u32 as_nr)
 	ret = readl_relaxed_poll_timeout_atomic(pfdev->iomem + AS_STATUS(as_nr),
 		val, !(val & AS_STATUS_AS_ACTIVE), 10, 1000);
 
-	if (ret)
+	if (ret) {
+		/* The GPU hung, let's trigger a reset */
+		panfrost_device_schedule_reset(pfdev);
 		dev_err(pfdev->dev, "AS_ACTIVE bit stuck\n");
+	}
 
 	return ret;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 13/15] drm/panfrost: Don't reset the GPU on job faults unless we really have to
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (11 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 12/15] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close Boris Brezillon
  2021-06-25 13:33 ` [PATCH v3 15/15] drm/panfrost: Queue jobs on the hardware Boris Brezillon
  14 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

If we can recover from a fault without a reset there's no reason to
issue one.

v3:
* Drop the mention of Valhall requiring a reset on JOB_BUS_FAULT
* Set the fence error to -EINVAL instead of having per-exception
  error codes

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.c |  9 +++++++++
 drivers/gpu/drm/panfrost/panfrost_device.h |  2 ++
 drivers/gpu/drm/panfrost/panfrost_job.c    | 16 ++++++++++++++--
 3 files changed, 25 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
index 736854542b05..f4e42009526d 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.c
+++ b/drivers/gpu/drm/panfrost/panfrost_device.c
@@ -379,6 +379,15 @@ const char *panfrost_exception_name(u32 exception_code)
 	return panfrost_exception_infos[exception_code].name;
 }
 
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+				    u32 exception_code)
+{
+	/* Right now, none of the GPU we support need a reset, but this
+	 * might change.
+	 */
+	return false;
+}
+
 void panfrost_device_reset(struct panfrost_device *pfdev)
 {
 	panfrost_gpu_soft_reset(pfdev);
diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index efe9a675b614..ecbc79ad0006 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -175,6 +175,8 @@ int panfrost_device_resume(struct device *dev);
 int panfrost_device_suspend(struct device *dev);
 
 const char *panfrost_exception_name(u32 exception_code);
+bool panfrost_exception_needs_reset(const struct panfrost_device *pfdev,
+				    u32 exception_code);
 
 static inline void
 panfrost_device_schedule_reset(struct panfrost_device *pfdev)
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 0566e2f7e84a..948bd174ff99 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -497,14 +497,26 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 		job_write(pfdev, JOB_INT_CLEAR, mask);
 
 		if (status & JOB_INT_MASK_ERR(j)) {
+			u32 js_status = job_read(pfdev, JS_STATUS(j));
+
 			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
 			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
 				j,
-				panfrost_exception_name(job_read(pfdev, JS_STATUS(j))),
+				panfrost_exception_name(js_status),
 				job_read(pfdev, JS_HEAD_LO(j)),
 				job_read(pfdev, JS_TAIL_LO(j)));
-			drm_sched_fault(&pfdev->js->queue[j].sched);
+
+			/* If we need a reset, signal it to the timeout
+			 * handler, otherwise, update the fence error field and
+			 * signal the job fence.
+			 */
+			if (panfrost_exception_needs_reset(pfdev, js_status)) {
+				drm_sched_fault(&pfdev->js->queue[j].sched);
+			} else {
+				dma_fence_set_error(pfdev->jobs[j]->done_fence, -EINVAL);
+				status |= JOB_INT_MASK_DONE(j);
+			}
 		}
 
 		if (status & JOB_INT_MASK_DONE(j)) {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (12 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 13/15] drm/panfrost: Don't reset the GPU on job faults unless we really have to Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  2021-06-25 13:43   ` Lucas Stach
  2021-06-25 13:33 ` [PATCH v3 15/15] drm/panfrost: Queue jobs on the hardware Boris Brezillon
  14 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

If the process who submitted these jobs decided to close the FD before
the jobs are done it probably means it doesn't care about the result.

v3:
* Set fence error to ECANCELED when a TERMINATED exception is received

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_job.c | 43 +++++++++++++++++++++----
 1 file changed, 37 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 948bd174ff99..aa1e6542adde 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -498,14 +498,21 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 
 		if (status & JOB_INT_MASK_ERR(j)) {
 			u32 js_status = job_read(pfdev, JS_STATUS(j));
+			const char *exception_name = panfrost_exception_name(js_status);
 
 			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
 
-			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
-				j,
-				panfrost_exception_name(js_status),
-				job_read(pfdev, JS_HEAD_LO(j)),
-				job_read(pfdev, JS_TAIL_LO(j)));
+			if (js_status < DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
+				dev_dbg(pfdev->dev, "js interrupt, js=%d, status=%s, head=0x%x, tail=0x%x",
+					j, exception_name,
+					job_read(pfdev, JS_HEAD_LO(j)),
+					job_read(pfdev, JS_TAIL_LO(j)));
+			} else {
+				dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
+					j, exception_name,
+					job_read(pfdev, JS_HEAD_LO(j)),
+					job_read(pfdev, JS_TAIL_LO(j)));
+			}
 
 			/* If we need a reset, signal it to the timeout
 			 * handler, otherwise, update the fence error field and
@@ -514,7 +521,16 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
 			if (panfrost_exception_needs_reset(pfdev, js_status)) {
 				drm_sched_fault(&pfdev->js->queue[j].sched);
 			} else {
-				dma_fence_set_error(pfdev->jobs[j]->done_fence, -EINVAL);
+				int error = 0;
+
+				if (js_status == DRM_PANFROST_EXCEPTION_TERMINATED)
+					error = -ECANCELED;
+				else if (js_status >= DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
+					error = -EINVAL;
+
+				if (error)
+					dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
+
 				status |= JOB_INT_MASK_DONE(j);
 			}
 		}
@@ -673,10 +689,25 @@ int panfrost_job_open(struct panfrost_file_priv *panfrost_priv)
 
 void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 {
+	struct panfrost_device *pfdev = panfrost_priv->pfdev;
+	unsigned long flags;
 	int i;
 
 	for (i = 0; i < NUM_JOB_SLOTS; i++)
 		drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
+
+	/* Kill in-flight jobs */
+	spin_lock_irqsave(&pfdev->js->job_lock, flags);
+	for (i = 0; i < NUM_JOB_SLOTS; i++) {
+		struct drm_sched_entity *entity = &panfrost_priv->sched_entity[i];
+		struct panfrost_job *job = pfdev->jobs[i];
+
+		if (!job || job->base.entity != entity)
+			continue;
+
+		job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+	}
+	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
 }
 
 int panfrost_job_is_idle(struct panfrost_device *pfdev)
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v3 15/15] drm/panfrost: Queue jobs on the hardware
  2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
                   ` (13 preceding siblings ...)
  2021-06-25 13:33 ` [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close Boris Brezillon
@ 2021-06-25 13:33 ` Boris Brezillon
  14 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 13:33 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig,
	Boris Brezillon, Robin Murphy

From: Steven Price <steven.price@arm.com>

The hardware has a set of '_NEXT' registers that can hold a second job
while the first is executing. Make use of these registers to enqueue a
second job per slot.

v3:
* Fix the done/err job dequeuing logic to get a valid active state
* Only enable the second slot on GPUs supporting jobchain disambiguation
* Split interrupt handling in sub-functions

Signed-off-by: Steven Price <steven.price@arm.com>
Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/panfrost/panfrost_device.h |   2 +-
 drivers/gpu/drm/panfrost/panfrost_job.c    | 473 ++++++++++++++++-----
 2 files changed, 357 insertions(+), 118 deletions(-)

diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
index ecbc79ad0006..65a7b9b08f3a 100644
--- a/drivers/gpu/drm/panfrost/panfrost_device.h
+++ b/drivers/gpu/drm/panfrost/panfrost_device.h
@@ -101,7 +101,7 @@ struct panfrost_device {
 
 	struct panfrost_job_slot *js;
 
-	struct panfrost_job *jobs[NUM_JOB_SLOTS];
+	struct panfrost_job *jobs[NUM_JOB_SLOTS][2];
 	struct list_head scheduled_jobs;
 
 	struct panfrost_perfcnt *perfcnt;
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index aa1e6542adde..0d0011cbe864 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -4,6 +4,7 @@
 #include <linux/delay.h>
 #include <linux/interrupt.h>
 #include <linux/io.h>
+#include <linux/iopoll.h>
 #include <linux/platform_device.h>
 #include <linux/pm_runtime.h>
 #include <linux/dma-resv.h>
@@ -140,9 +141,52 @@ static void panfrost_job_write_affinity(struct panfrost_device *pfdev,
 	job_write(pfdev, JS_AFFINITY_NEXT_HI(js), affinity >> 32);
 }
 
+static u32
+panfrost_get_job_chain_flag(const struct panfrost_job *job)
+{
+	struct panfrost_fence *f = to_panfrost_fence(job->done_fence);
+
+	if (!panfrost_has_hw_feature(job->pfdev, HW_FEATURE_JOBCHAIN_DISAMBIGUATION))
+		return 0;
+
+	return (f->seqno & 1) ? JS_CONFIG_JOB_CHAIN_FLAG : 0;
+}
+
+static struct panfrost_job *
+panfrost_dequeue_job(struct panfrost_device *pfdev, int slot)
+{
+	struct panfrost_job *job = pfdev->jobs[slot][0];
+
+	WARN_ON(!job);
+	pfdev->jobs[slot][0] = pfdev->jobs[slot][1];
+	pfdev->jobs[slot][1] = NULL;
+
+	return job;
+}
+
+static unsigned int
+panfrost_enqueue_job(struct panfrost_device *pfdev, int slot,
+		     struct panfrost_job *job)
+{
+	if (WARN_ON(!job))
+		return 0;
+
+	if (!pfdev->jobs[slot][0]) {
+		pfdev->jobs[slot][0] = job;
+		return 0;
+	}
+
+	WARN_ON(pfdev->jobs[slot][1]);
+	pfdev->jobs[slot][1] = job;
+	WARN_ON(panfrost_get_job_chain_flag(job) ==
+		panfrost_get_job_chain_flag(pfdev->jobs[slot][0]));
+	return 1;
+}
+
 static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 {
 	struct panfrost_device *pfdev = job->pfdev;
+	unsigned int subslot;
 	u32 cfg;
 	u64 jc_head = job->jc;
 	int ret;
@@ -168,7 +212,8 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 	 * start */
 	cfg |= JS_CONFIG_THREAD_PRI(8) |
 		JS_CONFIG_START_FLUSH_CLEAN_INVALIDATE |
-		JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE;
+		JS_CONFIG_END_FLUSH_CLEAN_INVALIDATE |
+		panfrost_get_job_chain_flag(job);
 
 	if (panfrost_has_hw_feature(pfdev, HW_FEATURE_FLUSH_REDUCTION))
 		cfg |= JS_CONFIG_ENABLE_FLUSH_REDUCTION;
@@ -182,10 +227,17 @@ static void panfrost_job_hw_submit(struct panfrost_job *job, int js)
 		job_write(pfdev, JS_FLUSH_ID_NEXT(js), job->flush_id);
 
 	/* GO ! */
-	dev_dbg(pfdev->dev, "JS: Submitting atom %p to js[%d] with head=0x%llx",
-				job, js, jc_head);
 
-	job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+	spin_lock(&pfdev->js->job_lock);
+	subslot = panfrost_enqueue_job(pfdev, js, job);
+	/* Don't queue the job if a reset is in progress */
+	if (!atomic_read(&pfdev->reset.pending)) {
+		job_write(pfdev, JS_COMMAND_NEXT(js), JS_COMMAND_START);
+		dev_dbg(pfdev->dev,
+			"JS: Submitting atom %p to js[%d][%d] with head=0x%llx AS %d",
+			job, js, subslot, jc_head, cfg & 0xf);
+	}
+	spin_unlock(&pfdev->js->job_lock);
 }
 
 static void panfrost_acquire_object_fences(struct drm_gem_object **bos,
@@ -343,7 +395,11 @@ static struct dma_fence *panfrost_job_run(struct drm_sched_job *sched_job)
 	if (unlikely(job->base.s_fence->finished.error))
 		return NULL;
 
-	pfdev->jobs[slot] = job;
+	/* Nothing to execute: can happen if the job has finished while
+	 * we were resetting the GPU.
+	 */
+	if (!job->jc)
+		return NULL;
 
 	fence = panfrost_fence_create(pfdev, slot);
 	if (IS_ERR(fence))
@@ -371,11 +427,218 @@ void panfrost_job_enable_interrupts(struct panfrost_device *pfdev)
 	job_write(pfdev, JOB_INT_MASK, irq_mask);
 }
 
-static void panfrost_reset(struct panfrost_device *pfdev,
-			   struct drm_sched_job *bad)
+static void panfrost_job_handle_err(struct panfrost_device *pfdev,
+				    struct panfrost_job *job,
+				    unsigned int js)
 {
-	unsigned int i;
+	u32 js_status = job_read(pfdev, JS_STATUS(js));
+	const char *exception_name = panfrost_exception_name(js_status);
+	bool signal_fence = true;
+
+	if (js_status < DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
+		dev_dbg(pfdev->dev, "js event, js=%d, status=%s, head=0x%x, tail=0x%x",
+			js, exception_name,
+			job_read(pfdev, JS_HEAD_LO(js)),
+			job_read(pfdev, JS_TAIL_LO(js)));
+	} else {
+		dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
+			js, exception_name,
+			job_read(pfdev, JS_HEAD_LO(js)),
+			job_read(pfdev, JS_TAIL_LO(js)));
+	}
+
+	if (js_status == DRM_PANFROST_EXCEPTION_STOPPED) {
+		/* Update the job head so we can resume */
+		job->jc = job_read(pfdev, JS_TAIL_LO(js)) |
+			  ((u64)job_read(pfdev, JS_TAIL_HI(js)) << 32);
+
+		/* The job will be resumed, don't signal the fence */
+		signal_fence = false;
+	} else if (js_status == DRM_PANFROST_EXCEPTION_TERMINATED) {
+		/* Job has been hard-stopped, flag it as canceled */
+		dma_fence_set_error(job->done_fence, -ECANCELED);
+		job->jc = 0;
+	} else if (js_status >= DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
+		/* We might want to provide finer-grained error code based on
+		 * the exception type, but unconditionally setting to EINVAL
+		 * is good enough for now.
+		 */
+		dma_fence_set_error(job->done_fence, -EINVAL);
+		job->jc = 0;
+	}
+
+	panfrost_mmu_as_put(pfdev, job->file_priv->mmu);
+	panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+
+	if (signal_fence)
+		dma_fence_signal_locked(job->done_fence);
+
+	pm_runtime_put_autosuspend(pfdev->dev);
+
+	if (panfrost_exception_needs_reset(pfdev, js_status)) {
+		atomic_set(&pfdev->reset.pending, 1);
+		drm_sched_fault(&pfdev->js->queue[js].sched);
+	}
+}
+
+static void panfrost_job_handle_done(struct panfrost_device *pfdev,
+				     struct panfrost_job *job)
+{
+	/* Set ->jc to 0 to avoid re-submitting an already finished job (can
+	 * happen when we receive the DONE interrupt while doing a GPU reset).
+	 */
+	job->jc = 0;
+	panfrost_mmu_as_put(pfdev, job->file_priv->mmu);
+	panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+
+	dma_fence_signal_locked(job->done_fence);
+	pm_runtime_put_autosuspend(pfdev->dev);
+}
+
+static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
+{
+	struct panfrost_job *done[NUM_JOB_SLOTS][2] = {};
+	struct panfrost_job *failed[NUM_JOB_SLOTS] = {};
+	u32 js_state = 0, js_events = 0;
+	unsigned int i, j;
+
+	/* First we collect all failed/done jobs. */
+	while (status) {
+		u32 js_state_mask = 0;
+
+		for (j = 0; j < NUM_JOB_SLOTS; j++) {
+			if (status & MK_JS_MASK(j))
+				js_state_mask |= MK_JS_MASK(j);
+
+			if (status & JOB_INT_MASK_DONE(j)) {
+				if (done[j][0])
+					done[j][1] = panfrost_dequeue_job(pfdev, j);
+				else
+					done[j][0] = panfrost_dequeue_job(pfdev, j);
+			}
+
+			if (status & JOB_INT_MASK_ERR(j)) {
+				/* Cancel the next submission. Will be submitted
+				 * after we're done handling this failure if
+				 * there's no reset pending.
+				 */
+				job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
+				failed[j] = panfrost_dequeue_job(pfdev, j);
+			}
+		}
+
+		/* JS_STATE is sampled when JOB_INT_CLEAR is written.
+		 * For each BIT(slot) or BIT(slot + 16) bit written to
+		 * JOB_INT_CLEAR, the corresponding bits in JS_STATE
+		 * (BIT(slot) and BIT(slot + 16)) are updated, but this
+		 * is racy. If we only have one job done at the time we
+		 * read JOB_INT_RAWSTAT but the second job fails before we
+		 * clear the status, we end up with a status containing
+		 * only the DONE bit and consider both jobs as DONE since
+		 * JS_STATE reports both NEXT and CURRENT as inactive.
+		 * To prevent that, let's repeat this clear+read steps
+		 * until status is 0.
+		 */
+		job_write(pfdev, JOB_INT_CLEAR, status);
+		js_state &= ~js_state_mask;
+		js_state |= job_read(pfdev, JOB_INT_JS_STATE) & js_state_mask;
+		js_events |= status;
+		status = job_read(pfdev, JOB_INT_RAWSTAT);
+	}
+
+	/* Then we handle the dequeued jobs. */
+	for (j = 0; j < NUM_JOB_SLOTS; j++) {
+		if (!(js_events & MK_JS_MASK(j)))
+			continue;
+
+		if (failed[j]) {
+			panfrost_job_handle_err(pfdev, failed[j], j);
+		} else if (pfdev->jobs[j][0] && !(js_state & MK_JS_MASK(j))) {
+			/* When the current job doesn't fail, the JM dequeues
+			 * the next job without waiting for an ACK, this means
+			 * we can have 2 jobs dequeued and only catch the
+			 * interrupt when the second one is done. If both slots
+			 * are inactive, but one job remains in pfdev->jobs[j],
+			 * consider it done. Of course that doesn't apply if a
+			 * failure happened since we cancelled execution of the
+			 * job in _NEXT (see above).
+			 */
+			if (WARN_ON(!done[j][0]))
+				done[j][0] = panfrost_dequeue_job(pfdev, j);
+			else
+				done[j][1] = panfrost_dequeue_job(pfdev, j);
+		}
+
+		for (i = 0; i < ARRAY_SIZE(done[0]) && done[j][i]; i++)
+			panfrost_job_handle_done(pfdev, done[j][i]);
+	}
+
+	/* And finally we requeue jobs that were waiting in the second slot
+	 * and have been stopped if we detected a failure on the first slot.
+	 */
+	for (j = 0; j < NUM_JOB_SLOTS; j++) {
+		if (!(js_events & MK_JS_MASK(j)))
+			continue;
+
+		if (!failed[j] || !pfdev->jobs[j][0])
+			continue;
+
+		if (pfdev->jobs[j][0]->jc == 0) {
+			/* The job was cancelled, signal the fence now */
+			struct panfrost_job *canceled = panfrost_dequeue_job(pfdev, j);
+
+			dma_fence_set_error(canceled->done_fence, -ECANCELED);
+			panfrost_job_handle_done(pfdev, canceled);
+		} else if (!atomic_read(&pfdev->reset.pending)) {
+			/* Requeue the job we removed if no reset is pending */
+			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_START);
+		}
+	}
+}
+
+static void panfrost_job_handle_irqs(struct panfrost_device *pfdev)
+{
+	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
+
+	while (status) {
+		pm_runtime_mark_last_busy(pfdev->dev);
+
+		spin_lock(&pfdev->js->job_lock);
+		panfrost_job_handle_irq(pfdev, status);
+		spin_unlock(&pfdev->js->job_lock);
+		status = job_read(pfdev, JOB_INT_RAWSTAT);
+	}
+}
+
+static u32 panfrost_active_slots(struct panfrost_device *pfdev,
+				 u32 *js_state_mask, u32 js_state)
+{
+	u32 rawstat;
+
+	if (!(js_state & *js_state_mask))
+		return 0;
+
+	rawstat = job_read(pfdev, JOB_INT_RAWSTAT);
+	if (rawstat) {
+		unsigned int i;
+
+		for (i = 0; i < NUM_JOB_SLOTS; i++) {
+			if (rawstat & MK_JS_MASK(i))
+				*js_state_mask &= ~MK_JS_MASK(i);
+		}
+	}
+
+	return js_state & *js_state_mask;
+}
+
+static void
+panfrost_reset(struct panfrost_device *pfdev,
+	       struct drm_sched_job *bad)
+{
+	u32 js_state, js_state_mask = 0xffffffff;
+	unsigned int i, j;
 	bool cookie;
+	int ret;
 
 	if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
 		return;
@@ -407,19 +670,44 @@ static void panfrost_reset(struct panfrost_device *pfdev,
 	job_write(pfdev, JOB_INT_MASK, 0);
 	synchronize_irq(pfdev->js->irq);
 
-	/* Schedulers are stopped and interrupts are masked+flushed, we don't
-	 * need to protect the 'evict unfinished jobs' lock with the job_lock.
-	 */
 	for (i = 0; i < NUM_JOB_SLOTS; i++) {
-		if (pfdev->jobs[i]) {
-			pm_runtime_put_noidle(pfdev->dev);
-			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
-			pfdev->jobs[i] = NULL;
-		}
+		/* Cancel the next job and soft-stop the running job. */
+		job_write(pfdev, JS_COMMAND_NEXT(i), JS_COMMAND_NOP);
+		job_write(pfdev, JS_COMMAND(i), JS_COMMAND_SOFT_STOP);
 	}
 
+	/* Wait at most 10ms for soft-stops to complete */
+	ret = readl_poll_timeout(pfdev->iomem + JOB_INT_JS_STATE, js_state,
+				 !panfrost_active_slots(pfdev, &js_state_mask, js_state),
+				 10, 10000);
+
+	if (ret)
+		dev_err(pfdev->dev, "Soft-stop failed\n");
+
+	/* Handle the remaining interrupts before we reset. */
+	panfrost_job_handle_irqs(pfdev);
+
+	/* Remaining interrupts have been handled, but we might still have
+	 * stuck jobs. Let's make sure the PM counters stay balanced by
+	 * manually calling pm_runtime_put_noidle() and
+	 * panfrost_devfreq_record_idle() for each stuck job.
+	 */
+	for (i = 0; i < NUM_JOB_SLOTS; i++) {
+		for (j = 0; j < ARRAY_SIZE(pfdev->jobs[0]) && pfdev->jobs[i][j]; j++) {
+			pm_runtime_put_noidle(pfdev->dev);
+			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
+		}
+	}
+	memset(pfdev->jobs, 0, sizeof(pfdev->jobs));
+
+	/* Proceed with reset now. */
 	panfrost_device_reset(pfdev);
 
+	/* panfrost_device_reset() unmasks job interrupts, but we want to
+	 * keep them masked a bit longer.
+	 */
+	job_write(pfdev, JOB_INT_MASK, 0);
+
 	/* GPU has been reset, we can cancel timeout/fault work that may have
 	 * been queued in the meantime and clear the reset pending bit.
 	 */
@@ -428,7 +716,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
 	for (i = 0; i < NUM_JOB_SLOTS; i++)
 		cancel_delayed_work(&pfdev->js->queue[i].sched.work_tdr);
 
-
 	/* Now resubmit jobs that were previously queued but didn't have a
 	 * chance to finish.
 	 * FIXME: We temporarily get out of the DMA fence signalling section
@@ -441,9 +728,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
 		drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
 	cookie = dma_fence_begin_signalling();
 
+	/* Restart the schedulers */
 	for (i = 0; i < NUM_JOB_SLOTS; i++)
 		drm_sched_start(&pfdev->js->queue[i].sched, true);
 
+	/* Re-enable job interrupts now that everything has been restarted. */
+	job_write(pfdev, JOB_INT_MASK,
+		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
+		  GENMASK(NUM_JOB_SLOTS - 1, 0));
+
 	dma_fence_end_signalling(cookie);
 }
 
@@ -475,6 +768,14 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
 	return DRM_GPU_SCHED_STAT_NOMINAL;
 }
 
+static void panfrost_reset_work(struct work_struct *work)
+{
+	struct panfrost_device *pfdev;
+
+	pfdev = container_of(work, struct panfrost_device, reset.work);
+	panfrost_reset(pfdev, NULL);
+}
+
 static const struct drm_sched_backend_ops panfrost_sched_ops = {
 	.dependency = panfrost_job_dependency,
 	.run_job = panfrost_job_run,
@@ -482,94 +783,11 @@ static const struct drm_sched_backend_ops panfrost_sched_ops = {
 	.free_job = panfrost_job_free
 };
 
-static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
-{
-	int j;
-
-	dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
-
-	for (j = 0; status; j++) {
-		u32 mask = MK_JS_MASK(j);
-
-		if (!(status & mask))
-			continue;
-
-		job_write(pfdev, JOB_INT_CLEAR, mask);
-
-		if (status & JOB_INT_MASK_ERR(j)) {
-			u32 js_status = job_read(pfdev, JS_STATUS(j));
-			const char *exception_name = panfrost_exception_name(js_status);
-
-			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
-
-			if (js_status < DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
-				dev_dbg(pfdev->dev, "js interrupt, js=%d, status=%s, head=0x%x, tail=0x%x",
-					j, exception_name,
-					job_read(pfdev, JS_HEAD_LO(j)),
-					job_read(pfdev, JS_TAIL_LO(j)));
-			} else {
-				dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
-					j, exception_name,
-					job_read(pfdev, JS_HEAD_LO(j)),
-					job_read(pfdev, JS_TAIL_LO(j)));
-			}
-
-			/* If we need a reset, signal it to the timeout
-			 * handler, otherwise, update the fence error field and
-			 * signal the job fence.
-			 */
-			if (panfrost_exception_needs_reset(pfdev, js_status)) {
-				drm_sched_fault(&pfdev->js->queue[j].sched);
-			} else {
-				int error = 0;
-
-				if (js_status == DRM_PANFROST_EXCEPTION_TERMINATED)
-					error = -ECANCELED;
-				else if (js_status >= DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
-					error = -EINVAL;
-
-				if (error)
-					dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
-
-				status |= JOB_INT_MASK_DONE(j);
-			}
-		}
-
-		if (status & JOB_INT_MASK_DONE(j)) {
-			struct panfrost_job *job;
-
-			job = pfdev->jobs[j];
-			/* Only NULL if job timeout occurred */
-			WARN_ON(!job);
-			if (job) {
-				pfdev->jobs[j] = NULL;
-
-				panfrost_mmu_as_put(pfdev, job->file_priv->mmu);
-				panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
-
-				dma_fence_signal_locked(job->done_fence);
-				pm_runtime_put_autosuspend(pfdev->dev);
-			}
-		}
-
-		status &= ~mask;
-	}
-}
-
 static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
 {
 	struct panfrost_device *pfdev = data;
-	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
-
-	while (status) {
-		pm_runtime_mark_last_busy(pfdev->dev);
-
-		spin_lock(&pfdev->js->job_lock);
-		panfrost_job_handle_irq(pfdev, status);
-		spin_unlock(&pfdev->js->job_lock);
-		status = job_read(pfdev, JOB_INT_RAWSTAT);
-	}
 
+	panfrost_job_handle_irqs(pfdev);
 	job_write(pfdev, JOB_INT_MASK,
 		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
 		  GENMASK(NUM_JOB_SLOTS - 1, 0));
@@ -588,26 +806,24 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
 	return IRQ_WAKE_THREAD;
 }
 
-static void panfrost_reset_work(struct work_struct *work)
-{
-	struct panfrost_device *pfdev = container_of(work,
-						     struct panfrost_device,
-						     reset.work);
-
-	panfrost_reset(pfdev, NULL);
-}
-
 int panfrost_job_init(struct panfrost_device *pfdev)
 {
 	struct panfrost_job_slot *js;
+	unsigned int nslots = 2;
 	int ret, j;
 
-	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
+	/* All GPUs have twnslotsueue, but without jobchain
+	 * disambiguation stopping the right job in the close path is tricky,
+	 * so let's just advertize one slot in that case.
+	 */
+	if (!panfrost_has_hw_feature(pfdev, HW_FEATURE_JOBCHAIN_DISAMBIGUATION))
+		nslots = 1;
 
 	pfdev->js = js = devm_kzalloc(pfdev->dev, sizeof(*js), GFP_KERNEL);
 	if (!js)
 		return -ENOMEM;
 
+	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
 	spin_lock_init(&js->job_lock);
 
 	js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
@@ -633,8 +849,9 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
 		ret = drm_sched_init(&js->queue[j].sched,
 				     &panfrost_sched_ops,
-				     1, 0,
-				     msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
+				     nslots, 0,
+				     msecs_to_jiffies(JOB_TIMEOUT_MS),
+				     pfdev->reset.wq,
 				     NULL, "pan_js");
 		if (ret) {
 			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
@@ -700,12 +917,34 @@ void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
 	spin_lock_irqsave(&pfdev->js->job_lock, flags);
 	for (i = 0; i < NUM_JOB_SLOTS; i++) {
 		struct drm_sched_entity *entity = &panfrost_priv->sched_entity[i];
-		struct panfrost_job *job = pfdev->jobs[i];
+		int j;
 
-		if (!job || job->base.entity != entity)
-			continue;
+		for (j = ARRAY_SIZE(pfdev->jobs[0]) - 1; j >= 0; j--) {
+			struct panfrost_job *job = pfdev->jobs[i][j];
+			u32 cmd;
 
-		job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
+			if (!job || job->base.entity != entity)
+				continue;
+
+			if (j == 1) {
+				/* Try to cancel the job before it starts */
+				job_write(pfdev, JS_COMMAND_NEXT(i), JS_COMMAND_NOP);
+				/* Reset the job head so it doesn't get restarted if
+				 * the job in the first slot failed.
+				 */
+				job->jc = 0;
+			}
+
+			if (panfrost_has_hw_feature(pfdev, HW_FEATURE_JOBCHAIN_DISAMBIGUATION)) {
+				cmd = panfrost_get_job_chain_flag(job) ?
+				      JS_COMMAND_HARD_STOP_1 :
+				      JS_COMMAND_HARD_STOP_0;
+			} else {
+				cmd = JS_COMMAND_HARD_STOP;
+			}
+
+			job_write(pfdev, JS_COMMAND(i), cmd);
+		}
 	}
 	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition
  2021-06-25 13:33 ` [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition Boris Brezillon
@ 2021-06-25 13:39   ` Alyssa Rosenzweig
  0 siblings, 0 replies; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:39 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

> Exception types will be defined as an enum in panfrost_drm.h so userspace
> and use the same definitions if needed.

s/and/can/, with that R-b

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table
  2021-06-25 13:33 ` [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
@ 2021-06-25 13:41   ` Alyssa Rosenzweig
  2021-06-25 15:35   ` Steven Price
  1 sibling, 0 replies; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:41 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

R-b

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace
  2021-06-25 13:33 ` [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace Boris Brezillon
@ 2021-06-25 13:42   ` Alyssa Rosenzweig
  2021-06-25 14:21     ` Boris Brezillon
  0 siblings, 1 reply; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:42 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

I'm not convinced. Right now most of our UABI is pleasantly
GPU-agnostic. With this suddenly there's divergence between Midgard and
Bifrost uABI. With that drawback in mind, could you explain the benefit?

On Fri, Jun 25, 2021 at 03:33:17PM +0200, Boris Brezillon wrote:
> Job headers contain an exception type field which might be read and
> converted to a human readable string by tracing tools. Let's expose
> the exception type as an enum so we share the same definition.
> 
> v3:
> * Add missing values
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  include/uapi/drm/panfrost_drm.h | 71 +++++++++++++++++++++++++++++++++
>  1 file changed, 71 insertions(+)
> 
> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> index ec19db1eead8..899cd6d952d4 100644
> --- a/include/uapi/drm/panfrost_drm.h
> +++ b/include/uapi/drm/panfrost_drm.h
> @@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
>  	__u32 retained;       /* out, whether backing store still exists */
>  };
>  
> +/* The exception types */
> +
> +enum drm_panfrost_exception_type {
> +	DRM_PANFROST_EXCEPTION_OK = 0x00,
> +	DRM_PANFROST_EXCEPTION_DONE = 0x01,
> +	DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
> +	DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
> +	DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
> +	DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
> +	DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
> +	DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
> +	DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
> +	DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
> +	DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
> +	DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
> +	DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
> +	DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
> +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
> +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
> +	DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
> +	DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
> +	DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
> +	DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
> +	DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
> +	DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> +	DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> +	DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> +	DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> +	DRM_PANFROST_EXCEPTION_OOM = 0x60,
> +	DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
> +	DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
> +	DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
> +	DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> +	DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> +	DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
> +};
> +
>  #if defined(__cplusplus)
>  }
>  #endif
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close
  2021-06-25 13:33 ` [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close Boris Brezillon
@ 2021-06-25 13:43   ` Lucas Stach
  2021-06-25 14:46     ` Boris Brezillon
  0 siblings, 1 reply; 39+ messages in thread
From: Lucas Stach @ 2021-06-25 13:43 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Steven Price, Rob Herring, Alyssa Rosenzweig, Robin Murphy

Am Freitag, dem 25.06.2021 um 15:33 +0200 schrieb Boris Brezillon:
> If the process who submitted these jobs decided to close the FD before
> the jobs are done it probably means it doesn't care about the result.
> 
> v3:
> * Set fence error to ECANCELED when a TERMINATED exception is received
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 43 +++++++++++++++++++++----
>  1 file changed, 37 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 948bd174ff99..aa1e6542adde 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -498,14 +498,21 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  
>  		if (status & JOB_INT_MASK_ERR(j)) {
>  			u32 js_status = job_read(pfdev, JS_STATUS(j));
> +			const char *exception_name = panfrost_exception_name(js_status);
>  
>  			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
>  
> -			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
> -				j,
> -				panfrost_exception_name(js_status),
> -				job_read(pfdev, JS_HEAD_LO(j)),
> -				job_read(pfdev, JS_TAIL_LO(j)));
> +			if (js_status < DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
> +				dev_dbg(pfdev->dev, "js interrupt, js=%d, status=%s, head=0x%x, tail=0x%x",
> +					j, exception_name,
> +					job_read(pfdev, JS_HEAD_LO(j)),
> +					job_read(pfdev, JS_TAIL_LO(j)));
> +			} else {
> +				dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
> +					j, exception_name,
> +					job_read(pfdev, JS_HEAD_LO(j)),
> +					job_read(pfdev, JS_TAIL_LO(j)));
> +			}
>  
>  			/* If we need a reset, signal it to the timeout
>  			 * handler, otherwise, update the fence error field and
> @@ -514,7 +521,16 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  			if (panfrost_exception_needs_reset(pfdev, js_status)) {
>  				drm_sched_fault(&pfdev->js->queue[j].sched);
>  			} else {
> -				dma_fence_set_error(pfdev->jobs[j]->done_fence, -EINVAL);
> +				int error = 0;
> +
> +				if (js_status == DRM_PANFROST_EXCEPTION_TERMINATED)
> +					error = -ECANCELED;
> +				else if (js_status >= DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
> +					error = -EINVAL;
> +
> +				if (error)
> +					dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
> +
>  				status |= JOB_INT_MASK_DONE(j);
>  			}
>  		}
> @@ -673,10 +689,25 @@ int panfrost_job_open(struct panfrost_file_priv *panfrost_priv)
>  
>  void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
>  {
> +	struct panfrost_device *pfdev = panfrost_priv->pfdev;
> +	unsigned long flags;
>  	int i;
>  
>  	for (i = 0; i < NUM_JOB_SLOTS; i++)
>  		drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
> +
> +	/* Kill in-flight jobs */
> +	spin_lock_irqsave(&pfdev->js->job_lock, flags);

Micro-optimization, but this code is never called from IRQ context, so
a spin_lock_irq would do here, no need to save/restore flags.

Regards,
Lucas

> +	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> +		struct drm_sched_entity *entity = &panfrost_priv->sched_entity[i];
> +		struct panfrost_job *job = pfdev->jobs[i];
> +
> +		if (!job || job->base.entity != entity)
> +			continue;
> +
> +		job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
> +	}
> +	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
>  }
>  
>  int panfrost_job_is_idle(struct panfrost_device *pfdev)



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name()
  2021-06-25 13:33 ` [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name() Boris Brezillon
@ 2021-06-25 13:45   ` Alyssa Rosenzweig
  0 siblings, 0 replies; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:45 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

> Currently unused. We'll add it back if we need per-GPU definitions.

We will, if not for Valhall v9, certainly for Valhall v10 with the CSF..

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate
  2021-06-25 13:33 ` [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate Boris Brezillon
@ 2021-06-25 13:45   ` Alyssa Rosenzweig
  0 siblings, 0 replies; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:45 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

R-b

On Fri, Jun 25, 2021 at 03:33:14PM +0200, Boris Brezillon wrote:
> If the fence creation fail, we can return the error pointer directly.
> The core will update the fence error accordingly.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> Reviewed-by: Steven Price <steven.price@arm.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 8ff79fd49577..d6c9698bca3b 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -355,7 +355,7 @@ static struct dma_fence *panfrost_job_run(struct drm_sched_job *sched_job)
>  
>  	fence = panfrost_fence_create(pfdev, slot);
>  	if (IS_ERR(fence))
> -		return NULL;
> +		return fence;
>  
>  	if (job->done_fence)
>  		dma_fence_put(job->done_fence);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset
  2021-06-25 13:33 ` [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset Boris Brezillon
@ 2021-06-25 13:46   ` Alyssa Rosenzweig
  0 siblings, 0 replies; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:46 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

R-b

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts
  2021-06-25 13:33 ` [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
@ 2021-06-25 13:47   ` Alyssa Rosenzweig
  2021-06-25 14:37     ` Boris Brezillon
  2021-06-25 15:40   ` Steven Price
  1 sibling, 1 reply; 39+ messages in thread
From: Alyssa Rosenzweig @ 2021-06-25 13:47 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

A-b, but could you explain the context? Thanks

On Fri, Jun 25, 2021 at 03:33:20PM +0200, Boris Brezillon wrote:
> This should avoid switching to interrupt context when the GPU is under
> heavy use.
> 
> v3:
> * Don't take the job_lock in panfrost_job_handle_irq()
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index be8f68f63974..e0c479e67304 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops panfrost_sched_ops = {
>  	.free_job = panfrost_job_free
>  };
>  
> -static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> +static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  {
> -	struct panfrost_device *pfdev = data;
> -	u32 status = job_read(pfdev, JOB_INT_STAT);
>  	int j;
>  
>  	dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
>  
> -	if (!status)
> -		return IRQ_NONE;
> -
> -	pm_runtime_mark_last_busy(pfdev->dev);
> -
>  	for (j = 0; status; j++) {
>  		u32 mask = MK_JS_MASK(j);
>  
> @@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
>  		if (status & JOB_INT_MASK_DONE(j)) {
>  			struct panfrost_job *job;
>  
> -			spin_lock(&pfdev->js->job_lock);
>  			job = pfdev->jobs[j];
>  			/* Only NULL if job timeout occurred */
>  			if (job) {
> @@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
>  				dma_fence_signal_locked(job->done_fence);
>  				pm_runtime_put_autosuspend(pfdev->dev);
>  			}
> -			spin_unlock(&pfdev->js->job_lock);
>  		}
>  
>  		status &= ~mask;
>  	}
> +}
>  
> +static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
> +{
> +	struct panfrost_device *pfdev = data;
> +	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
> +
> +	while (status) {
> +		pm_runtime_mark_last_busy(pfdev->dev);
> +
> +		spin_lock(&pfdev->js->job_lock);
> +		panfrost_job_handle_irq(pfdev, status);
> +		spin_unlock(&pfdev->js->job_lock);
> +		status = job_read(pfdev, JOB_INT_RAWSTAT);
> +	}
> +
> +	job_write(pfdev, JOB_INT_MASK,
> +		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
> +		  GENMASK(NUM_JOB_SLOTS - 1, 0));
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> +{
> +	struct panfrost_device *pfdev = data;
> +	u32 status = job_read(pfdev, JOB_INT_STAT);
> +
> +	if (!status)
> +		return IRQ_NONE;
> +
> +	job_write(pfdev, JOB_INT_MASK, 0);
> +	return IRQ_WAKE_THREAD;
> +}
> +
>  static void panfrost_reset(struct work_struct *work)
>  {
>  	struct panfrost_device *pfdev = container_of(work,
>  						     struct panfrost_device,
>  						     reset.work);
> -	unsigned long flags;
>  	unsigned int i;
>  	bool cookie;
>  
> @@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
>  	/* All timers have been stopped, we can safely reset the pending state. */
>  	atomic_set(&pfdev->reset.pending, 0);
>  
> -	spin_lock_irqsave(&pfdev->js->job_lock, flags);
> +	spin_lock(&pfdev->js->job_lock);
>  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>  		if (pfdev->jobs[i]) {
>  			pm_runtime_put_noidle(pfdev->dev);
> @@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
>  			pfdev->jobs[i] = NULL;
>  		}
>  	}
> -	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> +	spin_unlock(&pfdev->js->job_lock);
>  
>  	panfrost_device_reset(pfdev);
>  
> @@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  	if (irq <= 0)
>  		return -ENODEV;
>  
> -	ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
> -			       IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
> +	ret = devm_request_threaded_irq(pfdev->dev, irq,
> +					panfrost_job_irq_handler,
> +					panfrost_job_irq_handler_thread,
> +					IRQF_SHARED, KBUILD_MODNAME "-job",
> +					pfdev);
>  	if (ret) {
>  		dev_err(pfdev->dev, "failed to request job irq");
>  		return ret;
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace
  2021-06-25 13:42   ` Alyssa Rosenzweig
@ 2021-06-25 14:21     ` Boris Brezillon
  2021-06-25 15:32       ` Steven Price
  0 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 14:21 UTC (permalink / raw)
  To: Alyssa Rosenzweig
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 09:42:08 -0400
Alyssa Rosenzweig <alyssa@collabora.com> wrote:

> I'm not convinced. Right now most of our UABI is pleasantly
> GPU-agnostic. With this suddenly there's divergence between Midgard and
> Bifrost uABI.

Hm, I don't see why. I mean the exception types seem to be the same,
there are just some that are not used on Midgard and some that are no
used on Bifrost. Are there any collisions I didn't notice?

> With that drawback in mind, could you explain the benefit?

Well, I thought having these definitions in a central place would be a
good thing given they're not expected to change even if they might
be per-GPU. I don't know if that changes with CSF, maybe the exception
codes are no longer set in stone and can change with FW update...

> 
> On Fri, Jun 25, 2021 at 03:33:17PM +0200, Boris Brezillon wrote:
> > Job headers contain an exception type field which might be read and
> > converted to a human readable string by tracing tools. Let's expose
> > the exception type as an enum so we share the same definition.
> > 
> > v3:
> > * Add missing values
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  include/uapi/drm/panfrost_drm.h | 71 +++++++++++++++++++++++++++++++++
> >  1 file changed, 71 insertions(+)
> > 
> > diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
> > index ec19db1eead8..899cd6d952d4 100644
> > --- a/include/uapi/drm/panfrost_drm.h
> > +++ b/include/uapi/drm/panfrost_drm.h
> > @@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
> >  	__u32 retained;       /* out, whether backing store still exists */
> >  };
> >  
> > +/* The exception types */
> > +
> > +enum drm_panfrost_exception_type {
> > +	DRM_PANFROST_EXCEPTION_OK = 0x00,
> > +	DRM_PANFROST_EXCEPTION_DONE = 0x01,
> > +	DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
> > +	DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
> > +	DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
> > +	DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
> > +	DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
> > +	DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
> > +	DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
> > +	DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
> > +	DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
> > +	DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
> > +	DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
> > +	DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
> > +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
> > +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
> > +	DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
> > +	DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
> > +	DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
> > +	DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
> > +	DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
> > +	DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
> > +	DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
> > +	DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
> > +	DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
> > +	DRM_PANFROST_EXCEPTION_OOM = 0x60,
> > +	DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
> > +	DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
> > +	DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
> > +	DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
> > +	DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
> > +	DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
> > +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
> > +	DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
> > +	DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
> > +	DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
> > +	DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
> > +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
> > +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
> > +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
> > +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
> > +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
> > +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
> > +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
> > +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
> > +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
> > +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
> > +};
> > +
> >  #if defined(__cplusplus)
> >  }
> >  #endif
> > -- 
> > 2.31.1
> >   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts
  2021-06-25 13:47   ` Alyssa Rosenzweig
@ 2021-06-25 14:37     ` Boris Brezillon
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 14:37 UTC (permalink / raw)
  To: Alyssa Rosenzweig
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 09:47:59 -0400
Alyssa Rosenzweig <alyssa@collabora.com> wrote:

> A-b, but could you explain the context? Thanks

The rational behind this change is the complexity added to the
interrupt handler in patch 15. That means we might spend more time in
interrupt context after that patch and block other things on the system
while we dequeue job irqs. Moving things to a thread also helps
performances when the GPU gets faster as executing jobs than the CPU at
queueing them. In that case we keep switching back-and-forth between
interrupt and non-interrupt context which has a cost.

One drawback is increased latency when receiving job events and the
thread is idle, since you need to wake up the thread in that case.

> 
> On Fri, Jun 25, 2021 at 03:33:20PM +0200, Boris Brezillon wrote:
> > This should avoid switching to interrupt context when the GPU is under
> > heavy use.
> > 
> > v3:
> > * Don't take the job_lock in panfrost_job_handle_irq()
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++++++++++++++++++-------
> >  1 file changed, 38 insertions(+), 15 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index be8f68f63974..e0c479e67304 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops panfrost_sched_ops = {
> >  	.free_job = panfrost_job_free
> >  };
> >  
> > -static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> > +static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
> >  {
> > -	struct panfrost_device *pfdev = data;
> > -	u32 status = job_read(pfdev, JOB_INT_STAT);
> >  	int j;
> >  
> >  	dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
> >  
> > -	if (!status)
> > -		return IRQ_NONE;
> > -
> > -	pm_runtime_mark_last_busy(pfdev->dev);
> > -
> >  	for (j = 0; status; j++) {
> >  		u32 mask = MK_JS_MASK(j);
> >  
> > @@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> >  		if (status & JOB_INT_MASK_DONE(j)) {
> >  			struct panfrost_job *job;
> >  
> > -			spin_lock(&pfdev->js->job_lock);
> >  			job = pfdev->jobs[j];
> >  			/* Only NULL if job timeout occurred */
> >  			if (job) {
> > @@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> >  				dma_fence_signal_locked(job->done_fence);
> >  				pm_runtime_put_autosuspend(pfdev->dev);
> >  			}
> > -			spin_unlock(&pfdev->js->job_lock);
> >  		}
> >  
> >  		status &= ~mask;
> >  	}
> > +}
> >  
> > +static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
> > +{
> > +	struct panfrost_device *pfdev = data;
> > +	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
> > +
> > +	while (status) {
> > +		pm_runtime_mark_last_busy(pfdev->dev);
> > +
> > +		spin_lock(&pfdev->js->job_lock);
> > +		panfrost_job_handle_irq(pfdev, status);
> > +		spin_unlock(&pfdev->js->job_lock);
> > +		status = job_read(pfdev, JOB_INT_RAWSTAT);
> > +	}
> > +
> > +	job_write(pfdev, JOB_INT_MASK,
> > +		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
> > +		  GENMASK(NUM_JOB_SLOTS - 1, 0));
> >  	return IRQ_HANDLED;
> >  }
> >  
> > +static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> > +{
> > +	struct panfrost_device *pfdev = data;
> > +	u32 status = job_read(pfdev, JOB_INT_STAT);
> > +
> > +	if (!status)
> > +		return IRQ_NONE;
> > +
> > +	job_write(pfdev, JOB_INT_MASK, 0);
> > +	return IRQ_WAKE_THREAD;
> > +}
> > +
> >  static void panfrost_reset(struct work_struct *work)
> >  {
> >  	struct panfrost_device *pfdev = container_of(work,
> >  						     struct panfrost_device,
> >  						     reset.work);
> > -	unsigned long flags;
> >  	unsigned int i;
> >  	bool cookie;
> >  
> > @@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
> >  	/* All timers have been stopped, we can safely reset the pending state. */
> >  	atomic_set(&pfdev->reset.pending, 0);
> >  
> > -	spin_lock_irqsave(&pfdev->js->job_lock, flags);
> > +	spin_lock(&pfdev->js->job_lock);
> >  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> >  		if (pfdev->jobs[i]) {
> >  			pm_runtime_put_noidle(pfdev->dev);
> > @@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
> >  			pfdev->jobs[i] = NULL;
> >  		}
> >  	}
> > -	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> > +	spin_unlock(&pfdev->js->job_lock);
> >  
> >  	panfrost_device_reset(pfdev);
> >  
> > @@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
> >  	if (irq <= 0)
> >  		return -ENODEV;
> >  
> > -	ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
> > -			       IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
> > +	ret = devm_request_threaded_irq(pfdev->dev, irq,
> > +					panfrost_job_irq_handler,
> > +					panfrost_job_irq_handler_thread,
> > +					IRQF_SHARED, KBUILD_MODNAME "-job",
> > +					pfdev);
> >  	if (ret) {
> >  		dev_err(pfdev->dev, "failed to request job irq");
> >  		return ret;
> > -- 
> > 2.31.1
> >   


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close
  2021-06-25 13:43   ` Lucas Stach
@ 2021-06-25 14:46     ` Boris Brezillon
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 14:46 UTC (permalink / raw)
  To: Lucas Stach
  Cc: Tomeu Vizoso, dri-devel, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 15:43:45 +0200
Lucas Stach <l.stach@pengutronix.de> wrote:

> Am Freitag, dem 25.06.2021 um 15:33 +0200 schrieb Boris Brezillon:
> > If the process who submitted these jobs decided to close the FD before
> > the jobs are done it probably means it doesn't care about the result.
> > 
> > v3:
> > * Set fence error to ECANCELED when a TERMINATED exception is received
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 43 +++++++++++++++++++++----
> >  1 file changed, 37 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index 948bd174ff99..aa1e6542adde 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -498,14 +498,21 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
> >  
> >  		if (status & JOB_INT_MASK_ERR(j)) {
> >  			u32 js_status = job_read(pfdev, JS_STATUS(j));
> > +			const char *exception_name = panfrost_exception_name(js_status);
> >  
> >  			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
> >  
> > -			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
> > -				j,
> > -				panfrost_exception_name(js_status),
> > -				job_read(pfdev, JS_HEAD_LO(j)),
> > -				job_read(pfdev, JS_TAIL_LO(j)));
> > +			if (js_status < DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT) {
> > +				dev_dbg(pfdev->dev, "js interrupt, js=%d, status=%s, head=0x%x, tail=0x%x",
> > +					j, exception_name,
> > +					job_read(pfdev, JS_HEAD_LO(j)),
> > +					job_read(pfdev, JS_TAIL_LO(j)));
> > +			} else {
> > +				dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
> > +					j, exception_name,
> > +					job_read(pfdev, JS_HEAD_LO(j)),
> > +					job_read(pfdev, JS_TAIL_LO(j)));
> > +			}
> >  
> >  			/* If we need a reset, signal it to the timeout
> >  			 * handler, otherwise, update the fence error field and
> > @@ -514,7 +521,16 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
> >  			if (panfrost_exception_needs_reset(pfdev, js_status)) {
> >  				drm_sched_fault(&pfdev->js->queue[j].sched);
> >  			} else {
> > -				dma_fence_set_error(pfdev->jobs[j]->done_fence, -EINVAL);
> > +				int error = 0;
> > +
> > +				if (js_status == DRM_PANFROST_EXCEPTION_TERMINATED)
> > +					error = -ECANCELED;
> > +				else if (js_status >= DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT)
> > +					error = -EINVAL;
> > +
> > +				if (error)
> > +					dma_fence_set_error(pfdev->jobs[j]->done_fence, error);
> > +
> >  				status |= JOB_INT_MASK_DONE(j);
> >  			}
> >  		}
> > @@ -673,10 +689,25 @@ int panfrost_job_open(struct panfrost_file_priv *panfrost_priv)
> >  
> >  void panfrost_job_close(struct panfrost_file_priv *panfrost_priv)
> >  {
> > +	struct panfrost_device *pfdev = panfrost_priv->pfdev;
> > +	unsigned long flags;
> >  	int i;
> >  
> >  	for (i = 0; i < NUM_JOB_SLOTS; i++)
> >  		drm_sched_entity_destroy(&panfrost_priv->sched_entity[i]);
> > +
> > +	/* Kill in-flight jobs */
> > +	spin_lock_irqsave(&pfdev->js->job_lock, flags);  
> 
> Micro-optimization, but this code is never called from IRQ context, so
> a spin_lock_irq would do here, no need to save/restore flags.

Ah, right, I moved patches around. This patch was before the 'move to
threaded-irq' one in v2, but now that it's coming after, we can use a
regular lock here.

> 
> Regards,
> Lucas
> 
> > +	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> > +		struct drm_sched_entity *entity = &panfrost_priv->sched_entity[i];
> > +		struct panfrost_job *job = pfdev->jobs[i];
> > +
> > +		if (!job || job->base.entity != entity)
> > +			continue;
> > +
> > +		job_write(pfdev, JS_COMMAND(i), JS_COMMAND_HARD_STOP);
> > +	}
> > +	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> >  }
> >  
> >  int panfrost_job_is_idle(struct panfrost_device *pfdev)  
> 
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  2021-06-25 13:33 ` [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
@ 2021-06-25 15:07   ` Steven Price
  2021-06-25 15:18     ` Boris Brezillon
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:07 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> reset. This leads to extra complexity when we need to synchronize timeout
> works with the reset work. One solution to address that is to have an
> ordered workqueue at the driver level that will be used by the different
> schedulers to queue their timeout work. Thanks to the serialization
> provided by the ordered workqueue we are guaranteed that timeout
> handlers are executed sequentially, and can thus easily reset the GPU
> from the timeout handler without extra synchronization.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

I feel like I'm missing something here - I can't see where
sched->timeout_wq is ever actually used in this series. There's clearly
no point passing it into the drm core if the drm core never accesses it.
AFAICT the changes are all in patch 9 and that doesn't depend on this one.

Steve

> ---
>  drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
>  drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
>  drivers/gpu/drm/lima/lima_sched.c         |  3 ++-
>  drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
>  drivers/gpu/drm/scheduler/sched_main.c    |  6 +++++-
>  drivers/gpu/drm/v3d/v3d_sched.c           | 10 +++++-----
>  include/drm/gpu_scheduler.h               |  5 ++++-
>  7 files changed, 21 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> index 47ea46859618..532636ea20bc 100644
> --- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> +++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
> @@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
>  
>  	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
>  			   num_hw_submission, amdgpu_job_hang_limit,
> -			   timeout, sched_score, ring->name);
> +			   timeout, NULL, sched_score, ring->name);
>  	if (r) {
>  		DRM_ERROR("Failed to create scheduler on ring %s.\n",
>  			  ring->name);
> diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> index 19826e504efc..feb6da1b6ceb 100644
> --- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> +++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
> @@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
>  
>  	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
>  			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
> -			     msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
> +			     msecs_to_jiffies(500), NULL, NULL,
> +			     dev_name(gpu->dev));
>  	if (ret)
>  		return ret;
>  
> diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
> index ecf3267334ff..dba8329937a3 100644
> --- a/drivers/gpu/drm/lima/lima_sched.c
> +++ b/drivers/gpu/drm/lima/lima_sched.c
> @@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
>  	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
>  
>  	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
> -			      lima_job_hang_limit, msecs_to_jiffies(timeout),
> +			      lima_job_hang_limit,
> +			      msecs_to_jiffies(timeout), NULL,
>  			      NULL, name);
>  }
>  
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 682f2161b999..8ff79fd49577 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  
>  		ret = drm_sched_init(&js->queue[j].sched,
>  				     &panfrost_sched_ops,
> -				     1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
> +				     1, 0,
> +				     msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
>  				     NULL, "pan_js");
>  		if (ret) {
>  			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
> diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
> index c0a2f8f8d472..a937d0529944 100644
> --- a/drivers/gpu/drm/scheduler/sched_main.c
> +++ b/drivers/gpu/drm/scheduler/sched_main.c
> @@ -837,6 +837,8 @@ static int drm_sched_main(void *param)
>   * @hw_submission: number of hw submissions that can be in flight
>   * @hang_limit: number of times to allow a job to hang before dropping it
>   * @timeout: timeout value in jiffies for the scheduler
> + * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
> + *		used
>   * @score: optional score atomic shared with other schedulers
>   * @name: name used for debugging
>   *
> @@ -844,7 +846,8 @@ static int drm_sched_main(void *param)
>   */
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
> -		   unsigned hw_submission, unsigned hang_limit, long timeout,
> +		   unsigned hw_submission, unsigned hang_limit,
> +		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name)
>  {
>  	int i, ret;
> @@ -852,6 +855,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
>  	sched->hw_submission_limit = hw_submission;
>  	sched->name = name;
>  	sched->timeout = timeout;
> +	sched->timeout_wq = timeout_wq ? : system_wq;
>  	sched->hang_limit = hang_limit;
>  	sched->score = score ? score : &sched->_score;
>  	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
> diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
> index 8992480c88fa..a39bdd5cfc4f 100644
> --- a/drivers/gpu/drm/v3d/v3d_sched.c
> +++ b/drivers/gpu/drm/v3d/v3d_sched.c
> @@ -402,7 +402,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
>  			     &v3d_bin_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     msecs_to_jiffies(hang_limit_ms), NULL,
>  			     NULL, "v3d_bin");
>  	if (ret) {
>  		dev_err(v3d->drm.dev, "Failed to create bin scheduler: %d.", ret);
> @@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
>  			     &v3d_render_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     msecs_to_jiffies(hang_limit_ms), NULL,
>  			     NULL, "v3d_render");
>  	if (ret) {
>  		dev_err(v3d->drm.dev, "Failed to create render scheduler: %d.",
> @@ -424,7 +424,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
>  			     &v3d_tfu_sched_ops,
>  			     hw_jobs_limit, job_hang_limit,
> -			     msecs_to_jiffies(hang_limit_ms),
> +			     msecs_to_jiffies(hang_limit_ms), NULL,
>  			     NULL, "v3d_tfu");
>  	if (ret) {
>  		dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
> @@ -437,7 +437,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
>  				     &v3d_csd_sched_ops,
>  				     hw_jobs_limit, job_hang_limit,
> -				     msecs_to_jiffies(hang_limit_ms),
> +				     msecs_to_jiffies(hang_limit_ms), NULL,
>  				     NULL, "v3d_csd");
>  		if (ret) {
>  			dev_err(v3d->drm.dev, "Failed to create CSD scheduler: %d.",
> @@ -449,7 +449,7 @@ v3d_sched_init(struct v3d_dev *v3d)
>  		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
>  				     &v3d_cache_clean_sched_ops,
>  				     hw_jobs_limit, job_hang_limit,
> -				     msecs_to_jiffies(hang_limit_ms),
> +				     msecs_to_jiffies(hang_limit_ms), NULL,
>  				     NULL, "v3d_cache_clean");
>  		if (ret) {
>  			dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN scheduler: %d.",
> diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
> index 10225a0a35d0..d4cdc906709e 100644
> --- a/include/drm/gpu_scheduler.h
> +++ b/include/drm/gpu_scheduler.h
> @@ -269,6 +269,7 @@ struct drm_sched_backend_ops {
>   *                 finished.
>   * @hw_rq_count: the number of jobs currently in the hardware queue.
>   * @job_id_count: used to assign unique id to the each job.
> + * @timeout_wq: workqueue used to queue @work_tdr
>   * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
>   *            timeout interval is over.
>   * @thread: the kthread on which the scheduler which run.
> @@ -293,6 +294,7 @@ struct drm_gpu_scheduler {
>  	wait_queue_head_t		job_scheduled;
>  	atomic_t			hw_rq_count;
>  	atomic64_t			job_id_count;
> +	struct workqueue_struct		*timeout_wq;
>  	struct delayed_work		work_tdr;
>  	struct task_struct		*thread;
>  	struct list_head		pending_list;
> @@ -306,7 +308,8 @@ struct drm_gpu_scheduler {
>  
>  int drm_sched_init(struct drm_gpu_scheduler *sched,
>  		   const struct drm_sched_backend_ops *ops,
> -		   uint32_t hw_submission, unsigned hang_limit, long timeout,
> +		   uint32_t hw_submission, unsigned hang_limit,
> +		   long timeout, struct workqueue_struct *timeout_wq,
>  		   atomic_t *score, const char *name);
>  
>  void drm_sched_fini(struct drm_gpu_scheduler *sched);
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr
  2021-06-25 15:07   ` Steven Price
@ 2021-06-25 15:18     ` Boris Brezillon
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 15:18 UTC (permalink / raw)
  To: Steven Price
  Cc: Tomeu Vizoso, dri-devel, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 16:07:03 +0100
Steven Price <steven.price@arm.com> wrote:

> On 25/06/2021 14:33, Boris Brezillon wrote:
> > Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
> > reset. This leads to extra complexity when we need to synchronize timeout
> > works with the reset work. One solution to address that is to have an
> > ordered workqueue at the driver level that will be used by the different
> > schedulers to queue their timeout work. Thanks to the serialization
> > provided by the ordered workqueue we are guaranteed that timeout
> > handlers are executed sequentially, and can thus easily reset the GPU
> > from the timeout handler without extra synchronization.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>  
> 
> I feel like I'm missing something here - I can't see where
> sched->timeout_wq is ever actually used in this series. There's clearly
> no point passing it into the drm core if the drm core never accesses it.
> AFAICT the changes are all in patch 9 and that doesn't depend on this one.

Oops, indeed, I forgot to patch sched_main.c to use the timeout_wq (below
is a version doing that). We really need a way to trigger this sort of
race...

--->8---
From 18bb739da5a5fc3e36d2c4378408c6938198993c Mon Sep 17 00:00:00 2001
From: Boris Brezillon <boris.brezillon@collabora.com>
Date: Wed, 23 Jun 2021 16:14:01 +0200
Subject: [PATCH] drm/sched: Allow using a dedicated workqueue for the
 timeout/fault tdr

Mali Midgard/Bifrost GPUs have 3 hardware queues but only a global GPU
reset. This leads to extra complexity when we need to synchronize timeout
works with the reset work. One solution to address that is to have an
ordered workqueue at the driver level that will be used by the different
schedulers to queue their timeout work. Thanks to the serialization
provided by the ordered workqueue we are guaranteed that timeout
handlers are executed sequentially, and can thus easily reset the GPU
from the timeout handler without extra synchronization.

Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
---
 drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c |  2 +-
 drivers/gpu/drm/etnaviv/etnaviv_sched.c   |  3 ++-
 drivers/gpu/drm/lima/lima_sched.c         |  3 ++-
 drivers/gpu/drm/panfrost/panfrost_job.c   |  3 ++-
 drivers/gpu/drm/scheduler/sched_main.c    | 14 +++++++++-----
 drivers/gpu/drm/v3d/v3d_sched.c           | 10 +++++-----
 include/drm/gpu_scheduler.h               |  5 ++++-
 7 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
index 47ea46859618..532636ea20bc 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_fence.c
@@ -488,7 +488,7 @@ int amdgpu_fence_driver_init_ring(struct amdgpu_ring *ring,
 
 	r = drm_sched_init(&ring->sched, &amdgpu_sched_ops,
 			   num_hw_submission, amdgpu_job_hang_limit,
-			   timeout, sched_score, ring->name);
+			   timeout, NULL, sched_score, ring->name);
 	if (r) {
 		DRM_ERROR("Failed to create scheduler on ring %s.\n",
 			  ring->name);
diff --git a/drivers/gpu/drm/etnaviv/etnaviv_sched.c b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
index 19826e504efc..feb6da1b6ceb 100644
--- a/drivers/gpu/drm/etnaviv/etnaviv_sched.c
+++ b/drivers/gpu/drm/etnaviv/etnaviv_sched.c
@@ -190,7 +190,8 @@ int etnaviv_sched_init(struct etnaviv_gpu *gpu)
 
 	ret = drm_sched_init(&gpu->sched, &etnaviv_sched_ops,
 			     etnaviv_hw_jobs_limit, etnaviv_job_hang_limit,
-			     msecs_to_jiffies(500), NULL, dev_name(gpu->dev));
+			     msecs_to_jiffies(500), NULL, NULL,
+			     dev_name(gpu->dev));
 	if (ret)
 		return ret;
 
diff --git a/drivers/gpu/drm/lima/lima_sched.c b/drivers/gpu/drm/lima/lima_sched.c
index ecf3267334ff..dba8329937a3 100644
--- a/drivers/gpu/drm/lima/lima_sched.c
+++ b/drivers/gpu/drm/lima/lima_sched.c
@@ -508,7 +508,8 @@ int lima_sched_pipe_init(struct lima_sched_pipe *pipe, const char *name)
 	INIT_WORK(&pipe->recover_work, lima_sched_recover_work);
 
 	return drm_sched_init(&pipe->base, &lima_sched_ops, 1,
-			      lima_job_hang_limit, msecs_to_jiffies(timeout),
+			      lima_job_hang_limit,
+			      msecs_to_jiffies(timeout), NULL,
 			      NULL, name);
 }
 
diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
index 682f2161b999..8ff79fd49577 100644
--- a/drivers/gpu/drm/panfrost/panfrost_job.c
+++ b/drivers/gpu/drm/panfrost/panfrost_job.c
@@ -626,7 +626,8 @@ int panfrost_job_init(struct panfrost_device *pfdev)
 
 		ret = drm_sched_init(&js->queue[j].sched,
 				     &panfrost_sched_ops,
-				     1, 0, msecs_to_jiffies(JOB_TIMEOUT_MS),
+				     1, 0,
+				     msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
 				     NULL, "pan_js");
 		if (ret) {
 			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);
diff --git a/drivers/gpu/drm/scheduler/sched_main.c b/drivers/gpu/drm/scheduler/sched_main.c
index c0a2f8f8d472..3e180f0d4305 100644
--- a/drivers/gpu/drm/scheduler/sched_main.c
+++ b/drivers/gpu/drm/scheduler/sched_main.c
@@ -232,7 +232,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
 {
 	if (sched->timeout != MAX_SCHEDULE_TIMEOUT &&
 	    !list_empty(&sched->pending_list))
-		schedule_delayed_work(&sched->work_tdr, sched->timeout);
+		queue_delayed_work(sched->timeout_wq, &sched->work_tdr, sched->timeout);
 }
 
 /**
@@ -244,7 +244,7 @@ static void drm_sched_start_timeout(struct drm_gpu_scheduler *sched)
  */
 void drm_sched_fault(struct drm_gpu_scheduler *sched)
 {
-	mod_delayed_work(system_wq, &sched->work_tdr, 0);
+	mod_delayed_work(sched->timeout_wq, &sched->work_tdr, 0);
 }
 EXPORT_SYMBOL(drm_sched_fault);
 
@@ -270,7 +270,7 @@ unsigned long drm_sched_suspend_timeout(struct drm_gpu_scheduler *sched)
 	 * Modify the timeout to an arbitrarily large value. This also prevents
 	 * the timeout to be restarted when new submissions arrive
 	 */
-	if (mod_delayed_work(system_wq, &sched->work_tdr, MAX_SCHEDULE_TIMEOUT)
+	if (mod_delayed_work(sched->timeout_wq, &sched->work_tdr, MAX_SCHEDULE_TIMEOUT)
 			&& time_after(sched_timeout, now))
 		return sched_timeout - now;
 	else
@@ -294,7 +294,7 @@ void drm_sched_resume_timeout(struct drm_gpu_scheduler *sched,
 	if (list_empty(&sched->pending_list))
 		cancel_delayed_work(&sched->work_tdr);
 	else
-		mod_delayed_work(system_wq, &sched->work_tdr, remaining);
+		mod_delayed_work(sched->timeout_wq, &sched->work_tdr, remaining);
 
 	spin_unlock(&sched->job_list_lock);
 }
@@ -837,6 +837,8 @@ static int drm_sched_main(void *param)
  * @hw_submission: number of hw submissions that can be in flight
  * @hang_limit: number of times to allow a job to hang before dropping it
  * @timeout: timeout value in jiffies for the scheduler
+ * @timeout_wq: workqueue to use for timeout work. If NULL, the system_wq is
+ *		used
  * @score: optional score atomic shared with other schedulers
  * @name: name used for debugging
  *
@@ -844,7 +846,8 @@ static int drm_sched_main(void *param)
  */
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   unsigned hw_submission, unsigned hang_limit, long timeout,
+		   unsigned hw_submission, unsigned hang_limit,
+		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name)
 {
 	int i, ret;
@@ -852,6 +855,7 @@ int drm_sched_init(struct drm_gpu_scheduler *sched,
 	sched->hw_submission_limit = hw_submission;
 	sched->name = name;
 	sched->timeout = timeout;
+	sched->timeout_wq = timeout_wq ? : system_wq;
 	sched->hang_limit = hang_limit;
 	sched->score = score ? score : &sched->_score;
 	for (i = DRM_SCHED_PRIORITY_MIN; i < DRM_SCHED_PRIORITY_COUNT; i++)
diff --git a/drivers/gpu/drm/v3d/v3d_sched.c b/drivers/gpu/drm/v3d/v3d_sched.c
index 8992480c88fa..a39bdd5cfc4f 100644
--- a/drivers/gpu/drm/v3d/v3d_sched.c
+++ b/drivers/gpu/drm/v3d/v3d_sched.c
@@ -402,7 +402,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_BIN].sched,
 			     &v3d_bin_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_bin");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create bin scheduler: %d.", ret);
@@ -412,7 +412,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_RENDER].sched,
 			     &v3d_render_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_render");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create render scheduler: %d.",
@@ -424,7 +424,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 	ret = drm_sched_init(&v3d->queue[V3D_TFU].sched,
 			     &v3d_tfu_sched_ops,
 			     hw_jobs_limit, job_hang_limit,
-			     msecs_to_jiffies(hang_limit_ms),
+			     msecs_to_jiffies(hang_limit_ms), NULL,
 			     NULL, "v3d_tfu");
 	if (ret) {
 		dev_err(v3d->drm.dev, "Failed to create TFU scheduler: %d.",
@@ -437,7 +437,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		ret = drm_sched_init(&v3d->queue[V3D_CSD].sched,
 				     &v3d_csd_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_csd");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CSD scheduler: %d.",
@@ -449,7 +449,7 @@ v3d_sched_init(struct v3d_dev *v3d)
 		ret = drm_sched_init(&v3d->queue[V3D_CACHE_CLEAN].sched,
 				     &v3d_cache_clean_sched_ops,
 				     hw_jobs_limit, job_hang_limit,
-				     msecs_to_jiffies(hang_limit_ms),
+				     msecs_to_jiffies(hang_limit_ms), NULL,
 				     NULL, "v3d_cache_clean");
 		if (ret) {
 			dev_err(v3d->drm.dev, "Failed to create CACHE_CLEAN scheduler: %d.",
diff --git a/include/drm/gpu_scheduler.h b/include/drm/gpu_scheduler.h
index 10225a0a35d0..d4cdc906709e 100644
--- a/include/drm/gpu_scheduler.h
+++ b/include/drm/gpu_scheduler.h
@@ -269,6 +269,7 @@ struct drm_sched_backend_ops {
  *                 finished.
  * @hw_rq_count: the number of jobs currently in the hardware queue.
  * @job_id_count: used to assign unique id to the each job.
+ * @timeout_wq: workqueue used to queue @work_tdr
  * @work_tdr: schedules a delayed call to @drm_sched_job_timedout after the
  *            timeout interval is over.
  * @thread: the kthread on which the scheduler which run.
@@ -293,6 +294,7 @@ struct drm_gpu_scheduler {
 	wait_queue_head_t		job_scheduled;
 	atomic_t			hw_rq_count;
 	atomic64_t			job_id_count;
+	struct workqueue_struct		*timeout_wq;
 	struct delayed_work		work_tdr;
 	struct task_struct		*thread;
 	struct list_head		pending_list;
@@ -306,7 +308,8 @@ struct drm_gpu_scheduler {
 
 int drm_sched_init(struct drm_gpu_scheduler *sched,
 		   const struct drm_sched_backend_ops *ops,
-		   uint32_t hw_submission, unsigned hang_limit, long timeout,
+		   uint32_t hw_submission, unsigned hang_limit,
+		   long timeout, struct workqueue_struct *timeout_wq,
 		   atomic_t *score, const char *name);
 
 void drm_sched_fini(struct drm_gpu_scheduler *sched);

^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace
  2021-06-25 14:21     ` Boris Brezillon
@ 2021-06-25 15:32       ` Steven Price
  2021-06-25 15:40         ` Boris Brezillon
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:32 UTC (permalink / raw)
  To: Boris Brezillon, Alyssa Rosenzweig
  Cc: Tomeu Vizoso, dri-devel, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 15:21, Boris Brezillon wrote:
> On Fri, 25 Jun 2021 09:42:08 -0400
> Alyssa Rosenzweig <alyssa@collabora.com> wrote:
> 
>> I'm not convinced. Right now most of our UABI is pleasantly
>> GPU-agnostic. With this suddenly there's divergence between Midgard and
>> Bifrost uABI.
> 
> Hm, I don't see why. I mean the exception types seem to be the same,
> there are just some that are not used on Midgard and some that are no
> used on Bifrost. Are there any collisions I didn't notice?

I think the real question is: why are we exporting them if user space
doesn't want them ;) Should this be in an internal header file at least
until someone actually requests they be available to user space?

>> With that drawback in mind, could you explain the benefit?
> 
> Well, I thought having these definitions in a central place would be a
> good thing given they're not expected to change even if they might
> be per-GPU. I don't know if that changes with CSF, maybe the exception
> codes are no longer set in stone and can change with FW update...

CSF certainly means the firmware controls a lot more of this sort of
thing but AFAIK the exception types still fit in the same scheme.

Steve

>>
>> On Fri, Jun 25, 2021 at 03:33:17PM +0200, Boris Brezillon wrote:
>>> Job headers contain an exception type field which might be read and
>>> converted to a human readable string by tracing tools. Let's expose
>>> the exception type as an enum so we share the same definition.
>>>
>>> v3:
>>> * Add missing values
>>>
>>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>>> ---
>>>  include/uapi/drm/panfrost_drm.h | 71 +++++++++++++++++++++++++++++++++
>>>  1 file changed, 71 insertions(+)
>>>
>>> diff --git a/include/uapi/drm/panfrost_drm.h b/include/uapi/drm/panfrost_drm.h
>>> index ec19db1eead8..899cd6d952d4 100644
>>> --- a/include/uapi/drm/panfrost_drm.h
>>> +++ b/include/uapi/drm/panfrost_drm.h
>>> @@ -223,6 +223,77 @@ struct drm_panfrost_madvise {
>>>  	__u32 retained;       /* out, whether backing store still exists */
>>>  };
>>>  
>>> +/* The exception types */
>>> +
>>> +enum drm_panfrost_exception_type {
>>> +	DRM_PANFROST_EXCEPTION_OK = 0x00,
>>> +	DRM_PANFROST_EXCEPTION_DONE = 0x01,
>>> +	DRM_PANFROST_EXCEPTION_INTERRUPTED = 0x02,
>>> +	DRM_PANFROST_EXCEPTION_STOPPED = 0x03,
>>> +	DRM_PANFROST_EXCEPTION_TERMINATED = 0x04,
>>> +	DRM_PANFROST_EXCEPTION_KABOOM = 0x05,
>>> +	DRM_PANFROST_EXCEPTION_EUREKA = 0x06,
>>> +	DRM_PANFROST_EXCEPTION_ACTIVE = 0x08,
>>> +	DRM_PANFROST_EXCEPTION_JOB_CONFIG_FAULT = 0x40,
>>> +	DRM_PANFROST_EXCEPTION_JOB_POWER_FAULT = 0x41,
>>> +	DRM_PANFROST_EXCEPTION_JOB_READ_FAULT = 0x42,
>>> +	DRM_PANFROST_EXCEPTION_JOB_WRITE_FAULT = 0x43,
>>> +	DRM_PANFROST_EXCEPTION_JOB_AFFINITY_FAULT = 0x44,
>>> +	DRM_PANFROST_EXCEPTION_JOB_BUS_FAULT = 0x48,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_PC = 0x50,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_INVALID_ENC = 0x51,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_TYPE_MISMATCH = 0x52,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_OPERAND_FAULT = 0x53,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_TLS_FAULT = 0x54,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_BARRIER_FAULT = 0x55,
>>> +	DRM_PANFROST_EXCEPTION_INSTR_ALIGN_FAULT = 0x56,
>>> +	DRM_PANFROST_EXCEPTION_DATA_INVALID_FAULT = 0x58,
>>> +	DRM_PANFROST_EXCEPTION_TILE_RANGE_FAULT = 0x59,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_RANGE_FAULT = 0x5a,
>>> +	DRM_PANFROST_EXCEPTION_IMPRECISE_FAULT = 0x5b,
>>> +	DRM_PANFROST_EXCEPTION_OOM = 0x60,
>>> +	DRM_PANFROST_EXCEPTION_OOM_AFBC = 0x61,
>>> +	DRM_PANFROST_EXCEPTION_UNKNOWN = 0x7f,
>>> +	DRM_PANFROST_EXCEPTION_DELAYED_BUS_FAULT = 0x80,
>>> +	DRM_PANFROST_EXCEPTION_GPU_SHAREABILITY_FAULT = 0x88,
>>> +	DRM_PANFROST_EXCEPTION_SYS_SHAREABILITY_FAULT = 0x89,
>>> +	DRM_PANFROST_EXCEPTION_GPU_CACHEABILITY_FAULT = 0x8a,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_0 = 0xc0,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_1 = 0xc1,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_2 = 0xc2,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_3 = 0xc3,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_4 = 0xc4,
>>> +	DRM_PANFROST_EXCEPTION_TRANSLATION_FAULT_IDENTITY = 0xc7,
>>> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_0 = 0xc8,
>>> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_1 = 0xc9,
>>> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_2 = 0xca,
>>> +	DRM_PANFROST_EXCEPTION_PERM_FAULT_3 = 0xcb,
>>> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_0 = 0xd0,
>>> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_1 = 0xd1,
>>> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_2 = 0xd2,
>>> +	DRM_PANFROST_EXCEPTION_TRANSTAB_BUS_FAULT_3 = 0xd3,
>>> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_0 = 0xd8,
>>> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_1 = 0xd9,
>>> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_2 = 0xda,
>>> +	DRM_PANFROST_EXCEPTION_ACCESS_FLAG_3 = 0xdb,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN0 = 0xe0,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN1 = 0xe1,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN2 = 0xe2,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_IN3 = 0xe3,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT0 = 0xe4,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT1 = 0xe5,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT2 = 0xe6,
>>> +	DRM_PANFROST_EXCEPTION_ADDR_SIZE_FAULT_OUT3 = 0xe7,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_0 = 0xe8,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_1 = 0xe9,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_2 = 0xea,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_FAULT_3 = 0xeb,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_0 = 0xec,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_1 = 0xed,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_2 = 0xee,
>>> +	DRM_PANFROST_EXCEPTION_MEM_ATTR_NONCACHE_3 = 0xef,
>>> +};
>>> +
>>>  #if defined(__cplusplus)
>>>  }
>>>  #endif
>>> -- 
>>> 2.31.1
>>>   
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table
  2021-06-25 13:33 ` [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
  2021-06-25 13:41   ` Alyssa Rosenzweig
@ 2021-06-25 15:35   ` Steven Price
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:35 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> Do the exception -> string translation using a table. This way we get
> rid of those magic numbers and can easily add new fields if we need
> to attach extra information to exception types.
> 
> v3:
> * Drop the error field
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panfrost/panfrost_device.c | 130 +++++++++++++--------
>  1 file changed, 83 insertions(+), 47 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.c b/drivers/gpu/drm/panfrost/panfrost_device.c
> index bce6b0aff05e..736854542b05 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.c
> @@ -292,55 +292,91 @@ void panfrost_device_fini(struct panfrost_device *pfdev)
>  	panfrost_clk_fini(pfdev);
>  }
>  
> -const char *panfrost_exception_name(u32 exception_code)
> -{
> -	switch (exception_code) {
> -		/* Non-Fault Status code */
> -	case 0x00: return "NOT_STARTED/IDLE/OK";
> -	case 0x01: return "DONE";
> -	case 0x02: return "INTERRUPTED";
> -	case 0x03: return "STOPPED";
> -	case 0x04: return "TERMINATED";
> -	case 0x08: return "ACTIVE";
> -		/* Job exceptions */
> -	case 0x40: return "JOB_CONFIG_FAULT";
> -	case 0x41: return "JOB_POWER_FAULT";
> -	case 0x42: return "JOB_READ_FAULT";
> -	case 0x43: return "JOB_WRITE_FAULT";
> -	case 0x44: return "JOB_AFFINITY_FAULT";
> -	case 0x48: return "JOB_BUS_FAULT";
> -	case 0x50: return "INSTR_INVALID_PC";
> -	case 0x51: return "INSTR_INVALID_ENC";
> -	case 0x52: return "INSTR_TYPE_MISMATCH";
> -	case 0x53: return "INSTR_OPERAND_FAULT";
> -	case 0x54: return "INSTR_TLS_FAULT";
> -	case 0x55: return "INSTR_BARRIER_FAULT";
> -	case 0x56: return "INSTR_ALIGN_FAULT";
> -	case 0x58: return "DATA_INVALID_FAULT";
> -	case 0x59: return "TILE_RANGE_FAULT";
> -	case 0x5A: return "ADDR_RANGE_FAULT";
> -	case 0x60: return "OUT_OF_MEMORY";
> -		/* GPU exceptions */
> -	case 0x80: return "DELAYED_BUS_FAULT";
> -	case 0x88: return "SHAREABILITY_FAULT";
> -		/* MMU exceptions */
> -	case 0xC1: return "TRANSLATION_FAULT_LEVEL1";
> -	case 0xC2: return "TRANSLATION_FAULT_LEVEL2";
> -	case 0xC3: return "TRANSLATION_FAULT_LEVEL3";
> -	case 0xC4: return "TRANSLATION_FAULT_LEVEL4";
> -	case 0xC8: return "PERMISSION_FAULT";
> -	case 0xC9 ... 0xCF: return "PERMISSION_FAULT";
> -	case 0xD1: return "TRANSTAB_BUS_FAULT_LEVEL1";
> -	case 0xD2: return "TRANSTAB_BUS_FAULT_LEVEL2";
> -	case 0xD3: return "TRANSTAB_BUS_FAULT_LEVEL3";
> -	case 0xD4: return "TRANSTAB_BUS_FAULT_LEVEL4";
> -	case 0xD8: return "ACCESS_FLAG";
> -	case 0xD9 ... 0xDF: return "ACCESS_FLAG";
> -	case 0xE0 ... 0xE7: return "ADDRESS_SIZE_FAULT";
> -	case 0xE8 ... 0xEF: return "MEMORY_ATTRIBUTES_FAULT";
> +#define PANFROST_EXCEPTION(id) \
> +	[DRM_PANFROST_EXCEPTION_ ## id] = { \
> +		.name = #id, \
>  	}
>  
> -	return "UNKNOWN";
> +struct panfrost_exception_info {
> +	const char *name;
> +};
> +
> +static const struct panfrost_exception_info panfrost_exception_infos[] = {
> +	PANFROST_EXCEPTION(OK),
> +	PANFROST_EXCEPTION(DONE),
> +	PANFROST_EXCEPTION(INTERRUPTED),
> +	PANFROST_EXCEPTION(STOPPED),
> +	PANFROST_EXCEPTION(TERMINATED),
> +	PANFROST_EXCEPTION(KABOOM),
> +	PANFROST_EXCEPTION(EUREKA),
> +	PANFROST_EXCEPTION(ACTIVE),
> +	PANFROST_EXCEPTION(JOB_CONFIG_FAULT),
> +	PANFROST_EXCEPTION(JOB_POWER_FAULT),
> +	PANFROST_EXCEPTION(JOB_READ_FAULT),
> +	PANFROST_EXCEPTION(JOB_WRITE_FAULT),
> +	PANFROST_EXCEPTION(JOB_AFFINITY_FAULT),
> +	PANFROST_EXCEPTION(JOB_BUS_FAULT),
> +	PANFROST_EXCEPTION(INSTR_INVALID_PC),
> +	PANFROST_EXCEPTION(INSTR_INVALID_ENC),
> +	PANFROST_EXCEPTION(INSTR_TYPE_MISMATCH),
> +	PANFROST_EXCEPTION(INSTR_OPERAND_FAULT),
> +	PANFROST_EXCEPTION(INSTR_TLS_FAULT),
> +	PANFROST_EXCEPTION(INSTR_BARRIER_FAULT),
> +	PANFROST_EXCEPTION(INSTR_ALIGN_FAULT),
> +	PANFROST_EXCEPTION(DATA_INVALID_FAULT),
> +	PANFROST_EXCEPTION(TILE_RANGE_FAULT),
> +	PANFROST_EXCEPTION(ADDR_RANGE_FAULT),
> +	PANFROST_EXCEPTION(IMPRECISE_FAULT),
> +	PANFROST_EXCEPTION(OOM),
> +	PANFROST_EXCEPTION(OOM_AFBC),
> +	PANFROST_EXCEPTION(UNKNOWN),
> +	PANFROST_EXCEPTION(DELAYED_BUS_FAULT),
> +	PANFROST_EXCEPTION(GPU_SHAREABILITY_FAULT),
> +	PANFROST_EXCEPTION(SYS_SHAREABILITY_FAULT),
> +	PANFROST_EXCEPTION(GPU_CACHEABILITY_FAULT),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_0),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_1),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_2),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_3),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_4),
> +	PANFROST_EXCEPTION(TRANSLATION_FAULT_IDENTITY),
> +	PANFROST_EXCEPTION(PERM_FAULT_0),
> +	PANFROST_EXCEPTION(PERM_FAULT_1),
> +	PANFROST_EXCEPTION(PERM_FAULT_2),
> +	PANFROST_EXCEPTION(PERM_FAULT_3),
> +	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_0),
> +	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_1),
> +	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_2),
> +	PANFROST_EXCEPTION(TRANSTAB_BUS_FAULT_3),
> +	PANFROST_EXCEPTION(ACCESS_FLAG_0),
> +	PANFROST_EXCEPTION(ACCESS_FLAG_1),
> +	PANFROST_EXCEPTION(ACCESS_FLAG_2),
> +	PANFROST_EXCEPTION(ACCESS_FLAG_3),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN0),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN1),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN2),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_IN3),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT0),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT1),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT2),
> +	PANFROST_EXCEPTION(ADDR_SIZE_FAULT_OUT3),
> +	PANFROST_EXCEPTION(MEM_ATTR_FAULT_0),
> +	PANFROST_EXCEPTION(MEM_ATTR_FAULT_1),
> +	PANFROST_EXCEPTION(MEM_ATTR_FAULT_2),
> +	PANFROST_EXCEPTION(MEM_ATTR_FAULT_3),
> +	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_0),
> +	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_1),
> +	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_2),
> +	PANFROST_EXCEPTION(MEM_ATTR_NONCACHE_3),
> +};
> +
> +const char *panfrost_exception_name(u32 exception_code)
> +{
> +	if (WARN_ON(exception_code >= ARRAY_SIZE(panfrost_exception_infos) ||
> +		    !panfrost_exception_infos[exception_code].name))
> +		return "Unknown exception type";
> +
> +	return panfrost_exception_infos[exception_code].name;
>  }
>  
>  void panfrost_device_reset(struct panfrost_device *pfdev)
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace
  2021-06-25 15:32       ` Steven Price
@ 2021-06-25 15:40         ` Boris Brezillon
  0 siblings, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 15:40 UTC (permalink / raw)
  To: Steven Price
  Cc: Tomeu Vizoso, dri-devel, Alyssa Rosenzweig, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 16:32:27 +0100
Steven Price <steven.price@arm.com> wrote:

> On 25/06/2021 15:21, Boris Brezillon wrote:
> > On Fri, 25 Jun 2021 09:42:08 -0400
> > Alyssa Rosenzweig <alyssa@collabora.com> wrote:
> >   
> >> I'm not convinced. Right now most of our UABI is pleasantly
> >> GPU-agnostic. With this suddenly there's divergence between Midgard and
> >> Bifrost uABI.  
> > 
> > Hm, I don't see why. I mean the exception types seem to be the same,
> > there are just some that are not used on Midgard and some that are no
> > used on Bifrost. Are there any collisions I didn't notice?  
> 
> I think the real question is: why are we exporting them if user space
> doesn't want them ;) Should this be in an internal header file at least
> until someone actually requests they be available to user space?

Alright, I'll move it to panfrost_device.h (or panfrost_regs.h) then.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts
  2021-06-25 13:33 ` [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
  2021-06-25 13:47   ` Alyssa Rosenzweig
@ 2021-06-25 15:40   ` Steven Price
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:40 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> This should avoid switching to interrupt context when the GPU is under
> heavy use.
> 
> v3:
> * Don't take the job_lock in panfrost_job_handle_irq()
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 53 ++++++++++++++++++-------
>  1 file changed, 38 insertions(+), 15 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index be8f68f63974..e0c479e67304 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -470,19 +470,12 @@ static const struct drm_sched_backend_ops panfrost_sched_ops = {
>  	.free_job = panfrost_job_free
>  };
>  
> -static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> +static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  {
> -	struct panfrost_device *pfdev = data;
> -	u32 status = job_read(pfdev, JOB_INT_STAT);
>  	int j;
>  
>  	dev_dbg(pfdev->dev, "jobslot irq status=%x\n", status);
>  
> -	if (!status)
> -		return IRQ_NONE;
> -
> -	pm_runtime_mark_last_busy(pfdev->dev);
> -
>  	for (j = 0; status; j++) {
>  		u32 mask = MK_JS_MASK(j);
>  
> @@ -519,7 +512,6 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
>  		if (status & JOB_INT_MASK_DONE(j)) {
>  			struct panfrost_job *job;
>  
> -			spin_lock(&pfdev->js->job_lock);
>  			job = pfdev->jobs[j];
>  			/* Only NULL if job timeout occurred */
>  			if (job) {
> @@ -531,21 +523,49 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
>  				dma_fence_signal_locked(job->done_fence);
>  				pm_runtime_put_autosuspend(pfdev->dev);
>  			}
> -			spin_unlock(&pfdev->js->job_lock);
>  		}
>  
>  		status &= ~mask;
>  	}
> +}
>  
> +static irqreturn_t panfrost_job_irq_handler_thread(int irq, void *data)
> +{
> +	struct panfrost_device *pfdev = data;
> +	u32 status = job_read(pfdev, JOB_INT_RAWSTAT);
> +
> +	while (status) {
> +		pm_runtime_mark_last_busy(pfdev->dev);
> +
> +		spin_lock(&pfdev->js->job_lock);
> +		panfrost_job_handle_irq(pfdev, status);
> +		spin_unlock(&pfdev->js->job_lock);
> +		status = job_read(pfdev, JOB_INT_RAWSTAT);
> +	}
> +
> +	job_write(pfdev, JOB_INT_MASK,
> +		  GENMASK(16 + NUM_JOB_SLOTS - 1, 16) |
> +		  GENMASK(NUM_JOB_SLOTS - 1, 0));
>  	return IRQ_HANDLED;
>  }
>  
> +static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
> +{
> +	struct panfrost_device *pfdev = data;
> +	u32 status = job_read(pfdev, JOB_INT_STAT);
> +
> +	if (!status)
> +		return IRQ_NONE;
> +
> +	job_write(pfdev, JOB_INT_MASK, 0);
> +	return IRQ_WAKE_THREAD;
> +}
> +
>  static void panfrost_reset(struct work_struct *work)
>  {
>  	struct panfrost_device *pfdev = container_of(work,
>  						     struct panfrost_device,
>  						     reset.work);
> -	unsigned long flags;
>  	unsigned int i;
>  	bool cookie;
>  
> @@ -575,7 +595,7 @@ static void panfrost_reset(struct work_struct *work)
>  	/* All timers have been stopped, we can safely reset the pending state. */
>  	atomic_set(&pfdev->reset.pending, 0);
>  
> -	spin_lock_irqsave(&pfdev->js->job_lock, flags);
> +	spin_lock(&pfdev->js->job_lock);
>  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>  		if (pfdev->jobs[i]) {
>  			pm_runtime_put_noidle(pfdev->dev);
> @@ -583,7 +603,7 @@ static void panfrost_reset(struct work_struct *work)
>  			pfdev->jobs[i] = NULL;
>  		}
>  	}
> -	spin_unlock_irqrestore(&pfdev->js->job_lock, flags);
> +	spin_unlock(&pfdev->js->job_lock);
>  
>  	panfrost_device_reset(pfdev);
>  
> @@ -610,8 +630,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  	if (irq <= 0)
>  		return -ENODEV;
>  
> -	ret = devm_request_irq(pfdev->dev, irq, panfrost_job_irq_handler,
> -			       IRQF_SHARED, KBUILD_MODNAME "-job", pfdev);
> +	ret = devm_request_threaded_irq(pfdev->dev, irq,
> +					panfrost_job_irq_handler,
> +					panfrost_job_irq_handler_thread,
> +					IRQF_SHARED, KBUILD_MODNAME "-job",
> +					pfdev);
>  	if (ret) {
>  		dev_err(pfdev->dev, "failed to request job irq");
>  		return ret;
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic
  2021-06-25 13:33 ` [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
@ 2021-06-25 15:42   ` Steven Price
  2021-06-25 16:22   ` Boris Brezillon
  1 sibling, 0 replies; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:42 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Daniel Vetter, Rob Herring, Alyssa Rosenzweig,
	Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> Now that we can pass our own workqueue to drm_sched_init(), we can use

Except that part has somehow slipped through to patch 15:

> @@ -633,8 +849,9 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  
>  		ret = drm_sched_init(&js->queue[j].sched,
>  				     &panfrost_sched_ops,
> -				     1, 0,
> -				     msecs_to_jiffies(JOB_TIMEOUT_MS), NULL,
> +				     nslots, 0,
> +				     msecs_to_jiffies(JOB_TIMEOUT_MS),
> +				     pfdev->reset.wq,
>  				     NULL, "pan_js");
>  		if (ret) {
>  			dev_err(pfdev->dev, "Failed to create scheduler: %d.", ret);

Steve

> an ordered workqueue on for both the scheduler timeout tdr and our own
> reset work (which we use when the reset is not caused by a fault/timeout
> on a specific job, like when we have AS_ACTIVE bit stuck). This
> guarantees that the timeout handlers and reset handler can't run
> concurrently which drastically simplifies the locking.
> 
> Suggested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_device.h |   6 +-
>  drivers/gpu/drm/panfrost/panfrost_job.c    | 185 ++++++++-------------
>  2 files changed, 71 insertions(+), 120 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index 6024eaf34ba0..bfe32907ba6b 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -108,6 +108,7 @@ struct panfrost_device {
>  	struct mutex sched_lock;
>  
>  	struct {
> +		struct workqueue_struct *wq;
>  		struct work_struct work;
>  		atomic_t pending;
>  	} reset;
> @@ -177,9 +178,8 @@ const char *panfrost_exception_name(u32 exception_code);
>  static inline void
>  panfrost_device_schedule_reset(struct panfrost_device *pfdev)
>  {
> -	/* Schedule a reset if there's no reset in progress. */
> -	if (!atomic_xchg(&pfdev->reset.pending, 1))
> -		schedule_work(&pfdev->reset.work);
> +	atomic_set(&pfdev->reset.pending, 1);
> +	queue_work(pfdev->reset.wq, &pfdev->reset.work);
>  }
>  
>  #endif
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index e0c479e67304..88d34fd781e8 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -25,17 +25,8 @@
>  #define job_write(dev, reg, data) writel(data, dev->iomem + (reg))
>  #define job_read(dev, reg) readl(dev->iomem + (reg))
>  
> -enum panfrost_queue_status {
> -	PANFROST_QUEUE_STATUS_ACTIVE,
> -	PANFROST_QUEUE_STATUS_STOPPED,
> -	PANFROST_QUEUE_STATUS_STARTING,
> -	PANFROST_QUEUE_STATUS_FAULT_PENDING,
> -};
> -
>  struct panfrost_queue_state {
>  	struct drm_gpu_scheduler sched;
> -	atomic_t status;
> -	struct mutex lock;
>  	u64 fence_context;
>  	u64 emit_seqno;
>  };
> @@ -379,57 +370,73 @@ void panfrost_job_enable_interrupts(struct panfrost_device *pfdev)
>  	job_write(pfdev, JOB_INT_MASK, irq_mask);
>  }
>  
> -static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
> -				    struct drm_sched_job *bad)
> +static void panfrost_reset(struct panfrost_device *pfdev,
> +			   struct drm_sched_job *bad)
>  {
> -	enum panfrost_queue_status old_status;
> -	bool stopped = false;
> +	unsigned int i;
> +	bool cookie;
>  
> -	mutex_lock(&queue->lock);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_STOPPED);
> -	if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
> -		goto out;
> +	if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
> +		return;
> +
> +	/* Stop the schedulers.
> +	 *
> +	 * FIXME: We temporarily get out of the dma_fence_signalling section
> +	 * because the cleanup path generate lockdep splats when taking locks
> +	 * to release job resources. We should rework the code to follow this
> +	 * pattern:
> +	 *
> +	 *	try_lock
> +	 *	if (locked)
> +	 *		release
> +	 *	else
> +	 *		schedule_work_to_release_later
> +	 */
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_stop(&pfdev->js->queue[i].sched, bad);
> +
> +	cookie = dma_fence_begin_signalling();
>  
> -	WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
> -	drm_sched_stop(&queue->sched, bad);
>  	if (bad)
>  		drm_sched_increase_karma(bad);
>  
> -	stopped = true;
> +	spin_lock(&pfdev->js->job_lock);
> +	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> +		if (pfdev->jobs[i]) {
> +			pm_runtime_put_noidle(pfdev->dev);
> +			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
> +			pfdev->jobs[i] = NULL;
> +		}
> +	}
> +	spin_unlock(&pfdev->js->job_lock);
>  
> -	/*
> -	 * Set the timeout to max so the timer doesn't get started
> -	 * when we return from the timeout handler (restored in
> -	 * panfrost_scheduler_start()).
> +	panfrost_device_reset(pfdev);
> +
> +	/* GPU has been reset, we can cancel timeout/fault work that may have
> +	 * been queued in the meantime and clear the reset pending bit.
>  	 */
> -	queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> +	atomic_set(&pfdev->reset.pending, 0);
> +	cancel_work_sync(&pfdev->reset.work);
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		cancel_delayed_work(&pfdev->js->queue[i].sched.work_tdr);
>  
> -out:
> -	mutex_unlock(&queue->lock);
>  
> -	return stopped;
> -}
> +	/* Now resubmit jobs that were previously queued but didn't have a
> +	 * chance to finish.
> +	 * FIXME: We temporarily get out of the DMA fence signalling section
> +	 * while resubmitting jobs because the job submission logic will
> +	 * allocate memory with the GFP_KERNEL flag which can trigger memory
> +	 * reclaim and exposes a lock ordering issue.
> +	 */
> +	dma_fence_end_signalling(cookie);
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
> +	cookie = dma_fence_begin_signalling();
>  
> -static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
> -{
> -	enum panfrost_queue_status old_status;
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_start(&pfdev->js->queue[i].sched, true);
>  
> -	mutex_lock(&queue->lock);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_STARTING);
> -	WARN_ON(old_status != PANFROST_QUEUE_STATUS_STOPPED);
> -
> -	/* Restore the original timeout before starting the scheduler. */
> -	queue->sched.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS);
> -	drm_sched_resubmit_jobs(&queue->sched);
> -	drm_sched_start(&queue->sched, true);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_ACTIVE);
> -	if (old_status == PANFROST_QUEUE_STATUS_FAULT_PENDING)
> -		drm_sched_fault(&queue->sched);
> -
> -	mutex_unlock(&queue->lock);
> +	dma_fence_end_signalling(cookie);
>  }
>  
>  static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
> @@ -454,11 +461,8 @@ static enum drm_gpu_sched_stat panfrost_job_timedout(struct drm_sched_job
>  		job_read(pfdev, JS_TAIL_LO(js)),
>  		sched_job);
>  
> -	/* Scheduler is already stopped, nothing to do. */
> -	if (!panfrost_scheduler_stop(&pfdev->js->queue[js], sched_job))
> -		return DRM_GPU_SCHED_STAT_NOMINAL;
> -
> -	panfrost_device_schedule_reset(pfdev);
> +	atomic_set(&pfdev->reset.pending, 1);
> +	panfrost_reset(pfdev, sched_job);
>  
>  	return DRM_GPU_SCHED_STAT_NOMINAL;
>  }
> @@ -485,8 +489,6 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  		job_write(pfdev, JOB_INT_CLEAR, mask);
>  
>  		if (status & JOB_INT_MASK_ERR(j)) {
> -			enum panfrost_queue_status old_status;
> -
>  			job_write(pfdev, JS_COMMAND_NEXT(j), JS_COMMAND_NOP);
>  
>  			dev_err(pfdev->dev, "js fault, js=%d, status=%s, head=0x%x, tail=0x%x",
> @@ -494,19 +496,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  				panfrost_exception_name(job_read(pfdev, JS_STATUS(j))),
>  				job_read(pfdev, JS_HEAD_LO(j)),
>  				job_read(pfdev, JS_TAIL_LO(j)));
> -
> -			/*
> -			 * When the queue is being restarted we don't report
> -			 * faults directly to avoid races between the timeout
> -			 * and reset handlers. panfrost_scheduler_start() will
> -			 * call drm_sched_fault() after the queue has been
> -			 * started if status == FAULT_PENDING.
> -			 */
> -			old_status = atomic_cmpxchg(&pfdev->js->queue[j].status,
> -						    PANFROST_QUEUE_STATUS_STARTING,
> -						    PANFROST_QUEUE_STATUS_FAULT_PENDING);
> -			if (old_status == PANFROST_QUEUE_STATUS_ACTIVE)
> -				drm_sched_fault(&pfdev->js->queue[j].sched);
> +			drm_sched_fault(&pfdev->js->queue[j].sched);
>  		}
>  
>  		if (status & JOB_INT_MASK_DONE(j)) {
> @@ -561,56 +551,13 @@ static irqreturn_t panfrost_job_irq_handler(int irq, void *data)
>  	return IRQ_WAKE_THREAD;
>  }
>  
> -static void panfrost_reset(struct work_struct *work)
> +static void panfrost_reset_work(struct work_struct *work)
>  {
>  	struct panfrost_device *pfdev = container_of(work,
>  						     struct panfrost_device,
>  						     reset.work);
> -	unsigned int i;
> -	bool cookie;
>  
> -	cookie = dma_fence_begin_signalling();
> -	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> -		/*
> -		 * We want pending timeouts to be handled before we attempt
> -		 * to stop the scheduler. If we don't do that and the timeout
> -		 * handler is in flight, it might have removed the bad job
> -		 * from the list, and we'll lose this job if the reset handler
> -		 * enters the critical section in panfrost_scheduler_stop()
> -		 * before the timeout handler.
> -		 *
> -		 * Timeout is set to MAX_SCHEDULE_TIMEOUT - 1 because we need
> -		 * something big enough to make sure the timer will not expire
> -		 * before we manage to stop the scheduler, but we can't use
> -		 * MAX_SCHEDULE_TIMEOUT because drm_sched_get_cleanup_job()
> -		 * considers that as 'timer is not running' and will dequeue
> -		 * the job without making sure the timeout handler is not
> -		 * running.
> -		 */
> -		pfdev->js->queue[i].sched.timeout = MAX_SCHEDULE_TIMEOUT - 1;
> -		cancel_delayed_work_sync(&pfdev->js->queue[i].sched.work_tdr);
> -		panfrost_scheduler_stop(&pfdev->js->queue[i], NULL);
> -	}
> -
> -	/* All timers have been stopped, we can safely reset the pending state. */
> -	atomic_set(&pfdev->reset.pending, 0);
> -
> -	spin_lock(&pfdev->js->job_lock);
> -	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> -		if (pfdev->jobs[i]) {
> -			pm_runtime_put_noidle(pfdev->dev);
> -			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
> -			pfdev->jobs[i] = NULL;
> -		}
> -	}
> -	spin_unlock(&pfdev->js->job_lock);
> -
> -	panfrost_device_reset(pfdev);
> -
> -	for (i = 0; i < NUM_JOB_SLOTS; i++)
> -		panfrost_scheduler_start(&pfdev->js->queue[i]);
> -
> -	dma_fence_end_signalling(cookie);
> +	panfrost_reset(pfdev, NULL);
>  }
>  
>  int panfrost_job_init(struct panfrost_device *pfdev)
> @@ -618,7 +565,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  	struct panfrost_job_slot *js;
>  	int ret, j, irq;
>  
> -	INIT_WORK(&pfdev->reset.work, panfrost_reset);
> +	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
>  
>  	pfdev->js = js = devm_kzalloc(pfdev->dev, sizeof(*js), GFP_KERNEL);
>  	if (!js)
> @@ -640,9 +587,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  		return ret;
>  	}
>  
> -	for (j = 0; j < NUM_JOB_SLOTS; j++) {
> -		mutex_init(&js->queue[j].lock);
> +	pfdev->reset.wq = alloc_ordered_workqueue("panfrost-reset", 0);
> +	if (!pfdev->reset.wq)
> +		return -ENOMEM;
>  
> +	for (j = 0; j < NUM_JOB_SLOTS; j++) {
>  		js->queue[j].fence_context = dma_fence_context_alloc(1);
>  
>  		ret = drm_sched_init(&js->queue[j].sched,
> @@ -664,6 +613,7 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  	for (j--; j >= 0; j--)
>  		drm_sched_fini(&js->queue[j].sched);
>  
> +	destroy_workqueue(pfdev->reset.wq);
>  	return ret;
>  }
>  
> @@ -676,9 +626,10 @@ void panfrost_job_fini(struct panfrost_device *pfdev)
>  
>  	for (j = 0; j < NUM_JOB_SLOTS; j++) {
>  		drm_sched_fini(&js->queue[j].sched);
> -		mutex_destroy(&js->queue[j].lock);
>  	}
>  
> +	cancel_work_sync(&pfdev->reset.work);
> +	destroy_workqueue(pfdev->reset.wq);
>  }
>  
>  int panfrost_job_open(struct panfrost_file_priv *panfrost_priv)
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting
  2021-06-25 13:33 ` [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting Boris Brezillon
@ 2021-06-25 15:55   ` Steven Price
  2021-06-25 16:02     ` Boris Brezillon
  0 siblings, 1 reply; 39+ messages in thread
From: Steven Price @ 2021-06-25 15:55 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> This is not yet needed because we let active jobs be killed during by
> the reset and we don't really bother making sure they can be restarted.
> But once we start adding soft-stop support, controlling when we deal
> with the remaining interrrupts and making sure those are handled before
> the reset is issued gets tricky if we keep job interrupts active.
> 
> Let's prepare for that and mask+flush job IRQs before issuing a reset.
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> ---
>  drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++++++++++++++------
>  1 file changed, 15 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> index 88d34fd781e8..0566e2f7e84a 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> @@ -34,6 +34,7 @@ struct panfrost_queue_state {
>  struct panfrost_job_slot {
>  	struct panfrost_queue_state queue[NUM_JOB_SLOTS];
>  	spinlock_t job_lock;
> +	int irq;
>  };
>  
>  static struct panfrost_job *
> @@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
>  	if (bad)
>  		drm_sched_increase_karma(bad);
>  
> -	spin_lock(&pfdev->js->job_lock);

I'm not sure it's safe to remove this lock as this protects the
pfdev->jobs array: I can't see what would prevent panfrost_job_close()
running at the same time without the lock. Am I missing something?

> +	/* Mask job interrupts and synchronize to make sure we won't be
> +	 * interrupted during our reset.
> +	 */
> +	job_write(pfdev, JOB_INT_MASK, 0);
> +	synchronize_irq(pfdev->js->irq);
> +
> +	/* Schedulers are stopped and interrupts are masked+flushed, we don't
> +	 * need to protect the 'evict unfinished jobs' lock with the job_lock.
> +	 */
>  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>  		if (pfdev->jobs[i]) {
>  			pm_runtime_put_noidle(pfdev->dev);
> @@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
>  			pfdev->jobs[i] = NULL;
>  		}
>  	}
> -	spin_unlock(&pfdev->js->job_lock);
>  
>  	panfrost_device_reset(pfdev);
>  
> @@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>  
>  			job = pfdev->jobs[j];
>  			/* Only NULL if job timeout occurred */
> +			WARN_ON(!job);

Was this WARN_ON intentional?

Steve

>  			if (job) {
>  				pfdev->jobs[j] = NULL;
>  
> @@ -563,7 +572,7 @@ static void panfrost_reset_work(struct work_struct *work)
>  int panfrost_job_init(struct panfrost_device *pfdev)
>  {
>  	struct panfrost_job_slot *js;
> -	int ret, j, irq;
> +	int ret, j;
>  
>  	INIT_WORK(&pfdev->reset.work, panfrost_reset_work);
>  
> @@ -573,11 +582,11 @@ int panfrost_job_init(struct panfrost_device *pfdev)
>  
>  	spin_lock_init(&js->job_lock);
>  
> -	irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
> -	if (irq <= 0)
> +	js->irq = platform_get_irq_byname(to_platform_device(pfdev->dev), "job");
> +	if (js->irq <= 0)
>  		return -ENODEV;
>  
> -	ret = devm_request_threaded_irq(pfdev->dev, irq,
> +	ret = devm_request_threaded_irq(pfdev->dev, js->irq,
>  					panfrost_job_irq_handler,
>  					panfrost_job_irq_handler_thread,
>  					IRQF_SHARED, KBUILD_MODNAME "-job",
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting
  2021-06-25 15:55   ` Steven Price
@ 2021-06-25 16:02     ` Boris Brezillon
  2021-06-25 16:11       ` Steven Price
  0 siblings, 1 reply; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 16:02 UTC (permalink / raw)
  To: Steven Price
  Cc: Tomeu Vizoso, dri-devel, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 16:55:12 +0100
Steven Price <steven.price@arm.com> wrote:

> On 25/06/2021 14:33, Boris Brezillon wrote:
> > This is not yet needed because we let active jobs be killed during by
> > the reset and we don't really bother making sure they can be restarted.
> > But once we start adding soft-stop support, controlling when we deal
> > with the remaining interrrupts and making sure those are handled before
> > the reset is issued gets tricky if we keep job interrupts active.
> > 
> > Let's prepare for that and mask+flush job IRQs before issuing a reset.
> > 
> > Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
> > ---
> >  drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++++++++++++++------
> >  1 file changed, 15 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
> > index 88d34fd781e8..0566e2f7e84a 100644
> > --- a/drivers/gpu/drm/panfrost/panfrost_job.c
> > +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
> > @@ -34,6 +34,7 @@ struct panfrost_queue_state {
> >  struct panfrost_job_slot {
> >  	struct panfrost_queue_state queue[NUM_JOB_SLOTS];
> >  	spinlock_t job_lock;
> > +	int irq;
> >  };
> >  
> >  static struct panfrost_job *
> > @@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
> >  	if (bad)
> >  		drm_sched_increase_karma(bad);
> >  
> > -	spin_lock(&pfdev->js->job_lock);  
> 
> I'm not sure it's safe to remove this lock as this protects the
> pfdev->jobs array: I can't see what would prevent panfrost_job_close()
> running at the same time without the lock. Am I missing something?

Ah, you're right, I'll add it back.

> 
> > +	/* Mask job interrupts and synchronize to make sure we won't be
> > +	 * interrupted during our reset.
> > +	 */
> > +	job_write(pfdev, JOB_INT_MASK, 0);
> > +	synchronize_irq(pfdev->js->irq);
> > +
> > +	/* Schedulers are stopped and interrupts are masked+flushed, we don't
> > +	 * need to protect the 'evict unfinished jobs' lock with the job_lock.
> > +	 */
> >  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> >  		if (pfdev->jobs[i]) {
> >  			pm_runtime_put_noidle(pfdev->dev);
> > @@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
> >  			pfdev->jobs[i] = NULL;
> >  		}
> >  	}
> > -	spin_unlock(&pfdev->js->job_lock);
> >  
> >  	panfrost_device_reset(pfdev);
> >  
> > @@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
> >  
> >  			job = pfdev->jobs[j];
> >  			/* Only NULL if job timeout occurred */
> > +			WARN_ON(!job);  
> 
> Was this WARN_ON intentional?

Yes, now that we mask and synchronize the irq in the reset I don't see
any reason why we would end up with an event but no job to attach this
even to, but maybe I missed something.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults
  2021-06-25 13:33 ` [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults Boris Brezillon
@ 2021-06-25 16:10   ` Steven Price
  0 siblings, 0 replies; 39+ messages in thread
From: Steven Price @ 2021-06-25 16:10 UTC (permalink / raw)
  To: Boris Brezillon, dri-devel
  Cc: Tomeu Vizoso, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 14:33, Boris Brezillon wrote:
> If we don't do that, we have to wait for the job timeout to expire
> before the fault jobs gets killed.
> 
> v3:
> * Make sure the AS is re-enabled when new jobs are submitted to the
>   context
> 
> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>

Reviewed-by: Steven Price <steven.price@arm.com>

> ---
>  drivers/gpu/drm/panfrost/panfrost_device.h |  1 +
>  drivers/gpu/drm/panfrost/panfrost_mmu.c    | 34 ++++++++++++++++++++--
>  2 files changed, 32 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/gpu/drm/panfrost/panfrost_device.h b/drivers/gpu/drm/panfrost/panfrost_device.h
> index bfe32907ba6b..efe9a675b614 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_device.h
> +++ b/drivers/gpu/drm/panfrost/panfrost_device.h
> @@ -96,6 +96,7 @@ struct panfrost_device {
>  	spinlock_t as_lock;
>  	unsigned long as_in_use_mask;
>  	unsigned long as_alloc_mask;
> +	unsigned long as_faulty_mask;
>  	struct list_head as_lru_list;
>  
>  	struct panfrost_job_slot *js;
> diff --git a/drivers/gpu/drm/panfrost/panfrost_mmu.c b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> index b4f0c673cd7f..65e98c51cb66 100644
> --- a/drivers/gpu/drm/panfrost/panfrost_mmu.c
> +++ b/drivers/gpu/drm/panfrost/panfrost_mmu.c
> @@ -154,6 +154,7 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, struct panfrost_mmu *mmu)
>  	as = mmu->as;
>  	if (as >= 0) {
>  		int en = atomic_inc_return(&mmu->as_count);
> +		u32 mask = BIT(as) | BIT(16 + as);
>  
>  		/*
>  		 * AS can be retained by active jobs or a perfcnt context,
> @@ -162,6 +163,18 @@ u32 panfrost_mmu_as_get(struct panfrost_device *pfdev, struct panfrost_mmu *mmu)
>  		WARN_ON(en >= (NUM_JOB_SLOTS + 1));
>  
>  		list_move(&mmu->list, &pfdev->as_lru_list);
> +
> +		if (pfdev->as_faulty_mask & mask) {
> +			/* Unhandled pagefault on this AS, the MMU was
> +			 * disabled. We need to re-enable the MMU after
> +			 * clearing+unmasking the AS interrupts.
> +			 */
> +			mmu_write(pfdev, MMU_INT_CLEAR, mask);
> +			mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
> +			pfdev->as_faulty_mask &= ~mask;
> +			panfrost_mmu_enable(pfdev, mmu);
> +		}
> +
>  		goto out;
>  	}
>  
> @@ -211,6 +224,7 @@ void panfrost_mmu_reset(struct panfrost_device *pfdev)
>  	spin_lock(&pfdev->as_lock);
>  
>  	pfdev->as_alloc_mask = 0;
> +	pfdev->as_faulty_mask = 0;
>  
>  	list_for_each_entry_safe(mmu, mmu_tmp, &pfdev->as_lru_list, list) {
>  		mmu->as = -1;
> @@ -662,7 +676,7 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, void *data)
>  		if ((status & mask) == BIT(as) && (exception_type & 0xF8) == 0xC0)
>  			ret = panfrost_mmu_map_fault_addr(pfdev, as, addr);
>  
> -		if (ret)
> +		if (ret) {
>  			/* terminal fault, print info about the fault */
>  			dev_err(pfdev->dev,
>  				"Unhandled Page fault in AS%d at VA 0x%016llX\n"
> @@ -680,14 +694,28 @@ static irqreturn_t panfrost_mmu_irq_handler_thread(int irq, void *data)
>  				access_type, access_type_name(pfdev, fault_status),
>  				source_id);
>  
> +			spin_lock(&pfdev->as_lock);
> +			/* Ignore MMU interrupts on this AS until it's been
> +			 * re-enabled.
> +			 */
> +			pfdev->as_faulty_mask |= mask;
> +
> +			/* Disable the MMU to kill jobs on this AS. */
> +			panfrost_mmu_disable(pfdev, as);
> +			spin_unlock(&pfdev->as_lock);
> +		}
> +
>  		status &= ~mask;
>  
>  		/* If we received new MMU interrupts, process them before returning. */
>  		if (!status)
> -			status = mmu_read(pfdev, MMU_INT_RAWSTAT);
> +			status = mmu_read(pfdev, MMU_INT_RAWSTAT) & ~pfdev->as_faulty_mask;
>  	}
>  
> -	mmu_write(pfdev, MMU_INT_MASK, ~0);
> +	spin_lock(&pfdev->as_lock);
> +	mmu_write(pfdev, MMU_INT_MASK, ~pfdev->as_faulty_mask);
> +	spin_unlock(&pfdev->as_lock);
> +
>  	return IRQ_HANDLED;
>  };
>  
> 


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting
  2021-06-25 16:02     ` Boris Brezillon
@ 2021-06-25 16:11       ` Steven Price
  0 siblings, 0 replies; 39+ messages in thread
From: Steven Price @ 2021-06-25 16:11 UTC (permalink / raw)
  To: Boris Brezillon
  Cc: Tomeu Vizoso, dri-devel, Rob Herring, Alyssa Rosenzweig, Robin Murphy

On 25/06/2021 17:02, Boris Brezillon wrote:
> On Fri, 25 Jun 2021 16:55:12 +0100
> Steven Price <steven.price@arm.com> wrote:
> 
>> On 25/06/2021 14:33, Boris Brezillon wrote:
>>> This is not yet needed because we let active jobs be killed during by
>>> the reset and we don't really bother making sure they can be restarted.
>>> But once we start adding soft-stop support, controlling when we deal
>>> with the remaining interrrupts and making sure those are handled before
>>> the reset is issued gets tricky if we keep job interrupts active.
>>>
>>> Let's prepare for that and mask+flush job IRQs before issuing a reset.
>>>
>>> Signed-off-by: Boris Brezillon <boris.brezillon@collabora.com>
>>> ---
>>>  drivers/gpu/drm/panfrost/panfrost_job.c | 21 +++++++++++++++------
>>>  1 file changed, 15 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/panfrost/panfrost_job.c b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> index 88d34fd781e8..0566e2f7e84a 100644
>>> --- a/drivers/gpu/drm/panfrost/panfrost_job.c
>>> +++ b/drivers/gpu/drm/panfrost/panfrost_job.c
>>> @@ -34,6 +34,7 @@ struct panfrost_queue_state {
>>>  struct panfrost_job_slot {
>>>  	struct panfrost_queue_state queue[NUM_JOB_SLOTS];
>>>  	spinlock_t job_lock;
>>> +	int irq;
>>>  };
>>>  
>>>  static struct panfrost_job *
>>> @@ -400,7 +401,15 @@ static void panfrost_reset(struct panfrost_device *pfdev,
>>>  	if (bad)
>>>  		drm_sched_increase_karma(bad);
>>>  
>>> -	spin_lock(&pfdev->js->job_lock);  
>>
>> I'm not sure it's safe to remove this lock as this protects the
>> pfdev->jobs array: I can't see what would prevent panfrost_job_close()
>> running at the same time without the lock. Am I missing something?
> 
> Ah, you're right, I'll add it back.
> 
>>
>>> +	/* Mask job interrupts and synchronize to make sure we won't be
>>> +	 * interrupted during our reset.
>>> +	 */
>>> +	job_write(pfdev, JOB_INT_MASK, 0);
>>> +	synchronize_irq(pfdev->js->irq);
>>> +
>>> +	/* Schedulers are stopped and interrupts are masked+flushed, we don't
>>> +	 * need to protect the 'evict unfinished jobs' lock with the job_lock.
>>> +	 */
>>>  	for (i = 0; i < NUM_JOB_SLOTS; i++) {
>>>  		if (pfdev->jobs[i]) {
>>>  			pm_runtime_put_noidle(pfdev->dev);
>>> @@ -408,7 +417,6 @@ static void panfrost_reset(struct panfrost_device *pfdev,
>>>  			pfdev->jobs[i] = NULL;
>>>  		}
>>>  	}
>>> -	spin_unlock(&pfdev->js->job_lock);
>>>  
>>>  	panfrost_device_reset(pfdev);
>>>  
>>> @@ -504,6 +512,7 @@ static void panfrost_job_handle_irq(struct panfrost_device *pfdev, u32 status)
>>>  
>>>  			job = pfdev->jobs[j];
>>>  			/* Only NULL if job timeout occurred */
>>> +			WARN_ON(!job);  
>>
>> Was this WARN_ON intentional?
> 
> Yes, now that we mask and synchronize the irq in the reset I don't see
> any reason why we would end up with an event but no job to attach this
> even to, but maybe I missed something.
> 

Ok - but I guess the comment above needs updating then! ;) Job timeouts
are still a thing which definitely can happen!

Steve

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic
  2021-06-25 13:33 ` [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
  2021-06-25 15:42   ` Steven Price
@ 2021-06-25 16:22   ` Boris Brezillon
  1 sibling, 0 replies; 39+ messages in thread
From: Boris Brezillon @ 2021-06-25 16:22 UTC (permalink / raw)
  To: dri-devel
  Cc: Tomeu Vizoso, Daniel Vetter, Steven Price, Rob Herring,
	Alyssa Rosenzweig, Robin Murphy

On Fri, 25 Jun 2021 15:33:21 +0200
Boris Brezillon <boris.brezillon@collabora.com> wrote:


> @@ -379,57 +370,73 @@ void panfrost_job_enable_interrupts(struct panfrost_device *pfdev)
>  	job_write(pfdev, JOB_INT_MASK, irq_mask);
>  }
>  
> -static bool panfrost_scheduler_stop(struct panfrost_queue_state *queue,
> -				    struct drm_sched_job *bad)
> +static void panfrost_reset(struct panfrost_device *pfdev,
> +			   struct drm_sched_job *bad)
>  {
> -	enum panfrost_queue_status old_status;
> -	bool stopped = false;
> +	unsigned int i;
> +	bool cookie;
>  
> -	mutex_lock(&queue->lock);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_STOPPED);
> -	if (old_status == PANFROST_QUEUE_STATUS_STOPPED)
> -		goto out;
> +	if (WARN_ON(!atomic_read(&pfdev->reset.pending)))
> +		return;
> +
> +	/* Stop the schedulers.
> +	 *
> +	 * FIXME: We temporarily get out of the dma_fence_signalling section
> +	 * because the cleanup path generate lockdep splats when taking locks
> +	 * to release job resources. We should rework the code to follow this
> +	 * pattern:
> +	 *
> +	 *	try_lock
> +	 *	if (locked)
> +	 *		release
> +	 *	else
> +	 *		schedule_work_to_release_later
> +	 */
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_stop(&pfdev->js->queue[i].sched, bad);
> +
> +	cookie = dma_fence_begin_signalling();
>  
> -	WARN_ON(old_status != PANFROST_QUEUE_STATUS_ACTIVE);
> -	drm_sched_stop(&queue->sched, bad);
>  	if (bad)
>  		drm_sched_increase_karma(bad);
>  
> -	stopped = true;
> +	spin_lock(&pfdev->js->job_lock);
> +	for (i = 0; i < NUM_JOB_SLOTS; i++) {
> +		if (pfdev->jobs[i]) {
> +			pm_runtime_put_noidle(pfdev->dev);
> +			panfrost_devfreq_record_idle(&pfdev->pfdevfreq);
> +			pfdev->jobs[i] = NULL;
> +		}
> +	}
> +	spin_unlock(&pfdev->js->job_lock);
>  
> -	/*
> -	 * Set the timeout to max so the timer doesn't get started
> -	 * when we return from the timeout handler (restored in
> -	 * panfrost_scheduler_start()).
> +	panfrost_device_reset(pfdev);
> +
> +	/* GPU has been reset, we can cancel timeout/fault work that may have
> +	 * been queued in the meantime and clear the reset pending bit.
>  	 */
> -	queue->sched.timeout = MAX_SCHEDULE_TIMEOUT;
> +	atomic_set(&pfdev->reset.pending, 0);
> +	cancel_work_sync(&pfdev->reset.work);

This is introducing a deadlock since panfrost_reset() might be called
from the reset handler, and cancel_work_sync() waits for the handler to
return. Unfortunately there's no cancel_work() variant, so I'll just
remove the

	WARN_ON(!atomic_read(&pfdev->reset.pending)

and return directly when the pending bit is cleared.

> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		cancel_delayed_work(&pfdev->js->queue[i].sched.work_tdr);
>  
> -out:
> -	mutex_unlock(&queue->lock);
>  
> -	return stopped;
> -}
> +	/* Now resubmit jobs that were previously queued but didn't have a
> +	 * chance to finish.
> +	 * FIXME: We temporarily get out of the DMA fence signalling section
> +	 * while resubmitting jobs because the job submission logic will
> +	 * allocate memory with the GFP_KERNEL flag which can trigger memory
> +	 * reclaim and exposes a lock ordering issue.
> +	 */
> +	dma_fence_end_signalling(cookie);
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_resubmit_jobs(&pfdev->js->queue[i].sched);
> +	cookie = dma_fence_begin_signalling();
>  
> -static void panfrost_scheduler_start(struct panfrost_queue_state *queue)
> -{
> -	enum panfrost_queue_status old_status;
> +	for (i = 0; i < NUM_JOB_SLOTS; i++)
> +		drm_sched_start(&pfdev->js->queue[i].sched, true);
>  
> -	mutex_lock(&queue->lock);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_STARTING);
> -	WARN_ON(old_status != PANFROST_QUEUE_STATUS_STOPPED);
> -
> -	/* Restore the original timeout before starting the scheduler. */
> -	queue->sched.timeout = msecs_to_jiffies(JOB_TIMEOUT_MS);
> -	drm_sched_resubmit_jobs(&queue->sched);
> -	drm_sched_start(&queue->sched, true);
> -	old_status = atomic_xchg(&queue->status,
> -				 PANFROST_QUEUE_STATUS_ACTIVE);
> -	if (old_status == PANFROST_QUEUE_STATUS_FAULT_PENDING)
> -		drm_sched_fault(&queue->sched);
> -
> -	mutex_unlock(&queue->lock);
> +	dma_fence_end_signalling(cookie);
>  }
>  

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2021-06-25 16:22 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-25 13:33 [PATCH v3 00/15] drm/panfrost: Misc improvements Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 01/15] drm/sched: Allow using a dedicated workqueue for the timeout/fault tdr Boris Brezillon
2021-06-25 15:07   ` Steven Price
2021-06-25 15:18     ` Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 02/15] drm/panfrost: Make ->run_job() return an ERR_PTR() when appropriate Boris Brezillon
2021-06-25 13:45   ` Alyssa Rosenzweig
2021-06-25 13:33 ` [PATCH v3 03/15] drm/panfrost: Get rid of the unused JS_STATUS_EVENT_ACTIVE definition Boris Brezillon
2021-06-25 13:39   ` Alyssa Rosenzweig
2021-06-25 13:33 ` [PATCH v3 04/15] drm/panfrost: Drop the pfdev argument passed to panfrost_exception_name() Boris Brezillon
2021-06-25 13:45   ` Alyssa Rosenzweig
2021-06-25 13:33 ` [PATCH v3 05/15] drm/panfrost: Expose exception types to userspace Boris Brezillon
2021-06-25 13:42   ` Alyssa Rosenzweig
2021-06-25 14:21     ` Boris Brezillon
2021-06-25 15:32       ` Steven Price
2021-06-25 15:40         ` Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 06/15] drm/panfrost: Do the exception -> string translation using a table Boris Brezillon
2021-06-25 13:41   ` Alyssa Rosenzweig
2021-06-25 15:35   ` Steven Price
2021-06-25 13:33 ` [PATCH v3 07/15] drm/panfrost: Expose a helper to trigger a GPU reset Boris Brezillon
2021-06-25 13:46   ` Alyssa Rosenzweig
2021-06-25 13:33 ` [PATCH v3 08/15] drm/panfrost: Use a threaded IRQ for job interrupts Boris Brezillon
2021-06-25 13:47   ` Alyssa Rosenzweig
2021-06-25 14:37     ` Boris Brezillon
2021-06-25 15:40   ` Steven Price
2021-06-25 13:33 ` [PATCH v3 09/15] drm/panfrost: Simplify the reset serialization logic Boris Brezillon
2021-06-25 15:42   ` Steven Price
2021-06-25 16:22   ` Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 10/15] drm/panfrost: Make sure job interrupts are masked before resetting Boris Brezillon
2021-06-25 15:55   ` Steven Price
2021-06-25 16:02     ` Boris Brezillon
2021-06-25 16:11       ` Steven Price
2021-06-25 13:33 ` [PATCH v3 11/15] drm/panfrost: Disable the AS on unhandled page faults Boris Brezillon
2021-06-25 16:10   ` Steven Price
2021-06-25 13:33 ` [PATCH v3 12/15] drm/panfrost: Reset the GPU when the AS_ACTIVE bit is stuck Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 13/15] drm/panfrost: Don't reset the GPU on job faults unless we really have to Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 14/15] drm/panfrost: Kill in-flight jobs on FD close Boris Brezillon
2021-06-25 13:43   ` Lucas Stach
2021-06-25 14:46     ` Boris Brezillon
2021-06-25 13:33 ` [PATCH v3 15/15] drm/panfrost: Queue jobs on the hardware Boris Brezillon

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.