All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] Improve GPU Recovery
@ 2022-07-30  9:40 ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar,
	AngeloGioacchino Del Regno, Chia-I Wu, Dan Carpenter,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Nathan Chancellor,
	Philipp Zabel, Sean Paul, Stephen Boyd, Vladimir Lypak,
	Wang Qing, linux-kernel


Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.

First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.

Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.

This series is rebased on top of [1] which based on linus's master
branch.

[1] https://patchwork.freedesktop.org/series/106860/

Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
  https://patchwork.freedesktop.org/series/106860/
- Use single pm refcount for all active submits

Changes in v2:
- Rebased on msm-next tip

Akhil P Oommen (8):
  drm/msm: Remove unnecessary pm_runtime_get/put
  drm/msm: Take single rpm refcount on behalf of all submits
  drm/msm: Correct pm_runtime votes in recover worker
  drm/msm: Fix cx collapse issue during recovery
  drm/msm/a6xx: Ensure CX collapse during gpu recovery
  drm/msm/adreno: Remove a WARN() during runtime_suspend
  drm/msm/a6xx: Improve gpu recovery sequence
  drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

 drivers/gpu/drm/msm/adreno/a6xx.xml.h      |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c      | 83 +++++++++++++++++++-----------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c      | 35 +++++++++++--
 drivers/gpu/drm/msm/adreno/adreno_device.c |  7 ---
 drivers/gpu/drm/msm/msm_gpu.c              | 21 +++++---
 drivers/gpu/drm/msm/msm_gpu.h              |  4 ++
 drivers/gpu/drm/msm/msm_ringbuffer.c       |  4 --
 7 files changed, 106 insertions(+), 52 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 0/8] Improve GPU Recovery
@ 2022-07-30  9:40 ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Wang Qing, Jonathan Marek, Akhil P Oommen, linux-kernel,
	Stephen Boyd, Vladimir Lypak, Abhinav Kumar, Douglas Anderson,
	Nathan Chancellor, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul, Dan Carpenter,
	AngeloGioacchino Del Regno


Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.

First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.

Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.

This series is rebased on top of [1] which based on linus's master
branch.

[1] https://patchwork.freedesktop.org/series/106860/

Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
  https://patchwork.freedesktop.org/series/106860/
- Use single pm refcount for all active submits

Changes in v2:
- Rebased on msm-next tip

Akhil P Oommen (8):
  drm/msm: Remove unnecessary pm_runtime_get/put
  drm/msm: Take single rpm refcount on behalf of all submits
  drm/msm: Correct pm_runtime votes in recover worker
  drm/msm: Fix cx collapse issue during recovery
  drm/msm/a6xx: Ensure CX collapse during gpu recovery
  drm/msm/adreno: Remove a WARN() during runtime_suspend
  drm/msm/a6xx: Improve gpu recovery sequence
  drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

 drivers/gpu/drm/msm/adreno/a6xx.xml.h      |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c      | 83 +++++++++++++++++++-----------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c      | 35 +++++++++++--
 drivers/gpu/drm/msm/adreno/adreno_device.c |  7 ---
 drivers/gpu/drm/msm/msm_gpu.c              | 21 +++++---
 drivers/gpu/drm/msm/msm_gpu.h              |  4 ++
 drivers/gpu/drm/msm/msm_ringbuffer.c       |  4 --
 7 files changed, 106 insertions(+), 52 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 35+ messages in thread

* [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, linux-kernel

We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 		msm_gem_unlock(obj);
 	}
 
-	pm_runtime_get_sync(&gpu->pdev->dev);
-
 	/* TODO move submit path over to using a per-ring lock.. */
 	mutex_lock(&gpu->lock);
 
@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 
 	mutex_unlock(&gpu->lock);
 
-	pm_runtime_put(&gpu->pdev->dev);
-
 	return dma_fence_get(submit->hw_fence);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Abhinav Kumar,
	Douglas Anderson, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul

We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 		msm_gem_unlock(obj);
 	}
 
-	pm_runtime_get_sync(&gpu->pdev->dev);
-
 	/* TODO move submit path over to using a per-ring lock.. */
 	mutex_lock(&gpu->lock);
 
@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 
 	mutex_unlock(&gpu->lock);
 
-	pm_runtime_put(&gpu->pdev->dev);
-
 	return dma_fence_get(submit->hw_fence);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, linux-kernel

Instead of separate refcount for each submit, take single rpm refcount
on behalf of all the submits. This makes it easier to drop the rpm
refcount during recovery in an upcoming patch.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index c8cd9bf..e1dd3cc 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
 	mutex_lock(&gpu->active_lock);
 	gpu->active_submits--;
 	WARN_ON(gpu->active_submits < 0);
-	if (!gpu->active_submits)
+	if (!gpu->active_submits) {
 		msm_devfreq_idle(gpu);
-	mutex_unlock(&gpu->active_lock);
+		pm_runtime_put_autosuspend(&gpu->pdev->dev);
+	}
 
-	pm_runtime_put_autosuspend(&gpu->pdev->dev);
+	mutex_unlock(&gpu->active_lock);
 
 	msm_gem_submit_put(submit);
 }
@@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
 
 	/* Update devfreq on transition from idle->active: */
 	mutex_lock(&gpu->active_lock);
-	if (!gpu->active_submits)
+	if (!gpu->active_submits) {
+		pm_runtime_get(&gpu->pdev->dev);
 		msm_devfreq_active(gpu);
+	}
 	gpu->active_submits++;
 	mutex_unlock(&gpu->active_lock);
 
 	gpu->funcs->submit(gpu, submit);
 	gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
 
+	pm_runtime_put(&gpu->pdev->dev);
 	hangcheck_timer_reset(gpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Abhinav Kumar,
	Douglas Anderson, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul

Instead of separate refcount for each submit, take single rpm refcount
on behalf of all the submits. This makes it easier to drop the rpm
refcount during recovery in an upcoming patch.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index c8cd9bf..e1dd3cc 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
 	mutex_lock(&gpu->active_lock);
 	gpu->active_submits--;
 	WARN_ON(gpu->active_submits < 0);
-	if (!gpu->active_submits)
+	if (!gpu->active_submits) {
 		msm_devfreq_idle(gpu);
-	mutex_unlock(&gpu->active_lock);
+		pm_runtime_put_autosuspend(&gpu->pdev->dev);
+	}
 
-	pm_runtime_put_autosuspend(&gpu->pdev->dev);
+	mutex_unlock(&gpu->active_lock);
 
 	msm_gem_submit_put(submit);
 }
@@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
 
 	/* Update devfreq on transition from idle->active: */
 	mutex_lock(&gpu->active_lock);
-	if (!gpu->active_submits)
+	if (!gpu->active_submits) {
+		pm_runtime_get(&gpu->pdev->dev);
 		msm_devfreq_active(gpu);
+	}
 	gpu->active_submits++;
 	mutex_unlock(&gpu->active_lock);
 
 	gpu->funcs->submit(gpu, submit);
 	gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
 
+	pm_runtime_put(&gpu->pdev->dev);
 	hangcheck_timer_reset(gpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/8] drm/msm: Correct pm_runtime votes in recover worker
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, linux-kernel

In the scenario where there is one a single submit which is hung, gpu is
power collapsed when it is retired. Because of this, by the time we call
reover(), gpu state would be already clear. Fix this by correctly
managing the pm runtime votes.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index e1dd3cc..1945efb 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -398,7 +398,6 @@ static void recover_worker(struct kthread_work *work)
 	/* Record the crash state */
 	pm_runtime_get_sync(&gpu->pdev->dev);
 	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
-	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	kfree(cmd);
 	kfree(comm);
@@ -446,6 +445,8 @@ static void recover_worker(struct kthread_work *work)
 		}
 	}
 
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
 	mutex_unlock(&gpu->lock);
 
 	msm_gpu_retire(gpu);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 3/8] drm/msm: Correct pm_runtime votes in recover worker
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Abhinav Kumar,
	Douglas Anderson, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul

In the scenario where there is one a single submit which is hung, gpu is
power collapsed when it is retired. Because of this, by the time we call
reover(), gpu state would be already clear. Fix this by correctly
managing the pm runtime votes.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index e1dd3cc..1945efb 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -398,7 +398,6 @@ static void recover_worker(struct kthread_work *work)
 	/* Record the crash state */
 	pm_runtime_get_sync(&gpu->pdev->dev);
 	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
-	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	kfree(cmd);
 	kfree(comm);
@@ -446,6 +445,8 @@ static void recover_worker(struct kthread_work *work)
 		}
 	}
 
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
 	mutex_unlock(&gpu->lock);
 
 	msm_gpu_retire(gpu);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Sean Paul,
	Stephen Boyd, linux-kernel

There are some hardware logic under CX domain. For a successful
recovery, we should ensure cx headswitch collapses to ensure all the
stale states are cleard out. This is especially true to for a6xx family
where we can GMU co-processor.

Currently, cx doesn't collapse due to a devlink between gpu and its
smmu. So the *struct gpu device* needs to be runtime suspended to ensure
that the iommu driver removes its vote on cx gdsc.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

Changes in v3:
- Simplied the pm refcount drop since we have just a single refcount now
for all active submits

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 24 +++++++++++++++++++++---
 drivers/gpu/drm/msm/msm_gpu.c         |  4 +---
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 42ed9a3..1b049c5 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1193,7 +1193,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
-	int i;
+	int i, active_submits;
 
 	adreno_dump_info(gpu);
 
@@ -1210,8 +1210,26 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	 */
 	gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
 
-	gpu->funcs->pm_suspend(gpu);
-	gpu->funcs->pm_resume(gpu);
+	pm_runtime_dont_use_autosuspend(&gpu->pdev->dev);
+
+	/* active_submit won't change until we make a submission */
+	mutex_lock(&gpu->active_lock);
+	active_submits = gpu->active_submits;
+	mutex_unlock(&gpu->active_lock);
+
+	/* Drop the rpm refcount from active submits */
+	if (active_submits)
+		pm_runtime_put(&gpu->pdev->dev);
+
+	/* And the final one from recover worker */
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	pm_runtime_use_autosuspend(&gpu->pdev->dev);
+
+	if (active_submits)
+		pm_runtime_get(&gpu->pdev->dev);
+
+	pm_runtime_get_sync(&gpu->pdev->dev);
 
 	msm_gpu_hw_init(gpu);
 }
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 1945efb..07e55a6 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -426,9 +426,7 @@ static void recover_worker(struct kthread_work *work)
 		/* retire completed submits, plus the one that hung: */
 		retire_submits(gpu);
 
-		pm_runtime_get_sync(&gpu->pdev->dev);
 		gpu->funcs->recover(gpu);
-		pm_runtime_put_sync(&gpu->pdev->dev);
 
 		/*
 		 * Replay all remaining submits starting with highest priority
@@ -445,7 +443,7 @@ static void recover_worker(struct kthread_work *work)
 		}
 	}
 
-	pm_runtime_put_sync(&gpu->pdev->dev);
+	pm_runtime_put(&gpu->pdev->dev);
 
 	mutex_unlock(&gpu->lock);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Stephen Boyd,
	Abhinav Kumar, Douglas Anderson, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul

There are some hardware logic under CX domain. For a successful
recovery, we should ensure cx headswitch collapses to ensure all the
stale states are cleard out. This is especially true to for a6xx family
where we can GMU co-processor.

Currently, cx doesn't collapse due to a devlink between gpu and its
smmu. So the *struct gpu device* needs to be runtime suspended to ensure
that the iommu driver removes its vote on cx gdsc.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

Changes in v3:
- Simplied the pm refcount drop since we have just a single refcount now
for all active submits

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 24 +++++++++++++++++++++---
 drivers/gpu/drm/msm/msm_gpu.c         |  4 +---
 2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 42ed9a3..1b049c5 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1193,7 +1193,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
-	int i;
+	int i, active_submits;
 
 	adreno_dump_info(gpu);
 
@@ -1210,8 +1210,26 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	 */
 	gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
 
-	gpu->funcs->pm_suspend(gpu);
-	gpu->funcs->pm_resume(gpu);
+	pm_runtime_dont_use_autosuspend(&gpu->pdev->dev);
+
+	/* active_submit won't change until we make a submission */
+	mutex_lock(&gpu->active_lock);
+	active_submits = gpu->active_submits;
+	mutex_unlock(&gpu->active_lock);
+
+	/* Drop the rpm refcount from active submits */
+	if (active_submits)
+		pm_runtime_put(&gpu->pdev->dev);
+
+	/* And the final one from recover worker */
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	pm_runtime_use_autosuspend(&gpu->pdev->dev);
+
+	if (active_submits)
+		pm_runtime_get(&gpu->pdev->dev);
+
+	pm_runtime_get_sync(&gpu->pdev->dev);
 
 	msm_gpu_hw_init(gpu);
 }
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 1945efb..07e55a6 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -426,9 +426,7 @@ static void recover_worker(struct kthread_work *work)
 		/* retire completed submits, plus the one that hung: */
 		retire_submits(gpu);
 
-		pm_runtime_get_sync(&gpu->pdev->dev);
 		gpu->funcs->recover(gpu);
-		pm_runtime_put_sync(&gpu->pdev->dev);
 
 		/*
 		 * Replay all remaining submits starting with highest priority
@@ -445,7 +443,7 @@ static void recover_worker(struct kthread_work *work)
 		}
 	}
 
-	pm_runtime_put_sync(&gpu->pdev->dev);
+	pm_runtime_put(&gpu->pdev->dev);
 
 	mutex_unlock(&gpu->lock);
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Philipp Zabel,
	Sean Paul, Stephen Boyd, linux-kernel

Because there could be transient votes from other drivers/tz/hyp which
may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
We can use the reset framework to poll for cx gdsc collapse from gpucc
clk driver.

This feature requires support from the platform's gpucc driver.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
  https://patchwork.freedesktop.org/series/106860/

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
 drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
 drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
 3 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 1b049c5..721d5e6 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -10,6 +10,7 @@
 
 #include <linux/bitfield.h>
 #include <linux/devfreq.h>
+#include <linux/reset.h>
 #include <linux/soc/qcom/llcc-qcom.h>
 
 #define GPU_PAS_ID 13
@@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	/* And the final one from recover worker */
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
+	/* Call into gpucc driver to poll for cx gdsc collapse */
+	reset_control_reset(gpu->cx_collapse);
+
 	pm_runtime_use_autosuspend(&gpu->pdev->dev);
 
 	if (active_submits)
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 07e55a6..4a57627 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -14,6 +14,7 @@
 #include <generated/utsrelease.h>
 #include <linux/string_helpers.h>
 #include <linux/devcoredump.h>
+#include <linux/reset.h>
 #include <linux/sched/task.h>
 
 /*
@@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	if (IS_ERR(gpu->gpu_cx))
 		gpu->gpu_cx = NULL;
 
+	gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
+			"cx_collapse");
+
 	gpu->pdev = pdev;
 	platform_set_drvdata(pdev, &gpu->adreno_smmu);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 6def008..ab59fd2 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -13,6 +13,7 @@
 #include <linux/interconnect.h>
 #include <linux/pm_opp.h>
 #include <linux/regulator/consumer.h>
+#include <linux/reset.h>
 
 #include "msm_drv.h"
 #include "msm_fence.h"
@@ -268,6 +269,9 @@ struct msm_gpu {
 	bool hw_apriv;
 
 	struct thermal_cooling_device *cooling;
+
+	/* To poll for cx gdsc collapse during gpu recovery */
+	struct reset_control *cx_collapse;
 };
 
 static inline struct msm_gpu *dev_to_gpu(struct device *dev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Stephen Boyd,
	Abhinav Kumar, Douglas Anderson, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul

Because there could be transient votes from other drivers/tz/hyp which
may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
We can use the reset framework to poll for cx gdsc collapse from gpucc
clk driver.

This feature requires support from the platform's gpucc driver.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

Changes in v3:
- Use reset interface from gpucc driver to poll for cx gdsc collapse
  https://patchwork.freedesktop.org/series/106860/

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
 drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
 drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
 3 files changed, 12 insertions(+)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 1b049c5..721d5e6 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -10,6 +10,7 @@
 
 #include <linux/bitfield.h>
 #include <linux/devfreq.h>
+#include <linux/reset.h>
 #include <linux/soc/qcom/llcc-qcom.h>
 
 #define GPU_PAS_ID 13
@@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	/* And the final one from recover worker */
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
+	/* Call into gpucc driver to poll for cx gdsc collapse */
+	reset_control_reset(gpu->cx_collapse);
+
 	pm_runtime_use_autosuspend(&gpu->pdev->dev);
 
 	if (active_submits)
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 07e55a6..4a57627 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -14,6 +14,7 @@
 #include <generated/utsrelease.h>
 #include <linux/string_helpers.h>
 #include <linux/devcoredump.h>
+#include <linux/reset.h>
 #include <linux/sched/task.h>
 
 /*
@@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	if (IS_ERR(gpu->gpu_cx))
 		gpu->gpu_cx = NULL;
 
+	gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
+			"cx_collapse");
+
 	gpu->pdev = pdev;
 	platform_set_drvdata(pdev, &gpu->adreno_smmu);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 6def008..ab59fd2 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -13,6 +13,7 @@
 #include <linux/interconnect.h>
 #include <linux/pm_opp.h>
 #include <linux/regulator/consumer.h>
+#include <linux/reset.h>
 
 #include "msm_drv.h"
 #include "msm_fence.h"
@@ -268,6 +269,9 @@ struct msm_gpu {
 	bool hw_apriv;
 
 	struct thermal_cooling_device *cooling;
+
+	/* To poll for cx gdsc collapse during gpu recovery */
+	struct reset_control *cx_collapse;
 };
 
 static inline struct msm_gpu *dev_to_gpu(struct device *dev)
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 6/8] drm/msm/adreno: Remove a WARN() during runtime_suspend
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar,
	AngeloGioacchino Del Regno, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Nathan Chancellor, Sean Paul, Vladimir Lypak,
	linux-kernel

WARN(gpu->active_submits) during runtime_suspend doesn't make sense now
because we force runtime suspend during a gpu recovery when there are
active submissions pending.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/adreno_device.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 8706bcd..d1e1e8e 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -610,13 +610,6 @@ static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 
-	/*
-	 * We should be holding a runpm ref, which will prevent
-	 * runtime suspend.  In the system suspend path, we've
-	 * already waited for active jobs to complete.
-	 */
-	WARN_ON_ONCE(gpu->active_submits);
-
 	return gpu->funcs->pm_suspend(gpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 6/8] drm/msm/adreno: Remove a WARN() during runtime_suspend
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Vladimir Lypak,
	Abhinav Kumar, Douglas Anderson, Nathan Chancellor, David Airlie,
	Matthias Kaehlcke, Dmitry Baryshkov, Jordan Crouse, Sean Paul,
	AngeloGioacchino Del Regno

WARN(gpu->active_submits) during runtime_suspend doesn't make sense now
because we force runtime suspend during a gpu recovery when there are
active submissions pending.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/adreno_device.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c
index 8706bcd..d1e1e8e 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_device.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_device.c
@@ -610,13 +610,6 @@ static int adreno_runtime_suspend(struct device *dev)
 {
 	struct msm_gpu *gpu = dev_to_gpu(dev);
 
-	/*
-	 * We should be holding a runpm ref, which will prevent
-	 * runtime suspend.  In the system suspend path, we've
-	 * already waited for active jobs to complete.
-	 */
-	WARN_ON_ONCE(gpu->active_submits);
-
 	return gpu->funcs->pm_suspend(gpu);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 7/8] drm/msm/a6xx: Improve gpu recovery sequence
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Dan Carpenter, Daniel Vetter, David Airlie, Dmitry Baryshkov,
	Sean Paul, Stephen Boyd, Wang Qing, linux-kernel

We can do a few more things to improve our chance at a successful gpu
recovery, especially during a hangcheck timeout:
1. Halt CP and GMU core
2. Do RBBM GBIF HALT sequence
3. Do a soft reset of GPU core

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx.xml.h |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 77 +++++++++++++++++++++--------------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c |  7 ++++
 3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx.xml.h b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
index b03e2c4..beea4a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx.xml.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
@@ -1413,6 +1413,10 @@ static inline uint32_t REG_A6XX_RBBM_PERFCTR_RBBM_SEL(uint32_t i0) { return 0x00
 
 #define REG_A6XX_RBBM_GBIF_CLIENT_QOS_CNTL			0x00000011
 
+#define REG_A6XX_RBBM_GBIF_HALT					0x00000016
+
+#define REG_A6XX_RBBM_GBIF_HALT_ACK				0x00000017
+
 #define REG_A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD			0x0000001c
 #define A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD_WAIT_GPU_IDLE		0x00000001
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 9f76f5b..db05942 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -869,9 +869,47 @@ static void a6xx_gmu_rpmh_off(struct a6xx_gmu *gmu)
 		(val & 1), 100, 1000);
 }
 
+#define GBIF_CLIENT_HALT_MASK             BIT(0)
+#define GBIF_ARB_HALT_MASK                BIT(1)
+
+static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
+{
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
+	if (!a6xx_has_gbif(adreno_gpu)) {
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
+		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
+								0xf) == 0xf);
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
+
+		return;
+	}
+
+	/* Halt the gx side of GBIF */
+	gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 1);
+	spin_until(gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT_ACK) & 1);
+
+	/* Halt new client requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
+	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
+
+	/* Halt all AXI requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
+	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
+
+	/* The GBIF halt needs to be explicitly cleared */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
+}
+
 /* Force the GMU off in case it isn't responsive */
 static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 {
+	struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
+	struct adreno_gpu *adreno_gpu = &a6xx_gpu->base;
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
 	/* Flush all the queues */
 	a6xx_hfi_stop(gmu);
 
@@ -883,6 +921,15 @@ static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 
 	/* Make sure there are no outstanding RPMh votes */
 	a6xx_gmu_rpmh_off(gmu);
+
+	/* Halt the gmu cm3 core */
+	gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
+
+	a6xx_bus_clear_pending_transactions(adreno_gpu);
+
+	/* Reset GPU core blocks */
+	gpu_write(gpu, REG_A6XX_RBBM_SW_RESET_CMD, 1);
+	udelay(100);
 }
 
 static void a6xx_gmu_set_initial_freq(struct msm_gpu *gpu, struct a6xx_gmu *gmu)
@@ -1010,36 +1057,6 @@ bool a6xx_gmu_isidle(struct a6xx_gmu *gmu)
 	return true;
 }
 
-#define GBIF_CLIENT_HALT_MASK             BIT(0)
-#define GBIF_ARB_HALT_MASK                BIT(1)
-
-static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
-{
-	struct msm_gpu *gpu = &adreno_gpu->base;
-
-	if (!a6xx_has_gbif(adreno_gpu)) {
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
-		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
-								0xf) == 0xf);
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
-
-		return;
-	}
-
-	/* Halt new client requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
-	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
-
-	/* Halt all AXI requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
-	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
-
-	/* The GBIF halt needs to be explicitly cleared */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
-}
-
 /* Gracefully try to shut down the GMU and by extension the GPU */
 static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 {
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 721d5e6..33c98bb 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -920,6 +920,10 @@ static int hw_init(struct msm_gpu *gpu)
 	/* Make sure the GMU keeps the GPU on while we set it up */
 	a6xx_gmu_set_oob(&a6xx_gpu->gmu, GMU_OOB_GPU_SET);
 
+	/* Clear GBIF halt in case GX domain was not collapsed */
+	if (a6xx_has_gbif(adreno_gpu))
+		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
+
 	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
 
 	/*
@@ -1205,6 +1209,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	if (hang_debug)
 		a6xx_dump(gpu);
 
+	/* Halt SQE first */
+	gpu_write(gpu, REG_A6XX_CP_SQE_CNTL, 3);
+
 	/*
 	 * Turn off keep alive that might have been enabled by the hang
 	 * interrupt
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 7/8] drm/msm/a6xx: Improve gpu recovery sequence
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Stephen Boyd,
	Abhinav Kumar, Douglas Anderson, Wang Qing, David Airlie,
	Matthias Kaehlcke, Dmitry Baryshkov, Jordan Crouse, Sean Paul,
	Dan Carpenter

We can do a few more things to improve our chance at a successful gpu
recovery, especially during a hangcheck timeout:
1. Halt CP and GMU core
2. Do RBBM GBIF HALT sequence
3. Do a soft reset of GPU core

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx.xml.h |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 77 +++++++++++++++++++++--------------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c |  7 ++++
 3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx.xml.h b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
index b03e2c4..beea4a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx.xml.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
@@ -1413,6 +1413,10 @@ static inline uint32_t REG_A6XX_RBBM_PERFCTR_RBBM_SEL(uint32_t i0) { return 0x00
 
 #define REG_A6XX_RBBM_GBIF_CLIENT_QOS_CNTL			0x00000011
 
+#define REG_A6XX_RBBM_GBIF_HALT					0x00000016
+
+#define REG_A6XX_RBBM_GBIF_HALT_ACK				0x00000017
+
 #define REG_A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD			0x0000001c
 #define A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD_WAIT_GPU_IDLE		0x00000001
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 9f76f5b..db05942 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -869,9 +869,47 @@ static void a6xx_gmu_rpmh_off(struct a6xx_gmu *gmu)
 		(val & 1), 100, 1000);
 }
 
+#define GBIF_CLIENT_HALT_MASK             BIT(0)
+#define GBIF_ARB_HALT_MASK                BIT(1)
+
+static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
+{
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
+	if (!a6xx_has_gbif(adreno_gpu)) {
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
+		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
+								0xf) == 0xf);
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
+
+		return;
+	}
+
+	/* Halt the gx side of GBIF */
+	gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 1);
+	spin_until(gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT_ACK) & 1);
+
+	/* Halt new client requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
+	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
+
+	/* Halt all AXI requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
+	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
+
+	/* The GBIF halt needs to be explicitly cleared */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
+}
+
 /* Force the GMU off in case it isn't responsive */
 static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 {
+	struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
+	struct adreno_gpu *adreno_gpu = &a6xx_gpu->base;
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
 	/* Flush all the queues */
 	a6xx_hfi_stop(gmu);
 
@@ -883,6 +921,15 @@ static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 
 	/* Make sure there are no outstanding RPMh votes */
 	a6xx_gmu_rpmh_off(gmu);
+
+	/* Halt the gmu cm3 core */
+	gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
+
+	a6xx_bus_clear_pending_transactions(adreno_gpu);
+
+	/* Reset GPU core blocks */
+	gpu_write(gpu, REG_A6XX_RBBM_SW_RESET_CMD, 1);
+	udelay(100);
 }
 
 static void a6xx_gmu_set_initial_freq(struct msm_gpu *gpu, struct a6xx_gmu *gmu)
@@ -1010,36 +1057,6 @@ bool a6xx_gmu_isidle(struct a6xx_gmu *gmu)
 	return true;
 }
 
-#define GBIF_CLIENT_HALT_MASK             BIT(0)
-#define GBIF_ARB_HALT_MASK                BIT(1)
-
-static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
-{
-	struct msm_gpu *gpu = &adreno_gpu->base;
-
-	if (!a6xx_has_gbif(adreno_gpu)) {
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
-		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
-								0xf) == 0xf);
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
-
-		return;
-	}
-
-	/* Halt new client requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
-	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
-
-	/* Halt all AXI requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
-	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
-
-	/* The GBIF halt needs to be explicitly cleared */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
-}
-
 /* Gracefully try to shut down the GMU and by extension the GPU */
 static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 {
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 721d5e6..33c98bb 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -920,6 +920,10 @@ static int hw_init(struct msm_gpu *gpu)
 	/* Make sure the GMU keeps the GPU on while we set it up */
 	a6xx_gmu_set_oob(&a6xx_gpu->gmu, GMU_OOB_GPU_SET);
 
+	/* Clear GBIF halt in case GX domain was not collapsed */
+	if (a6xx_has_gbif(adreno_gpu))
+		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
+
 	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
 
 	/*
@@ -1205,6 +1209,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	if (hang_debug)
 		a6xx_dump(gpu);
 
+	/* Halt SQE first */
+	gpu_write(gpu, REG_A6XX_CP_SQE_CNTL, 3);
+
 	/*
 	 * Turn off keep alive that might have been enabled by the hang
 	 * interrupt
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 8/8] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure
  2022-07-30  9:40 ` Akhil P Oommen
@ 2022-07-30  9:40   ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Akhil P Oommen, Abhinav Kumar, Dan Carpenter,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Sean Paul,
	Wang Qing, linux-kernel

When prepare-slumber hfi fails, we should follow a6xx_gmu_force_off()
sequence.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index db05942..3d00ef9 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1082,7 +1082,11 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 		a6xx_bus_clear_pending_transactions(adreno_gpu);
 
 		/* tell the GMU we want to slumber */
-		a6xx_gmu_notify_slumber(gmu);
+		ret = a6xx_gmu_notify_slumber(gmu);
+		if (ret) {
+			a6xx_gmu_force_off(gmu);
+			return;
+		}
 
 		ret = gmu_poll_timeout(gmu,
 			REG_A6XX_GPU_GMU_AO_GPU_CX_BUSY_STATUS, val,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* [PATCH v3 8/8] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure
@ 2022-07-30  9:40   ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-30  9:40 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Akhil P Oommen, linux-kernel, Abhinav Kumar,
	Douglas Anderson, Wang Qing, David Airlie, Matthias Kaehlcke,
	Dmitry Baryshkov, Jordan Crouse, Sean Paul, Dan Carpenter

When prepare-slumber hfi fails, we should follow a6xx_gmu_force_off()
sequence.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index db05942..3d00ef9 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1082,7 +1082,11 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 		a6xx_bus_clear_pending_transactions(adreno_gpu);
 
 		/* tell the GMU we want to slumber */
-		a6xx_gmu_notify_slumber(gmu);
+		ret = a6xx_gmu_notify_slumber(gmu);
+		if (ret) {
+			a6xx_gmu_force_off(gmu);
+			return;
+		}
 
 		ret = gmu_poll_timeout(gmu,
 			REG_A6XX_GPU_GMU_AO_GPU_CX_BUSY_STATUS, val,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put
  2022-07-30  9:40   ` Akhil P Oommen
  (?)
@ 2022-07-31 15:55   ` Rob Clark
  2022-08-01 14:32       ` Akhil P Oommen
  -1 siblings, 1 reply; 35+ messages in thread
From: Rob Clark @ 2022-07-31 15:55 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
> pm_runtime_get/put from msm_job_run().
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> (no changes since v1)
>
>  drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
>  1 file changed, 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
> index 56eecb4..cad4c35 100644
> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
> @@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>                 msm_gem_unlock(obj);
>         }
>
> -       pm_runtime_get_sync(&gpu->pdev->dev);
> -

This is removing a _get_sync() and simply relying on a _get() (async)
in msm_gpu_submit().. that seems pretty likely to go badly?  I think
it should probably replace the _get() in msm_gpu_submit() with
_get_sync() (but also since this is changing position of
resume/suspend vs active_lock, please make sure you test with lockdep
enabled)

BR,
-R

>         /* TODO move submit path over to using a per-ring lock.. */
>         mutex_lock(&gpu->lock);
>
> @@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>
>         mutex_unlock(&gpu->lock);
>
> -       pm_runtime_put(&gpu->pdev->dev);
> -
>         return dma_fence_get(submit->hw_fence);
>  }
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
  2022-07-30  9:40   ` Akhil P Oommen
  (?)
@ 2022-07-31 15:56   ` Rob Clark
  2022-07-31 16:32     ` Akhil P Oommen
  -1 siblings, 1 reply; 35+ messages in thread
From: Rob Clark @ 2022-07-31 15:56 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> Instead of separate refcount for each submit, take single rpm refcount
> on behalf of all the submits. This makes it easier to drop the rpm
> refcount during recovery in an upcoming patch.
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> (no changes since v1)

I see no earlier version of this patch?

>
>  drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index c8cd9bf..e1dd3cc 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>         mutex_lock(&gpu->active_lock);
>         gpu->active_submits--;
>         WARN_ON(gpu->active_submits < 0);
> -       if (!gpu->active_submits)
> +       if (!gpu->active_submits) {
>                 msm_devfreq_idle(gpu);
> -       mutex_unlock(&gpu->active_lock);
> +               pm_runtime_put_autosuspend(&gpu->pdev->dev);
> +       }
>
> -       pm_runtime_put_autosuspend(&gpu->pdev->dev);
> +       mutex_unlock(&gpu->active_lock);
>
>         msm_gem_submit_put(submit);
>  }
> @@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
>
>         /* Update devfreq on transition from idle->active: */
>         mutex_lock(&gpu->active_lock);
> -       if (!gpu->active_submits)
> +       if (!gpu->active_submits) {
> +               pm_runtime_get(&gpu->pdev->dev);
>                 msm_devfreq_active(gpu);
> +       }
>         gpu->active_submits++;
>         mutex_unlock(&gpu->active_lock);
>
>         gpu->funcs->submit(gpu, submit);
>         gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
>
> +       pm_runtime_put(&gpu->pdev->dev);

this looks unbalanced?

BR,
-R

>         hangcheck_timer_reset(gpu);
>  }
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery
  2022-07-30  9:40   ` Akhil P Oommen
  (?)
@ 2022-07-31 16:22   ` Rob Clark
  2022-08-01 15:10       ` Akhil P Oommen
  -1 siblings, 1 reply; 35+ messages in thread
From: Rob Clark @ 2022-07-31 16:22 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Chia-I Wu, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, Stephen Boyd,
	linux-kernel

On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> There are some hardware logic under CX domain. For a successful
> recovery, we should ensure cx headswitch collapses to ensure all the
> stale states are cleard out. This is especially true to for a6xx family
> where we can GMU co-processor.
>
> Currently, cx doesn't collapse due to a devlink between gpu and its
> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> that the iommu driver removes its vote on cx gdsc.
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> Changes in v3:
> - Simplied the pm refcount drop since we have just a single refcount now
> for all active submits
>
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 24 +++++++++++++++++++++---
>  drivers/gpu/drm/msm/msm_gpu.c         |  4 +---
>  2 files changed, 22 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 42ed9a3..1b049c5 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1193,7 +1193,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  {
>         struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>         struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
> -       int i;
> +       int i, active_submits;
>
>         adreno_dump_info(gpu);
>
> @@ -1210,8 +1210,26 @@ static void a6xx_recover(struct msm_gpu *gpu)
>          */
>         gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>
> -       gpu->funcs->pm_suspend(gpu);
> -       gpu->funcs->pm_resume(gpu);
> +       pm_runtime_dont_use_autosuspend(&gpu->pdev->dev);
> +
> +       /* active_submit won't change until we make a submission */
> +       mutex_lock(&gpu->active_lock);
> +       active_submits = gpu->active_submits;
> +       mutex_unlock(&gpu->active_lock);
> +
> +       /* Drop the rpm refcount from active submits */
> +       if (active_submits)
> +               pm_runtime_put(&gpu->pdev->dev);

Couldn't this race with an incoming submit triggering active_submits
to transition 0 -> 1?  Moving the mutex_unlock() would solve this.

Actually, maybe just move the mutex_unlock() to the end of the entire
sequence.  You could also clear gpu->active_submits and restore it
before unlocking, so you can drop the removal of the WARN_ON_ONCE
(patch 6/8) which should otherwise be squashed into this patch to keep
things bisectable

> +
> +       /* And the final one from recover worker */
> +       pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +       pm_runtime_use_autosuspend(&gpu->pdev->dev);
> +
> +       if (active_submits)
> +               pm_runtime_get(&gpu->pdev->dev);
> +
> +       pm_runtime_get_sync(&gpu->pdev->dev);
>
>         msm_gpu_hw_init(gpu);
>  }
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 1945efb..07e55a6 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -426,9 +426,7 @@ static void recover_worker(struct kthread_work *work)
>                 /* retire completed submits, plus the one that hung: */
>                 retire_submits(gpu);
>
> -               pm_runtime_get_sync(&gpu->pdev->dev);
>                 gpu->funcs->recover(gpu);
> -               pm_runtime_put_sync(&gpu->pdev->dev);

Hmm, could this have some fallout on earlier gens?

I guess I should extend the igt msm_recovery test to run on things
prior to a6xx..

BR,
-R

>
>                 /*
>                  * Replay all remaining submits starting with highest priority
> @@ -445,7 +443,7 @@ static void recover_worker(struct kthread_work *work)
>                 }
>         }
>
> -       pm_runtime_put_sync(&gpu->pdev->dev);
> +       pm_runtime_put(&gpu->pdev->dev);
>
>         mutex_unlock(&gpu->lock);
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
  2022-07-31 15:56   ` Rob Clark
@ 2022-07-31 16:32     ` Akhil P Oommen
  2022-07-31 22:15       ` Rob Clark
  0 siblings, 1 reply; 35+ messages in thread
From: Akhil P Oommen @ 2022-07-31 16:32 UTC (permalink / raw)
  To: Rob Clark
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On 7/31/2022 9:26 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> Instead of separate refcount for each submit, take single rpm refcount
>> on behalf of all the submits. This makes it easier to drop the rpm
>> refcount during recovery in an upcoming patch.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
> I see no earlier version of this patch?
>
>>   drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
>>   1 file changed, 8 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>> index c8cd9bf..e1dd3cc 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>>          mutex_lock(&gpu->active_lock);
>>          gpu->active_submits--;
>>          WARN_ON(gpu->active_submits < 0);
>> -       if (!gpu->active_submits)
>> +       if (!gpu->active_submits) {
>>                  msm_devfreq_idle(gpu);
>> -       mutex_unlock(&gpu->active_lock);
>> +               pm_runtime_put_autosuspend(&gpu->pdev->dev);
>> +       }
>>
>> -       pm_runtime_put_autosuspend(&gpu->pdev->dev);
>> +       mutex_unlock(&gpu->active_lock);
>>
>>          msm_gem_submit_put(submit);
>>   }
>> @@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
>>
>>          /* Update devfreq on transition from idle->active: */
>>          mutex_lock(&gpu->active_lock);
>> -       if (!gpu->active_submits)
>> +       if (!gpu->active_submits) {
>> +               pm_runtime_get(&gpu->pdev->dev);
>>                  msm_devfreq_active(gpu);
>> +       }
>>          gpu->active_submits++;
>>          mutex_unlock(&gpu->active_lock);
>>
>>          gpu->funcs->submit(gpu, submit);
>>          gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
>>
>> +       pm_runtime_put(&gpu->pdev->dev);
> this looks unbalanced?
There is another pm_runtime_get_sync at the top of this function. Just 
before hw_init().
https://elixir.bootlin.com/linux/v5.19-rc8/source/drivers/gpu/drm/msm/msm_gpu.c#L737

-Akhil.
>
> BR,
> -R
>
>>          hangcheck_timer_reset(gpu);
>>   }
>>
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
  2022-07-31 16:32     ` Akhil P Oommen
@ 2022-07-31 22:15       ` Rob Clark
  2022-08-01 14:35           ` Akhil P Oommen
  0 siblings, 1 reply; 35+ messages in thread
From: Rob Clark @ 2022-07-31 22:15 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On Sun, Jul 31, 2022 at 9:33 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 7/31/2022 9:26 PM, Rob Clark wrote:
> > On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >> Instead of separate refcount for each submit, take single rpm refcount
> >> on behalf of all the submits. This makes it easier to drop the rpm
> >> refcount during recovery in an upcoming patch.
> >>
> >> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >> ---
> >>
> >> (no changes since v1)
> > I see no earlier version of this patch?
> >
> >>   drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
> >>   1 file changed, 8 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> >> index c8cd9bf..e1dd3cc 100644
> >> --- a/drivers/gpu/drm/msm/msm_gpu.c
> >> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> >> @@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
> >>          mutex_lock(&gpu->active_lock);
> >>          gpu->active_submits--;
> >>          WARN_ON(gpu->active_submits < 0);
> >> -       if (!gpu->active_submits)
> >> +       if (!gpu->active_submits) {
> >>                  msm_devfreq_idle(gpu);
> >> -       mutex_unlock(&gpu->active_lock);
> >> +               pm_runtime_put_autosuspend(&gpu->pdev->dev);
> >> +       }
> >>
> >> -       pm_runtime_put_autosuspend(&gpu->pdev->dev);
> >> +       mutex_unlock(&gpu->active_lock);
> >>
> >>          msm_gem_submit_put(submit);
> >>   }
> >> @@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
> >>
> >>          /* Update devfreq on transition from idle->active: */
> >>          mutex_lock(&gpu->active_lock);
> >> -       if (!gpu->active_submits)
> >> +       if (!gpu->active_submits) {
> >> +               pm_runtime_get(&gpu->pdev->dev);
> >>                  msm_devfreq_active(gpu);
> >> +       }
> >>          gpu->active_submits++;
> >>          mutex_unlock(&gpu->active_lock);
> >>
> >>          gpu->funcs->submit(gpu, submit);
> >>          gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
> >>
> >> +       pm_runtime_put(&gpu->pdev->dev);
> > this looks unbalanced?
> There is another pm_runtime_get_sync at the top of this function. Just
> before hw_init().
> https://elixir.bootlin.com/linux/v5.19-rc8/source/drivers/gpu/drm/msm/msm_gpu.c#L737

oh, right.. sorry, I was looking at my local stack of WIP patches
which went the opposite direction and moved the runpm into just
msm_job_run().. I'll drop that one

BR,
-R

>
> -Akhil.
> >
> > BR,
> > -R
> >
> >>          hangcheck_timer_reset(gpu);
> >>   }
> >>
> >> --
> >> 2.7.4
> >>
>

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put
  2022-07-31 15:55   ` Rob Clark
@ 2022-08-01 14:32       ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 14:32 UTC (permalink / raw)
  To: Rob Clark
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On 7/31/2022 9:25 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
>> pm_runtime_get/put from msm_job_run().
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
>>
>>   drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> index 56eecb4..cad4c35 100644
>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> @@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>>                  msm_gem_unlock(obj);
>>          }
>>
>> -       pm_runtime_get_sync(&gpu->pdev->dev);
>> -
> This is removing a _get_sync() and simply relying on a _get() (async)
> in msm_gpu_submit().. that seems pretty likely to go badly?  I think
> it should probably replace the _get() in msm_gpu_submit() with
> _get_sync() (but also since this is changing position of
> resume/suspend vs active_lock, please make sure you test with lockdep
> enabled)
>
> BR,
> -R
As discussed in the other patch, this is correctly handled in 
msm_gpu_submit(). And from active_lock perspective, there is no change 
actually. GPU is ON by the time we touch active_lock in both cases.

-Akhil.
>>          /* TODO move submit path over to using a per-ring lock.. */
>>          mutex_lock(&gpu->lock);
>>
>> @@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>>
>>          mutex_unlock(&gpu->lock);
>>
>> -       pm_runtime_put(&gpu->pdev->dev);
>> -
>>          return dma_fence_get(submit->hw_fence);
>>   }
>>
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put
@ 2022-08-01 14:32       ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 14:32 UTC (permalink / raw)
  To: Rob Clark
  Cc: Sean Paul, Jonathan Marek, David Airlie, linux-arm-msm,
	Douglas Anderson, dri-devel, Jordan Crouse, Abhinav Kumar,
	Matthias Kaehlcke, Dmitry Baryshkov, Bjorn Andersson, freedreno,
	linux-kernel

On 7/31/2022 9:25 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
>> pm_runtime_get/put from msm_job_run().
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
>>
>>   drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
>>   1 file changed, 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> index 56eecb4..cad4c35 100644
>> --- a/drivers/gpu/drm/msm/msm_ringbuffer.c
>> +++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
>> @@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>>                  msm_gem_unlock(obj);
>>          }
>>
>> -       pm_runtime_get_sync(&gpu->pdev->dev);
>> -
> This is removing a _get_sync() and simply relying on a _get() (async)
> in msm_gpu_submit().. that seems pretty likely to go badly?  I think
> it should probably replace the _get() in msm_gpu_submit() with
> _get_sync() (but also since this is changing position of
> resume/suspend vs active_lock, please make sure you test with lockdep
> enabled)
>
> BR,
> -R
As discussed in the other patch, this is correctly handled in 
msm_gpu_submit(). And from active_lock perspective, there is no change 
actually. GPU is ON by the time we touch active_lock in both cases.

-Akhil.
>>          /* TODO move submit path over to using a per-ring lock.. */
>>          mutex_lock(&gpu->lock);
>>
>> @@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
>>
>>          mutex_unlock(&gpu->lock);
>>
>> -       pm_runtime_put(&gpu->pdev->dev);
>> -
>>          return dma_fence_get(submit->hw_fence);
>>   }
>>
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
  2022-07-31 22:15       ` Rob Clark
@ 2022-08-01 14:35           ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 14:35 UTC (permalink / raw)
  To: Rob Clark
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Sean Paul, linux-kernel

On 8/1/2022 3:45 AM, Rob Clark wrote:
> On Sun, Jul 31, 2022 at 9:33 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> On 7/31/2022 9:26 PM, Rob Clark wrote:
>>> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>> Instead of separate refcount for each submit, take single rpm refcount
>>>> on behalf of all the submits. This makes it easier to drop the rpm
>>>> refcount during recovery in an upcoming patch.
>>>>
>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>> ---
>>>>
>>>> (no changes since v1)
>>> I see no earlier version of this patch?
My bad, that is incorrect. This is a new patch included in the current 
series.

-Akhil.
>>>
>>>>    drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
>>>>    1 file changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>>>> index c8cd9bf..e1dd3cc 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>> @@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>>>>           mutex_lock(&gpu->active_lock);
>>>>           gpu->active_submits--;
>>>>           WARN_ON(gpu->active_submits < 0);
>>>> -       if (!gpu->active_submits)
>>>> +       if (!gpu->active_submits) {
>>>>                   msm_devfreq_idle(gpu);
>>>> -       mutex_unlock(&gpu->active_lock);
>>>> +               pm_runtime_put_autosuspend(&gpu->pdev->dev);
>>>> +       }
>>>>
>>>> -       pm_runtime_put_autosuspend(&gpu->pdev->dev);
>>>> +       mutex_unlock(&gpu->active_lock);
>>>>
>>>>           msm_gem_submit_put(submit);
>>>>    }
>>>> @@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
>>>>
>>>>           /* Update devfreq on transition from idle->active: */
>>>>           mutex_lock(&gpu->active_lock);
>>>> -       if (!gpu->active_submits)
>>>> +       if (!gpu->active_submits) {
>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>>                   msm_devfreq_active(gpu);
>>>> +       }
>>>>           gpu->active_submits++;
>>>>           mutex_unlock(&gpu->active_lock);
>>>>
>>>>           gpu->funcs->submit(gpu, submit);
>>>>           gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
>>>>
>>>> +       pm_runtime_put(&gpu->pdev->dev);
>>> this looks unbalanced?
>> There is another pm_runtime_get_sync at the top of this function. Just
>> before hw_init().
>> https://elixir.bootlin.com/linux/v5.19-rc8/source/drivers/gpu/drm/msm/msm_gpu.c#L737
> oh, right.. sorry, I was looking at my local stack of WIP patches
> which went the opposite direction and moved the runpm into just
> msm_job_run().. I'll drop that one
>
> BR,
> -R
>
>> -Akhil.
>>> BR,
>>> -R
>>>
>>>>           hangcheck_timer_reset(gpu);
>>>>    }
>>>>
>>>> --
>>>> 2.7.4
>>>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits
@ 2022-08-01 14:35           ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 14:35 UTC (permalink / raw)
  To: Rob Clark
  Cc: Sean Paul, Jonathan Marek, David Airlie, linux-arm-msm,
	Douglas Anderson, dri-devel, Jordan Crouse, Abhinav Kumar,
	Matthias Kaehlcke, Dmitry Baryshkov, Bjorn Andersson, freedreno,
	linux-kernel

On 8/1/2022 3:45 AM, Rob Clark wrote:
> On Sun, Jul 31, 2022 at 9:33 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> On 7/31/2022 9:26 PM, Rob Clark wrote:
>>> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>> Instead of separate refcount for each submit, take single rpm refcount
>>>> on behalf of all the submits. This makes it easier to drop the rpm
>>>> refcount during recovery in an upcoming patch.
>>>>
>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>> ---
>>>>
>>>> (no changes since v1)
>>> I see no earlier version of this patch?
My bad, that is incorrect. This is a new patch included in the current 
series.

-Akhil.
>>>
>>>>    drivers/gpu/drm/msm/msm_gpu.c | 12 ++++++++----
>>>>    1 file changed, 8 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>>>> index c8cd9bf..e1dd3cc 100644
>>>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>>>> @@ -663,11 +663,12 @@ static void retire_submit(struct msm_gpu *gpu, struct msm_ringbuffer *ring,
>>>>           mutex_lock(&gpu->active_lock);
>>>>           gpu->active_submits--;
>>>>           WARN_ON(gpu->active_submits < 0);
>>>> -       if (!gpu->active_submits)
>>>> +       if (!gpu->active_submits) {
>>>>                   msm_devfreq_idle(gpu);
>>>> -       mutex_unlock(&gpu->active_lock);
>>>> +               pm_runtime_put_autosuspend(&gpu->pdev->dev);
>>>> +       }
>>>>
>>>> -       pm_runtime_put_autosuspend(&gpu->pdev->dev);
>>>> +       mutex_unlock(&gpu->active_lock);
>>>>
>>>>           msm_gem_submit_put(submit);
>>>>    }
>>>> @@ -756,14 +757,17 @@ void msm_gpu_submit(struct msm_gpu *gpu, struct msm_gem_submit *submit)
>>>>
>>>>           /* Update devfreq on transition from idle->active: */
>>>>           mutex_lock(&gpu->active_lock);
>>>> -       if (!gpu->active_submits)
>>>> +       if (!gpu->active_submits) {
>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>>                   msm_devfreq_active(gpu);
>>>> +       }
>>>>           gpu->active_submits++;
>>>>           mutex_unlock(&gpu->active_lock);
>>>>
>>>>           gpu->funcs->submit(gpu, submit);
>>>>           gpu->cur_ctx_seqno = submit->queue->ctx->seqno;
>>>>
>>>> +       pm_runtime_put(&gpu->pdev->dev);
>>> this looks unbalanced?
>> There is another pm_runtime_get_sync at the top of this function. Just
>> before hw_init().
>> https://elixir.bootlin.com/linux/v5.19-rc8/source/drivers/gpu/drm/msm/msm_gpu.c#L737
> oh, right.. sorry, I was looking at my local stack of WIP patches
> which went the opposite direction and moved the runpm into just
> msm_job_run().. I'll drop that one
>
> BR,
> -R
>
>> -Akhil.
>>> BR,
>>> -R
>>>
>>>>           hangcheck_timer_reset(gpu);
>>>>    }
>>>>
>>>> --
>>>> 2.7.4
>>>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery
  2022-07-31 16:22   ` Rob Clark
@ 2022-08-01 15:10       ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 15:10 UTC (permalink / raw)
  To: Rob Clark
  Cc: freedreno, dri-devel, linux-arm-msm, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Chia-I Wu, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, Stephen Boyd,
	linux-kernel

On 7/31/2022 9:52 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> There are some hardware logic under CX domain. For a successful
>> recovery, we should ensure cx headswitch collapses to ensure all the
>> stale states are cleard out. This is especially true to for a6xx family
>> where we can GMU co-processor.
>>
>> Currently, cx doesn't collapse due to a devlink between gpu and its
>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>> that the iommu driver removes its vote on cx gdsc.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> Changes in v3:
>> - Simplied the pm refcount drop since we have just a single refcount now
>> for all active submits
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 24 +++++++++++++++++++++---
>>   drivers/gpu/drm/msm/msm_gpu.c         |  4 +---
>>   2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 42ed9a3..1b049c5 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1193,7 +1193,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>   {
>>          struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>>          struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>> -       int i;
>> +       int i, active_submits;
>>
>>          adreno_dump_info(gpu);
>>
>> @@ -1210,8 +1210,26 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>           */
>>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>
>> -       gpu->funcs->pm_suspend(gpu);
>> -       gpu->funcs->pm_resume(gpu);
>> +       pm_runtime_dont_use_autosuspend(&gpu->pdev->dev);
>> +
>> +       /* active_submit won't change until we make a submission */
>> +       mutex_lock(&gpu->active_lock);
>> +       active_submits = gpu->active_submits;
>> +       mutex_unlock(&gpu->active_lock);
>> +
>> +       /* Drop the rpm refcount from active submits */
>> +       if (active_submits)
>> +               pm_runtime_put(&gpu->pdev->dev);
> Couldn't this race with an incoming submit triggering active_submits
> to transition 0 -> 1?  Moving the mutex_unlock() would solve this.
>
> Actually, maybe just move the mutex_unlock() to the end of the entire
> sequence.  You could also clear gpu->active_submits and restore it
> before unlocking, so you can drop the removal of the WARN_ON_ONCE
> (patch 6/8) which should otherwise be squashed into this patch to keep
> things bisectable
Because we are holding gpu->lock, there won't be any new submissions to 
gpu. But I agree with your both suggestions.

-Akhil.
>
>> +
>> +       /* And the final one from recover worker */
>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>> +
>> +       pm_runtime_use_autosuspend(&gpu->pdev->dev);
>> +
>> +       if (active_submits)
>> +               pm_runtime_get(&gpu->pdev->dev);
>> +
>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>
>>          msm_gpu_hw_init(gpu);
>>   }
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>> index 1945efb..07e55a6 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -426,9 +426,7 @@ static void recover_worker(struct kthread_work *work)
>>                  /* retire completed submits, plus the one that hung: */
>>                  retire_submits(gpu);
>>
>> -               pm_runtime_get_sync(&gpu->pdev->dev);
>>                  gpu->funcs->recover(gpu);
>> -               pm_runtime_put_sync(&gpu->pdev->dev);
> Hmm, could this have some fallout on earlier gens?
>
> I guess I should extend the igt msm_recovery test to run on things
> prior to a6xx..
>
> BR,
> -R
No, because of patch 3/8 in this series.

-Akhil.
>
>>                  /*
>>                   * Replay all remaining submits starting with highest priority
>> @@ -445,7 +443,7 @@ static void recover_worker(struct kthread_work *work)
>>                  }
>>          }
>>
>> -       pm_runtime_put_sync(&gpu->pdev->dev);
>> +       pm_runtime_put(&gpu->pdev->dev);
>>
>>          mutex_unlock(&gpu->lock);
>>
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery
@ 2022-08-01 15:10       ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-01 15:10 UTC (permalink / raw)
  To: Rob Clark
  Cc: Sean Paul, Jonathan Marek, David Airlie, linux-arm-msm,
	Stephen Boyd, Douglas Anderson, dri-devel, Jordan Crouse,
	Abhinav Kumar, Matthias Kaehlcke, Dmitry Baryshkov,
	Bjorn Andersson, freedreno, linux-kernel

On 7/31/2022 9:52 PM, Rob Clark wrote:
> On Sat, Jul 30, 2022 at 2:41 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> There are some hardware logic under CX domain. For a successful
>> recovery, we should ensure cx headswitch collapses to ensure all the
>> stale states are cleard out. This is especially true to for a6xx family
>> where we can GMU co-processor.
>>
>> Currently, cx doesn't collapse due to a devlink between gpu and its
>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>> that the iommu driver removes its vote on cx gdsc.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> Changes in v3:
>> - Simplied the pm refcount drop since we have just a single refcount now
>> for all active submits
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 24 +++++++++++++++++++++---
>>   drivers/gpu/drm/msm/msm_gpu.c         |  4 +---
>>   2 files changed, 22 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 42ed9a3..1b049c5 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1193,7 +1193,7 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>   {
>>          struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>>          struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>> -       int i;
>> +       int i, active_submits;
>>
>>          adreno_dump_info(gpu);
>>
>> @@ -1210,8 +1210,26 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>           */
>>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>
>> -       gpu->funcs->pm_suspend(gpu);
>> -       gpu->funcs->pm_resume(gpu);
>> +       pm_runtime_dont_use_autosuspend(&gpu->pdev->dev);
>> +
>> +       /* active_submit won't change until we make a submission */
>> +       mutex_lock(&gpu->active_lock);
>> +       active_submits = gpu->active_submits;
>> +       mutex_unlock(&gpu->active_lock);
>> +
>> +       /* Drop the rpm refcount from active submits */
>> +       if (active_submits)
>> +               pm_runtime_put(&gpu->pdev->dev);
> Couldn't this race with an incoming submit triggering active_submits
> to transition 0 -> 1?  Moving the mutex_unlock() would solve this.
>
> Actually, maybe just move the mutex_unlock() to the end of the entire
> sequence.  You could also clear gpu->active_submits and restore it
> before unlocking, so you can drop the removal of the WARN_ON_ONCE
> (patch 6/8) which should otherwise be squashed into this patch to keep
> things bisectable
Because we are holding gpu->lock, there won't be any new submissions to 
gpu. But I agree with your both suggestions.

-Akhil.
>
>> +
>> +       /* And the final one from recover worker */
>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>> +
>> +       pm_runtime_use_autosuspend(&gpu->pdev->dev);
>> +
>> +       if (active_submits)
>> +               pm_runtime_get(&gpu->pdev->dev);
>> +
>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>
>>          msm_gpu_hw_init(gpu);
>>   }
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
>> index 1945efb..07e55a6 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -426,9 +426,7 @@ static void recover_worker(struct kthread_work *work)
>>                  /* retire completed submits, plus the one that hung: */
>>                  retire_submits(gpu);
>>
>> -               pm_runtime_get_sync(&gpu->pdev->dev);
>>                  gpu->funcs->recover(gpu);
>> -               pm_runtime_put_sync(&gpu->pdev->dev);
> Hmm, could this have some fallout on earlier gens?
>
> I guess I should extend the igt msm_recovery test to run on things
> prior to a6xx..
>
> BR,
> -R
No, because of patch 3/8 in this series.

-Akhil.
>
>>                  /*
>>                   * Replay all remaining submits starting with highest priority
>> @@ -445,7 +443,7 @@ static void recover_worker(struct kthread_work *work)
>>                  }
>>          }
>>
>> -       pm_runtime_put_sync(&gpu->pdev->dev);
>> +       pm_runtime_put(&gpu->pdev->dev);
>>
>>          mutex_unlock(&gpu->lock);
>>
>> --
>> 2.7.4
>>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
  2022-07-30  9:40   ` Akhil P Oommen
@ 2022-08-02  7:14     ` Dmitry Baryshkov
  -1 siblings, 0 replies; 35+ messages in thread
From: Dmitry Baryshkov @ 2022-08-02  7:14 UTC (permalink / raw)
  To: Akhil P Oommen, freedreno, dri-devel, linux-arm-msm, Rob Clark,
	Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Chia-I Wu, Daniel Vetter,
	David Airlie, Philipp Zabel, Sean Paul, Stephen Boyd,
	linux-kernel

On 30/07/2022 12:40, Akhil P Oommen wrote:
> Because there could be transient votes from other drivers/tz/hyp which
> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
> We can use the reset framework to poll for cx gdsc collapse from gpucc
> clk driver.
> 
> This feature requires support from the platform's gpucc driver.
> 
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
> 
> Changes in v3:
> - Use reset interface from gpucc driver to poll for cx gdsc collapse
>    https://patchwork.freedesktop.org/series/106860/
> 
>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
>   3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 1b049c5..721d5e6 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -10,6 +10,7 @@
>   
>   #include <linux/bitfield.h>
>   #include <linux/devfreq.h>
> +#include <linux/reset.h>
>   #include <linux/soc/qcom/llcc-qcom.h>
>   
>   #define GPU_PAS_ID 13
> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
>   	/* And the final one from recover worker */
>   	pm_runtime_put_sync(&gpu->pdev->dev);
>   
> +	/* Call into gpucc driver to poll for cx gdsc collapse */
> +	reset_control_reset(gpu->cx_collapse);

Do we have a race between the last pm_runtime_put_sync(), this polling 
and other voters removing their votes beforehand?

> +
>   	pm_runtime_use_autosuspend(&gpu->pdev->dev);
>   
>   	if (active_submits)
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 07e55a6..4a57627 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -14,6 +14,7 @@
>   #include <generated/utsrelease.h>
>   #include <linux/string_helpers.h>
>   #include <linux/devcoredump.h>
> +#include <linux/reset.h>
>   #include <linux/sched/task.h>
>   
>   /*
> @@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>   	if (IS_ERR(gpu->gpu_cx))
>   		gpu->gpu_cx = NULL;
>   
> +	gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
> +			"cx_collapse");
> +
>   	gpu->pdev = pdev;
>   	platform_set_drvdata(pdev, &gpu->adreno_smmu);
>   
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 6def008..ab59fd2 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -13,6 +13,7 @@
>   #include <linux/interconnect.h>
>   #include <linux/pm_opp.h>
>   #include <linux/regulator/consumer.h>
> +#include <linux/reset.h>
>   
>   #include "msm_drv.h"
>   #include "msm_fence.h"
> @@ -268,6 +269,9 @@ struct msm_gpu {
>   	bool hw_apriv;
>   
>   	struct thermal_cooling_device *cooling;
> +
> +	/* To poll for cx gdsc collapse during gpu recovery */
> +	struct reset_control *cx_collapse;
>   };
>   
>   static inline struct msm_gpu *dev_to_gpu(struct device *dev)


-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
@ 2022-08-02  7:14     ` Dmitry Baryshkov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Baryshkov @ 2022-08-02  7:14 UTC (permalink / raw)
  To: Akhil P Oommen, freedreno, dri-devel, linux-arm-msm, Rob Clark,
	Bjorn Andersson
  Cc: Jonathan Marek, David Airlie, linux-kernel, Stephen Boyd,
	Abhinav Kumar, Douglas Anderson, Matthias Kaehlcke,
	Jordan Crouse, Sean Paul

On 30/07/2022 12:40, Akhil P Oommen wrote:
> Because there could be transient votes from other drivers/tz/hyp which
> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
> We can use the reset framework to poll for cx gdsc collapse from gpucc
> clk driver.
> 
> This feature requires support from the platform's gpucc driver.
> 
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
> 
> Changes in v3:
> - Use reset interface from gpucc driver to poll for cx gdsc collapse
>    https://patchwork.freedesktop.org/series/106860/
> 
>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
>   3 files changed, 12 insertions(+)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 1b049c5..721d5e6 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -10,6 +10,7 @@
>   
>   #include <linux/bitfield.h>
>   #include <linux/devfreq.h>
> +#include <linux/reset.h>
>   #include <linux/soc/qcom/llcc-qcom.h>
>   
>   #define GPU_PAS_ID 13
> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
>   	/* And the final one from recover worker */
>   	pm_runtime_put_sync(&gpu->pdev->dev);
>   
> +	/* Call into gpucc driver to poll for cx gdsc collapse */
> +	reset_control_reset(gpu->cx_collapse);

Do we have a race between the last pm_runtime_put_sync(), this polling 
and other voters removing their votes beforehand?

> +
>   	pm_runtime_use_autosuspend(&gpu->pdev->dev);
>   
>   	if (active_submits)
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 07e55a6..4a57627 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -14,6 +14,7 @@
>   #include <generated/utsrelease.h>
>   #include <linux/string_helpers.h>
>   #include <linux/devcoredump.h>
> +#include <linux/reset.h>
>   #include <linux/sched/task.h>
>   
>   /*
> @@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>   	if (IS_ERR(gpu->gpu_cx))
>   		gpu->gpu_cx = NULL;
>   
> +	gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
> +			"cx_collapse");
> +
>   	gpu->pdev = pdev;
>   	platform_set_drvdata(pdev, &gpu->adreno_smmu);
>   
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 6def008..ab59fd2 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -13,6 +13,7 @@
>   #include <linux/interconnect.h>
>   #include <linux/pm_opp.h>
>   #include <linux/regulator/consumer.h>
> +#include <linux/reset.h>
>   
>   #include "msm_drv.h"
>   #include "msm_fence.h"
> @@ -268,6 +269,9 @@ struct msm_gpu {
>   	bool hw_apriv;
>   
>   	struct thermal_cooling_device *cooling;
> +
> +	/* To poll for cx gdsc collapse during gpu recovery */
> +	struct reset_control *cx_collapse;
>   };
>   
>   static inline struct msm_gpu *dev_to_gpu(struct device *dev)


-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
  2022-08-02  7:14     ` Dmitry Baryshkov
@ 2022-08-03 10:15       ` Akhil P Oommen
  -1 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-03 10:15 UTC (permalink / raw)
  To: Dmitry Baryshkov, freedreno, dri-devel, linux-arm-msm, Rob Clark,
	Bjorn Andersson
  Cc: Jonathan Marek, David Airlie, linux-kernel, Stephen Boyd,
	Abhinav Kumar, Douglas Anderson, Matthias Kaehlcke,
	Jordan Crouse, Sean Paul

On 8/2/2022 12:44 PM, Dmitry Baryshkov wrote:
> On 30/07/2022 12:40, Akhil P Oommen wrote:
>> Because there could be transient votes from other drivers/tz/hyp which
>> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
>> We can use the reset framework to poll for cx gdsc collapse from gpucc
>> clk driver.
>>
>> This feature requires support from the platform's gpucc driver.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> Changes in v3:
>> - Use reset interface from gpucc driver to poll for cx gdsc collapse
>>    https://patchwork.freedesktop.org/series/106860/
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
>>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
>>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
>>   3 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 1b049c5..721d5e6 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -10,6 +10,7 @@
>>     #include <linux/bitfield.h>
>>   #include <linux/devfreq.h>
>> +#include <linux/reset.h>
>>   #include <linux/soc/qcom/llcc-qcom.h>
>>     #define GPU_PAS_ID 13
>> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>       /* And the final one from recover worker */
>>       pm_runtime_put_sync(&gpu->pdev->dev);
>>   +    /* Call into gpucc driver to poll for cx gdsc collapse */
>> +    reset_control_reset(gpu->cx_collapse);
>
> Do we have a race between the last pm_runtime_put_sync(), this polling 
> and other voters removing their votes beforehand?
I can't see any issue with a race here. reset_control_reset() will 
return immediately in that case.

-Akhil.
>
>> +
>>       pm_runtime_use_autosuspend(&gpu->pdev->dev);
>>         if (active_submits)
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c 
>> b/drivers/gpu/drm/msm/msm_gpu.c
>> index 07e55a6..4a57627 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -14,6 +14,7 @@
>>   #include <generated/utsrelease.h>
>>   #include <linux/string_helpers.h>
>>   #include <linux/devcoredump.h>
>> +#include <linux/reset.h>
>>   #include <linux/sched/task.h>
>>     /*
>> @@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct 
>> platform_device *pdev,
>>       if (IS_ERR(gpu->gpu_cx))
>>           gpu->gpu_cx = NULL;
>>   +    gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
>> +            "cx_collapse");
>> +
>>       gpu->pdev = pdev;
>>       platform_set_drvdata(pdev, &gpu->adreno_smmu);
>>   diff --git a/drivers/gpu/drm/msm/msm_gpu.h 
>> b/drivers/gpu/drm/msm/msm_gpu.h
>> index 6def008..ab59fd2 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -13,6 +13,7 @@
>>   #include <linux/interconnect.h>
>>   #include <linux/pm_opp.h>
>>   #include <linux/regulator/consumer.h>
>> +#include <linux/reset.h>
>>     #include "msm_drv.h"
>>   #include "msm_fence.h"
>> @@ -268,6 +269,9 @@ struct msm_gpu {
>>       bool hw_apriv;
>>         struct thermal_cooling_device *cooling;
>> +
>> +    /* To poll for cx gdsc collapse during gpu recovery */
>> +    struct reset_control *cx_collapse;
>>   };
>>     static inline struct msm_gpu *dev_to_gpu(struct device *dev)
>
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
@ 2022-08-03 10:15       ` Akhil P Oommen
  0 siblings, 0 replies; 35+ messages in thread
From: Akhil P Oommen @ 2022-08-03 10:15 UTC (permalink / raw)
  To: Dmitry Baryshkov, freedreno, dri-devel, linux-arm-msm, Rob Clark,
	Bjorn Andersson
  Cc: Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Chia-I Wu, Daniel Vetter,
	David Airlie, Philipp Zabel, Sean Paul, Stephen Boyd,
	linux-kernel

On 8/2/2022 12:44 PM, Dmitry Baryshkov wrote:
> On 30/07/2022 12:40, Akhil P Oommen wrote:
>> Because there could be transient votes from other drivers/tz/hyp which
>> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
>> We can use the reset framework to poll for cx gdsc collapse from gpucc
>> clk driver.
>>
>> This feature requires support from the platform's gpucc driver.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> Changes in v3:
>> - Use reset interface from gpucc driver to poll for cx gdsc collapse
>>    https://patchwork.freedesktop.org/series/106860/
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
>>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
>>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
>>   3 files changed, 12 insertions(+)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 1b049c5..721d5e6 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -10,6 +10,7 @@
>>     #include <linux/bitfield.h>
>>   #include <linux/devfreq.h>
>> +#include <linux/reset.h>
>>   #include <linux/soc/qcom/llcc-qcom.h>
>>     #define GPU_PAS_ID 13
>> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>       /* And the final one from recover worker */
>>       pm_runtime_put_sync(&gpu->pdev->dev);
>>   +    /* Call into gpucc driver to poll for cx gdsc collapse */
>> +    reset_control_reset(gpu->cx_collapse);
>
> Do we have a race between the last pm_runtime_put_sync(), this polling 
> and other voters removing their votes beforehand?
I can't see any issue with a race here. reset_control_reset() will 
return immediately in that case.

-Akhil.
>
>> +
>>       pm_runtime_use_autosuspend(&gpu->pdev->dev);
>>         if (active_submits)
>> diff --git a/drivers/gpu/drm/msm/msm_gpu.c 
>> b/drivers/gpu/drm/msm/msm_gpu.c
>> index 07e55a6..4a57627 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.c
>> +++ b/drivers/gpu/drm/msm/msm_gpu.c
>> @@ -14,6 +14,7 @@
>>   #include <generated/utsrelease.h>
>>   #include <linux/string_helpers.h>
>>   #include <linux/devcoredump.h>
>> +#include <linux/reset.h>
>>   #include <linux/sched/task.h>
>>     /*
>> @@ -903,6 +904,9 @@ int msm_gpu_init(struct drm_device *drm, struct 
>> platform_device *pdev,
>>       if (IS_ERR(gpu->gpu_cx))
>>           gpu->gpu_cx = NULL;
>>   +    gpu->cx_collapse = devm_reset_control_get_optional(&pdev->dev,
>> +            "cx_collapse");
>> +
>>       gpu->pdev = pdev;
>>       platform_set_drvdata(pdev, &gpu->adreno_smmu);
>>   diff --git a/drivers/gpu/drm/msm/msm_gpu.h 
>> b/drivers/gpu/drm/msm/msm_gpu.h
>> index 6def008..ab59fd2 100644
>> --- a/drivers/gpu/drm/msm/msm_gpu.h
>> +++ b/drivers/gpu/drm/msm/msm_gpu.h
>> @@ -13,6 +13,7 @@
>>   #include <linux/interconnect.h>
>>   #include <linux/pm_opp.h>
>>   #include <linux/regulator/consumer.h>
>> +#include <linux/reset.h>
>>     #include "msm_drv.h"
>>   #include "msm_fence.h"
>> @@ -268,6 +269,9 @@ struct msm_gpu {
>>       bool hw_apriv;
>>         struct thermal_cooling_device *cooling;
>> +
>> +    /* To poll for cx gdsc collapse during gpu recovery */
>> +    struct reset_control *cx_collapse;
>>   };
>>     static inline struct msm_gpu *dev_to_gpu(struct device *dev)
>
>


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
  2022-08-03 10:15       ` Akhil P Oommen
@ 2022-08-03 11:09         ` Dmitry Baryshkov
  -1 siblings, 0 replies; 35+ messages in thread
From: Dmitry Baryshkov @ 2022-08-03 11:09 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson,
	Jordan Crouse, Jonathan Marek, Douglas Anderson,
	Matthias Kaehlcke, Abhinav Kumar, Chia-I Wu, Daniel Vetter,
	David Airlie, Philipp Zabel, Sean Paul, Stephen Boyd,
	linux-kernel

On Wed, 3 Aug 2022 at 13:15, Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 8/2/2022 12:44 PM, Dmitry Baryshkov wrote:
> > On 30/07/2022 12:40, Akhil P Oommen wrote:
> >> Because there could be transient votes from other drivers/tz/hyp which
> >> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
> >> We can use the reset framework to poll for cx gdsc collapse from gpucc
> >> clk driver.
> >>
> >> This feature requires support from the platform's gpucc driver.
> >>
> >> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >> ---
> >>
> >> Changes in v3:
> >> - Use reset interface from gpucc driver to poll for cx gdsc collapse
> >>    https://patchwork.freedesktop.org/series/106860/
> >>
> >>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
> >>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
> >>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
> >>   3 files changed, 12 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> index 1b049c5..721d5e6 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> @@ -10,6 +10,7 @@
> >>     #include <linux/bitfield.h>
> >>   #include <linux/devfreq.h>
> >> +#include <linux/reset.h>
> >>   #include <linux/soc/qcom/llcc-qcom.h>
> >>     #define GPU_PAS_ID 13
> >> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>       /* And the final one from recover worker */
> >>       pm_runtime_put_sync(&gpu->pdev->dev);
> >>   +    /* Call into gpucc driver to poll for cx gdsc collapse */
> >> +    reset_control_reset(gpu->cx_collapse);
> >
> > Do we have a race between the last pm_runtime_put_sync(), this polling
> > and other voters removing their votes beforehand?
> I can't see any issue with a race here. reset_control_reset() will
> return immediately in that case.

Ack, ok then.

Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>


-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery
@ 2022-08-03 11:09         ` Dmitry Baryshkov
  0 siblings, 0 replies; 35+ messages in thread
From: Dmitry Baryshkov @ 2022-08-03 11:09 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Sean Paul, Jonathan Marek, David Airlie, linux-arm-msm,
	Stephen Boyd, Douglas Anderson, dri-devel, Jordan Crouse,
	Abhinav Kumar, Matthias Kaehlcke, Bjorn Andersson, freedreno,
	linux-kernel

On Wed, 3 Aug 2022 at 13:15, Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 8/2/2022 12:44 PM, Dmitry Baryshkov wrote:
> > On 30/07/2022 12:40, Akhil P Oommen wrote:
> >> Because there could be transient votes from other drivers/tz/hyp which
> >> may keep the cx gdsc enabled, we should poll until cx gdsc collapses.
> >> We can use the reset framework to poll for cx gdsc collapse from gpucc
> >> clk driver.
> >>
> >> This feature requires support from the platform's gpucc driver.
> >>
> >> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >> ---
> >>
> >> Changes in v3:
> >> - Use reset interface from gpucc driver to poll for cx gdsc collapse
> >>    https://patchwork.freedesktop.org/series/106860/
> >>
> >>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 4 ++++
> >>   drivers/gpu/drm/msm/msm_gpu.c         | 4 ++++
> >>   drivers/gpu/drm/msm/msm_gpu.h         | 4 ++++
> >>   3 files changed, 12 insertions(+)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> index 1b049c5..721d5e6 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> @@ -10,6 +10,7 @@
> >>     #include <linux/bitfield.h>
> >>   #include <linux/devfreq.h>
> >> +#include <linux/reset.h>
> >>   #include <linux/soc/qcom/llcc-qcom.h>
> >>     #define GPU_PAS_ID 13
> >> @@ -1224,6 +1225,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>       /* And the final one from recover worker */
> >>       pm_runtime_put_sync(&gpu->pdev->dev);
> >>   +    /* Call into gpucc driver to poll for cx gdsc collapse */
> >> +    reset_control_reset(gpu->cx_collapse);
> >
> > Do we have a race between the last pm_runtime_put_sync(), this polling
> > and other voters removing their votes beforehand?
> I can't see any issue with a race here. reset_control_reset() will
> return immediately in that case.

Ack, ok then.

Reviewed-by: Dmitry Baryshkov <dmitry.baryshkov@linaro.org>


-- 
With best wishes
Dmitry

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2022-08-03 11:09 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-30  9:40 [PATCH v3 0/8] Improve GPU Recovery Akhil P Oommen
2022-07-30  9:40 ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 1/8] drm/msm: Remove unnecessary pm_runtime_get/put Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-31 15:55   ` Rob Clark
2022-08-01 14:32     ` Akhil P Oommen
2022-08-01 14:32       ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 2/8] drm/msm: Take single rpm refcount on behalf of all submits Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-31 15:56   ` Rob Clark
2022-07-31 16:32     ` Akhil P Oommen
2022-07-31 22:15       ` Rob Clark
2022-08-01 14:35         ` Akhil P Oommen
2022-08-01 14:35           ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 3/8] drm/msm: Correct pm_runtime votes in recover worker Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 4/8] drm/msm: Fix cx collapse issue during recovery Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-31 16:22   ` Rob Clark
2022-08-01 15:10     ` Akhil P Oommen
2022-08-01 15:10       ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 5/8] drm/msm/a6xx: Ensure CX collapse during gpu recovery Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-08-02  7:14   ` Dmitry Baryshkov
2022-08-02  7:14     ` Dmitry Baryshkov
2022-08-03 10:15     ` Akhil P Oommen
2022-08-03 10:15       ` Akhil P Oommen
2022-08-03 11:09       ` Dmitry Baryshkov
2022-08-03 11:09         ` Dmitry Baryshkov
2022-07-30  9:40 ` [PATCH v3 6/8] drm/msm/adreno: Remove a WARN() during runtime_suspend Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 7/8] drm/msm/a6xx: Improve gpu recovery sequence Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen
2022-07-30  9:40 ` [PATCH v3 8/8] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure Akhil P Oommen
2022-07-30  9:40   ` Akhil P Oommen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.