linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/7] Improve GPU Recovery
@ 2022-07-09  5:59 Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put Akhil P Oommen
                   ` (6 more replies)
  0 siblings, 7 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Andy Gross,
	Chia-I Wu, Christian König, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Geert Uytterhoeven, Konrad Dybcio,
	Krzysztof Kozlowski, Rob Herring, Sean Paul, Stephen Boyd,
	Wang Qing, devicetree, linux-kernel


Recently, I debugged a few device crashes which occured during recovery
after a hangcheck timeout. It looks like there are a few things we can
do to improve our chance at a successful gpu recovery.

First one is to ensure that CX GDSC collapses which clears the internal
states in gpu's CX domain. First 5 patches tries to handle this.

Rest of the patches are to ensure that few internal blocks like CP, GMU
and GBIF are halted properly before proceeding for a snapshot followed by
recovery. Also, handle 'prepare slumber' hfi failure correctly. These
are A6x specific improvements.

Changes in v2:
- Rebased on msm-next tip

Akhil P Oommen (7):
  drm/msm: Remove unnecessary pm_runtime_get/put
  drm/msm: Correct pm_runtime votes in recover worker
  drm/msm: Fix cx collapse issue during recovery
  drm/msm: Ensure cx gdsc collapse during recovery
  arm64: dts: qcom: sc7280: Update gpu register list
  drm/msm/a6xx: Improve gpu recovery sequence
  drm/msm/a6xx: Handle GMU prepare-slumber hfi failure

 arch/arm64/boot/dts/qcom/sc7280.dtsi  |  6 ++-
 drivers/gpu/drm/msm/adreno/a6xx.xml.h |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 83 ++++++++++++++++++++++-------------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 36 +++++++++++++--
 drivers/gpu/drm/msm/msm_gpu.c         |  9 ++--
 drivers/gpu/drm/msm/msm_gpu.h         |  1 +
 drivers/gpu/drm/msm/msm_ringbuffer.c  |  4 --
 7 files changed, 100 insertions(+), 43 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 25+ messages in thread

* [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 2/7] drm/msm: Correct pm_runtime votes in recover worker Akhil P Oommen
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, linux-kernel

We already enable gpu power from msm_gpu_submit(), so avoid a duplicate
pm_runtime_get/put from msm_job_run().

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_ringbuffer.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/msm_ringbuffer.c b/drivers/gpu/drm/msm/msm_ringbuffer.c
index 56eecb4..cad4c35 100644
--- a/drivers/gpu/drm/msm/msm_ringbuffer.c
+++ b/drivers/gpu/drm/msm/msm_ringbuffer.c
@@ -29,8 +29,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 		msm_gem_unlock(obj);
 	}
 
-	pm_runtime_get_sync(&gpu->pdev->dev);
-
 	/* TODO move submit path over to using a per-ring lock.. */
 	mutex_lock(&gpu->lock);
 
@@ -38,8 +36,6 @@ static struct dma_fence *msm_job_run(struct drm_sched_job *job)
 
 	mutex_unlock(&gpu->lock);
 
-	pm_runtime_put(&gpu->pdev->dev);
-
 	return dma_fence_get(submit->hw_fence);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 2/7] drm/msm: Correct pm_runtime votes in recover worker
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery Akhil P Oommen
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Sean Paul, linux-kernel

In the scenario where there is one a single submit which is hung, gpu is
power collapsed when it is retired. Because of this, by the time we call
reover(), gpu state would be already clear. Fix this by correctly
managing the pm runtime votes.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/msm_gpu.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index c2bfcf3f..18c1544 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -394,7 +394,6 @@ static void recover_worker(struct kthread_work *work)
 	/* Record the crash state */
 	pm_runtime_get_sync(&gpu->pdev->dev);
 	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
-	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	kfree(cmd);
 	kfree(comm);
@@ -442,6 +441,8 @@ static void recover_worker(struct kthread_work *work)
 		}
 	}
 
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
 	mutex_unlock(&gpu->lock);
 
 	msm_gpu_retire(gpu);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 2/7] drm/msm: Correct pm_runtime votes in recover worker Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-11 23:22   ` Doug Anderson
  2022-07-09  5:59 ` [PATCH v2 4/7] drm/msm: Ensure cx gdsc collapse " Akhil P Oommen
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Konrad Dybcio,
	Sean Paul, linux-kernel

There are some hardware logic under CX domain. For a successful
recovery, we should ensure cx headswitch collapses to ensure all the
stale states are cleard out. This is especially true to for a6xx family
where we can GMU co-processor.

Currently, cx doesn't collapse due to a devlink between gpu and its
smmu. So the *struct gpu device* needs to be runtime suspended to ensure
that the iommu driver removes its vote on cx gdsc.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
 drivers/gpu/drm/msm/msm_gpu.c         |  2 --
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 4d50110..7ed347c 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	 */
 	gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
 
-	gpu->funcs->pm_suspend(gpu);
-	gpu->funcs->pm_resume(gpu);
+	/*
+	 * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
+	 * First drop the usage count from all active submits
+	 */
+	for (i = gpu->active_submits; i > 0; i--)
+		pm_runtime_put(&gpu->pdev->dev);
+
+	/* And the final one from recover worker */
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	for (i = gpu->active_submits; i > 0; i--)
+		pm_runtime_get(&gpu->pdev->dev);
+
+	pm_runtime_get_sync(&gpu->pdev->dev);
 
 	msm_gpu_hw_init(gpu);
 }
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 18c1544..aa6f34f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -422,9 +422,7 @@ static void recover_worker(struct kthread_work *work)
 		/* retire completed submits, plus the one that hung: */
 		retire_submits(gpu);
 
-		pm_runtime_get_sync(&gpu->pdev->dev);
 		gpu->funcs->recover(gpu);
-		pm_runtime_put_sync(&gpu->pdev->dev);
 
 		/*
 		 * Replay all remaining submits starting with highest priority
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 4/7] drm/msm: Ensure cx gdsc collapse during recovery
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
                   ` (2 preceding siblings ...)
  2022-07-09  5:59 ` [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list Akhil P Oommen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Daniel Vetter, David Airlie, Dmitry Baryshkov, Konrad Dybcio,
	Sean Paul, Stephen Boyd, linux-kernel

To improve our chance of a successful recovery, we should ensure that
cx headswitch collapses. Cx headswitch might be kept enabled through a
vote from another driver like iommu or even another hardware subsystem.
So, poll the cx gdscr register to ensure that it collapses during
recovery.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 13 ++++++++++++-
 drivers/gpu/drm/msm/msm_gpu.c         |  4 ++++
 drivers/gpu/drm/msm/msm_gpu.h         |  1 +
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 7ed347c..9aaa469 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1257,11 +1257,15 @@ static void a6xx_dump(struct msm_gpu *gpu)
 #define VBIF_RESET_ACK_TIMEOUT	100
 #define VBIF_RESET_ACK_MASK	0x00f0
 
+#define CX_GDSCR_OFFSET	0x106c
+#define CX_GDSC_ON_MASK	BIT(31)
+
 static void a6xx_recover(struct msm_gpu *gpu)
 {
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
-	int i;
+	int i, ret;
+	u32 val;
 
 	adreno_dump_info(gpu);
 
@@ -1288,6 +1292,13 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	/* And the final one from recover worker */
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
+	if (gpu->gpucc_io) {
+		ret = readl_poll_timeout(gpu->gpucc_io + CX_GDSCR_OFFSET, val,
+			!(val & CX_GDSC_ON_MASK), 100, 500000);
+		if (ret)
+			DRM_DEV_INFO(&gpu->pdev->dev, "cx gdsc didn't collapse\n");
+	}
+
 	for (i = gpu->active_submits; i > 0; i--)
 		pm_runtime_get(&gpu->pdev->dev);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index aa6f34f..7ada0785 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -865,6 +865,10 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 		goto fail;
 	}
 
+	gpu->gpucc_io = msm_ioremap(pdev, "gpucc");
+	if (IS_ERR(gpu->gpucc_io))
+		gpu->gpucc_io = NULL;
+
 	/* Get Interrupt: */
 	gpu->irq = platform_get_irq(pdev, 0);
 	if (gpu->irq < 0) {
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 4d935fe..1fe498f 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -226,6 +226,7 @@ struct msm_gpu {
 	int global_faults;
 
 	void __iomem *mmio;
+	void __iomem *gpucc_io;
 	int irq;
 
 	struct msm_gem_address_space *aspace;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
                   ` (3 preceding siblings ...)
  2022-07-09  5:59 ` [PATCH v2 4/7] drm/msm: Ensure cx gdsc collapse " Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-11 23:27   ` Doug Anderson
  2022-07-09  5:59 ` [PATCH v2 6/7] drm/msm/a6xx: Improve gpu recovery sequence Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 7/7] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure Akhil P Oommen
  6 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Andy Gross,
	Krzysztof Kozlowski, Rob Herring, devicetree, linux-kernel

Update gpu register array with gpucc memory region.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
index e66fc67..defdb25 100644
--- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
+++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
@@ -2228,10 +2228,12 @@
 			compatible = "qcom,adreno-635.0", "qcom,adreno";
 			reg = <0 0x03d00000 0 0x40000>,
 			      <0 0x03d9e000 0 0x1000>,
-			      <0 0x03d61000 0 0x800>;
+			      <0 0x03d61000 0 0x800>,
+			      <0 0x03d90000 0 0x2000>;
 			reg-names = "kgsl_3d0_reg_memory",
 				    "cx_mem",
-				    "cx_dbgc";
+				    "cx_dbgc",
+				    "gpucc";
 			interrupts = <GIC_SPI 300 IRQ_TYPE_LEVEL_HIGH>;
 			iommus = <&adreno_smmu 0 0x401>;
 			operating-points-v2 = <&gpu_opp_table>;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 6/7] drm/msm/a6xx: Improve gpu recovery sequence
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
                   ` (4 preceding siblings ...)
  2022-07-09  5:59 ` [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  2022-07-09  5:59 ` [PATCH v2 7/7] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure Akhil P Oommen
  6 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Chia-I Wu,
	Christian König, Daniel Vetter, David Airlie,
	Dmitry Baryshkov, Geert Uytterhoeven, Konrad Dybcio, Sean Paul,
	Wang Qing, linux-kernel

We can do a few more things to improve our chance at a successful gpu
recovery, especially during a hangcheck timeout:
1. Halt CP and GMU core
2. Do RBBM GBIF HALT sequence
3. Do a soft reset of GPU core

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx.xml.h |  4 ++
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 77 +++++++++++++++++++++--------------
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c |  7 ++++
 3 files changed, 58 insertions(+), 30 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx.xml.h b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
index b03e2c4..beea4a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx.xml.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx.xml.h
@@ -1413,6 +1413,10 @@ static inline uint32_t REG_A6XX_RBBM_PERFCTR_RBBM_SEL(uint32_t i0) { return 0x00
 
 #define REG_A6XX_RBBM_GBIF_CLIENT_QOS_CNTL			0x00000011
 
+#define REG_A6XX_RBBM_GBIF_HALT					0x00000016
+
+#define REG_A6XX_RBBM_GBIF_HALT_ACK				0x00000017
+
 #define REG_A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD			0x0000001c
 #define A6XX_RBBM_WAIT_FOR_GPU_IDLE_CMD_WAIT_GPU_IDLE		0x00000001
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 310a317..ec25f19 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -873,9 +873,47 @@ static void a6xx_gmu_rpmh_off(struct a6xx_gmu *gmu)
 		(val & 1), 100, 1000);
 }
 
+#define GBIF_CLIENT_HALT_MASK             BIT(0)
+#define GBIF_ARB_HALT_MASK                BIT(1)
+
+static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
+{
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
+	if (!a6xx_has_gbif(adreno_gpu)) {
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
+		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
+								0xf) == 0xf);
+		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
+
+		return;
+	}
+
+	/* Halt the gx side of GBIF */
+	gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 1);
+	spin_until(gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT_ACK) & 1);
+
+	/* Halt new client requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
+	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
+
+	/* Halt all AXI requests on GBIF */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
+	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
+			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
+
+	/* The GBIF halt needs to be explicitly cleared */
+	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
+}
+
 /* Force the GMU off in case it isn't responsive */
 static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 {
+	struct a6xx_gpu *a6xx_gpu = container_of(gmu, struct a6xx_gpu, gmu);
+	struct adreno_gpu *adreno_gpu = &a6xx_gpu->base;
+	struct msm_gpu *gpu = &adreno_gpu->base;
+
 	/* Flush all the queues */
 	a6xx_hfi_stop(gmu);
 
@@ -887,6 +925,15 @@ static void a6xx_gmu_force_off(struct a6xx_gmu *gmu)
 
 	/* Make sure there are no outstanding RPMh votes */
 	a6xx_gmu_rpmh_off(gmu);
+
+	/* Halt the gmu cm3 core */
+	gmu_write(gmu, REG_A6XX_GMU_CM3_SYSRESET, 1);
+
+	a6xx_bus_clear_pending_transactions(adreno_gpu);
+
+	/* Reset GPU core blocks */
+	gpu_write(gpu, REG_A6XX_RBBM_SW_RESET_CMD, 1);
+	udelay(100);
 }
 
 static void a6xx_gmu_set_initial_freq(struct msm_gpu *gpu, struct a6xx_gmu *gmu)
@@ -1014,36 +1061,6 @@ bool a6xx_gmu_isidle(struct a6xx_gmu *gmu)
 	return true;
 }
 
-#define GBIF_CLIENT_HALT_MASK             BIT(0)
-#define GBIF_ARB_HALT_MASK                BIT(1)
-
-static void a6xx_bus_clear_pending_transactions(struct adreno_gpu *adreno_gpu)
-{
-	struct msm_gpu *gpu = &adreno_gpu->base;
-
-	if (!a6xx_has_gbif(adreno_gpu)) {
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0xf);
-		spin_until((gpu_read(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL1) &
-								0xf) == 0xf);
-		gpu_write(gpu, REG_A6XX_VBIF_XIN_HALT_CTRL0, 0);
-
-		return;
-	}
-
-	/* Halt new client requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_CLIENT_HALT_MASK);
-	spin_until((gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_CLIENT_HALT_MASK)) == GBIF_CLIENT_HALT_MASK);
-
-	/* Halt all AXI requests on GBIF */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, GBIF_ARB_HALT_MASK);
-	spin_until((gpu_read(gpu,  REG_A6XX_GBIF_HALT_ACK) &
-			(GBIF_ARB_HALT_MASK)) == GBIF_ARB_HALT_MASK);
-
-	/* The GBIF halt needs to be explicitly cleared */
-	gpu_write(gpu, REG_A6XX_GBIF_HALT, 0x0);
-}
-
 /* Gracefully try to shut down the GMU and by extension the GPU */
 static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 {
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 9aaa469..60c3033 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -987,6 +987,10 @@ static int hw_init(struct msm_gpu *gpu)
 	/* Make sure the GMU keeps the GPU on while we set it up */
 	a6xx_gmu_set_oob(&a6xx_gpu->gmu, GMU_OOB_GPU_SET);
 
+	/* Clear GBIF halt in case GX domain was not collapsed */
+	if (a6xx_has_gbif(adreno_gpu))
+		gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
+
 	gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
 
 	/*
@@ -1276,6 +1280,9 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	if (hang_debug)
 		a6xx_dump(gpu);
 
+	/* Halt SQE first */
+	gpu_write(gpu, REG_A6XX_CP_SQE_CNTL, 3);
+
 	/*
 	 * Turn off keep alive that might have been enabled by the hang
 	 * interrupt
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* [PATCH v2 7/7] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure
  2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
                   ` (5 preceding siblings ...)
  2022-07-09  5:59 ` [PATCH v2 6/7] drm/msm/a6xx: Improve gpu recovery sequence Akhil P Oommen
@ 2022-07-09  5:59 ` Akhil P Oommen
  6 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-09  5:59 UTC (permalink / raw)
  To: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson
  Cc: Jonathan Marek, Jordan Crouse, Matthias Kaehlcke,
	Douglas Anderson, Akhil P Oommen, Abhinav Kumar, Daniel Vetter,
	David Airlie, Dmitry Baryshkov, Geert Uytterhoeven,
	Konrad Dybcio, Sean Paul, linux-kernel

When prepare-slumber hfi fails, we should follow a6xx_gmu_force_off()
sequence.

Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
---

(no changes since v1)

 drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index ec25f19..e033d6a 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -1086,7 +1086,11 @@ static void a6xx_gmu_shutdown(struct a6xx_gmu *gmu)
 		a6xx_bus_clear_pending_transactions(adreno_gpu);
 
 		/* tell the GMU we want to slumber */
-		a6xx_gmu_notify_slumber(gmu);
+		ret = a6xx_gmu_notify_slumber(gmu);
+		if (ret) {
+			a6xx_gmu_force_off(gmu);
+			return;
+		}
 
 		ret = gmu_poll_timeout(gmu,
 			REG_A6XX_GPU_GMU_AO_GPU_CX_BUSY_STATUS, val,
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-09  5:59 ` [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery Akhil P Oommen
@ 2022-07-11 23:22   ` Doug Anderson
  2022-07-12  5:04     ` [Freedreno] " Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Doug Anderson @ 2022-07-11 23:22 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson,
	Jonathan Marek, Jordan Crouse, Matthias Kaehlcke, Abhinav Kumar,
	Chia-I Wu, Daniel Vetter, David Airlie, Dmitry Baryshkov,
	Konrad Dybcio, Sean Paul, LKML

Hi,

On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> There are some hardware logic under CX domain. For a successful
> recovery, we should ensure cx headswitch collapses to ensure all the
> stale states are cleard out. This is especially true to for a6xx family
> where we can GMU co-processor.
>
> Currently, cx doesn't collapse due to a devlink between gpu and its
> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> that the iommu driver removes its vote on cx gdsc.
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> (no changes since v1)
>
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>  drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>  2 files changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 4d50110..7ed347c 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>          */
>         gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>
> -       gpu->funcs->pm_suspend(gpu);
> -       gpu->funcs->pm_resume(gpu);
> +       /*
> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> +        * First drop the usage count from all active submits
> +        */
> +       for (i = gpu->active_submits; i > 0; i--)
> +               pm_runtime_put(&gpu->pdev->dev);
> +
> +       /* And the final one from recover worker */
> +       pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +       for (i = gpu->active_submits; i > 0; i--)
> +               pm_runtime_get(&gpu->pdev->dev);
> +
> +       pm_runtime_get_sync(&gpu->pdev->dev);

In response to v1, Rob suggested pm_runtime_force_suspend/resume().
Those seem like they would work to me, too. Why not use them?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-09  5:59 ` [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list Akhil P Oommen
@ 2022-07-11 23:27   ` Doug Anderson
  2022-07-14  5:40     ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Doug Anderson @ 2022-07-11 23:27 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson,
	Jonathan Marek, Jordan Crouse, Matthias Kaehlcke, Andy Gross,
	Krzysztof Kozlowski, Rob Herring,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, LKML,
	Stephen Boyd

Hi,

On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> Update gpu register array with gpucc memory region.
>
> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> ---
>
> (no changes since v1)
>
>  arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> index e66fc67..defdb25 100644
> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> @@ -2228,10 +2228,12 @@
>                         compatible = "qcom,adreno-635.0", "qcom,adreno";
>                         reg = <0 0x03d00000 0 0x40000>,
>                               <0 0x03d9e000 0 0x1000>,
> -                             <0 0x03d61000 0 0x800>;
> +                             <0 0x03d61000 0 0x800>,
> +                             <0 0x03d90000 0 0x2000>;
>                         reg-names = "kgsl_3d0_reg_memory",
>                                     "cx_mem",
> -                                   "cx_dbgc";
> +                                   "cx_dbgc",
> +                                   "gpucc";

This doesn't seem right. Shouldn't you be coordinating with the
existing gpucc instead of reaching into its registers?

-Doug

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-11 23:22   ` Doug Anderson
@ 2022-07-12  5:04     ` Akhil P Oommen
  2022-07-12 16:44       ` Rob Clark
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-12  5:04 UTC (permalink / raw)
  To: Doug Anderson
  Cc: Sean Paul, Jonathan Marek, David Airlie, linux-arm-msm,
	Konrad Dybcio, Abhinav Kumar, dri-devel, Bjorn Andersson,
	Rob Clark, Matthias Kaehlcke, Daniel Vetter, Dmitry Baryshkov,
	Jordan Crouse, freedreno, Chia-I Wu, LKML

On 7/12/2022 4:52 AM, Doug Anderson wrote:
> Hi,
>
> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> There are some hardware logic under CX domain. For a successful
>> recovery, we should ensure cx headswitch collapses to ensure all the
>> stale states are cleard out. This is especially true to for a6xx family
>> where we can GMU co-processor.
>>
>> Currently, cx doesn't collapse due to a devlink between gpu and its
>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>> that the iommu driver removes its vote on cx gdsc.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
>>
>>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>   drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>   2 files changed, 14 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> index 4d50110..7ed347c 100644
>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>           */
>>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>
>> -       gpu->funcs->pm_suspend(gpu);
>> -       gpu->funcs->pm_resume(gpu);
>> +       /*
>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>> +        * First drop the usage count from all active submits
>> +        */
>> +       for (i = gpu->active_submits; i > 0; i--)
>> +               pm_runtime_put(&gpu->pdev->dev);
>> +
>> +       /* And the final one from recover worker */
>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>> +
>> +       for (i = gpu->active_submits; i > 0; i--)
>> +               pm_runtime_get(&gpu->pdev->dev);
>> +
>> +       pm_runtime_get_sync(&gpu->pdev->dev);
> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> Those seem like they would work to me, too. Why not use them?
Quoting my previous response which I seem to have sent only to Freedreno 
list:

"I believe it is supposed to be used only during system sleep state 
transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to 
fail by disabling RPM here."

-Akhil

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-12  5:04     ` [Freedreno] " Akhil P Oommen
@ 2022-07-12 16:44       ` Rob Clark
  2022-07-12 19:15         ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Rob Clark @ 2022-07-12 16:44 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Doug Anderson, Sean Paul, Jonathan Marek, David Airlie,
	linux-arm-msm, Konrad Dybcio, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Matthias Kaehlcke, Daniel Vetter,
	Dmitry Baryshkov, Jordan Crouse, freedreno, Chia-I Wu, LKML

On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >> There are some hardware logic under CX domain. For a successful
> >> recovery, we should ensure cx headswitch collapses to ensure all the
> >> stale states are cleard out. This is especially true to for a6xx family
> >> where we can GMU co-processor.
> >>
> >> Currently, cx doesn't collapse due to a devlink between gpu and its
> >> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >> that the iommu driver removes its vote on cx gdsc.
> >>
> >> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >> ---
> >>
> >> (no changes since v1)
> >>
> >>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >>   drivers/gpu/drm/msm/msm_gpu.c         |  2 --
> >>   2 files changed, 14 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> index 4d50110..7ed347c 100644
> >> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>           */
> >>          gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>
> >> -       gpu->funcs->pm_suspend(gpu);
> >> -       gpu->funcs->pm_resume(gpu);
> >> +       /*
> >> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >> +        * First drop the usage count from all active submits
> >> +        */
> >> +       for (i = gpu->active_submits; i > 0; i--)
> >> +               pm_runtime_put(&gpu->pdev->dev);
> >> +
> >> +       /* And the final one from recover worker */
> >> +       pm_runtime_put_sync(&gpu->pdev->dev);
> >> +
> >> +       for (i = gpu->active_submits; i > 0; i--)
> >> +               pm_runtime_get(&gpu->pdev->dev);
> >> +
> >> +       pm_runtime_get_sync(&gpu->pdev->dev);
> > In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> > Those seem like they would work to me, too. Why not use them?
> Quoting my previous response which I seem to have sent only to Freedreno
> list:
>
> "I believe it is supposed to be used only during system sleep state
> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> fail by disabling RPM here."

The comment about not wanting other runpm calls to fail is valid.. but
that is also solveable, ie. by holding a lock around runpm calls.
Which I think we need to do anyways, otherwise looping over
gpu->active_submits is racey..

I think pm_runtime_force_suspend/resume() is the least-bad option.. or
at least I'm not seeing any obvious alternative that is better

BR,
-R

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-12 16:44       ` Rob Clark
@ 2022-07-12 19:15         ` Akhil P Oommen
  2022-07-20 18:06           ` Rob Clark
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-12 19:15 UTC (permalink / raw)
  To: Rob Clark
  Cc: Doug Anderson, Sean Paul, Jonathan Marek, David Airlie,
	linux-arm-msm, Konrad Dybcio, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Matthias Kaehlcke, Daniel Vetter,
	Dmitry Baryshkov, Jordan Crouse, freedreno, Chia-I Wu, LKML

On 7/12/2022 10:14 PM, Rob Clark wrote:
> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>> Hi,
>>>
>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>> There are some hardware logic under CX domain. For a successful
>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>> stale states are cleard out. This is especially true to for a6xx family
>>>> where we can GMU co-processor.
>>>>
>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>> that the iommu driver removes its vote on cx gdsc.
>>>>
>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>> ---
>>>>
>>>> (no changes since v1)
>>>>
>>>>    drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>    drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>    2 files changed, 14 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> index 4d50110..7ed347c 100644
>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>>            */
>>>>           gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>
>>>> -       gpu->funcs->pm_suspend(gpu);
>>>> -       gpu->funcs->pm_resume(gpu);
>>>> +       /*
>>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>> +        * First drop the usage count from all active submits
>>>> +        */
>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>> +               pm_runtime_put(&gpu->pdev->dev);
>>>> +
>>>> +       /* And the final one from recover worker */
>>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>>>> +
>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>> +
>>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>> Those seem like they would work to me, too. Why not use them?
>> Quoting my previous response which I seem to have sent only to Freedreno
>> list:
>>
>> "I believe it is supposed to be used only during system sleep state
>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>> fail by disabling RPM here."
> The comment about not wanting other runpm calls to fail is valid.. but
> that is also solveable, ie. by holding a lock around runpm calls.
> Which I think we need to do anyways, otherwise looping over
> gpu->active_submits is racey..
>
> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> at least I'm not seeing any obvious alternative that is better
>
> BR,
> -R
We are holding gpu->lock here which will block further submissions from 
scheduler. Will active_submits still race?

It is possible that there is another thread which successfully completed 
pm_runtime_get() and while it access the hardware, we pulled the plug on 
regulator/clock here. That will result in obvious device crash. So I can 
think of 2 solutions:

1. wrap *every* pm_runtime_get/put with a mutex. Something like:
             mutex_lock();
             pm_runtime_get();
             < ... access hardware here >>
             pm_runtime_put();
             mutex_unlock();

2. Drop runtime votes from every submit in recover worker and wait/poll 
for regulator to collapse in case there are transient votes on 
regulator  from other threads/subsystems.

Option (2) seems simpler to me.  What do you think?

-Akhil.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-11 23:27   ` Doug Anderson
@ 2022-07-14  5:40     ` Akhil P Oommen
  2022-07-19  4:07       ` [Freedreno] " Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-14  5:40 UTC (permalink / raw)
  To: Doug Anderson, Stephen Boyd, Taniya Das, quic_rjendra
  Cc: freedreno, dri-devel, linux-arm-msm, Rob Clark, Bjorn Andersson,
	Jonathan Marek, Jordan Crouse, Matthias Kaehlcke, Andy Gross,
	Krzysztof Kozlowski, Rob Herring,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS, LKML,
	Stephen Boyd

On 7/12/2022 4:57 AM, Doug Anderson wrote:
> Hi,
>
> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>> Update gpu register array with gpucc memory region.
>>
>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>> ---
>>
>> (no changes since v1)
>>
>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> index e66fc67..defdb25 100644
>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>> @@ -2228,10 +2228,12 @@
>>                          compatible = "qcom,adreno-635.0", "qcom,adreno";
>>                          reg = <0 0x03d00000 0 0x40000>,
>>                                <0 0x03d9e000 0 0x1000>,
>> -                             <0 0x03d61000 0 0x800>;
>> +                             <0 0x03d61000 0 0x800>,
>> +                             <0 0x03d90000 0 0x2000>;
>>                          reg-names = "kgsl_3d0_reg_memory",
>>                                      "cx_mem",
>> -                                   "cx_dbgc";
>> +                                   "cx_dbgc",
>> +                                   "gpucc";
> This doesn't seem right. Shouldn't you be coordinating with the
> existing gpucc instead of reaching into its registers?
>
> -Doug
IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they 
are vote-able switches. Ideally, we should ensure that the hw has 
collapsed for gpu recovery because there could be transient votes from 
other subsystems like hypervisor using their vote register.

I am not sure how complex the plumbing to gpucc driver would be to allow 
gpu driver to check hw status. OTOH, with this patch, gpu driver does a 
read operation on a gpucc register which is in always-on domain. That 
means we don't need to vote any resource to access this register.

Stephen/Rajendra/Taniya, any suggestion?

-Akhil.



^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-14  5:40     ` Akhil P Oommen
@ 2022-07-19  4:07       ` Akhil P Oommen
  2022-07-19  5:49         ` Stephen Boyd
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-19  4:07 UTC (permalink / raw)
  To: Doug Anderson, Stephen Boyd, Taniya Das, quic_rjendra
  Cc: open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML

On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> On 7/12/2022 4:57 AM, Doug Anderson wrote:
>> Hi,
>>
>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen 
>> <quic_akhilpo@quicinc.com> wrote:
>>> Update gpu register array with gpucc memory region.
>>>
>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>> ---
>>>
>>> (no changes since v1)
>>>
>>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi 
>>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> index e66fc67..defdb25 100644
>>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>> @@ -2228,10 +2228,12 @@
>>>                          compatible = "qcom,adreno-635.0", 
>>> "qcom,adreno";
>>>                          reg = <0 0x03d00000 0 0x40000>,
>>>                                <0 0x03d9e000 0 0x1000>,
>>> -                             <0 0x03d61000 0 0x800>;
>>> +                             <0 0x03d61000 0 0x800>,
>>> +                             <0 0x03d90000 0 0x2000>;
>>>                          reg-names = "kgsl_3d0_reg_memory",
>>>                                      "cx_mem",
>>> -                                   "cx_dbgc";
>>> +                                   "cx_dbgc",
>>> +                                   "gpucc";
>> This doesn't seem right. Shouldn't you be coordinating with the
>> existing gpucc instead of reaching into its registers?
>>
>> -Doug
> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they 
> are vote-able switches. Ideally, we should ensure that the hw has 
> collapsed for gpu recovery because there could be transient votes from 
> other subsystems like hypervisor using their vote register.
> 
> I am not sure how complex the plumbing to gpucc driver would be to allow 
> gpu driver to check hw status. OTOH, with this patch, gpu driver does a 
> read operation on a gpucc register which is in always-on domain. That 
> means we don't need to vote any resource to access this register.
> 
> Stephen/Rajendra/Taniya, any suggestion?
> 
> -Akhil.
> 
> 
Gentle ping.

-Akhil


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-19  4:07       ` [Freedreno] " Akhil P Oommen
@ 2022-07-19  5:49         ` Stephen Boyd
  2022-07-19  6:37           ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Boyd @ 2022-07-19  5:49 UTC (permalink / raw)
  To: Akhil P Oommen, Doug Anderson, Taniya Das, quic_rjendra
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML

Quoting Akhil P Oommen (2022-07-18 21:07:05)
> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> > On 7/12/2022 4:57 AM, Doug Anderson wrote:
> >> Hi,
> >>
> >> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
> >> <quic_akhilpo@quicinc.com> wrote:
> >>> Update gpu register array with gpucc memory region.
> >>>
> >>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >>> ---
> >>>
> >>> (no changes since v1)
> >>>
> >>>   arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
> >>>   1 file changed, 4 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> index e66fc67..defdb25 100644
> >>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
> >>> @@ -2228,10 +2228,12 @@
> >>>                          compatible = "qcom,adreno-635.0",
> >>> "qcom,adreno";
> >>>                          reg = <0 0x03d00000 0 0x40000>,
> >>>                                <0 0x03d9e000 0 0x1000>,
> >>> -                             <0 0x03d61000 0 0x800>;
> >>> +                             <0 0x03d61000 0 0x800>,
> >>> +                             <0 0x03d90000 0 0x2000>;
> >>>                          reg-names = "kgsl_3d0_reg_memory",
> >>>                                      "cx_mem",
> >>> -                                   "cx_dbgc";
> >>> +                                   "cx_dbgc",
> >>> +                                   "gpucc";
> >> This doesn't seem right. Shouldn't you be coordinating with the
> >> existing gpucc instead of reaching into its registers?
> >>
> > IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
> > are vote-able switches. Ideally, we should ensure that the hw has
> > collapsed for gpu recovery because there could be transient votes from
> > other subsystems like hypervisor using their vote register.
> >
> > I am not sure how complex the plumbing to gpucc driver would be to allow
> > gpu driver to check hw status. OTOH, with this patch, gpu driver does a
> > read operation on a gpucc register which is in always-on domain. That
> > means we don't need to vote any resource to access this register.
> >
> > Stephen/Rajendra/Taniya, any suggestion?

Why can't you assert a gpu reset signal with the reset APIs? This series
seems to jump through a bunch of hoops to get the gdsc and power domain
to "reset" when I don't know why any of that is necessary. Can't we
simply assert a reset to the hardware after recovery completes so the
device is back into a good known POR (power on reset) state?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-19  5:49         ` Stephen Boyd
@ 2022-07-19  6:37           ` Akhil P Oommen
  2022-07-19  7:19             ` Stephen Boyd
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-19  6:37 UTC (permalink / raw)
  To: Stephen Boyd, Doug Anderson, Taniya Das, quic_rjendra
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML

On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>> On 7/12/2022 4:57 AM, Doug Anderson wrote:
>>>> Hi,
>>>>
>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen
>>>> <quic_akhilpo@quicinc.com> wrote:
>>>>> Update gpu register array with gpucc memory region.
>>>>>
>>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>>> ---
>>>>>
>>>>> (no changes since v1)
>>>>>
>>>>>    arch/arm64/boot/dts/qcom/sc7280.dtsi | 6 ++++--
>>>>>    1 file changed, 4 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> index e66fc67..defdb25 100644
>>>>> --- a/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> +++ b/arch/arm64/boot/dts/qcom/sc7280.dtsi
>>>>> @@ -2228,10 +2228,12 @@
>>>>>                           compatible = "qcom,adreno-635.0",
>>>>> "qcom,adreno";
>>>>>                           reg = <0 0x03d00000 0 0x40000>,
>>>>>                                 <0 0x03d9e000 0 0x1000>,
>>>>> -                             <0 0x03d61000 0 0x800>;
>>>>> +                             <0 0x03d61000 0 0x800>,
>>>>> +                             <0 0x03d90000 0 0x2000>;
>>>>>                           reg-names = "kgsl_3d0_reg_memory",
>>>>>                                       "cx_mem",
>>>>> -                                   "cx_dbgc";
>>>>> +                                   "cx_dbgc",
>>>>> +                                   "gpucc";
>>>> This doesn't seem right. Shouldn't you be coordinating with the
>>>> existing gpucc instead of reaching into its registers?
>>>>
>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
>>> are vote-able switches. Ideally, we should ensure that the hw has
>>> collapsed for gpu recovery because there could be transient votes from
>>> other subsystems like hypervisor using their vote register.
>>>
>>> I am not sure how complex the plumbing to gpucc driver would be to allow
>>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
>>> read operation on a gpucc register which is in always-on domain. That
>>> means we don't need to vote any resource to access this register.
>>>
>>> Stephen/Rajendra/Taniya, any suggestion?
> Why can't you assert a gpu reset signal with the reset APIs? This series
> seems to jump through a bunch of hoops to get the gdsc and power domain
> to "reset" when I don't know why any of that is necessary. Can't we
> simply assert a reset to the hardware after recovery completes so the
> device is back into a good known POR (power on reset) state?
That is because there is no register interface to reset GPU CX domain. 
The recommended sequence from HW design folks is to collapse both cx and 
gx gdsc to properly reset gpu/gmu.

-Akhil.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-19  6:37           ` Akhil P Oommen
@ 2022-07-19  7:19             ` Stephen Boyd
  2022-07-19  9:56               ` Rajendra Nayak
  0 siblings, 1 reply; 25+ messages in thread
From: Stephen Boyd @ 2022-07-19  7:19 UTC (permalink / raw)
  To: Akhil P Oommen, Doug Anderson, Taniya Das, quic_rjendra
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML

Quoting Akhil P Oommen (2022-07-18 23:37:16)
> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> > Quoting Akhil P Oommen (2022-07-18 21:07:05)
> >> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> >>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
> >>> are vote-able switches. Ideally, we should ensure that the hw has
> >>> collapsed for gpu recovery because there could be transient votes from
> >>> other subsystems like hypervisor using their vote register.
> >>>
> >>> I am not sure how complex the plumbing to gpucc driver would be to allow
> >>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
> >>> read operation on a gpucc register which is in always-on domain. That
> >>> means we don't need to vote any resource to access this register.

Reading between the lines here, you're saying that you have to read the
gdsc register to make sure that the gdsc is in some state? Can you
clarify exactly what you're doing? And how do you know that something
else in the kernel can't cause the register to change after it is read?
It certainly seems like we can't be certain because there is voting
involved.

> >>>
> >>> Stephen/Rajendra/Taniya, any suggestion?
> > Why can't you assert a gpu reset signal with the reset APIs? This series
> > seems to jump through a bunch of hoops to get the gdsc and power domain
> > to "reset" when I don't know why any of that is necessary. Can't we
> > simply assert a reset to the hardware after recovery completes so the
> > device is back into a good known POR (power on reset) state?
> That is because there is no register interface to reset GPU CX domain.
> The recommended sequence from HW design folks is to collapse both cx and
> gx gdsc to properly reset gpu/gmu.
>

Ok. One knee jerk reaction is to treat the gdsc as a reset then and
possibly mux that request along with any power domain on/off so that if
the reset is requested and the power domain is off nothing happens.
Otherwise if the power domain is on then it manually sequences and
controls the two gdscs so that the GPU is reset and then restores the
enable state of the power domain.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-19  7:19             ` Stephen Boyd
@ 2022-07-19  9:56               ` Rajendra Nayak
  2022-07-20  6:04                 ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Rajendra Nayak @ 2022-07-19  9:56 UTC (permalink / raw)
  To: Stephen Boyd, Akhil P Oommen, Doug Anderson, Taniya Das
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML



On 7/19/2022 12:49 PM, Stephen Boyd wrote:
> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since they
>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>> collapsed for gpu recovery because there could be transient votes from
>>>>> other subsystems like hypervisor using their vote register.
>>>>>
>>>>> I am not sure how complex the plumbing to gpucc driver would be to allow
>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver does a
>>>>> read operation on a gpucc register which is in always-on domain. That
>>>>> means we don't need to vote any resource to access this register.
> 
> Reading between the lines here, you're saying that you have to read the
> gdsc register to make sure that the gdsc is in some state? Can you
> clarify exactly what you're doing? And how do you know that something
> else in the kernel can't cause the register to change after it is read?
> It certainly seems like we can't be certain because there is voting
> involved.

yes, this looks like the best case effort to get the gpu to recover, but
the kernel driver really has no control to make sure this condition can
always be met (because it depends on other entities like hyp, trustzone etc right?)
Why not just put a worst case polling delay?

> 
>>>>>
>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>> Why can't you assert a gpu reset signal with the reset APIs? This series
>>> seems to jump through a bunch of hoops to get the gdsc and power domain
>>> to "reset" when I don't know why any of that is necessary. Can't we
>>> simply assert a reset to the hardware after recovery completes so the
>>> device is back into a good known POR (power on reset) state?
>> That is because there is no register interface to reset GPU CX domain.
>> The recommended sequence from HW design folks is to collapse both cx and
>> gx gdsc to properly reset gpu/gmu.
>>
> 
> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
> possibly mux that request along with any power domain on/off so that if
> the reset is requested and the power domain is off nothing happens.
> Otherwise if the power domain is on then it manually sequences and
> controls the two gdscs so that the GPU is reset and then restores the
> enable state of the power domain.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-19  9:56               ` Rajendra Nayak
@ 2022-07-20  6:04                 ` Akhil P Oommen
  2022-07-21 16:04                   ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-20  6:04 UTC (permalink / raw)
  To: Rajendra Nayak, Stephen Boyd, Doug Anderson, Taniya Das
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Bjorn Andersson, Rob Herring, Rob Clark, Matthias Kaehlcke,
	Krzysztof Kozlowski, Jordan Crouse, freedreno, LKML

On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
>
>
> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed since 
>>>>>> they
>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>>> collapsed for gpu recovery because there could be transient votes 
>>>>>> from
>>>>>> other subsystems like hypervisor using their vote register.
>>>>>>
>>>>>> I am not sure how complex the plumbing to gpucc driver would be 
>>>>>> to allow
>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver 
>>>>>> does a
>>>>>> read operation on a gpucc register which is in always-on domain. 
>>>>>> That
>>>>>> means we don't need to vote any resource to access this register.
>>
>> Reading between the lines here, you're saying that you have to read the
>> gdsc register to make sure that the gdsc is in some state? Can you
>> clarify exactly what you're doing? And how do you know that something
>> else in the kernel can't cause the register to change after it is read?
>> It certainly seems like we can't be certain because there is voting
>> involved.
 From gpu driver, cx_gdscr.bit[31] (power off status) register can be 
polled to ensure that it *collapsed at least once*. We don't need to 
care if something turns ON gdsc after that.

>
> yes, this looks like the best case effort to get the gpu to recover, but
> the kernel driver really has no control to make sure this condition can
> always be met (because it depends on other entities like hyp, 
> trustzone etc right?)
> Why not just put a worst case polling delay?

I didn't get you entirely. Where do you mean to keep the polling delay?
>
>>
>>>>>>
>>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>>> Why can't you assert a gpu reset signal with the reset APIs? This 
>>>> series
>>>> seems to jump through a bunch of hoops to get the gdsc and power 
>>>> domain
>>>> to "reset" when I don't know why any of that is necessary. Can't we
>>>> simply assert a reset to the hardware after recovery completes so the
>>>> device is back into a good known POR (power on reset) state?
>>> That is because there is no register interface to reset GPU CX domain.
>>> The recommended sequence from HW design folks is to collapse both cx 
>>> and
>>> gx gdsc to properly reset gpu/gmu.
>>>
>>
>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
>> possibly mux that request along with any power domain on/off so that if
>> the reset is requested and the power domain is off nothing happens.
>> Otherwise if the power domain is on then it manually sequences and
>> controls the two gdscs so that the GPU is reset and then restores the
>> enable state of the power domain.
It would be fatal to asynchronously pull the plug on CX gdsc forcefully 
because there might be another gpu/smmu driver thread accessing 
registers in cx domain.

-Akhil.


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-12 19:15         ` Akhil P Oommen
@ 2022-07-20 18:06           ` Rob Clark
  2022-07-20 20:38             ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Rob Clark @ 2022-07-20 18:06 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Doug Anderson, Sean Paul, Jonathan Marek, David Airlie,
	linux-arm-msm, Konrad Dybcio, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Matthias Kaehlcke, Daniel Vetter,
	Dmitry Baryshkov, Jordan Crouse, freedreno, Chia-I Wu, LKML

On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
<quic_akhilpo@quicinc.com> wrote:
>
> On 7/12/2022 10:14 PM, Rob Clark wrote:
> > On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
> > <quic_akhilpo@quicinc.com> wrote:
> >> On 7/12/2022 4:52 AM, Doug Anderson wrote:
> >>> Hi,
> >>>
> >>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
> >>>> There are some hardware logic under CX domain. For a successful
> >>>> recovery, we should ensure cx headswitch collapses to ensure all the
> >>>> stale states are cleard out. This is especially true to for a6xx family
> >>>> where we can GMU co-processor.
> >>>>
> >>>> Currently, cx doesn't collapse due to a devlink between gpu and its
> >>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
> >>>> that the iommu driver removes its vote on cx gdsc.
> >>>>
> >>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
> >>>> ---
> >>>>
> >>>> (no changes since v1)
> >>>>
> >>>>    drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
> >>>>    drivers/gpu/drm/msm/msm_gpu.c         |  2 --
> >>>>    2 files changed, 14 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> index 4d50110..7ed347c 100644
> >>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> >>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
> >>>>            */
> >>>>           gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
> >>>>
> >>>> -       gpu->funcs->pm_suspend(gpu);
> >>>> -       gpu->funcs->pm_resume(gpu);
> >>>> +       /*
> >>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
> >>>> +        * First drop the usage count from all active submits
> >>>> +        */
> >>>> +       for (i = gpu->active_submits; i > 0; i--)
> >>>> +               pm_runtime_put(&gpu->pdev->dev);
> >>>> +
> >>>> +       /* And the final one from recover worker */
> >>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
> >>>> +
> >>>> +       for (i = gpu->active_submits; i > 0; i--)
> >>>> +               pm_runtime_get(&gpu->pdev->dev);
> >>>> +
> >>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
> >>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
> >>> Those seem like they would work to me, too. Why not use them?
> >> Quoting my previous response which I seem to have sent only to Freedreno
> >> list:
> >>
> >> "I believe it is supposed to be used only during system sleep state
> >> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
> >> fail by disabling RPM here."
> > The comment about not wanting other runpm calls to fail is valid.. but
> > that is also solveable, ie. by holding a lock around runpm calls.
> > Which I think we need to do anyways, otherwise looping over
> > gpu->active_submits is racey..
> >
> > I think pm_runtime_force_suspend/resume() is the least-bad option.. or
> > at least I'm not seeing any obvious alternative that is better
> >
> > BR,
> > -R
> We are holding gpu->lock here which will block further submissions from
> scheduler. Will active_submits still race?
>
> It is possible that there is another thread which successfully completed
> pm_runtime_get() and while it access the hardware, we pulled the plug on
> regulator/clock here. That will result in obvious device crash. So I can
> think of 2 solutions:
>
> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>              mutex_lock();
>              pm_runtime_get();
>              < ... access hardware here >>
>              pm_runtime_put();
>              mutex_unlock();
>
> 2. Drop runtime votes from every submit in recover worker and wait/poll
> for regulator to collapse in case there are transient votes on
> regulator  from other threads/subsystems.
>
> Option (2) seems simpler to me.  What do you think?
>

But I think without #1 you could still be racing w/ some other path
that touches the hw, like devfreq, right.  They could be holding a
runpm ref, so even if you loop over active_submits decrementing the
runpm ref, it still doesn't drop to zero

BR,
-R

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-20 18:06           ` Rob Clark
@ 2022-07-20 20:38             ` Akhil P Oommen
  2022-07-22 17:25               ` Akhil P Oommen
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-20 20:38 UTC (permalink / raw)
  To: Rob Clark
  Cc: Doug Anderson, Sean Paul, Jonathan Marek, David Airlie,
	linux-arm-msm, Konrad Dybcio, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Matthias Kaehlcke, Daniel Vetter,
	Dmitry Baryshkov, Jordan Crouse, freedreno, Chia-I Wu, LKML

On 7/20/2022 11:36 PM, Rob Clark wrote:
> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
> <quic_akhilpo@quicinc.com> wrote:
>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>> <quic_akhilpo@quicinc.com> wrote:
>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>> Hi,
>>>>>
>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>> recovery, we should ensure cx headswitch collapses to ensure all the
>>>>>> stale states are cleard out. This is especially true to for a6xx family
>>>>>> where we can GMU co-processor.
>>>>>>
>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended to ensure
>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>
>>>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>>>> ---
>>>>>>
>>>>>> (no changes since v1)
>>>>>>
>>>>>>     drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>>     drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>>>     2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> index 4d50110..7ed347c 100644
>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu *gpu)
>>>>>>             */
>>>>>>            gmu_write(&a6xx_gpu->gmu, REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>
>>>>>> -       gpu->funcs->pm_suspend(gpu);
>>>>>> -       gpu->funcs->pm_resume(gpu);
>>>>>> +       /*
>>>>>> +        * Now drop all the pm_runtime usage count to allow cx gdsc to collapse.
>>>>>> +        * First drop the usage count from all active submits
>>>>>> +        */
>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>> +               pm_runtime_put(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       /* And the final one from recover worker */
>>>>>> +       pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>> +               pm_runtime_get(&gpu->pdev->dev);
>>>>>> +
>>>>>> +       pm_runtime_get_sync(&gpu->pdev->dev);
>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>> Those seem like they would work to me, too. Why not use them?
>>>> Quoting my previous response which I seem to have sent only to Freedreno
>>>> list:
>>>>
>>>> "I believe it is supposed to be used only during system sleep state
>>>> transitions. Btw, we don't want pm_runtime_get() calls from elsewhere to
>>>> fail by disabling RPM here."
>>> The comment about not wanting other runpm calls to fail is valid.. but
>>> that is also solveable, ie. by holding a lock around runpm calls.
>>> Which I think we need to do anyways, otherwise looping over
>>> gpu->active_submits is racey..
>>>
>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>> at least I'm not seeing any obvious alternative that is better
>>>
>>> BR,
>>> -R
>> We are holding gpu->lock here which will block further submissions from
>> scheduler. Will active_submits still race?
>>
>> It is possible that there is another thread which successfully completed
>> pm_runtime_get() and while it access the hardware, we pulled the plug on
>> regulator/clock here. That will result in obvious device crash. So I can
>> think of 2 solutions:
>>
>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>>               mutex_lock();
>>               pm_runtime_get();
>>               < ... access hardware here >>
>>               pm_runtime_put();
>>               mutex_unlock();
>>
>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>> for regulator to collapse in case there are transient votes on
>> regulator  from other threads/subsystems.
>>
>> Option (2) seems simpler to me.  What do you think?
>>
> But I think without #1 you could still be racing w/ some other path
> that touches the hw, like devfreq, right.  They could be holding a
> runpm ref, so even if you loop over active_submits decrementing the
> runpm ref, it still doesn't drop to zero
>
> BR,
> -R
Yes, you are right. There could be some transient votes from other 
threads/drivers/subsystem. This is the reason we need to poll for cx 
gdsc collapse in the next patch.Even with #1, it is difficult to 
coordinate with smmu driver and close to impossible with tz/hyp.

-Akhil.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-20  6:04                 ` Akhil P Oommen
@ 2022-07-21 16:04                   ` Akhil P Oommen
  2022-07-22 15:28                     ` Rob Clark
  0 siblings, 1 reply; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-21 16:04 UTC (permalink / raw)
  To: Rajendra Nayak, Stephen Boyd, Doug Anderson, Taniya Das, Bjorn Andersson
  Cc: devicetree, Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Rob Herring, Rob Clark, Matthias Kaehlcke, Krzysztof Kozlowski,
	Jordan Crouse, freedreno, LKML

On 7/20/2022 11:34 AM, Akhil P Oommen wrote:
> On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
>>
>>
>> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
>>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
>>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
>>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
>>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
>>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed 
>>>>>>> since they
>>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
>>>>>>> collapsed for gpu recovery because there could be transient 
>>>>>>> votes from
>>>>>>> other subsystems like hypervisor using their vote register.
>>>>>>>
>>>>>>> I am not sure how complex the plumbing to gpucc driver would be 
>>>>>>> to allow
>>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver 
>>>>>>> does a
>>>>>>> read operation on a gpucc register which is in always-on domain. 
>>>>>>> That
>>>>>>> means we don't need to vote any resource to access this register.
>>>
>>> Reading between the lines here, you're saying that you have to read the
>>> gdsc register to make sure that the gdsc is in some state? Can you
>>> clarify exactly what you're doing? And how do you know that something
>>> else in the kernel can't cause the register to change after it is read?
>>> It certainly seems like we can't be certain because there is voting
>>> involved.
> From gpu driver, cx_gdscr.bit[31] (power off status) register can be 
> polled to ensure that it *collapsed at least once*. We don't need to 
> care if something turns ON gdsc after that.
>
>>
>> yes, this looks like the best case effort to get the gpu to recover, but
>> the kernel driver really has no control to make sure this condition can
>> always be met (because it depends on other entities like hyp, 
>> trustzone etc right?)
>> Why not just put a worst case polling delay?
>
> I didn't get you entirely. Where do you mean to keep the polling delay?
>>
>>>
>>>>>>>
>>>>>>> Stephen/Rajendra/Taniya, any suggestion?
>>>>> Why can't you assert a gpu reset signal with the reset APIs? This 
>>>>> series
>>>>> seems to jump through a bunch of hoops to get the gdsc and power 
>>>>> domain
>>>>> to "reset" when I don't know why any of that is necessary. Can't we
>>>>> simply assert a reset to the hardware after recovery completes so the
>>>>> device is back into a good known POR (power on reset) state?
>>>> That is because there is no register interface to reset GPU CX domain.
>>>> The recommended sequence from HW design folks is to collapse both 
>>>> cx and
>>>> gx gdsc to properly reset gpu/gmu.
>>>>
>>>
>>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
>>> possibly mux that request along with any power domain on/off so that if
>>> the reset is requested and the power domain is off nothing happens.
>>> Otherwise if the power domain is on then it manually sequences and
>>> controls the two gdscs so that the GPU is reset and then restores the
>>> enable state of the power domain.
> It would be fatal to asynchronously pull the plug on CX gdsc 
> forcefully because there might be another gpu/smmu driver thread 
> accessing registers in cx domain.
>
> -Akhil.
>
But, we can move the cx collapse polling to gpucc and expose it to gpu 
driver using 'reset' framework. I am not very familiar with clk driver, 
but I did a rough prototype here (untested): 
https://zerobin.net/?d34b5f958be3b9b8#NKGzdPy9fgcuOqXZ/XqjI7b8JWcivqe+oSTf4yWHSOU=

If this approach is acceptable, I will send it out as a separate series.

-Akhil.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list
  2022-07-21 16:04                   ` Akhil P Oommen
@ 2022-07-22 15:28                     ` Rob Clark
  0 siblings, 0 replies; 25+ messages in thread
From: Rob Clark @ 2022-07-22 15:28 UTC (permalink / raw)
  To: Akhil P Oommen
  Cc: Rajendra Nayak, Stephen Boyd, Doug Anderson, Taniya Das,
	Bjorn Andersson,
	open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
	Jonathan Marek, linux-arm-msm, Andy Gross, dri-devel,
	Rob Herring, Matthias Kaehlcke, Krzysztof Kozlowski,
	Jordan Crouse, freedreno, LKML

On Thu, Jul 21, 2022 at 9:04 AM Akhil P Oommen <quic_akhilpo@quicinc.com> wrote:
>
> On 7/20/2022 11:34 AM, Akhil P Oommen wrote:
> > On 7/19/2022 3:26 PM, Rajendra Nayak wrote:
> >>
> >>
> >> On 7/19/2022 12:49 PM, Stephen Boyd wrote:
> >>> Quoting Akhil P Oommen (2022-07-18 23:37:16)
> >>>> On 7/19/2022 11:19 AM, Stephen Boyd wrote:
> >>>>> Quoting Akhil P Oommen (2022-07-18 21:07:05)
> >>>>>> On 7/14/2022 11:10 AM, Akhil P Oommen wrote:
> >>>>>>> IIUC, qcom gdsc driver doesn't ensure hardware is collapsed
> >>>>>>> since they
> >>>>>>> are vote-able switches. Ideally, we should ensure that the hw has
> >>>>>>> collapsed for gpu recovery because there could be transient
> >>>>>>> votes from
> >>>>>>> other subsystems like hypervisor using their vote register.
> >>>>>>>
> >>>>>>> I am not sure how complex the plumbing to gpucc driver would be
> >>>>>>> to allow
> >>>>>>> gpu driver to check hw status. OTOH, with this patch, gpu driver
> >>>>>>> does a
> >>>>>>> read operation on a gpucc register which is in always-on domain.
> >>>>>>> That
> >>>>>>> means we don't need to vote any resource to access this register.
> >>>
> >>> Reading between the lines here, you're saying that you have to read the
> >>> gdsc register to make sure that the gdsc is in some state? Can you
> >>> clarify exactly what you're doing? And how do you know that something
> >>> else in the kernel can't cause the register to change after it is read?
> >>> It certainly seems like we can't be certain because there is voting
> >>> involved.
> > From gpu driver, cx_gdscr.bit[31] (power off status) register can be
> > polled to ensure that it *collapsed at least once*. We don't need to
> > care if something turns ON gdsc after that.
> >
> >>
> >> yes, this looks like the best case effort to get the gpu to recover, but
> >> the kernel driver really has no control to make sure this condition can
> >> always be met (because it depends on other entities like hyp,
> >> trustzone etc right?)
> >> Why not just put a worst case polling delay?
> >
> > I didn't get you entirely. Where do you mean to keep the polling delay?
> >>
> >>>
> >>>>>>>
> >>>>>>> Stephen/Rajendra/Taniya, any suggestion?
> >>>>> Why can't you assert a gpu reset signal with the reset APIs? This
> >>>>> series
> >>>>> seems to jump through a bunch of hoops to get the gdsc and power
> >>>>> domain
> >>>>> to "reset" when I don't know why any of that is necessary. Can't we
> >>>>> simply assert a reset to the hardware after recovery completes so the
> >>>>> device is back into a good known POR (power on reset) state?
> >>>> That is because there is no register interface to reset GPU CX domain.
> >>>> The recommended sequence from HW design folks is to collapse both
> >>>> cx and
> >>>> gx gdsc to properly reset gpu/gmu.
> >>>>
> >>>
> >>> Ok. One knee jerk reaction is to treat the gdsc as a reset then and
> >>> possibly mux that request along with any power domain on/off so that if
> >>> the reset is requested and the power domain is off nothing happens.
> >>> Otherwise if the power domain is on then it manually sequences and
> >>> controls the two gdscs so that the GPU is reset and then restores the
> >>> enable state of the power domain.
> > It would be fatal to asynchronously pull the plug on CX gdsc
> > forcefully because there might be another gpu/smmu driver thread
> > accessing registers in cx domain.
> >
> > -Akhil.
> >
> But, we can move the cx collapse polling to gpucc and expose it to gpu
> driver using 'reset' framework. I am not very familiar with clk driver,
> but I did a rough prototype here (untested):
> https://zerobin.net/?d34b5f958be3b9b8#NKGzdPy9fgcuOqXZ/XqjI7b8JWcivqe+oSTf4yWHSOU=
>
> If this approach is acceptable, I will send it out as a separate series.
>

I'm not super familiar w/ reset framework, but this approach seems
like it would avoid needing to play games with working around runpm as
well.  So that seems like a cleaner approach.

BR,
-R

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [Freedreno] [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery
  2022-07-20 20:38             ` Akhil P Oommen
@ 2022-07-22 17:25               ` Akhil P Oommen
  0 siblings, 0 replies; 25+ messages in thread
From: Akhil P Oommen @ 2022-07-22 17:25 UTC (permalink / raw)
  To: Rob Clark
  Cc: Doug Anderson, Sean Paul, Jonathan Marek, David Airlie,
	linux-arm-msm, Konrad Dybcio, Abhinav Kumar, dri-devel,
	Bjorn Andersson, Matthias Kaehlcke, Daniel Vetter,
	Dmitry Baryshkov, Jordan Crouse, freedreno, Chia-I Wu, LKML

On 7/21/2022 2:08 AM, Akhil P Oommen wrote:
> On 7/20/2022 11:36 PM, Rob Clark wrote:
>> On Tue, Jul 12, 2022 at 12:15 PM Akhil P Oommen
>> <quic_akhilpo@quicinc.com> wrote:
>>> On 7/12/2022 10:14 PM, Rob Clark wrote:
>>>> On Mon, Jul 11, 2022 at 10:05 PM Akhil P Oommen
>>>> <quic_akhilpo@quicinc.com> wrote:
>>>>> On 7/12/2022 4:52 AM, Doug Anderson wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Fri, Jul 8, 2022 at 11:00 PM Akhil P Oommen 
>>>>>> <quic_akhilpo@quicinc.com> wrote:
>>>>>>> There are some hardware logic under CX domain. For a successful
>>>>>>> recovery, we should ensure cx headswitch collapses to ensure all 
>>>>>>> the
>>>>>>> stale states are cleard out. This is especially true to for a6xx 
>>>>>>> family
>>>>>>> where we can GMU co-processor.
>>>>>>>
>>>>>>> Currently, cx doesn't collapse due to a devlink between gpu and its
>>>>>>> smmu. So the *struct gpu device* needs to be runtime suspended 
>>>>>>> to ensure
>>>>>>> that the iommu driver removes its vote on cx gdsc.
>>>>>>>
>>>>>>> Signed-off-by: Akhil P Oommen <quic_akhilpo@quicinc.com>
>>>>>>> ---
>>>>>>>
>>>>>>> (no changes since v1)
>>>>>>>
>>>>>>>     drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 16 ++++++++++++++--
>>>>>>>     drivers/gpu/drm/msm/msm_gpu.c         |  2 --
>>>>>>>     2 files changed, 14 insertions(+), 4 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
>>>>>>> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> index 4d50110..7ed347c 100644
>>>>>>> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
>>>>>>> @@ -1278,8 +1278,20 @@ static void a6xx_recover(struct msm_gpu 
>>>>>>> *gpu)
>>>>>>>             */
>>>>>>>            gmu_write(&a6xx_gpu->gmu, 
>>>>>>> REG_A6XX_GMU_GMU_PWR_COL_KEEPALIVE, 0);
>>>>>>>
>>>>>>> -       gpu->funcs->pm_suspend(gpu);
>>>>>>> -       gpu->funcs->pm_resume(gpu);
>>>>>>> +       /*
>>>>>>> +        * Now drop all the pm_runtime usage count to allow cx 
>>>>>>> gdsc to collapse.
>>>>>>> +        * First drop the usage count from all active submits
>>>>>>> +        */
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_put(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       /* And the final one from recover worker */
>>>>>>> + pm_runtime_put_sync(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> +       for (i = gpu->active_submits; i > 0; i--)
>>>>>>> + pm_runtime_get(&gpu->pdev->dev);
>>>>>>> +
>>>>>>> + pm_runtime_get_sync(&gpu->pdev->dev);
>>>>>> In response to v1, Rob suggested pm_runtime_force_suspend/resume().
>>>>>> Those seem like they would work to me, too. Why not use them?
>>>>> Quoting my previous response which I seem to have sent only to 
>>>>> Freedreno
>>>>> list:
>>>>>
>>>>> "I believe it is supposed to be used only during system sleep state
>>>>> transitions. Btw, we don't want pm_runtime_get() calls from 
>>>>> elsewhere to
>>>>> fail by disabling RPM here."
>>>> The comment about not wanting other runpm calls to fail is valid.. but
>>>> that is also solveable, ie. by holding a lock around runpm calls.
>>>> Which I think we need to do anyways, otherwise looping over
>>>> gpu->active_submits is racey..
>>>>
>>>> I think pm_runtime_force_suspend/resume() is the least-bad option.. or
>>>> at least I'm not seeing any obvious alternative that is better
>>>>
>>>> BR,
>>>> -R
>>> We are holding gpu->lock here which will block further submissions from
>>> scheduler. Will active_submits still race?
>>>
>>> It is possible that there is another thread which successfully 
>>> completed
>>> pm_runtime_get() and while it access the hardware, we pulled the 
>>> plug on
>>> regulator/clock here. That will result in obvious device crash. So I 
>>> can
>>> think of 2 solutions:
>>>
>>> 1. wrap *every* pm_runtime_get/put with a mutex. Something like:
>>>               mutex_lock();
>>>               pm_runtime_get();
>>>               < ... access hardware here >>
>>>               pm_runtime_put();
>>>               mutex_unlock();
>>>
>>> 2. Drop runtime votes from every submit in recover worker and wait/poll
>>> for regulator to collapse in case there are transient votes on
>>> regulator  from other threads/subsystems.
>>>
>>> Option (2) seems simpler to me.  What do you think?
>>>
>> But I think without #1 you could still be racing w/ some other path
>> that touches the hw, like devfreq, right.  They could be holding a
>> runpm ref, so even if you loop over active_submits decrementing the
>> runpm ref, it still doesn't drop to zero
>>
>> BR,
>> -R
> Yes, you are right. There could be some transient votes from other 
> threads/drivers/subsystem. This is the reason we need to poll for cx 
> gdsc collapse in the next patch.Even with #1, it is difficult to 
> coordinate with smmu driver and close to impossible with tz/hyp.
>
> -Akhil.

Rob,

Summarizing my responses:
1. We cannot blindly force turn off cx headswitch because that would 
impact other gpu driver threads/smmu driver/tz/hyp etc which access cx 
domain register at the same time.
2. We need to drop all our rpm votes on 'gpu device' instead of a single 
vote on 'gmu device' because of [1]. Otherwise, smmu driver's vote on cx 
headswitch will block its collapse forever.

This is the high level sequence implemented in the current series' version:
1. Drop all rpm votes on 'gpu device' which will indirectly let smmu 
driver drop its vote on cx HS.
2. To take care of transient votes from other threads/hyp etc, poll for 
cx gdsc hw register to ensure that it has collapsed. (We might be able 
to move this to gpucc driver depending on the consensus on the other patch.)

[1] https://lkml.org/lkml/2018/8/30/590

-Akhil.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2022-07-22 17:26 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-07-09  5:59 [PATCH v2 0/7] Improve GPU Recovery Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 1/7] drm/msm: Remove unnecessary pm_runtime_get/put Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 2/7] drm/msm: Correct pm_runtime votes in recover worker Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 3/7] drm/msm: Fix cx collapse issue during recovery Akhil P Oommen
2022-07-11 23:22   ` Doug Anderson
2022-07-12  5:04     ` [Freedreno] " Akhil P Oommen
2022-07-12 16:44       ` Rob Clark
2022-07-12 19:15         ` Akhil P Oommen
2022-07-20 18:06           ` Rob Clark
2022-07-20 20:38             ` Akhil P Oommen
2022-07-22 17:25               ` Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 4/7] drm/msm: Ensure cx gdsc collapse " Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 5/7] arm64: dts: qcom: sc7280: Update gpu register list Akhil P Oommen
2022-07-11 23:27   ` Doug Anderson
2022-07-14  5:40     ` Akhil P Oommen
2022-07-19  4:07       ` [Freedreno] " Akhil P Oommen
2022-07-19  5:49         ` Stephen Boyd
2022-07-19  6:37           ` Akhil P Oommen
2022-07-19  7:19             ` Stephen Boyd
2022-07-19  9:56               ` Rajendra Nayak
2022-07-20  6:04                 ` Akhil P Oommen
2022-07-21 16:04                   ` Akhil P Oommen
2022-07-22 15:28                     ` Rob Clark
2022-07-09  5:59 ` [PATCH v2 6/7] drm/msm/a6xx: Improve gpu recovery sequence Akhil P Oommen
2022-07-09  5:59 ` [PATCH v2 7/7] drm/msm/a6xx: Handle GMU prepare-slumber hfi failure Akhil P Oommen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).