[PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
@ 2021-06-01 22:47 ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Abhinav Kumar, Akhil P Oommen,
	AngeloGioacchino Del Regno, Bjorn Andersson, Dave Airlie,
	Douglas Anderson, Eric Anholt,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list:IOMMU DRIVERS,
	Iskren Chernev, Joerg Roedel, Jonathan Marek, Kalyan Thota,
	Konrad Dybcio, Krishna Reddy, Kristian H. Kristensen,
	Laurent Pinchart, Lee Jones, moderated list:ARM SMMU DRIVERS,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list,
	Marijn Suijten, Maxime Ripard, Qinglang Miao, Robin Murphy,
	Sai Prakash Ranjan, Sharat Masetty, Stephen Boyd,
	Thomas Zimmermann, Ville Syrjälä,
	Will Deacon, Zhenzhong Duan

From: Rob Clark <robdclark@chromium.org>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
    resume translation after it has had a chance to snapshot the GPUs
    state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
    info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 101 +++++++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c           |   2 +-
 drivers/gpu/drm/msm/msm_gem.h               |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
 drivers/gpu/drm/msm/msm_gpu.c               |  55 ++++++++++-
 drivers/gpu/drm/msm/msm_gpu.h               |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
 drivers/gpu/drm/msm/msm_iommu.c             |  22 ++++-
 drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
 include/linux/adreno-smmu-priv.h            |  38 +++++++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
@ 2021-06-01 22:47 ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Konrad Dybcio, Akhil P Oommen, Eric Anholt, Iskren Chernev,
	Laurent Pinchart, AngeloGioacchino Del Regno, Marijn Suijten,
	Lee Jones, Ville Syrjälä,
	Rob Clark, Jonathan Marek, Will Deacon, Zhenzhong Duan,
	Qinglang Miao, Dave Airlie, Joerg Roedel,
	open list:DRM DRIVER FOR MSM ADRENO GPU, Sharat Masetty,
	Abhinav Kumar, Stephen Boyd, Maxime Ripard, Kalyan Thota,
	moderated list:ARM SMMU DRIVERS, Robin Murphy, Douglas Anderson,
	open list, open list:IOMMU DRIVERS, Kristian H. Kristensen,
	Thomas Zimmermann, open list:DRM DRIVER FOR MSM ADRENO GPU

From: Rob Clark <robdclark@chromium.org>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
    resume translation after it has had a chance to snapshot the GPUs
    state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
    info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 101 +++++++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c           |   2 +-
 drivers/gpu/drm/msm/msm_gem.h               |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
 drivers/gpu/drm/msm/msm_gpu.c               |  55 ++++++++++-
 drivers/gpu/drm/msm/msm_gpu.h               |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
 drivers/gpu/drm/msm/msm_iommu.c             |  22 ++++-
 drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
 include/linux/adreno-smmu-priv.h            |  38 +++++++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
@ 2021-06-01 22:47 ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Konrad Dybcio, Akhil P Oommen, Bjorn Andersson, Eric Anholt,
	Iskren Chernev, Laurent Pinchart, AngeloGioacchino Del Regno,
	Marijn Suijten, Lee Jones, Ville Syrjälä,
	Rob Clark, Sai Prakash Ranjan, Jonathan Marek, Will Deacon,
	Zhenzhong Duan, Qinglang Miao, Dave Airlie, Joerg Roedel,
	open list:DRM DRIVER FOR MSM ADRENO GPU, Sharat Masetty,
	Abhinav Kumar, Stephen Boyd, Maxime Ripard, Kalyan Thota,
	Jordan Crouse, moderated list:ARM SMMU DRIVERS, Robin Murphy,
	Douglas Anderson, open list, open list:IOMMU DRIVERS,
	Kristian H. Kristensen, Thomas Zimmermann,
	open list:DRM DRIVER FOR MSM ADRENO GPU

From: Rob Clark <robdclark@chromium.org>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
    resume translation after it has had a chance to snapshot the GPUs
    state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
    info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 101 +++++++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c           |   2 +-
 drivers/gpu/drm/msm/msm_gem.h               |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
 drivers/gpu/drm/msm/msm_gpu.c               |  55 ++++++++++-
 drivers/gpu/drm/msm/msm_gpu.h               |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
 drivers/gpu/drm/msm/msm_iommu.c             |  22 ++++-
 drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
 include/linux/adreno-smmu-priv.h            |  38 +++++++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling
@ 2021-06-01 22:47 ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Konrad Dybcio, Akhil P Oommen, Bjorn Andersson, Iskren Chernev,
	Laurent Pinchart, AngeloGioacchino Del Regno, Marijn Suijten,
	Lee Jones, Rob Clark, Sai Prakash Ranjan, Jonathan Marek,
	Will Deacon, Zhenzhong Duan, Qinglang Miao, Dave Airlie,
	Joerg Roedel, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sharat Masetty, Abhinav Kumar, Stephen Boyd, Krishna Reddy,
	Maxime Ripard, Kalyan Thota, Jordan Crouse,
	moderated list:ARM SMMU DRIVERS, Robin Murphy, Douglas Anderson,
	open list, open list:IOMMU DRIVERS, Kristian H. Kristensen,
	Thomas Zimmermann, open list:DRM DRIVER FOR MSM ADRENO GPU

From: Rob Clark <robdclark@chromium.org>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
    resume translation after it has had a chance to snapshot the GPUs
    state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
    info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (3):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: Add crashdump support for stalled SMMU
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |   2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |   9 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 101 +++++++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |   2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  43 +++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
 drivers/gpu/drm/msm/msm_debugfs.c           |   2 +-
 drivers/gpu/drm/msm/msm_gem.h               |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
 drivers/gpu/drm/msm/msm_gpu.c               |  55 ++++++++++-
 drivers/gpu/drm/msm/msm_gpu.h               |  19 +++-
 drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
 drivers/gpu/drm/msm/msm_iommu.c             |  22 ++++-
 drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 ++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
 include/linux/adreno-smmu-priv.h            |  38 +++++++-
 20 files changed, 354 insertions(+), 31 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] drm/msm/dpu: Delete bonkers code
  2021-06-01 22:47 ` Rob Clark
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Stephen Boyd, John Stultz, Rob Clark,
	Sean Paul, David Airlie, Daniel Vetter, Abhinav Kumar,
	Maxime Ripard, Thomas Zimmermann, Stephen Boyd, Kalyan Thota,
	Sakari Ailus, Qinglang Miao, Laurent Pinchart, Lee Jones,
	Dmitry Baryshkov, Ville Syrjälä,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

From: Rob Clark <robdclark@chromium.org>

dpu_crtc_atomic_flush() was directly poking it's attached planes in a
code path that ended up in dpu_plane_atomic_update(), even if the plane
was not involved in the current atomic update.  While a bit dubious,
this worked before because plane->state would always point to something
valid.  But now using drm_atomic_get_new_plane_state() we could get a
NULL state pointer instead, leading to:

   [   20.873273] Call trace:
   [   20.875740]  dpu_plane_atomic_update+0x5c/0xed0
   [   20.880311]  dpu_plane_restore+0x40/0x88
   [   20.884266]  dpu_crtc_atomic_flush+0xf4/0x208
   [   20.888660]  drm_atomic_helper_commit_planes+0x150/0x238
   [   20.894014]  msm_atomic_commit_tail+0x1d4/0x7a0
   [   20.898579]  commit_tail+0xa4/0x168
   [   20.902102]  drm_atomic_helper_commit+0x164/0x178
   [   20.906841]  drm_atomic_commit+0x54/0x60
   [   20.910798]  drm_atomic_connector_commit_dpms+0x10c/0x118
   [   20.916236]  drm_mode_obj_set_property_ioctl+0x1e4/0x440
   [   20.921588]  drm_connector_property_set_ioctl+0x60/0x88
   [   20.926852]  drm_ioctl_kernel+0xd0/0x120
   [   20.930807]  drm_ioctl+0x21c/0x478
   [   20.934235]  __arm64_sys_ioctl+0xa8/0xe0
   [   20.938193]  invoke_syscall+0x64/0x130
   [   20.941977]  el0_svc_common.constprop.3+0x5c/0xe0
   [   20.946716]  do_el0_svc+0x80/0xa0
   [   20.950058]  el0_svc+0x20/0x30
   [   20.953145]  el0_sync_handler+0x88/0xb0
   [   20.957014]  el0_sync+0x13c/0x140

The reason for the codepath seems dubious, the atomic suspend/resume
heplers should handle the power-collapse case.  If not, the CRTC's
atomic_check() should be adding the planes to the atomic update.

Reported-by: Stephen Boyd <sboyd@kernel.org>
Reported-by: John Stultz <john.stultz@linaro.org>
Fixes: 37418bf14c13 drm: Use state helper instead of the plane state pointer
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c  | 10 ----------
 drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 16 ----------------
 drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h |  6 ------
 3 files changed, 32 deletions(-)

diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
index 7c29976be243..18bc76b7f1a3 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
@@ -648,16 +648,6 @@ static void dpu_crtc_atomic_flush(struct drm_crtc *crtc,
 	if (unlikely(!cstate->num_mixers))
 		return;
 
-	/*
-	 * For planes without commit update, drm framework will not add
-	 * those planes to current state since hardware update is not
-	 * required. However, if those planes were power collapsed since
-	 * last commit cycle, driver has to restore the hardware state
-	 * of those planes explicitly here prior to plane flush.
-	 */
-	drm_atomic_crtc_for_each_plane(plane, crtc)
-		dpu_plane_restore(plane, state);
-
 	/* update performance setting before crtc kickoff */
 	dpu_core_perf_crtc_update(crtc, 1, false);
 
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
index df7f3d3afd8b..7a993547eb75 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
@@ -1258,22 +1258,6 @@ static void dpu_plane_atomic_update(struct drm_plane *plane,
 	}
 }
 
-void dpu_plane_restore(struct drm_plane *plane, struct drm_atomic_state *state)
-{
-	struct dpu_plane *pdpu;
-
-	if (!plane || !plane->state) {
-		DPU_ERROR("invalid plane\n");
-		return;
-	}
-
-	pdpu = to_dpu_plane(plane);
-
-	DPU_DEBUG_PLANE(pdpu, "\n");
-
-	dpu_plane_atomic_update(plane, state);
-}
-
 static void dpu_plane_destroy(struct drm_plane *plane)
 {
 	struct dpu_plane *pdpu = plane ? to_dpu_plane(plane) : NULL;
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
index 03b6365a750c..34e03ac05f4a 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
@@ -84,12 +84,6 @@ bool is_dpu_plane_virtual(struct drm_plane *plane);
 void dpu_plane_get_ctl_flush(struct drm_plane *plane, struct dpu_hw_ctl *ctl,
 		u32 *flush_sspp);
 
-/**
- * dpu_plane_restore - restore hw state if previously power collapsed
- * @plane: Pointer to drm plane structure
- */
-void dpu_plane_restore(struct drm_plane *plane, struct drm_atomic_state *state);
-
 /**
  * dpu_plane_flush - final plane operations before commit flush
  * @plane: Pointer to drm plane structure
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH] drm/msm/dpu: Delete bonkers code
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: David Airlie, Jordan Crouse, Laurent Pinchart, Lee Jones,
	Rob Clark, Qinglang Miao, Sakari Ailus,
	open list:DRM DRIVER FOR MSM ADRENO GPU, Abhinav Kumar,
	Stephen Boyd, Maxime Ripard, Kalyan Thota, Sean Paul,
	Stephen Boyd, open list, Thomas Zimmermann, Dmitry Baryshkov,
	open list:DRM DRIVER FOR MSM ADRENO GPU

From: Rob Clark <robdclark@chromium.org>

dpu_crtc_atomic_flush() was directly poking it's attached planes in a
code path that ended up in dpu_plane_atomic_update(), even if the plane
was not involved in the current atomic update.  While a bit dubious,
this worked before because plane->state would always point to something
valid.  But now using drm_atomic_get_new_plane_state() we could get a
NULL state pointer instead, leading to:

   [   20.873273] Call trace:
   [   20.875740]  dpu_plane_atomic_update+0x5c/0xed0
   [   20.880311]  dpu_plane_restore+0x40/0x88
   [   20.884266]  dpu_crtc_atomic_flush+0xf4/0x208
   [   20.888660]  drm_atomic_helper_commit_planes+0x150/0x238
   [   20.894014]  msm_atomic_commit_tail+0x1d4/0x7a0
   [   20.898579]  commit_tail+0xa4/0x168
   [   20.902102]  drm_atomic_helper_commit+0x164/0x178
   [   20.906841]  drm_atomic_commit+0x54/0x60
   [   20.910798]  drm_atomic_connector_commit_dpms+0x10c/0x118
   [   20.916236]  drm_mode_obj_set_property_ioctl+0x1e4/0x440
   [   20.921588]  drm_connector_property_set_ioctl+0x60/0x88
   [   20.926852]  drm_ioctl_kernel+0xd0/0x120
   [   20.930807]  drm_ioctl+0x21c/0x478
   [   20.934235]  __arm64_sys_ioctl+0xa8/0xe0
   [   20.938193]  invoke_syscall+0x64/0x130
   [   20.941977]  el0_svc_common.constprop.3+0x5c/0xe0
   [   20.946716]  do_el0_svc+0x80/0xa0
   [   20.950058]  el0_svc+0x20/0x30
   [   20.953145]  el0_sync_handler+0x88/0xb0
   [   20.957014]  el0_sync+0x13c/0x140

The reason for the codepath seems dubious, the atomic suspend/resume
heplers should handle the power-collapse case.  If not, the CRTC's
atomic_check() should be adding the planes to the atomic update.

Reported-by: Stephen Boyd <sboyd@kernel.org>
Reported-by: John Stultz <john.stultz@linaro.org>
Fixes: 37418bf14c13 drm: Use state helper instead of the plane state pointer
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c  | 10 ----------
 drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c | 16 ----------------
 drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h |  6 ------
 3 files changed, 32 deletions(-)

diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
index 7c29976be243..18bc76b7f1a3 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_crtc.c
@@ -648,16 +648,6 @@ static void dpu_crtc_atomic_flush(struct drm_crtc *crtc,
 	if (unlikely(!cstate->num_mixers))
 		return;
 
-	/*
-	 * For planes without commit update, drm framework will not add
-	 * those planes to current state since hardware update is not
-	 * required. However, if those planes were power collapsed since
-	 * last commit cycle, driver has to restore the hardware state
-	 * of those planes explicitly here prior to plane flush.
-	 */
-	drm_atomic_crtc_for_each_plane(plane, crtc)
-		dpu_plane_restore(plane, state);
-
 	/* update performance setting before crtc kickoff */
 	dpu_core_perf_crtc_update(crtc, 1, false);
 
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
index df7f3d3afd8b..7a993547eb75 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.c
@@ -1258,22 +1258,6 @@ static void dpu_plane_atomic_update(struct drm_plane *plane,
 	}
 }
 
-void dpu_plane_restore(struct drm_plane *plane, struct drm_atomic_state *state)
-{
-	struct dpu_plane *pdpu;
-
-	if (!plane || !plane->state) {
-		DPU_ERROR("invalid plane\n");
-		return;
-	}
-
-	pdpu = to_dpu_plane(plane);
-
-	DPU_DEBUG_PLANE(pdpu, "\n");
-
-	dpu_plane_atomic_update(plane, state);
-}
-
 static void dpu_plane_destroy(struct drm_plane *plane)
 {
 	struct dpu_plane *pdpu = plane ? to_dpu_plane(plane) : NULL;
diff --git a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
index 03b6365a750c..34e03ac05f4a 100644
--- a/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
+++ b/drivers/gpu/drm/msm/disp/dpu1/dpu_plane.h
@@ -84,12 +84,6 @@ bool is_dpu_plane_virtual(struct drm_plane *plane);
 void dpu_plane_get_ctl_flush(struct drm_plane *plane, struct dpu_hw_ctl *ctl,
 		u32 *flush_sspp);
 
-/**
- * dpu_plane_restore - restore hw state if previously power collapsed
- * @plane: Pointer to drm plane structure
- */
-void dpu_plane_restore(struct drm_plane *plane, struct drm_atomic_state *state);
-
 /**
  * dpu_plane_flush - final plane operations before commit flush
  * @plane: Pointer to drm plane structure
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
  2021-06-01 22:47 ` Rob Clark
  (?)
  (?)
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Jordan Crouse, Rob Clark, Will Deacon,
	Robin Murphy, Joerg Roedel, Krishna Reddy, Sai Prakash Ranjan,
	moderated list:ARM SMMU DRIVERS, open list:IOMMU DRIVERS,
	open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	int idx = smmu_domain->cfg.cbndx;
+	int ret;
 
 	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
 	if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
 	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+	ret = report_iommu_fault(domain, NULL, iova,
+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+	if (ret == -ENOSYS)
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
 			    fsr, iova, fsynr, cbfrsynra, idx);
 
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, open list, Will Deacon, open list:IOMMU DRIVERS,
	Jordan Crouse, Robin Murphy, moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	int idx = smmu_domain->cfg.cbndx;
+	int ret;
 
 	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
 	if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
 	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+	ret = report_iommu_fault(domain, NULL, iova,
+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+	if (ret == -ENOSYS)
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
 			    fsr, iova, fsynr, cbfrsynra, idx);
 
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Sai Prakash Ranjan, open list, Will Deacon,
	Joerg Roedel, open list:IOMMU DRIVERS, Jordan Crouse,
	Jordan Crouse, Robin Murphy, moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	int idx = smmu_domain->cfg.cbndx;
+	int ret;
 
 	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
 	if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
 	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+	ret = report_iommu_fault(domain, NULL, iova,
+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+	if (ret == -ENOSYS)
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
 			    fsr, iova, fsynr, cbfrsynra, idx);
 
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Sai Prakash Ranjan, open list, Will Deacon,
	Joerg Roedel, Krishna Reddy, open list:IOMMU DRIVERS,
	Jordan Crouse, Jordan Crouse, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	int idx = smmu_domain->cfg.cbndx;
+	int ret;
 
 	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
 	if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
 	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+	ret = report_iommu_fault(domain, NULL, iova,
+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+	if (ret == -ENOSYS)
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
 			    fsr, iova, fsynr, cbfrsynra, idx);
 
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
  2021-06-01 22:47 ` Rob Clark
  (?)
  (?)
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Jordan Crouse, Rob Clark, Will Deacon,
	Robin Murphy, Joerg Roedel, Bjorn Andersson, Vinod Koul,
	Krishna Reddy, Sai Prakash Ranjan,
	moderated list:ARM SMMU DRIVERS, open list:IOMMU DRIVERS,
	open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
 include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+		struct adreno_smmu_fault_info *info)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->cookie = smmu_domain;
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0		0x68
 #define ARM_SMMU_FSYNR0_WNR		BIT(4)
 
+#define ARM_SMMU_CB_FSYNR1		0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
 #define ARM_SMMU_CB_S1_TLBIVAL		0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include <linux/io-pgtable.h>
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+	u64 far;
+	u64 ttbr0;
+	u32 contextidr;
+	u32 fsr;
+	u32 fsynr0;
+	u32 fsynr1;
+	u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  *                 NULL config disables TTBR0 translation, otherwise
  *                 TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *                  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
     const void *cookie;
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, open list, Will Deacon, Vinod Koul,
	open list:IOMMU DRIVERS, Jordan Crouse, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
 include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+		struct adreno_smmu_fault_info *info)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->cookie = smmu_domain;
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0		0x68
 #define ARM_SMMU_FSYNR0_WNR		BIT(4)
 
+#define ARM_SMMU_CB_FSYNR1		0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
 #define ARM_SMMU_CB_S1_TLBIVAL		0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include <linux/io-pgtable.h>
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+	u64 far;
+	u64 ttbr0;
+	u32 contextidr;
+	u32 fsr;
+	u32 fsynr0;
+	u32 fsynr1;
+	u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  *                 NULL config disables TTBR0 translation, otherwise
  *                 TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *                  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
     const void *cookie;
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Sai Prakash Ranjan, open list, Will Deacon,
	Joerg Roedel, Vinod Koul, open list:IOMMU DRIVERS, Jordan Crouse,
	Jordan Crouse, Bjorn Andersson, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
 include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+		struct adreno_smmu_fault_info *info)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->cookie = smmu_domain;
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0		0x68
 #define ARM_SMMU_FSYNR0_WNR		BIT(4)
 
+#define ARM_SMMU_CB_FSYNR1		0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
 #define ARM_SMMU_CB_S1_TLBIVAL		0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include <linux/io-pgtable.h>
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+	u64 far;
+	u64 ttbr0;
+	u32 contextidr;
+	u32 fsr;
+	u32 fsynr0;
+	u32 fsynr1;
+	u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  *                 NULL config disables TTBR0 translation, otherwise
  *                 TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *                  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
     const void *cookie;
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Sai Prakash Ranjan, open list, Will Deacon,
	Joerg Roedel, Vinod Koul, open list:IOMMU DRIVERS, Jordan Crouse,
	Jordan Crouse, Krishna Reddy, Bjorn Andersson, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Jordan Crouse <jcrouse@codeaurora.org>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
 include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+		struct adreno_smmu_fault_info *info)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->cookie = smmu_domain;
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0		0x68
 #define ARM_SMMU_FSYNR0_WNR		BIT(4)
 
+#define ARM_SMMU_CB_FSYNR1		0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
 #define ARM_SMMU_CB_S1_TLBIVAL		0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include <linux/io-pgtable.h>
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+	u64 far;
+	u64 ttbr0;
+	u32 contextidr;
+	u32 fsr;
+	u32 fsynr0;
+	u32 fsynr1;
+	u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  *                 NULL config disables TTBR0 translation, otherwise
  *                 TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *                  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
     const void *cookie;
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 3/6] drm/msm: Improve the a6xx page fault handler
  2021-06-01 22:47 ` Rob Clark
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Jordan Crouse, Rob Clark, Rob Clark, Sean Paul,
	David Airlie, Daniel Vetter, AngeloGioacchino Del Regno,
	Konrad Dybcio, Kristian H. Kristensen, Marijn Suijten,
	Jonathan Marek, Akhil P Oommen, Sai Prakash Ranjan, Eric Anholt,
	Sharat Masetty, Douglas Anderson,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Use the new adreno-smmu-priv fault info function to get more SMMU
debug registers and print the current TTBR0 to debug per-instance
pagetables and figure out which GPU block generated the request.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c |  4 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
 drivers/gpu/drm/msm/msm_iommu.c       | 11 +++-
 drivers/gpu/drm/msm/msm_mmu.h         |  4 +-
 4 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index ce13d49e615b..a0eef5d9b89b 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 	return true;
 }
 
-static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
+static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
 	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
@@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+	return 0;
 }
 
 static void a5xx_cp_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 23464d735682..094dc17fd20f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -959,18 +959,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	msm_gpu_hw_init(gpu);
 }
 
-static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
+static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
+{
+	static const char *uche_clients[7] = {
+		"VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
+	};
+	u32 val;
+
+	if (mid < 1 || mid > 3)
+		return "UNKNOWN";
+
+	/*
+	 * The source of the data depends on the mid ID read from FSYNR1.
+	 * and the client ID read from the UCHE block
+	 */
+	val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
+
+	/* mid = 3 is most precise and refers to only one block per client */
+	if (mid == 3)
+		return uche_clients[val & 7];
+
+	/* For mid=2 the source is TP or VFD except when the client id is 0 */
+	if (mid == 2)
+		return ((val & 7) == 0) ? "TP" : "TP|VFD";
+
+	/* For mid=1 just return "UCHE" as a catchall for everything else */
+	return "UCHE";
+}
+
+static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
+{
+	if (id == 0)
+		return "CP";
+	else if (id == 4)
+		return "CCU";
+	else if (id == 6)
+		return "CDP Prefetch";
+
+	return a6xx_uche_fault_block(gpu, id);
+}
+
+#define ARM_SMMU_FSR_TF                 BIT(1)
+#define ARM_SMMU_FSR_PF			BIT(3)
+#define ARM_SMMU_FSR_EF			BIT(4)
+
+static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
+	struct adreno_smmu_fault_info *info = data;
+	const char *type = "UNKNOWN";
 
-	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
+	/*
+	 * Print a default message if we couldn't get the data from the
+	 * adreno-smmu-priv
+	 */
+	if (!info) {
+		pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
 			iova, flags,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+		return 0;
+	}
+
+	if (info->fsr & ARM_SMMU_FSR_TF)
+		type = "TRANSLATION";
+	else if (info->fsr & ARM_SMMU_FSR_PF)
+		type = "PERMISSION";
+	else if (info->fsr & ARM_SMMU_FSR_EF)
+		type = "EXTERNAL";
+
+	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
+			info->ttbr0, iova,
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
+			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
+
+	return 0;
 }
 
 static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 50d881794758..6975b95c3c29 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 		unsigned long iova, int flags, void *arg)
 {
 	struct msm_iommu *iommu = arg;
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
+	struct adreno_smmu_fault_info info, *ptr = NULL;
+
+	if (adreno_smmu->get_fault_info) {
+		adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
+		ptr = &info;
+	}
+
 	if (iommu->base.handler)
-		return iommu->base.handler(iommu->base.arg, iova, flags);
+		return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
+
 	pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
 	return 0;
 }
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index 61ade89d9e48..a88f44c3268d 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -26,7 +26,7 @@ enum msm_mmu_type {
 struct msm_mmu {
 	const struct msm_mmu_funcs *funcs;
 	struct device *dev;
-	int (*handler)(void *arg, unsigned long iova, int flags);
+	int (*handler)(void *arg, unsigned long iova, int flags, void *data);
 	void *arg;
 	enum msm_mmu_type type;
 };
@@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
 
 static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
-		int (*handler)(void *arg, unsigned long iova, int flags))
+		int (*handler)(void *arg, unsigned long iova, int flags, void *data))
 {
 	mmu->arg = arg;
 	mmu->handler = handler;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 3/6] drm/msm: Improve the a6xx page fault handler
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sai Prakash Ranjan, Douglas Anderson, Akhil P Oommen,
	Jonathan Marek, David Airlie,
	open list:DRM DRIVER FOR MSM ADRENO GPU, Sharat Masetty,
	Konrad Dybcio, Jordan Crouse, Jordan Crouse,
	Kristian H. Kristensen, AngeloGioacchino Del Regno,
	Marijn Suijten, Sean Paul, open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Use the new adreno-smmu-priv fault info function to get more SMMU
debug registers and print the current TTBR0 to debug per-instance
pagetables and figure out which GPU block generated the request.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c |  4 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
 drivers/gpu/drm/msm/msm_iommu.c       | 11 +++-
 drivers/gpu/drm/msm/msm_mmu.h         |  4 +-
 4 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index ce13d49e615b..a0eef5d9b89b 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 	return true;
 }
 
-static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
+static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
 	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
@@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+	return 0;
 }
 
 static void a5xx_cp_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 23464d735682..094dc17fd20f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -959,18 +959,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	msm_gpu_hw_init(gpu);
 }
 
-static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
+static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
+{
+	static const char *uche_clients[7] = {
+		"VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
+	};
+	u32 val;
+
+	if (mid < 1 || mid > 3)
+		return "UNKNOWN";
+
+	/*
+	 * The source of the data depends on the mid ID read from FSYNR1.
+	 * and the client ID read from the UCHE block
+	 */
+	val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
+
+	/* mid = 3 is most precise and refers to only one block per client */
+	if (mid == 3)
+		return uche_clients[val & 7];
+
+	/* For mid=2 the source is TP or VFD except when the client id is 0 */
+	if (mid == 2)
+		return ((val & 7) == 0) ? "TP" : "TP|VFD";
+
+	/* For mid=1 just return "UCHE" as a catchall for everything else */
+	return "UCHE";
+}
+
+static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
+{
+	if (id == 0)
+		return "CP";
+	else if (id == 4)
+		return "CCU";
+	else if (id == 6)
+		return "CDP Prefetch";
+
+	return a6xx_uche_fault_block(gpu, id);
+}
+
+#define ARM_SMMU_FSR_TF                 BIT(1)
+#define ARM_SMMU_FSR_PF			BIT(3)
+#define ARM_SMMU_FSR_EF			BIT(4)
+
+static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
+	struct adreno_smmu_fault_info *info = data;
+	const char *type = "UNKNOWN";
 
-	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
+	/*
+	 * Print a default message if we couldn't get the data from the
+	 * adreno-smmu-priv
+	 */
+	if (!info) {
+		pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
 			iova, flags,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+		return 0;
+	}
+
+	if (info->fsr & ARM_SMMU_FSR_TF)
+		type = "TRANSLATION";
+	else if (info->fsr & ARM_SMMU_FSR_PF)
+		type = "PERMISSION";
+	else if (info->fsr & ARM_SMMU_FSR_EF)
+		type = "EXTERNAL";
+
+	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
+			info->ttbr0, iova,
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
+			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
+
+	return 0;
 }
 
 static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 50d881794758..6975b95c3c29 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 		unsigned long iova, int flags, void *arg)
 {
 	struct msm_iommu *iommu = arg;
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
+	struct adreno_smmu_fault_info info, *ptr = NULL;
+
+	if (adreno_smmu->get_fault_info) {
+		adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
+		ptr = &info;
+	}
+
 	if (iommu->base.handler)
-		return iommu->base.handler(iommu->base.arg, iova, flags);
+		return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
+
 	pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
 	return 0;
 }
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index 61ade89d9e48..a88f44c3268d 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -26,7 +26,7 @@ enum msm_mmu_type {
 struct msm_mmu {
 	const struct msm_mmu_funcs *funcs;
 	struct device *dev;
-	int (*handler)(void *arg, unsigned long iova, int flags);
+	int (*handler)(void *arg, unsigned long iova, int flags, void *data);
 	void *arg;
 	enum msm_mmu_type type;
 };
@@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
 
 static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
-		int (*handler)(void *arg, unsigned long iova, int flags))
+		int (*handler)(void *arg, unsigned long iova, int flags, void *data))
 {
 	mmu->arg = arg;
 	mmu->handler = handler;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
  2021-06-01 22:47 ` Rob Clark
  (?)
  (?)
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Will Deacon, Robin Murphy,
	Joerg Roedel, Bjorn Andersson, Isaac J. Manjarres,
	Sai Prakash Ranjan, moderated list:ARM SMMU DRIVERS,
	open list:IOMMU DRIVERS, open list

From: Rob Clark <robdclark@chromium.org>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
 include/linux/adreno-smmu-priv.h           |  7 +++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
 	struct arm_smmu_device smmu;
 	bool bypass_quirk;
 	u8 bypass_cbndx;
+	u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 		u32 reg)
 {
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
 	/*
 	 * On the GPU device we want to process subsequent transactions after a
 	 * fault to keep the GPU from hanging
 	 */
 	reg |= ARM_SMMU_SCTLR_HUPCF;
 
+	if (qsmmu->stall_enabled & BIT(idx))
+		reg |= ARM_SMMU_SCTLR_CFCFG;
+
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
 	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+	if (enabled)
+		qsmmu->stall_enabled |= BIT(cfg->cbndx);
+	else
+		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	u32 reg = 0;
+
+	if (terminate)
+		reg |= ARM_SMMU_RESUME_TERMINATE;
+
+	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+	priv->set_stall = qcom_adreno_smmu_set_stall;
+	priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
 	return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  *                 TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *                  the fault
+ * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
+ *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
+ *                 the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
     void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+    void (*set_stall)(const void *cookie, bool enabled);
+    void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Isaac J. Manjarres, Will Deacon, open list,
	open list:IOMMU DRIVERS, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Rob Clark <robdclark@chromium.org>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
 include/linux/adreno-smmu-priv.h           |  7 +++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
 	struct arm_smmu_device smmu;
 	bool bypass_quirk;
 	u8 bypass_cbndx;
+	u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 		u32 reg)
 {
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
 	/*
 	 * On the GPU device we want to process subsequent transactions after a
 	 * fault to keep the GPU from hanging
 	 */
 	reg |= ARM_SMMU_SCTLR_HUPCF;
 
+	if (qsmmu->stall_enabled & BIT(idx))
+		reg |= ARM_SMMU_SCTLR_CFCFG;
+
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
 	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+	if (enabled)
+		qsmmu->stall_enabled |= BIT(cfg->cbndx);
+	else
+		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	u32 reg = 0;
+
+	if (terminate)
+		reg |= ARM_SMMU_RESUME_TERMINATE;
+
+	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+	priv->set_stall = qcom_adreno_smmu_set_stall;
+	priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
 	return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  *                 TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *                  the fault
+ * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
+ *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
+ *                 the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
     void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+    void (*set_stall)(const void *cookie, bool enabled);
+    void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1

_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Will Deacon, Robin Murphy,
	Joerg Roedel, Bjorn Andersson, Isaac J. Manjarres,
	Sai Prakash Ranjan, moderated list:ARM SMMU DRIVERS,
	open list:IOMMU DRIVERS, open list

From: Rob Clark <robdclark@chromium.org>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
 include/linux/adreno-smmu-priv.h           |  7 +++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
 	struct arm_smmu_device smmu;
 	bool bypass_quirk;
 	u8 bypass_cbndx;
+	u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 		u32 reg)
 {
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
 	/*
 	 * On the GPU device we want to process subsequent transactions after a
 	 * fault to keep the GPU from hanging
 	 */
 	reg |= ARM_SMMU_SCTLR_HUPCF;
 
+	if (qsmmu->stall_enabled & BIT(idx))
+		reg |= ARM_SMMU_SCTLR_CFCFG;
+
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
 	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+	if (enabled)
+		qsmmu->stall_enabled |= BIT(cfg->cbndx);
+	else
+		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	u32 reg = 0;
+
+	if (terminate)
+		reg |= ARM_SMMU_RESUME_TERMINATE;
+
+	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+	priv->set_stall = qcom_adreno_smmu_set_stall;
+	priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
 	return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  *                 TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *                  the fault
+ * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
+ *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
+ *                 the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
     void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+    void (*set_stall)(const void *cookie, bool enabled);
+    void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, Isaac J. Manjarres, Sai Prakash Ranjan, Will Deacon,
	Joerg Roedel, open list, Bjorn Andersson,
	open list:IOMMU DRIVERS, Jordan Crouse, Robin Murphy,
	moderated list:ARM SMMU DRIVERS

From: Rob Clark <robdclark@chromium.org>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
 include/linux/adreno-smmu-priv.h           |  7 +++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
 	struct arm_smmu_device smmu;
 	bool bypass_quirk;
 	u8 bypass_cbndx;
+	u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 		u32 reg)
 {
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
 	/*
 	 * On the GPU device we want to process subsequent transactions after a
 	 * fault to keep the GPU from hanging
 	 */
 	reg |= ARM_SMMU_SCTLR_HUPCF;
 
+	if (qsmmu->stall_enabled & BIT(idx))
+		reg |= ARM_SMMU_SCTLR_CFCFG;
+
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
 	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+	if (enabled)
+		qsmmu->stall_enabled |= BIT(cfg->cbndx);
+	else
+		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	u32 reg = 0;
+
+	if (terminate)
+		reg |= ARM_SMMU_RESUME_TERMINATE;
+
+	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+	priv->set_stall = qcom_adreno_smmu_set_stall;
+	priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
 	return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  *                 TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *                  the fault
+ * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
+ *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
+ *                 the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
     void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+    void (*set_stall)(const void *cookie, bool enabled);
+    void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
  2021-06-01 22:47 ` Rob Clark
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Iskren Chernev, Akhil P Oommen,
	AngeloGioacchino Del Regno, Konrad Dybcio,
	Kristian H. Kristensen, Marijn Suijten, Sai Prakash Ranjan,
	Sharat Masetty, Jonathan Marek, Zhenzhong Duan, Lee Jones,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

From: Rob Clark <robdclark@chromium.org>

For collecting devcoredumps with the SMMU stalled after an iova fault,
we need to skip the parts of the GPU state which are normally collected
with the hw crashdumper, since with the SMMU stalled the hw would be
unable to write out the requested state to memory.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
 drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
 drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
 drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
 9 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
index bdc989183c64..d2c31fae64fd 100644
--- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
@@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
 	adreno_dump(gpu);
 }
 
-static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
index 4534633fe7cd..b1a6f87d74ef 100644
--- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
@@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
 	adreno_dump(gpu);
 }
 
-static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
index 82bebb40234d..22780a594d6f 100644
--- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
@@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
 	~0 /* sentinel */
 };
 
-static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index a0eef5d9b89b..2e7714b1a17f 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
 	msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
 }
 
-static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
 			GFP_KERNEL);
@@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
 	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
 
 	/* Get the HLSQ regs with the help of the crashdumper */
-	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
+	if (!stalled)
+		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
 
 	a5xx_set_hwcg(gpu, true);
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
index ce0610c5256f..e0f06ce4e1a9 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
@@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
 void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 		struct drm_printer *p);
 
-struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
+struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
 int a6xx_gpu_state_put(struct msm_gpu_state *state);
 
 #endif /* __A6XX_GPU_H__ */
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index c1699b4f9a89..d0af68a76c4f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
 				a6xx_state, &a6xx_vbif_reglist,
 				&a6xx_state->registers[index++]);
 
+	if (!dumper) {
+		/*
+		 * We can't use the crashdumper when the SMMU is stalled,
+		 * because the GPU has no memory access until we resume
+		 * translation (but we don't want to do that until after
+		 * we have captured as much useful GPU state as possible).
+		 * So instead collect registers via the CPU:
+		 */
+		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
+			a6xx_get_ahb_gpu_registers(gpu,
+				a6xx_state, &a6xx_reglist[i],
+				&a6xx_state->registers[index++]);
+		return;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
 		a6xx_get_crashdumper_registers(gpu,
 			a6xx_state, &a6xx_reglist[i],
@@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
 	a6xx_state->nr_indexed_regs = count;
 }
 
-struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
+struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
-	struct a6xx_crashdumper dumper = { 0 };
+	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
 	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
@@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
 	/* Get the banks of indexed registers */
 	a6xx_get_indexed_registers(gpu, a6xx_state);
 
-	/* Try to initialize the crashdumper */
-	if (!a6xx_crashdumper_init(gpu, &dumper)) {
-		a6xx_get_registers(gpu, a6xx_state, &dumper);
-		a6xx_get_shaders(gpu, a6xx_state, &dumper);
-		a6xx_get_clusters(gpu, a6xx_state, &dumper);
-		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
+	/*
+	 * Try to initialize the crashdumper, if we are not dumping state
+	 * with the SMMU stalled.  The crashdumper needs memory access to
+	 * write out GPU state, so we need to skip this when the SMMU is
+	 * stalled in response to an iova fault
+	 */
+	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
+		dumper = &_dumper;
+	}
+
+	a6xx_get_registers(gpu, a6xx_state, dumper);
+
+	if (dumper) {
+		a6xx_get_shaders(gpu, a6xx_state, dumper);
+		a6xx_get_clusters(gpu, a6xx_state, dumper);
+		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
 
-		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
 	}
 
 	if (snapshot_debugbus)
diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
index 7a2b53d35e6b..90558e826934 100644
--- a/drivers/gpu/drm/msm/msm_debugfs.c
+++ b/drivers/gpu/drm/msm/msm_debugfs.c
@@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
 		goto free_priv;
 
 	pm_runtime_get_sync(&gpu->pdev->dev);
-	show_priv->state = gpu->funcs->gpu_state_get(gpu);
+	show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index fa7691cb4614..4d280bf446e6 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
 }
 
 static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
-		struct msm_gem_submit *submit, char *comm, char *cmd)
+		struct msm_gem_submit *submit, char *comm, char *cmd,
+		bool stalled)
 {
 	struct msm_gpu_state *state;
 
@@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	if (gpu->crashstate)
 		return;
 
-	state = gpu->funcs->gpu_state_get(gpu);
+	state = gpu->funcs->gpu_state_get(gpu, stalled);
 	if (IS_ERR_OR_NULL(state))
 		return;
 
@@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
 
 	/* Record the crash state */
 	pm_runtime_get_sync(&gpu->pdev->dev);
-	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
+	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	kfree(cmd);
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 7a082a12d98f..c15e5fd675d2 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -60,7 +60,7 @@ struct msm_gpu_funcs {
 	void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
 #endif
 	unsigned long (*gpu_busy)(struct msm_gpu *gpu);
-	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
+	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
 	int (*gpu_state_put)(struct msm_gpu_state *state);
 	unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
 	void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sai Prakash Ranjan, Marijn Suijten, Jonathan Marek, David Airlie,
	Lee Jones, Sharat Masetty, Konrad Dybcio, Akhil P Oommen,
	Jordan Crouse, Iskren Chernev, AngeloGioacchino Del Regno,
	Kristian H. Kristensen, open list, Sean Paul, Zhenzhong Duan,
	open list:DRM DRIVER FOR MSM ADRENO GPU

From: Rob Clark <robdclark@chromium.org>

For collecting devcoredumps with the SMMU stalled after an iova fault,
we need to skip the parts of the GPU state which are normally collected
with the hw crashdumper, since with the SMMU stalled the hw would be
unable to write out the requested state to memory.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
 drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
 drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
 drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
 9 files changed, 47 insertions(+), 20 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
index bdc989183c64..d2c31fae64fd 100644
--- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
@@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
 	adreno_dump(gpu);
 }
 
-static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
index 4534633fe7cd..b1a6f87d74ef 100644
--- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
@@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
 	adreno_dump(gpu);
 }
 
-static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
index 82bebb40234d..22780a594d6f 100644
--- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
@@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
 	~0 /* sentinel */
 };
 
-static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
 
diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index a0eef5d9b89b..2e7714b1a17f 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
 	msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
 }
 
-static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
+static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
 	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
 			GFP_KERNEL);
@@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
 	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
 
 	/* Get the HLSQ regs with the help of the crashdumper */
-	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
+	if (!stalled)
+		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
 
 	a5xx_set_hwcg(gpu, true);
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
index ce0610c5256f..e0f06ce4e1a9 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
@@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
 void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 		struct drm_printer *p);
 
-struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
+struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
 int a6xx_gpu_state_put(struct msm_gpu_state *state);
 
 #endif /* __A6XX_GPU_H__ */
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index c1699b4f9a89..d0af68a76c4f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
 				a6xx_state, &a6xx_vbif_reglist,
 				&a6xx_state->registers[index++]);
 
+	if (!dumper) {
+		/*
+		 * We can't use the crashdumper when the SMMU is stalled,
+		 * because the GPU has no memory access until we resume
+		 * translation (but we don't want to do that until after
+		 * we have captured as much useful GPU state as possible).
+		 * So instead collect registers via the CPU:
+		 */
+		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
+			a6xx_get_ahb_gpu_registers(gpu,
+				a6xx_state, &a6xx_reglist[i],
+				&a6xx_state->registers[index++]);
+		return;
+	}
+
 	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
 		a6xx_get_crashdumper_registers(gpu,
 			a6xx_state, &a6xx_reglist[i],
@@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
 	a6xx_state->nr_indexed_regs = count;
 }
 
-struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
+struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
 {
-	struct a6xx_crashdumper dumper = { 0 };
+	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
 	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
@@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
 	/* Get the banks of indexed registers */
 	a6xx_get_indexed_registers(gpu, a6xx_state);
 
-	/* Try to initialize the crashdumper */
-	if (!a6xx_crashdumper_init(gpu, &dumper)) {
-		a6xx_get_registers(gpu, a6xx_state, &dumper);
-		a6xx_get_shaders(gpu, a6xx_state, &dumper);
-		a6xx_get_clusters(gpu, a6xx_state, &dumper);
-		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
+	/*
+	 * Try to initialize the crashdumper, if we are not dumping state
+	 * with the SMMU stalled.  The crashdumper needs memory access to
+	 * write out GPU state, so we need to skip this when the SMMU is
+	 * stalled in response to an iova fault
+	 */
+	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
+		dumper = &_dumper;
+	}
+
+	a6xx_get_registers(gpu, a6xx_state, dumper);
+
+	if (dumper) {
+		a6xx_get_shaders(gpu, a6xx_state, dumper);
+		a6xx_get_clusters(gpu, a6xx_state, dumper);
+		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
 
-		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
 	}
 
 	if (snapshot_debugbus)
diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
index 7a2b53d35e6b..90558e826934 100644
--- a/drivers/gpu/drm/msm/msm_debugfs.c
+++ b/drivers/gpu/drm/msm/msm_debugfs.c
@@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
 		goto free_priv;
 
 	pm_runtime_get_sync(&gpu->pdev->dev);
-	show_priv->state = gpu->funcs->gpu_state_get(gpu);
+	show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	mutex_unlock(&dev->struct_mutex);
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index fa7691cb4614..4d280bf446e6 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
 }
 
 static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
-		struct msm_gem_submit *submit, char *comm, char *cmd)
+		struct msm_gem_submit *submit, char *comm, char *cmd,
+		bool stalled)
 {
 	struct msm_gpu_state *state;
 
@@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	if (gpu->crashstate)
 		return;
 
-	state = gpu->funcs->gpu_state_get(gpu);
+	state = gpu->funcs->gpu_state_get(gpu, stalled);
 	if (IS_ERR_OR_NULL(state))
 		return;
 
@@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
 
 	/* Record the crash state */
 	pm_runtime_get_sync(&gpu->pdev->dev);
-	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
+	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
 	pm_runtime_put_sync(&gpu->pdev->dev);
 
 	kfree(cmd);
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 7a082a12d98f..c15e5fd675d2 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -60,7 +60,7 @@ struct msm_gpu_funcs {
 	void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
 #endif
 	unsigned long (*gpu_busy)(struct msm_gpu *gpu);
-	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
+	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
 	int (*gpu_state_put)(struct msm_gpu_state *state);
 	unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
 	void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
  2021-06-01 22:47 ` Rob Clark
@ 2021-06-01 22:47   ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Jordan Crouse, Rob Clark, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Sai Prakash Ranjan, Jonathan Marek,
	Akhil P Oommen, Eric Anholt, Sharat Masetty, Douglas Anderson,
	Bjorn Andersson, open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

From: Rob Clark <robdclark@chromium.org>

Wire up support to stall the SMMU on iova fault, and collect a devcore-
dump snapshot for easier debugging of faults.

Currently this is a6xx-only, but mostly only because so far it is the
only one using adreno-smmu-priv.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
 drivers/gpu/drm/msm/msm_gem.h           |  1 +
 drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
 drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
 drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
 drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
 drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
 drivers/gpu/drm/msm/msm_mmu.h           |  1 +
 9 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 094dc17fd20f..0dcde917e575 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	struct msm_gpu *gpu = arg;
 	struct adreno_smmu_fault_info *info = data;
 	const char *type = "UNKNOWN";
+	const char *block;
+	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
+
+	/*
+	 * If we aren't going to be resuming later from fault_worker, then do
+	 * it now.
+	 */
+	if (!do_devcoredump) {
+		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+	}
 
 	/*
 	 * Print a default message if we couldn't get the data from the
@@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	else if (info->fsr & ARM_SMMU_FSR_EF)
 		type = "EXTERNAL";
 
+	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
+
 	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
 			info->ttbr0, iova,
-			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
-			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
+			type, block,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
+	if (do_devcoredump) {
+		/* Turn off the hangcheck timer to keep it from bothering us */
+		del_timer(&gpu->hangcheck_timer);
+
+		gpu->fault_info.ttbr0 = info->ttbr0;
+		gpu->fault_info.iova  = iova;
+		gpu->fault_info.flags = flags;
+		gpu->fault_info.type  = type;
+		gpu->fault_info.block = block;
+
+		kthread_queue_work(gpu->worker, &gpu->fault_work);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index cf897297656f..4e88d4407667 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 			adreno_gpu->info->revn, adreno_gpu->rev.core,
 			adreno_gpu->rev.major, adreno_gpu->rev.minor,
 			adreno_gpu->rev.patchid);
+	/*
+	 * If this is state collected due to iova fault, so fault related info
+	 *
+	 * TTBR0 would not be zero, so this is a good way to distinguish
+	 */
+	if (state->fault_info.ttbr0) {
+		const struct msm_gpu_fault_info *info = &state->fault_info;
+
+		drm_puts(p, "fault-info:\n");
+		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
+		drm_printf(p, "  - iova=%.16lx\n", info->iova);
+		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
+		drm_printf(p, "  - type=%s\n", info->type);
+		drm_printf(p, "  - source=%s\n", info->block);
+	}
 
 	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
 
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 03e2cc2a2ce1..405f8411e395 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -328,6 +328,7 @@ struct msm_gem_submit {
 	struct dma_fence *fence;
 	struct msm_gpu_submitqueue *queue;
 	struct pid *pid;    /* submitting process */
+	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
 	bool valid;         /* true if no cmdstream patching needed */
 	bool in_rb;         /* "sudo" mode, copy cmds into RB */
 	struct msm_ringbuffer *ring;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 5480852bdeda..44f84bfd0c0e 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
 	submit->cmd = (void *)&submit->bos[nr_bos];
 	submit->queue = queue;
 	submit->ring = gpu->rb[queue->prio];
+	submit->fault_dumped = false;
 
 	/* initially, until copy_from_user() and bo lookup succeeds: */
 	submit->nr_bos = 0;
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 4d280bf446e6..4da2053c1ffb 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	/* Fill in the additional crash state information */
 	state->comm = kstrdup(comm, GFP_KERNEL);
 	state->cmd = kstrdup(cmd, GFP_KERNEL);
+	state->fault_info = gpu->fault_info;
 
 	if (submit) {
 		int i, nr = 0;
@@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
 	msm_gpu_retire(gpu);
 }
 
+static void fault_worker(struct kthread_work *work)
+{
+	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
+	struct drm_device *dev = gpu->dev;
+	struct msm_gem_submit *submit;
+	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
+	char *comm = NULL, *cmd = NULL;
+
+	mutex_lock(&dev->struct_mutex);
+
+	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
+	if (submit && submit->fault_dumped)
+		goto resume_smmu;
+
+	if (submit) {
+		struct task_struct *task;
+
+		task = get_pid_task(submit->pid, PIDTYPE_PID);
+		if (task) {
+			comm = kstrdup(task->comm, GFP_KERNEL);
+			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
+			put_task_struct(task);
+		}
+
+		/*
+		 * When we get GPU iova faults, we can get 1000s of them,
+		 * but we really only want to log the first one.
+		 */
+		submit->fault_dumped = true;
+	}
+
+	/* Record the crash state */
+	pm_runtime_get_sync(&gpu->pdev->dev);
+	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	kfree(cmd);
+	kfree(comm);
+
+resume_smmu:
+	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
+	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+
+	mutex_unlock(&dev->struct_mutex);
+}
+
 static void hangcheck_timer_reset(struct msm_gpu *gpu)
 {
 	mod_timer(&gpu->hangcheck_timer,
@@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	INIT_LIST_HEAD(&gpu->active_list);
 	kthread_init_work(&gpu->retire_work, retire_worker);
 	kthread_init_work(&gpu->recover_work, recover_worker);
+	kthread_init_work(&gpu->fault_work, fault_worker);
 
 	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index c15e5fd675d2..8dae601085ee 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -71,6 +71,15 @@ struct msm_gpu_funcs {
 	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 };
 
+/* Additional state for iommu faults: */
+struct msm_gpu_fault_info {
+	u64 ttbr0;
+	unsigned long iova;
+	int flags;
+	const char *type;
+	const char *block;
+};
+
 struct msm_gpu {
 	const char *name;
 	struct drm_device *dev;
@@ -135,6 +144,12 @@ struct msm_gpu {
 #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 
+	/* Fault info for most recent iova fault: */
+	struct msm_gpu_fault_info fault_info;
+
+	/* work for handling GPU ioval faults: */
+	struct kthread_work fault_work;
+
 	/* work for handling GPU recovery: */
 	struct kthread_work recover_work;
 
@@ -243,6 +258,8 @@ struct msm_gpu_state {
 	char *comm;
 	char *cmd;
 
+	struct msm_gpu_fault_info fault_info;
+
 	int nr_bos;
 	struct msm_gpu_state_bo *bos;
 };
diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
index 379496186c7f..f7d1945e0c9f 100644
--- a/drivers/gpu/drm/msm/msm_gpummu.c
+++ b/drivers/gpu/drm/msm/msm_gpummu.c
@@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
 	return 0;
 }
 
+static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
+{
+}
+
 static void msm_gpummu_destroy(struct msm_mmu *mmu)
 {
 	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
@@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_gpummu_map,
 		.unmap = msm_gpummu_unmap,
 		.destroy = msm_gpummu_destroy,
+		.resume_translation = msm_gpummu_resume_translation,
 };
 
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 6975b95c3c29..eed2a762e9dd 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
 	 * the arm-smmu driver as a trigger to set up TTBR0
 	 */
 	if (atomic_inc_return(&iommu->pagetables) == 1) {
+		/* Enable stall on iommu fault: */
+		adreno_smmu->set_stall(adreno_smmu->cookie, true);
+
 		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
 		if (ret) {
 			free_io_pgtable_ops(pagetable->pgtbl_ops);
@@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 	return 0;
 }
 
+static void msm_iommu_resume_translation(struct msm_mmu *mmu)
+{
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
+
+	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
+}
+
 static void msm_iommu_detach(struct msm_mmu *mmu)
 {
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
@@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_iommu_map,
 		.unmap = msm_iommu_unmap,
 		.destroy = msm_iommu_destroy,
+		.resume_translation = msm_iommu_resume_translation,
 };
 
 struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index a88f44c3268d..de158e1bf765 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -15,6 +15,7 @@ struct msm_mmu_funcs {
 			size_t len, int prot);
 	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
 	void (*destroy)(struct msm_mmu *mmu);
+	void (*resume_translation)(struct msm_mmu *mmu);
 };
 
 enum msm_mmu_type {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
@ 2021-06-01 22:47   ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-01 22:47 UTC (permalink / raw)
  To: dri-devel
  Cc: Rob Clark, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sai Prakash Ranjan, Douglas Anderson, Jonathan Marek,
	David Airlie, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sharat Masetty, Akhil P Oommen, Jordan Crouse, Bjorn Andersson,
	Sean Paul, open list

From: Rob Clark <robdclark@chromium.org>

Wire up support to stall the SMMU on iova fault, and collect a devcore-
dump snapshot for easier debugging of faults.

Currently this is a6xx-only, but mostly only because so far it is the
only one using adreno-smmu-priv.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
 drivers/gpu/drm/msm/msm_gem.h           |  1 +
 drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
 drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
 drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
 drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
 drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
 drivers/gpu/drm/msm/msm_mmu.h           |  1 +
 9 files changed, 126 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 094dc17fd20f..0dcde917e575 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	struct msm_gpu *gpu = arg;
 	struct adreno_smmu_fault_info *info = data;
 	const char *type = "UNKNOWN";
+	const char *block;
+	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
+
+	/*
+	 * If we aren't going to be resuming later from fault_worker, then do
+	 * it now.
+	 */
+	if (!do_devcoredump) {
+		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+	}
 
 	/*
 	 * Print a default message if we couldn't get the data from the
@@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	else if (info->fsr & ARM_SMMU_FSR_EF)
 		type = "EXTERNAL";
 
+	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
+
 	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
 			info->ttbr0, iova,
-			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
-			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
+			type, block,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
+	if (do_devcoredump) {
+		/* Turn off the hangcheck timer to keep it from bothering us */
+		del_timer(&gpu->hangcheck_timer);
+
+		gpu->fault_info.ttbr0 = info->ttbr0;
+		gpu->fault_info.iova  = iova;
+		gpu->fault_info.flags = flags;
+		gpu->fault_info.type  = type;
+		gpu->fault_info.block = block;
+
+		kthread_queue_work(gpu->worker, &gpu->fault_work);
+	}
+
 	return 0;
 }
 
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index cf897297656f..4e88d4407667 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 			adreno_gpu->info->revn, adreno_gpu->rev.core,
 			adreno_gpu->rev.major, adreno_gpu->rev.minor,
 			adreno_gpu->rev.patchid);
+	/*
+	 * If this is state collected due to iova fault, so fault related info
+	 *
+	 * TTBR0 would not be zero, so this is a good way to distinguish
+	 */
+	if (state->fault_info.ttbr0) {
+		const struct msm_gpu_fault_info *info = &state->fault_info;
+
+		drm_puts(p, "fault-info:\n");
+		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
+		drm_printf(p, "  - iova=%.16lx\n", info->iova);
+		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
+		drm_printf(p, "  - type=%s\n", info->type);
+		drm_printf(p, "  - source=%s\n", info->block);
+	}
 
 	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
 
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 03e2cc2a2ce1..405f8411e395 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -328,6 +328,7 @@ struct msm_gem_submit {
 	struct dma_fence *fence;
 	struct msm_gpu_submitqueue *queue;
 	struct pid *pid;    /* submitting process */
+	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
 	bool valid;         /* true if no cmdstream patching needed */
 	bool in_rb;         /* "sudo" mode, copy cmds into RB */
 	struct msm_ringbuffer *ring;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 5480852bdeda..44f84bfd0c0e 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
 	submit->cmd = (void *)&submit->bos[nr_bos];
 	submit->queue = queue;
 	submit->ring = gpu->rb[queue->prio];
+	submit->fault_dumped = false;
 
 	/* initially, until copy_from_user() and bo lookup succeeds: */
 	submit->nr_bos = 0;
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index 4d280bf446e6..4da2053c1ffb 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	/* Fill in the additional crash state information */
 	state->comm = kstrdup(comm, GFP_KERNEL);
 	state->cmd = kstrdup(cmd, GFP_KERNEL);
+	state->fault_info = gpu->fault_info;
 
 	if (submit) {
 		int i, nr = 0;
@@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
 	msm_gpu_retire(gpu);
 }
 
+static void fault_worker(struct kthread_work *work)
+{
+	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
+	struct drm_device *dev = gpu->dev;
+	struct msm_gem_submit *submit;
+	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
+	char *comm = NULL, *cmd = NULL;
+
+	mutex_lock(&dev->struct_mutex);
+
+	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
+	if (submit && submit->fault_dumped)
+		goto resume_smmu;
+
+	if (submit) {
+		struct task_struct *task;
+
+		task = get_pid_task(submit->pid, PIDTYPE_PID);
+		if (task) {
+			comm = kstrdup(task->comm, GFP_KERNEL);
+			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
+			put_task_struct(task);
+		}
+
+		/*
+		 * When we get GPU iova faults, we can get 1000s of them,
+		 * but we really only want to log the first one.
+		 */
+		submit->fault_dumped = true;
+	}
+
+	/* Record the crash state */
+	pm_runtime_get_sync(&gpu->pdev->dev);
+	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	kfree(cmd);
+	kfree(comm);
+
+resume_smmu:
+	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
+	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+
+	mutex_unlock(&dev->struct_mutex);
+}
+
 static void hangcheck_timer_reset(struct msm_gpu *gpu)
 {
 	mod_timer(&gpu->hangcheck_timer,
@@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	INIT_LIST_HEAD(&gpu->active_list);
 	kthread_init_work(&gpu->retire_work, retire_worker);
 	kthread_init_work(&gpu->recover_work, recover_worker);
+	kthread_init_work(&gpu->fault_work, fault_worker);
 
 	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index c15e5fd675d2..8dae601085ee 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -71,6 +71,15 @@ struct msm_gpu_funcs {
 	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 };
 
+/* Additional state for iommu faults: */
+struct msm_gpu_fault_info {
+	u64 ttbr0;
+	unsigned long iova;
+	int flags;
+	const char *type;
+	const char *block;
+};
+
 struct msm_gpu {
 	const char *name;
 	struct drm_device *dev;
@@ -135,6 +144,12 @@ struct msm_gpu {
 #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 
+	/* Fault info for most recent iova fault: */
+	struct msm_gpu_fault_info fault_info;
+
+	/* work for handling GPU ioval faults: */
+	struct kthread_work fault_work;
+
 	/* work for handling GPU recovery: */
 	struct kthread_work recover_work;
 
@@ -243,6 +258,8 @@ struct msm_gpu_state {
 	char *comm;
 	char *cmd;
 
+	struct msm_gpu_fault_info fault_info;
+
 	int nr_bos;
 	struct msm_gpu_state_bo *bos;
 };
diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
index 379496186c7f..f7d1945e0c9f 100644
--- a/drivers/gpu/drm/msm/msm_gpummu.c
+++ b/drivers/gpu/drm/msm/msm_gpummu.c
@@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
 	return 0;
 }
 
+static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
+{
+}
+
 static void msm_gpummu_destroy(struct msm_mmu *mmu)
 {
 	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
@@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_gpummu_map,
 		.unmap = msm_gpummu_unmap,
 		.destroy = msm_gpummu_destroy,
+		.resume_translation = msm_gpummu_resume_translation,
 };
 
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 6975b95c3c29..eed2a762e9dd 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
 	 * the arm-smmu driver as a trigger to set up TTBR0
 	 */
 	if (atomic_inc_return(&iommu->pagetables) == 1) {
+		/* Enable stall on iommu fault: */
+		adreno_smmu->set_stall(adreno_smmu->cookie, true);
+
 		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
 		if (ret) {
 			free_io_pgtable_ops(pagetable->pgtbl_ops);
@@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 	return 0;
 }
 
+static void msm_iommu_resume_translation(struct msm_mmu *mmu)
+{
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
+
+	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
+}
+
 static void msm_iommu_detach(struct msm_mmu *mmu)
 {
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
@@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_iommu_map,
 		.unmap = msm_iommu_unmap,
 		.destroy = msm_iommu_destroy,
+		.resume_translation = msm_iommu_resume_translation,
 };
 
 struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index a88f44c3268d..de158e1bf765 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -15,6 +15,7 @@ struct msm_mmu_funcs {
 			size_t len, int prot);
 	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
 	void (*destroy)(struct msm_mmu *mmu);
+	void (*resume_translation)(struct msm_mmu *mmu);
 };
 
 enum msm_mmu_type {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
  2021-06-01 22:47   ` Rob Clark
@ 2021-06-08 15:12     ` Jordan Crouse
  -1 siblings, 0 replies; 32+ messages in thread
From: Jordan Crouse @ 2021-06-08 15:12 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, Rob Clark, Sean Paul, David Airlie, Daniel Vetter,
	Iskren Chernev, Akhil P Oommen, AngeloGioacchino Del Regno,
	Konrad Dybcio, Kristian H. Kristensen, Marijn Suijten,
	Sai Prakash Ranjan, Sharat Masetty, Jonathan Marek,
	Zhenzhong Duan, Lee Jones,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 01, 2021 at 03:47:24PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> For collecting devcoredumps with the SMMU stalled after an iova fault,
> we need to skip the parts of the GPU state which are normally collected
> with the hw crashdumper, since with the SMMU stalled the hw would be
> unable to write out the requested state to memory.

On a5xx and a6xx you can query RBBM_STATUS3 bit 24 to see if the IOMMU is
stalled.  That could be an alternative option to adding the "stalled"
infrastructure across all targets.

Jordan
>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
>  drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
>  drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
>  drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
>  9 files changed, 47 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> index bdc989183c64..d2c31fae64fd 100644
> --- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> @@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
>  	adreno_dump(gpu);
>  }
>  
> -static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> index 4534633fe7cd..b1a6f87d74ef 100644
> --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> @@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
>  	adreno_dump(gpu);
>  }
>  
> -static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> index 82bebb40234d..22780a594d6f 100644
> --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> @@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
>  	~0 /* sentinel */
>  };
>  
> -static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index a0eef5d9b89b..2e7714b1a17f 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
>  	msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
>  }
>  
> -static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
>  			GFP_KERNEL);
> @@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
>  	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
>  
>  	/* Get the HLSQ regs with the help of the crashdumper */
> -	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> +	if (!stalled)
> +		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
>  
>  	a5xx_set_hwcg(gpu, true);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> index ce0610c5256f..e0f06ce4e1a9 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> @@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
>  void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
>  		struct drm_printer *p);
>  
> -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
> +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
>  int a6xx_gpu_state_put(struct msm_gpu_state *state);
>  
>  #endif /* __A6XX_GPU_H__ */
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> index c1699b4f9a89..d0af68a76c4f 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> @@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
>  				a6xx_state, &a6xx_vbif_reglist,
>  				&a6xx_state->registers[index++]);
>  
> +	if (!dumper) {
> +		/*
> +		 * We can't use the crashdumper when the SMMU is stalled,
> +		 * because the GPU has no memory access until we resume
> +		 * translation (but we don't want to do that until after
> +		 * we have captured as much useful GPU state as possible).
> +		 * So instead collect registers via the CPU:
> +		 */
> +		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> +			a6xx_get_ahb_gpu_registers(gpu,
> +				a6xx_state, &a6xx_reglist[i],
> +				&a6xx_state->registers[index++]);
> +		return;
> +	}
> +
>  	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
>  		a6xx_get_crashdumper_registers(gpu,
>  			a6xx_state, &a6xx_reglist[i],
> @@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
>  	a6xx_state->nr_indexed_regs = count;
>  }
>  
> -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
> -	struct a6xx_crashdumper dumper = { 0 };
> +	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
>  	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>  	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>  	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
> @@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
>  	/* Get the banks of indexed registers */
>  	a6xx_get_indexed_registers(gpu, a6xx_state);
>  
> -	/* Try to initialize the crashdumper */
> -	if (!a6xx_crashdumper_init(gpu, &dumper)) {
> -		a6xx_get_registers(gpu, a6xx_state, &dumper);
> -		a6xx_get_shaders(gpu, a6xx_state, &dumper);
> -		a6xx_get_clusters(gpu, a6xx_state, &dumper);
> -		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> +	/*
> +	 * Try to initialize the crashdumper, if we are not dumping state
> +	 * with the SMMU stalled.  The crashdumper needs memory access to
> +	 * write out GPU state, so we need to skip this when the SMMU is
> +	 * stalled in response to an iova fault
> +	 */
> +	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> +		dumper = &_dumper;
> +	}
> +
> +	a6xx_get_registers(gpu, a6xx_state, dumper);
> +
> +	if (dumper) {
> +		a6xx_get_shaders(gpu, a6xx_state, dumper);
> +		a6xx_get_clusters(gpu, a6xx_state, dumper);
> +		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
>  
> -		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> +		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
>  	}
>  
>  	if (snapshot_debugbus)
> diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
> index 7a2b53d35e6b..90558e826934 100644
> --- a/drivers/gpu/drm/msm/msm_debugfs.c
> +++ b/drivers/gpu/drm/msm/msm_debugfs.c
> @@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
>  		goto free_priv;
>  
>  	pm_runtime_get_sync(&gpu->pdev->dev);
> -	show_priv->state = gpu->funcs->gpu_state_get(gpu);
> +	show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
>  	pm_runtime_put_sync(&gpu->pdev->dev);
>  
>  	mutex_unlock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index fa7691cb4614..4d280bf446e6 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
>  }
>  
>  static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> -		struct msm_gem_submit *submit, char *comm, char *cmd)
> +		struct msm_gem_submit *submit, char *comm, char *cmd,
> +		bool stalled)
>  {
>  	struct msm_gpu_state *state;
>  
> @@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
>  	if (gpu->crashstate)
>  		return;
>  
> -	state = gpu->funcs->gpu_state_get(gpu);
> +	state = gpu->funcs->gpu_state_get(gpu, stalled);
>  	if (IS_ERR_OR_NULL(state))
>  		return;
>  
> @@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
>  
>  	/* Record the crash state */
>  	pm_runtime_get_sync(&gpu->pdev->dev);
> -	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> +	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
>  	pm_runtime_put_sync(&gpu->pdev->dev);
>  
>  	kfree(cmd);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 7a082a12d98f..c15e5fd675d2 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -60,7 +60,7 @@ struct msm_gpu_funcs {
>  	void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
>  #endif
>  	unsigned long (*gpu_busy)(struct msm_gpu *gpu);
> -	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
> +	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
>  	int (*gpu_state_put)(struct msm_gpu_state *state);
>  	unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
>  	void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
@ 2021-06-08 15:12     ` Jordan Crouse
  0 siblings, 0 replies; 32+ messages in thread
From: Jordan Crouse @ 2021-06-08 15:12 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sai Prakash Ranjan, Marijn Suijten, Jonathan Marek, David Airlie,
	Lee Jones, Sharat Masetty, Konrad Dybcio, Akhil P Oommen,
	dri-devel, open list, Zhenzhong Duan, Iskren Chernev,
	AngeloGioacchino Del Regno, Kristian H. Kristensen, Sean Paul,
	open list:DRM DRIVER FOR MSM ADRENO GPU

On Tue, Jun 01, 2021 at 03:47:24PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> For collecting devcoredumps with the SMMU stalled after an iova fault,
> we need to skip the parts of the GPU state which are normally collected
> with the hw crashdumper, since with the SMMU stalled the hw would be
> unable to write out the requested state to memory.

On a5xx and a6xx you can query RBBM_STATUS3 bit 24 to see if the IOMMU is
stalled.  That could be an alternative option to adding the "stalled"
infrastructure across all targets.

Jordan
>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
>  drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
>  drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
>  drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
>  9 files changed, 47 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> index bdc989183c64..d2c31fae64fd 100644
> --- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> @@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
>  	adreno_dump(gpu);
>  }
>  
> -static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> index 4534633fe7cd..b1a6f87d74ef 100644
> --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> @@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
>  	adreno_dump(gpu);
>  }
>  
> -static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> index 82bebb40234d..22780a594d6f 100644
> --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> @@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
>  	~0 /* sentinel */
>  };
>  
> -static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index a0eef5d9b89b..2e7714b1a17f 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
>  	msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
>  }
>  
> -static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> +static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
>  	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
>  			GFP_KERNEL);
> @@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
>  	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
>  
>  	/* Get the HLSQ regs with the help of the crashdumper */
> -	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> +	if (!stalled)
> +		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
>  
>  	a5xx_set_hwcg(gpu, true);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> index ce0610c5256f..e0f06ce4e1a9 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> @@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
>  void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
>  		struct drm_printer *p);
>  
> -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
> +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
>  int a6xx_gpu_state_put(struct msm_gpu_state *state);
>  
>  #endif /* __A6XX_GPU_H__ */
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> index c1699b4f9a89..d0af68a76c4f 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> @@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
>  				a6xx_state, &a6xx_vbif_reglist,
>  				&a6xx_state->registers[index++]);
>  
> +	if (!dumper) {
> +		/*
> +		 * We can't use the crashdumper when the SMMU is stalled,
> +		 * because the GPU has no memory access until we resume
> +		 * translation (but we don't want to do that until after
> +		 * we have captured as much useful GPU state as possible).
> +		 * So instead collect registers via the CPU:
> +		 */
> +		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> +			a6xx_get_ahb_gpu_registers(gpu,
> +				a6xx_state, &a6xx_reglist[i],
> +				&a6xx_state->registers[index++]);
> +		return;
> +	}
> +
>  	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
>  		a6xx_get_crashdumper_registers(gpu,
>  			a6xx_state, &a6xx_reglist[i],
> @@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
>  	a6xx_state->nr_indexed_regs = count;
>  }
>  
> -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
>  {
> -	struct a6xx_crashdumper dumper = { 0 };
> +	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
>  	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>  	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>  	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
> @@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
>  	/* Get the banks of indexed registers */
>  	a6xx_get_indexed_registers(gpu, a6xx_state);
>  
> -	/* Try to initialize the crashdumper */
> -	if (!a6xx_crashdumper_init(gpu, &dumper)) {
> -		a6xx_get_registers(gpu, a6xx_state, &dumper);
> -		a6xx_get_shaders(gpu, a6xx_state, &dumper);
> -		a6xx_get_clusters(gpu, a6xx_state, &dumper);
> -		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> +	/*
> +	 * Try to initialize the crashdumper, if we are not dumping state
> +	 * with the SMMU stalled.  The crashdumper needs memory access to
> +	 * write out GPU state, so we need to skip this when the SMMU is
> +	 * stalled in response to an iova fault
> +	 */
> +	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> +		dumper = &_dumper;
> +	}
> +
> +	a6xx_get_registers(gpu, a6xx_state, dumper);
> +
> +	if (dumper) {
> +		a6xx_get_shaders(gpu, a6xx_state, dumper);
> +		a6xx_get_clusters(gpu, a6xx_state, dumper);
> +		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
>  
> -		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> +		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
>  	}
>  
>  	if (snapshot_debugbus)
> diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
> index 7a2b53d35e6b..90558e826934 100644
> --- a/drivers/gpu/drm/msm/msm_debugfs.c
> +++ b/drivers/gpu/drm/msm/msm_debugfs.c
> @@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
>  		goto free_priv;
>  
>  	pm_runtime_get_sync(&gpu->pdev->dev);
> -	show_priv->state = gpu->funcs->gpu_state_get(gpu);
> +	show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
>  	pm_runtime_put_sync(&gpu->pdev->dev);
>  
>  	mutex_unlock(&dev->struct_mutex);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index fa7691cb4614..4d280bf446e6 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
>  }
>  
>  static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> -		struct msm_gem_submit *submit, char *comm, char *cmd)
> +		struct msm_gem_submit *submit, char *comm, char *cmd,
> +		bool stalled)
>  {
>  	struct msm_gpu_state *state;
>  
> @@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
>  	if (gpu->crashstate)
>  		return;
>  
> -	state = gpu->funcs->gpu_state_get(gpu);
> +	state = gpu->funcs->gpu_state_get(gpu, stalled);
>  	if (IS_ERR_OR_NULL(state))
>  		return;
>  
> @@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
>  
>  	/* Record the crash state */
>  	pm_runtime_get_sync(&gpu->pdev->dev);
> -	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> +	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
>  	pm_runtime_put_sync(&gpu->pdev->dev);
>  
>  	kfree(cmd);
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 7a082a12d98f..c15e5fd675d2 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -60,7 +60,7 @@ struct msm_gpu_funcs {
>  	void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
>  #endif
>  	unsigned long (*gpu_busy)(struct msm_gpu *gpu);
> -	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
> +	struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
>  	int (*gpu_state_put)(struct msm_gpu_state *state);
>  	unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
>  	void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
  2021-06-01 22:47   ` Rob Clark
@ 2021-06-08 15:20     ` Jordan Crouse
  -1 siblings, 0 replies; 32+ messages in thread
From: Jordan Crouse @ 2021-06-08 15:20 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, Rob Clark, Sean Paul, David Airlie, Daniel Vetter,
	Sai Prakash Ranjan, Jonathan Marek, Akhil P Oommen, Eric Anholt,
	Sharat Masetty, Douglas Anderson, Bjorn Andersson,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 01, 2021 at 03:47:25PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Wire up support to stall the SMMU on iova fault, and collect a devcore-
> dump snapshot for easier debugging of faults.
> 
> Currently this is a6xx-only, but mostly only because so far it is the
> only one using adreno-smmu-priv.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
>  drivers/gpu/drm/msm/msm_gem.h           |  1 +
>  drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
>  drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
>  drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
>  drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
>  drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
>  drivers/gpu/drm/msm/msm_mmu.h           |  1 +
>  9 files changed, 126 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 094dc17fd20f..0dcde917e575 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	struct msm_gpu *gpu = arg;
>  	struct adreno_smmu_fault_info *info = data;
>  	const char *type = "UNKNOWN";
> +	const char *block;
> +	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> +
> +	/*
> +	 * If we aren't going to be resuming later from fault_worker, then do
> +	 * it now.
> +	 */
> +	if (!do_devcoredump) {
> +		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +	}
>  
>  	/*
>  	 * Print a default message if we couldn't get the data from the
> @@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	else if (info->fsr & ARM_SMMU_FSR_EF)
>  		type = "EXTERNAL";
>  
> +	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> +
>  	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
>  			info->ttbr0, iova,
> -			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> -			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> +			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> +			type, block,
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>  
> +	if (do_devcoredump) {
> +		/* Turn off the hangcheck timer to keep it from bothering us */
> +		del_timer(&gpu->hangcheck_timer);
> +
> +		gpu->fault_info.ttbr0 = info->ttbr0;
> +		gpu->fault_info.iova  = iova;
> +		gpu->fault_info.flags = flags;
> +		gpu->fault_info.type  = type;
> +		gpu->fault_info.block = block;
> +
> +		kthread_queue_work(gpu->worker, &gpu->fault_work);
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index cf897297656f..4e88d4407667 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
>  			adreno_gpu->info->revn, adreno_gpu->rev.core,
>  			adreno_gpu->rev.major, adreno_gpu->rev.minor,
>  			adreno_gpu->rev.patchid);
> +	/*
> +	 * If this is state collected due to iova fault, so fault related info
> +	 *
> +	 * TTBR0 would not be zero, so this is a good way to distinguish
> +	 */
> +	if (state->fault_info.ttbr0) {
> +		const struct msm_gpu_fault_info *info = &state->fault_info;
> +
> +		drm_puts(p, "fault-info:\n");
> +		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
> +		drm_printf(p, "  - iova=%.16lx\n", info->iova);
> +		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> +		drm_printf(p, "  - type=%s\n", info->type);
> +		drm_printf(p, "  - source=%s\n", info->block);
> +	}
>  
>  	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index 03e2cc2a2ce1..405f8411e395 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -328,6 +328,7 @@ struct msm_gem_submit {
>  	struct dma_fence *fence;
>  	struct msm_gpu_submitqueue *queue;
>  	struct pid *pid;    /* submitting process */
> +	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
>  	bool valid;         /* true if no cmdstream patching needed */
>  	bool in_rb;         /* "sudo" mode, copy cmds into RB */
>  	struct msm_ringbuffer *ring;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 5480852bdeda..44f84bfd0c0e 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>  	submit->cmd = (void *)&submit->bos[nr_bos];
>  	submit->queue = queue;
>  	submit->ring = gpu->rb[queue->prio];
> +	submit->fault_dumped = false;
>  
>  	/* initially, until copy_from_user() and bo lookup succeeds: */
>  	submit->nr_bos = 0;
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 4d280bf446e6..4da2053c1ffb 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
>  	/* Fill in the additional crash state information */
>  	state->comm = kstrdup(comm, GFP_KERNEL);
>  	state->cmd = kstrdup(cmd, GFP_KERNEL);
> +	state->fault_info = gpu->fault_info;
>  
>  	if (submit) {
>  		int i, nr = 0;
> @@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
>  	msm_gpu_retire(gpu);
>  }
>  
> +static void fault_worker(struct kthread_work *work)
> +{
> +	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> +	struct drm_device *dev = gpu->dev;
> +	struct msm_gem_submit *submit;
> +	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> +	char *comm = NULL, *cmd = NULL;
> +
> +	mutex_lock(&dev->struct_mutex);
> +
> +	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> +	if (submit && submit->fault_dumped)
> +		goto resume_smmu;
> +
> +	if (submit) {
> +		struct task_struct *task;
> +
> +		task = get_pid_task(submit->pid, PIDTYPE_PID);
> +		if (task) {
> +			comm = kstrdup(task->comm, GFP_KERNEL);
> +			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> +			put_task_struct(task);
> +		}
> +
> +		/*
> +		 * When we get GPU iova faults, we can get 1000s of them,
> +		 * but we really only want to log the first one.
> +		 */
> +		submit->fault_dumped = true;
> +	}
> +
> +	/* Record the crash state */
> +	pm_runtime_get_sync(&gpu->pdev->dev);
> +	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);

You are going to run the risk of a race here. Once the IOMMU stalls then the
various bits of the GPU pipeline are going to stop and as soon one of them hits
the hang cycles threshold its going to pus the big red HANG! button.

It is fine to keep this infrastructure in place, but at there needs to be an
escape valve in the hang infrastructure to keep you from double dumping and
also to keep from resetting the GPU if that isn't your intention.

This can be as simple as adding a RBBM_STATUS3 check in the hang function and
returning early or you could skip the capture state call here and rely on the
hang to be the single entry point into the crashstate capture (with the
appropriate protections to keep from accidentally recovering, of course).

Jordan

> +	pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +	kfree(cmd);
> +	kfree(comm);
> +
> +resume_smmu:
> +	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> +	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +
> +	mutex_unlock(&dev->struct_mutex);
> +}
> +
>  static void hangcheck_timer_reset(struct msm_gpu *gpu)
>  {
>  	mod_timer(&gpu->hangcheck_timer,
> @@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  	INIT_LIST_HEAD(&gpu->active_list);
>  	kthread_init_work(&gpu->retire_work, retire_worker);
>  	kthread_init_work(&gpu->recover_work, recover_worker);
> +	kthread_init_work(&gpu->fault_work, fault_worker);
>  
>  	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index c15e5fd675d2..8dae601085ee 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
>  	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>  };
>  
> +/* Additional state for iommu faults: */
> +struct msm_gpu_fault_info {
> +	u64 ttbr0;
> +	unsigned long iova;
> +	int flags;
> +	const char *type;
> +	const char *block;
> +};
> +
>  struct msm_gpu {
>  	const char *name;
>  	struct drm_device *dev;
> @@ -135,6 +144,12 @@ struct msm_gpu {
>  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
>  	struct timer_list hangcheck_timer;
>  
> +	/* Fault info for most recent iova fault: */
> +	struct msm_gpu_fault_info fault_info;
> +
> +	/* work for handling GPU ioval faults: */
> +	struct kthread_work fault_work;
> +
>  	/* work for handling GPU recovery: */
>  	struct kthread_work recover_work;
>  
> @@ -243,6 +258,8 @@ struct msm_gpu_state {
>  	char *comm;
>  	char *cmd;
>  
> +	struct msm_gpu_fault_info fault_info;
> +
>  	int nr_bos;
>  	struct msm_gpu_state_bo *bos;
>  };
> diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> index 379496186c7f..f7d1945e0c9f 100644
> --- a/drivers/gpu/drm/msm/msm_gpummu.c
> +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
>  	return 0;
>  }
>  
> +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> +{
> +}
> +
>  static void msm_gpummu_destroy(struct msm_mmu *mmu)
>  {
>  	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_gpummu_map,
>  		.unmap = msm_gpummu_unmap,
>  		.destroy = msm_gpummu_destroy,
> +		.resume_translation = msm_gpummu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 6975b95c3c29..eed2a762e9dd 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
>  	 * the arm-smmu driver as a trigger to set up TTBR0
>  	 */
>  	if (atomic_inc_return(&iommu->pagetables) == 1) {
> +		/* Enable stall on iommu fault: */
> +		adreno_smmu->set_stall(adreno_smmu->cookie, true);
> +
>  		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
>  		if (ret) {
>  			free_io_pgtable_ops(pagetable->pgtbl_ops);
> @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
>  	return 0;
>  }
>  
> +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> +{
> +	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> +
> +	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> +}
> +
>  static void msm_iommu_detach(struct msm_mmu *mmu)
>  {
>  	struct msm_iommu *iommu = to_msm_iommu(mmu);
> @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_iommu_map,
>  		.unmap = msm_iommu_unmap,
>  		.destroy = msm_iommu_destroy,
> +		.resume_translation = msm_iommu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index a88f44c3268d..de158e1bf765 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
>  			size_t len, int prot);
>  	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
>  	void (*destroy)(struct msm_mmu *mmu);
> +	void (*resume_translation)(struct msm_mmu *mmu);
>  };
>  
>  enum msm_mmu_type {
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
@ 2021-06-08 15:20     ` Jordan Crouse
  0 siblings, 0 replies; 32+ messages in thread
From: Jordan Crouse @ 2021-06-08 15:20 UTC (permalink / raw)
  To: Rob Clark
  Cc: Rob Clark, open list:DRM DRIVER FOR MSM ADRENO GPU,
	Sai Prakash Ranjan, Jonathan Marek, David Airlie,
	open list:DRM DRIVER FOR MSM ADRENO GPU, Sharat Masetty,
	Akhil P Oommen, dri-devel, Douglas Anderson, Bjorn Andersson,
	Sean Paul, open list

On Tue, Jun 01, 2021 at 03:47:25PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Wire up support to stall the SMMU on iova fault, and collect a devcore-
> dump snapshot for easier debugging of faults.
> 
> Currently this is a6xx-only, but mostly only because so far it is the
> only one using adreno-smmu-priv.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
>  drivers/gpu/drm/msm/msm_gem.h           |  1 +
>  drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
>  drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
>  drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
>  drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
>  drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
>  drivers/gpu/drm/msm/msm_mmu.h           |  1 +
>  9 files changed, 126 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 094dc17fd20f..0dcde917e575 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	struct msm_gpu *gpu = arg;
>  	struct adreno_smmu_fault_info *info = data;
>  	const char *type = "UNKNOWN";
> +	const char *block;
> +	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> +
> +	/*
> +	 * If we aren't going to be resuming later from fault_worker, then do
> +	 * it now.
> +	 */
> +	if (!do_devcoredump) {
> +		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +	}
>  
>  	/*
>  	 * Print a default message if we couldn't get the data from the
> @@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	else if (info->fsr & ARM_SMMU_FSR_EF)
>  		type = "EXTERNAL";
>  
> +	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> +
>  	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
>  			info->ttbr0, iova,
> -			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> -			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> +			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> +			type, block,
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>  
> +	if (do_devcoredump) {
> +		/* Turn off the hangcheck timer to keep it from bothering us */
> +		del_timer(&gpu->hangcheck_timer);
> +
> +		gpu->fault_info.ttbr0 = info->ttbr0;
> +		gpu->fault_info.iova  = iova;
> +		gpu->fault_info.flags = flags;
> +		gpu->fault_info.type  = type;
> +		gpu->fault_info.block = block;
> +
> +		kthread_queue_work(gpu->worker, &gpu->fault_work);
> +	}
> +
>  	return 0;
>  }
>  
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index cf897297656f..4e88d4407667 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
>  			adreno_gpu->info->revn, adreno_gpu->rev.core,
>  			adreno_gpu->rev.major, adreno_gpu->rev.minor,
>  			adreno_gpu->rev.patchid);
> +	/*
> +	 * If this is state collected due to iova fault, so fault related info
> +	 *
> +	 * TTBR0 would not be zero, so this is a good way to distinguish
> +	 */
> +	if (state->fault_info.ttbr0) {
> +		const struct msm_gpu_fault_info *info = &state->fault_info;
> +
> +		drm_puts(p, "fault-info:\n");
> +		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
> +		drm_printf(p, "  - iova=%.16lx\n", info->iova);
> +		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> +		drm_printf(p, "  - type=%s\n", info->type);
> +		drm_printf(p, "  - source=%s\n", info->block);
> +	}
>  
>  	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index 03e2cc2a2ce1..405f8411e395 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -328,6 +328,7 @@ struct msm_gem_submit {
>  	struct dma_fence *fence;
>  	struct msm_gpu_submitqueue *queue;
>  	struct pid *pid;    /* submitting process */
> +	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
>  	bool valid;         /* true if no cmdstream patching needed */
>  	bool in_rb;         /* "sudo" mode, copy cmds into RB */
>  	struct msm_ringbuffer *ring;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 5480852bdeda..44f84bfd0c0e 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>  	submit->cmd = (void *)&submit->bos[nr_bos];
>  	submit->queue = queue;
>  	submit->ring = gpu->rb[queue->prio];
> +	submit->fault_dumped = false;
>  
>  	/* initially, until copy_from_user() and bo lookup succeeds: */
>  	submit->nr_bos = 0;
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index 4d280bf446e6..4da2053c1ffb 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
>  	/* Fill in the additional crash state information */
>  	state->comm = kstrdup(comm, GFP_KERNEL);
>  	state->cmd = kstrdup(cmd, GFP_KERNEL);
> +	state->fault_info = gpu->fault_info;
>  
>  	if (submit) {
>  		int i, nr = 0;
> @@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
>  	msm_gpu_retire(gpu);
>  }
>  
> +static void fault_worker(struct kthread_work *work)
> +{
> +	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> +	struct drm_device *dev = gpu->dev;
> +	struct msm_gem_submit *submit;
> +	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> +	char *comm = NULL, *cmd = NULL;
> +
> +	mutex_lock(&dev->struct_mutex);
> +
> +	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> +	if (submit && submit->fault_dumped)
> +		goto resume_smmu;
> +
> +	if (submit) {
> +		struct task_struct *task;
> +
> +		task = get_pid_task(submit->pid, PIDTYPE_PID);
> +		if (task) {
> +			comm = kstrdup(task->comm, GFP_KERNEL);
> +			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> +			put_task_struct(task);
> +		}
> +
> +		/*
> +		 * When we get GPU iova faults, we can get 1000s of them,
> +		 * but we really only want to log the first one.
> +		 */
> +		submit->fault_dumped = true;
> +	}
> +
> +	/* Record the crash state */
> +	pm_runtime_get_sync(&gpu->pdev->dev);
> +	msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);

You are going to run the risk of a race here. Once the IOMMU stalls then the
various bits of the GPU pipeline are going to stop and as soon one of them hits
the hang cycles threshold its going to pus the big red HANG! button.

It is fine to keep this infrastructure in place, but at there needs to be an
escape valve in the hang infrastructure to keep you from double dumping and
also to keep from resetting the GPU if that isn't your intention.

This can be as simple as adding a RBBM_STATUS3 check in the hang function and
returning early or you could skip the capture state call here and rely on the
hang to be the single entry point into the crashstate capture (with the
appropriate protections to keep from accidentally recovering, of course).

Jordan

> +	pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +	kfree(cmd);
> +	kfree(comm);
> +
> +resume_smmu:
> +	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> +	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +
> +	mutex_unlock(&dev->struct_mutex);
> +}
> +
>  static void hangcheck_timer_reset(struct msm_gpu *gpu)
>  {
>  	mod_timer(&gpu->hangcheck_timer,
> @@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  	INIT_LIST_HEAD(&gpu->active_list);
>  	kthread_init_work(&gpu->retire_work, retire_worker);
>  	kthread_init_work(&gpu->recover_work, recover_worker);
> +	kthread_init_work(&gpu->fault_work, fault_worker);
>  
>  	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index c15e5fd675d2..8dae601085ee 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
>  	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>  };
>  
> +/* Additional state for iommu faults: */
> +struct msm_gpu_fault_info {
> +	u64 ttbr0;
> +	unsigned long iova;
> +	int flags;
> +	const char *type;
> +	const char *block;
> +};
> +
>  struct msm_gpu {
>  	const char *name;
>  	struct drm_device *dev;
> @@ -135,6 +144,12 @@ struct msm_gpu {
>  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
>  	struct timer_list hangcheck_timer;
>  
> +	/* Fault info for most recent iova fault: */
> +	struct msm_gpu_fault_info fault_info;
> +
> +	/* work for handling GPU ioval faults: */
> +	struct kthread_work fault_work;
> +
>  	/* work for handling GPU recovery: */
>  	struct kthread_work recover_work;
>  
> @@ -243,6 +258,8 @@ struct msm_gpu_state {
>  	char *comm;
>  	char *cmd;
>  
> +	struct msm_gpu_fault_info fault_info;
> +
>  	int nr_bos;
>  	struct msm_gpu_state_bo *bos;
>  };
> diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> index 379496186c7f..f7d1945e0c9f 100644
> --- a/drivers/gpu/drm/msm/msm_gpummu.c
> +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
>  	return 0;
>  }
>  
> +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> +{
> +}
> +
>  static void msm_gpummu_destroy(struct msm_mmu *mmu)
>  {
>  	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_gpummu_map,
>  		.unmap = msm_gpummu_unmap,
>  		.destroy = msm_gpummu_destroy,
> +		.resume_translation = msm_gpummu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 6975b95c3c29..eed2a762e9dd 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
>  	 * the arm-smmu driver as a trigger to set up TTBR0
>  	 */
>  	if (atomic_inc_return(&iommu->pagetables) == 1) {
> +		/* Enable stall on iommu fault: */
> +		adreno_smmu->set_stall(adreno_smmu->cookie, true);
> +
>  		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
>  		if (ret) {
>  			free_io_pgtable_ops(pagetable->pgtbl_ops);
> @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
>  	return 0;
>  }
>  
> +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> +{
> +	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> +
> +	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> +}
> +
>  static void msm_iommu_detach(struct msm_mmu *mmu)
>  {
>  	struct msm_iommu *iommu = to_msm_iommu(mmu);
> @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_iommu_map,
>  		.unmap = msm_iommu_unmap,
>  		.destroy = msm_iommu_destroy,
> +		.resume_translation = msm_iommu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index a88f44c3268d..de158e1bf765 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
>  			size_t len, int prot);
>  	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
>  	void (*destroy)(struct msm_mmu *mmu);
> +	void (*resume_translation)(struct msm_mmu *mmu);
>  };
>  
>  enum msm_mmu_type {
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
  2021-06-08 15:12     ` Jordan Crouse
@ 2021-06-09 21:46       ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-09 21:46 UTC (permalink / raw)
  To: Rob Clark, dri-devel, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Iskren Chernev, Akhil P Oommen,
	AngeloGioacchino Del Regno, Konrad Dybcio,
	Kristian H. Kristensen, Marijn Suijten, Sai Prakash Ranjan,
	Sharat Masetty, Jonathan Marek, Zhenzhong Duan, Lee Jones,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 8, 2021 at 8:12 AM Jordan Crouse <jordan@cosmicpenguin.net> wrote:
>
> On Tue, Jun 01, 2021 at 03:47:24PM -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > For collecting devcoredumps with the SMMU stalled after an iova fault,
> > we need to skip the parts of the GPU state which are normally collected
> > with the hw crashdumper, since with the SMMU stalled the hw would be
> > unable to write out the requested state to memory.
>
> On a5xx and a6xx you can query RBBM_STATUS3 bit 24 to see if the IOMMU is
> stalled.  That could be an alternative option to adding the "stalled"
> infrastructure across all targets.

Hmm, I suppose it is really only a5xx/a6xx that needs to do something
differently in this case, because of crashdumper, so maybe this would
be a reasonable approach

BR,
-R

> Jordan
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
> >  drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
> >  drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
> >  drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
> >  9 files changed, 47 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > index bdc989183c64..d2c31fae64fd 100644
> > --- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > @@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
> >       adreno_dump(gpu);
> >  }
> >
> > -static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > index 4534633fe7cd..b1a6f87d74ef 100644
> > --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > @@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
> >       adreno_dump(gpu);
> >  }
> >
> > -static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > index 82bebb40234d..22780a594d6f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > @@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
> >       ~0 /* sentinel */
> >  };
> >
> > -static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > index a0eef5d9b89b..2e7714b1a17f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > @@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
> >       msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> >  }
> >
> > -static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
> >                       GFP_KERNEL);
> > @@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> >       a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
> >
> >       /* Get the HLSQ regs with the help of the crashdumper */
> > -     a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> > +     if (!stalled)
> > +             a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> >
> >       a5xx_set_hwcg(gpu, true);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > index ce0610c5256f..e0f06ce4e1a9 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > @@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
> >  void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
> >               struct drm_printer *p);
> >
> > -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
> > +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
> >  int a6xx_gpu_state_put(struct msm_gpu_state *state);
> >
> >  #endif /* __A6XX_GPU_H__ */
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > index c1699b4f9a89..d0af68a76c4f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > @@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
> >                               a6xx_state, &a6xx_vbif_reglist,
> >                               &a6xx_state->registers[index++]);
> >
> > +     if (!dumper) {
> > +             /*
> > +              * We can't use the crashdumper when the SMMU is stalled,
> > +              * because the GPU has no memory access until we resume
> > +              * translation (but we don't want to do that until after
> > +              * we have captured as much useful GPU state as possible).
> > +              * So instead collect registers via the CPU:
> > +              */
> > +             for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> > +                     a6xx_get_ahb_gpu_registers(gpu,
> > +                             a6xx_state, &a6xx_reglist[i],
> > +                             &a6xx_state->registers[index++]);
> > +             return;
> > +     }
> > +
> >       for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> >               a6xx_get_crashdumper_registers(gpu,
> >                       a6xx_state, &a6xx_reglist[i],
> > @@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
> >       a6xx_state->nr_indexed_regs = count;
> >  }
> >
> > -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> > +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> > -     struct a6xx_crashdumper dumper = { 0 };
> > +     struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
> >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >       struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
> >       struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
> > @@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> >       /* Get the banks of indexed registers */
> >       a6xx_get_indexed_registers(gpu, a6xx_state);
> >
> > -     /* Try to initialize the crashdumper */
> > -     if (!a6xx_crashdumper_init(gpu, &dumper)) {
> > -             a6xx_get_registers(gpu, a6xx_state, &dumper);
> > -             a6xx_get_shaders(gpu, a6xx_state, &dumper);
> > -             a6xx_get_clusters(gpu, a6xx_state, &dumper);
> > -             a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> > +     /*
> > +      * Try to initialize the crashdumper, if we are not dumping state
> > +      * with the SMMU stalled.  The crashdumper needs memory access to
> > +      * write out GPU state, so we need to skip this when the SMMU is
> > +      * stalled in response to an iova fault
> > +      */
> > +     if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> > +             dumper = &_dumper;
> > +     }
> > +
> > +     a6xx_get_registers(gpu, a6xx_state, dumper);
> > +
> > +     if (dumper) {
> > +             a6xx_get_shaders(gpu, a6xx_state, dumper);
> > +             a6xx_get_clusters(gpu, a6xx_state, dumper);
> > +             a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
> >
> > -             msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> > +             msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
> >       }
> >
> >       if (snapshot_debugbus)
> > diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
> > index 7a2b53d35e6b..90558e826934 100644
> > --- a/drivers/gpu/drm/msm/msm_debugfs.c
> > +++ b/drivers/gpu/drm/msm/msm_debugfs.c
> > @@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
> >               goto free_priv;
> >
> >       pm_runtime_get_sync(&gpu->pdev->dev);
> > -     show_priv->state = gpu->funcs->gpu_state_get(gpu);
> > +     show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
> >       pm_runtime_put_sync(&gpu->pdev->dev);
> >
> >       mutex_unlock(&dev->struct_mutex);
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index fa7691cb4614..4d280bf446e6 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
> >  }
> >
> >  static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> > -             struct msm_gem_submit *submit, char *comm, char *cmd)
> > +             struct msm_gem_submit *submit, char *comm, char *cmd,
> > +             bool stalled)
> >  {
> >       struct msm_gpu_state *state;
> >
> > @@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> >       if (gpu->crashstate)
> >               return;
> >
> > -     state = gpu->funcs->gpu_state_get(gpu);
> > +     state = gpu->funcs->gpu_state_get(gpu, stalled);
> >       if (IS_ERR_OR_NULL(state))
> >               return;
> >
> > @@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
> >
> >       /* Record the crash state */
> >       pm_runtime_get_sync(&gpu->pdev->dev);
> > -     msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> > +     msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
> >       pm_runtime_put_sync(&gpu->pdev->dev);
> >
> >       kfree(cmd);
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index 7a082a12d98f..c15e5fd675d2 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -60,7 +60,7 @@ struct msm_gpu_funcs {
> >       void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
> >  #endif
> >       unsigned long (*gpu_busy)(struct msm_gpu *gpu);
> > -     struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
> > +     struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
> >       int (*gpu_state_put)(struct msm_gpu_state *state);
> >       unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
> >       void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU
@ 2021-06-09 21:46       ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-09 21:46 UTC (permalink / raw)
  To: Rob Clark, dri-devel, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Iskren Chernev, Akhil P Oommen,
	AngeloGioacchino Del Regno, Konrad Dybcio,
	Kristian H. Kristensen, Marijn Suijten, Sai Prakash Ranjan,
	Sharat Masetty, Jonathan Marek, Zhenzhong Duan, Lee Jones,
	open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 8, 2021 at 8:12 AM Jordan Crouse <jordan@cosmicpenguin.net> wrote:
>
> On Tue, Jun 01, 2021 at 03:47:24PM -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > For collecting devcoredumps with the SMMU stalled after an iova fault,
> > we need to skip the parts of the GPU state which are normally collected
> > with the hw crashdumper, since with the SMMU stalled the hw would be
> > unable to write out the requested state to memory.
>
> On a5xx and a6xx you can query RBBM_STATUS3 bit 24 to see if the IOMMU is
> stalled.  That could be an alternative option to adding the "stalled"
> infrastructure across all targets.

Hmm, I suppose it is really only a5xx/a6xx that needs to do something
differently in this case, because of crashdumper, so maybe this would
be a reasonable approach

BR,
-R

> Jordan
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a2xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a3xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a4xx_gpu.c       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  5 ++-
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.h       |  2 +-
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 43 ++++++++++++++++-----
> >  drivers/gpu/drm/msm/msm_debugfs.c           |  2 +-
> >  drivers/gpu/drm/msm/msm_gpu.c               |  7 ++--
> >  drivers/gpu/drm/msm/msm_gpu.h               |  2 +-
> >  9 files changed, 47 insertions(+), 20 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > index bdc989183c64..d2c31fae64fd 100644
> > --- a/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a2xx_gpu.c
> > @@ -434,7 +434,7 @@ static void a2xx_dump(struct msm_gpu *gpu)
> >       adreno_dump(gpu);
> >  }
> >
> > -static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a2xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > index 4534633fe7cd..b1a6f87d74ef 100644
> > --- a/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a3xx_gpu.c
> > @@ -464,7 +464,7 @@ static void a3xx_dump(struct msm_gpu *gpu)
> >       adreno_dump(gpu);
> >  }
> >
> > -static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a3xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > index 82bebb40234d..22780a594d6f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a4xx_gpu.c
> > @@ -549,7 +549,7 @@ static const unsigned int a405_registers[] = {
> >       ~0 /* sentinel */
> >  };
> >
> > -static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a4xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct msm_gpu_state *state = kzalloc(sizeof(*state), GFP_KERNEL);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > index a0eef5d9b89b..2e7714b1a17f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> > @@ -1519,7 +1519,7 @@ static void a5xx_gpu_state_get_hlsq_regs(struct msm_gpu *gpu,
> >       msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> >  }
> >
> > -static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> > +static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> >       struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
> >                       GFP_KERNEL);
> > @@ -1536,7 +1536,8 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
> >       a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
> >
> >       /* Get the HLSQ regs with the help of the crashdumper */
> > -     a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> > +     if (!stalled)
> > +             a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> >
> >       a5xx_set_hwcg(gpu, true);
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > index ce0610c5256f..e0f06ce4e1a9 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.h
> > @@ -86,7 +86,7 @@ unsigned long a6xx_gmu_get_freq(struct msm_gpu *gpu);
> >  void a6xx_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
> >               struct drm_printer *p);
> >
> > -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu);
> > +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled);
> >  int a6xx_gpu_state_put(struct msm_gpu_state *state);
> >
> >  #endif /* __A6XX_GPU_H__ */
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > index c1699b4f9a89..d0af68a76c4f 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> > @@ -833,6 +833,21 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
> >                               a6xx_state, &a6xx_vbif_reglist,
> >                               &a6xx_state->registers[index++]);
> >
> > +     if (!dumper) {
> > +             /*
> > +              * We can't use the crashdumper when the SMMU is stalled,
> > +              * because the GPU has no memory access until we resume
> > +              * translation (but we don't want to do that until after
> > +              * we have captured as much useful GPU state as possible).
> > +              * So instead collect registers via the CPU:
> > +              */
> > +             for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> > +                     a6xx_get_ahb_gpu_registers(gpu,
> > +                             a6xx_state, &a6xx_reglist[i],
> > +                             &a6xx_state->registers[index++]);
> > +             return;
> > +     }
> > +
> >       for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> >               a6xx_get_crashdumper_registers(gpu,
> >                       a6xx_state, &a6xx_reglist[i],
> > @@ -903,9 +918,9 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
> >       a6xx_state->nr_indexed_regs = count;
> >  }
> >
> > -struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> > +struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu, bool stalled)
> >  {
> > -     struct a6xx_crashdumper dumper = { 0 };
> > +     struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
> >       struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
> >       struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
> >       struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
> > @@ -928,14 +943,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
> >       /* Get the banks of indexed registers */
> >       a6xx_get_indexed_registers(gpu, a6xx_state);
> >
> > -     /* Try to initialize the crashdumper */
> > -     if (!a6xx_crashdumper_init(gpu, &dumper)) {
> > -             a6xx_get_registers(gpu, a6xx_state, &dumper);
> > -             a6xx_get_shaders(gpu, a6xx_state, &dumper);
> > -             a6xx_get_clusters(gpu, a6xx_state, &dumper);
> > -             a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> > +     /*
> > +      * Try to initialize the crashdumper, if we are not dumping state
> > +      * with the SMMU stalled.  The crashdumper needs memory access to
> > +      * write out GPU state, so we need to skip this when the SMMU is
> > +      * stalled in response to an iova fault
> > +      */
> > +     if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> > +             dumper = &_dumper;
> > +     }
> > +
> > +     a6xx_get_registers(gpu, a6xx_state, dumper);
> > +
> > +     if (dumper) {
> > +             a6xx_get_shaders(gpu, a6xx_state, dumper);
> > +             a6xx_get_clusters(gpu, a6xx_state, dumper);
> > +             a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
> >
> > -             msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> > +             msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
> >       }
> >
> >       if (snapshot_debugbus)
> > diff --git a/drivers/gpu/drm/msm/msm_debugfs.c b/drivers/gpu/drm/msm/msm_debugfs.c
> > index 7a2b53d35e6b..90558e826934 100644
> > --- a/drivers/gpu/drm/msm/msm_debugfs.c
> > +++ b/drivers/gpu/drm/msm/msm_debugfs.c
> > @@ -77,7 +77,7 @@ static int msm_gpu_open(struct inode *inode, struct file *file)
> >               goto free_priv;
> >
> >       pm_runtime_get_sync(&gpu->pdev->dev);
> > -     show_priv->state = gpu->funcs->gpu_state_get(gpu);
> > +     show_priv->state = gpu->funcs->gpu_state_get(gpu, false);
> >       pm_runtime_put_sync(&gpu->pdev->dev);
> >
> >       mutex_unlock(&dev->struct_mutex);
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index fa7691cb4614..4d280bf446e6 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -381,7 +381,8 @@ static void msm_gpu_crashstate_get_bo(struct msm_gpu_state *state,
> >  }
> >
> >  static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> > -             struct msm_gem_submit *submit, char *comm, char *cmd)
> > +             struct msm_gem_submit *submit, char *comm, char *cmd,
> > +             bool stalled)
> >  {
> >       struct msm_gpu_state *state;
> >
> > @@ -393,7 +394,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> >       if (gpu->crashstate)
> >               return;
> >
> > -     state = gpu->funcs->gpu_state_get(gpu);
> > +     state = gpu->funcs->gpu_state_get(gpu, stalled);
> >       if (IS_ERR_OR_NULL(state))
> >               return;
> >
> > @@ -519,7 +520,7 @@ static void recover_worker(struct kthread_work *work)
> >
> >       /* Record the crash state */
> >       pm_runtime_get_sync(&gpu->pdev->dev);
> > -     msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> > +     msm_gpu_crashstate_capture(gpu, submit, comm, cmd, false);
> >       pm_runtime_put_sync(&gpu->pdev->dev);
> >
> >       kfree(cmd);
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index 7a082a12d98f..c15e5fd675d2 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -60,7 +60,7 @@ struct msm_gpu_funcs {
> >       void (*debugfs_init)(struct msm_gpu *gpu, struct drm_minor *minor);
> >  #endif
> >       unsigned long (*gpu_busy)(struct msm_gpu *gpu);
> > -     struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu);
> > +     struct msm_gpu_state *(*gpu_state_get)(struct msm_gpu *gpu, bool stalled);
> >       int (*gpu_state_put)(struct msm_gpu_state *state);
> >       unsigned long (*gpu_get_freq)(struct msm_gpu *gpu);
> >       void (*gpu_set_freq)(struct msm_gpu *gpu, struct dev_pm_opp *opp);
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
  2021-06-08 15:20     ` Jordan Crouse
@ 2021-06-09 21:50       ` Rob Clark
  -1 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-09 21:50 UTC (permalink / raw)
  To: Rob Clark, dri-devel, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Sai Prakash Ranjan, Jonathan Marek,
	Akhil P Oommen, Eric Anholt, Sharat Masetty, Douglas Anderson,
	Bjorn Andersson, open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 8, 2021 at 8:20 AM Jordan Crouse <jordan@cosmicpenguin.net> wrote:
>
> On Tue, Jun 01, 2021 at 03:47:25PM -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > Wire up support to stall the SMMU on iova fault, and collect a devcore-
> > dump snapshot for easier debugging of faults.
> >
> > Currently this is a6xx-only, but mostly only because so far it is the
> > only one using adreno-smmu-priv.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
> >  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
> >  drivers/gpu/drm/msm/msm_gem.h           |  1 +
> >  drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
> >  drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
> >  drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
> >  drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
> >  drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
> >  drivers/gpu/drm/msm/msm_mmu.h           |  1 +
> >  9 files changed, 126 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 094dc17fd20f..0dcde917e575 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> >       struct msm_gpu *gpu = arg;
> >       struct adreno_smmu_fault_info *info = data;
> >       const char *type = "UNKNOWN";
> > +     const char *block;
> > +     bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> > +
> > +     /*
> > +      * If we aren't going to be resuming later from fault_worker, then do
> > +      * it now.
> > +      */
> > +     if (!do_devcoredump) {
> > +             gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> > +     }
> >
> >       /*
> >        * Print a default message if we couldn't get the data from the
> > @@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> >       else if (info->fsr & ARM_SMMU_FSR_EF)
> >               type = "EXTERNAL";
> >
> > +     block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> > +
> >       pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> >                       info->ttbr0, iova,
> > -                     flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> > -                     a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> > +                     flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> > +                     type, block,
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
> >
> > +     if (do_devcoredump) {
> > +             /* Turn off the hangcheck timer to keep it from bothering us */
> > +             del_timer(&gpu->hangcheck_timer);
> > +
> > +             gpu->fault_info.ttbr0 = info->ttbr0;
> > +             gpu->fault_info.iova  = iova;
> > +             gpu->fault_info.flags = flags;
> > +             gpu->fault_info.type  = type;
> > +             gpu->fault_info.block = block;
> > +
> > +             kthread_queue_work(gpu->worker, &gpu->fault_work);
> > +     }
> > +
> >       return 0;
> >  }
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index cf897297656f..4e88d4407667 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
> >                       adreno_gpu->info->revn, adreno_gpu->rev.core,
> >                       adreno_gpu->rev.major, adreno_gpu->rev.minor,
> >                       adreno_gpu->rev.patchid);
> > +     /*
> > +      * If this is state collected due to iova fault, so fault related info
> > +      *
> > +      * TTBR0 would not be zero, so this is a good way to distinguish
> > +      */
> > +     if (state->fault_info.ttbr0) {
> > +             const struct msm_gpu_fault_info *info = &state->fault_info;
> > +
> > +             drm_puts(p, "fault-info:\n");
> > +             drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
> > +             drm_printf(p, "  - iova=%.16lx\n", info->iova);
> > +             drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> > +             drm_printf(p, "  - type=%s\n", info->type);
> > +             drm_printf(p, "  - source=%s\n", info->block);
> > +     }
> >
> >       drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> > index 03e2cc2a2ce1..405f8411e395 100644
> > --- a/drivers/gpu/drm/msm/msm_gem.h
> > +++ b/drivers/gpu/drm/msm/msm_gem.h
> > @@ -328,6 +328,7 @@ struct msm_gem_submit {
> >       struct dma_fence *fence;
> >       struct msm_gpu_submitqueue *queue;
> >       struct pid *pid;    /* submitting process */
> > +     bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
> >       bool valid;         /* true if no cmdstream patching needed */
> >       bool in_rb;         /* "sudo" mode, copy cmds into RB */
> >       struct msm_ringbuffer *ring;
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 5480852bdeda..44f84bfd0c0e 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >       submit->cmd = (void *)&submit->bos[nr_bos];
> >       submit->queue = queue;
> >       submit->ring = gpu->rb[queue->prio];
> > +     submit->fault_dumped = false;
> >
> >       /* initially, until copy_from_user() and bo lookup succeeds: */
> >       submit->nr_bos = 0;
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index 4d280bf446e6..4da2053c1ffb 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> >       /* Fill in the additional crash state information */
> >       state->comm = kstrdup(comm, GFP_KERNEL);
> >       state->cmd = kstrdup(cmd, GFP_KERNEL);
> > +     state->fault_info = gpu->fault_info;
> >
> >       if (submit) {
> >               int i, nr = 0;
> > @@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
> >       msm_gpu_retire(gpu);
> >  }
> >
> > +static void fault_worker(struct kthread_work *work)
> > +{
> > +     struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> > +     struct drm_device *dev = gpu->dev;
> > +     struct msm_gem_submit *submit;
> > +     struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> > +     char *comm = NULL, *cmd = NULL;
> > +
> > +     mutex_lock(&dev->struct_mutex);
> > +
> > +     submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> > +     if (submit && submit->fault_dumped)
> > +             goto resume_smmu;
> > +
> > +     if (submit) {
> > +             struct task_struct *task;
> > +
> > +             task = get_pid_task(submit->pid, PIDTYPE_PID);
> > +             if (task) {
> > +                     comm = kstrdup(task->comm, GFP_KERNEL);
> > +                     cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> > +                     put_task_struct(task);
> > +             }
> > +
> > +             /*
> > +              * When we get GPU iova faults, we can get 1000s of them,
> > +              * but we really only want to log the first one.
> > +              */
> > +             submit->fault_dumped = true;
> > +     }
> > +
> > +     /* Record the crash state */
> > +     pm_runtime_get_sync(&gpu->pdev->dev);
> > +     msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);
>
> You are going to run the risk of a race here. Once the IOMMU stalls then the
> various bits of the GPU pipeline are going to stop and as soon one of them hits
> the hang cycles threshold its going to pus the big red HANG! button.
>
> It is fine to keep this infrastructure in place, but at there needs to be an
> escape valve in the hang infrastructure to keep you from double dumping and
> also to keep from resetting the GPU if that isn't your intention.
>
> This can be as simple as adding a RBBM_STATUS3 check in the hang function and
> returning early or you could skip the capture state call here and rely on the
> hang to be the single entry point into the crashstate capture (with the
> appropriate protections to keep from accidentally recovering, of course).

I guess it isn't really the hw hitting the reset button, but just
raising an irq to the driver which hits the reset button?

If this is the case, I think the pragmatic thing is just to check the
stall bit in RBBM_STATUS3.

Even without that we are serializing recover and fault work on a
single worker, so I *think* we are good.. but I suppose the
RBBM_STATUS3 check would be useful to avoid dmesg spam about hangs
when the real issue is a fault

BR,
-R


> Jordan
>
> > +     pm_runtime_put_sync(&gpu->pdev->dev);
> > +
> > +     kfree(cmd);
> > +     kfree(comm);
> > +
> > +resume_smmu:
> > +     memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> > +     gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> > +
> > +     mutex_unlock(&dev->struct_mutex);
> > +}
> > +
> >  static void hangcheck_timer_reset(struct msm_gpu *gpu)
> >  {
> >       mod_timer(&gpu->hangcheck_timer,
> > @@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> >       INIT_LIST_HEAD(&gpu->active_list);
> >       kthread_init_work(&gpu->retire_work, retire_worker);
> >       kthread_init_work(&gpu->recover_work, recover_worker);
> > +     kthread_init_work(&gpu->fault_work, fault_worker);
> >
> >       timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index c15e5fd675d2..8dae601085ee 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
> >       uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
> >  };
> >
> > +/* Additional state for iommu faults: */
> > +struct msm_gpu_fault_info {
> > +     u64 ttbr0;
> > +     unsigned long iova;
> > +     int flags;
> > +     const char *type;
> > +     const char *block;
> > +};
> > +
> >  struct msm_gpu {
> >       const char *name;
> >       struct drm_device *dev;
> > @@ -135,6 +144,12 @@ struct msm_gpu {
> >  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
> >       struct timer_list hangcheck_timer;
> >
> > +     /* Fault info for most recent iova fault: */
> > +     struct msm_gpu_fault_info fault_info;
> > +
> > +     /* work for handling GPU ioval faults: */
> > +     struct kthread_work fault_work;
> > +
> >       /* work for handling GPU recovery: */
> >       struct kthread_work recover_work;
> >
> > @@ -243,6 +258,8 @@ struct msm_gpu_state {
> >       char *comm;
> >       char *cmd;
> >
> > +     struct msm_gpu_fault_info fault_info;
> > +
> >       int nr_bos;
> >       struct msm_gpu_state_bo *bos;
> >  };
> > diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> > index 379496186c7f..f7d1945e0c9f 100644
> > --- a/drivers/gpu/drm/msm/msm_gpummu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> > @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
> >       return 0;
> >  }
> >
> > +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> > +{
> > +}
> > +
> >  static void msm_gpummu_destroy(struct msm_mmu *mmu)
> >  {
> >       struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> > @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
> >               .map = msm_gpummu_map,
> >               .unmap = msm_gpummu_unmap,
> >               .destroy = msm_gpummu_destroy,
> > +             .resume_translation = msm_gpummu_resume_translation,
> >  };
> >
> >  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> > index 6975b95c3c29..eed2a762e9dd 100644
> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> > @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
> >        * the arm-smmu driver as a trigger to set up TTBR0
> >        */
> >       if (atomic_inc_return(&iommu->pagetables) == 1) {
> > +             /* Enable stall on iommu fault: */
> > +             adreno_smmu->set_stall(adreno_smmu->cookie, true);
> > +
> >               ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
> >               if (ret) {
> >                       free_io_pgtable_ops(pagetable->pgtbl_ops);
> > @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> >       return 0;
> >  }
> >
> > +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> > +{
> > +     struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> > +
> > +     adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> > +}
> > +
> >  static void msm_iommu_detach(struct msm_mmu *mmu)
> >  {
> >       struct msm_iommu *iommu = to_msm_iommu(mmu);
> > @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
> >               .map = msm_iommu_map,
> >               .unmap = msm_iommu_unmap,
> >               .destroy = msm_iommu_destroy,
> > +             .resume_translation = msm_iommu_resume_translation,
> >  };
> >
> >  struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> > diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> > index a88f44c3268d..de158e1bf765 100644
> > --- a/drivers/gpu/drm/msm/msm_mmu.h
> > +++ b/drivers/gpu/drm/msm/msm_mmu.h
> > @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
> >                       size_t len, int prot);
> >       int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
> >       void (*destroy)(struct msm_mmu *mmu);
> > +     void (*resume_translation)(struct msm_mmu *mmu);
> >  };
> >
> >  enum msm_mmu_type {
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v4 6/6] drm/msm: devcoredump iommu fault support
@ 2021-06-09 21:50       ` Rob Clark
  0 siblings, 0 replies; 32+ messages in thread
From: Rob Clark @ 2021-06-09 21:50 UTC (permalink / raw)
  To: Rob Clark, dri-devel, Rob Clark, Sean Paul, David Airlie,
	Daniel Vetter, Sai Prakash Ranjan, Jonathan Marek,
	Akhil P Oommen, Eric Anholt, Sharat Masetty, Douglas Anderson,
	Bjorn Andersson, open list:DRM DRIVER FOR MSM ADRENO GPU,
	open list:DRM DRIVER FOR MSM ADRENO GPU, open list

On Tue, Jun 8, 2021 at 8:20 AM Jordan Crouse <jordan@cosmicpenguin.net> wrote:
>
> On Tue, Jun 01, 2021 at 03:47:25PM -0700, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > Wire up support to stall the SMMU on iova fault, and collect a devcore-
> > dump snapshot for easier debugging of faults.
> >
> > Currently this is a6xx-only, but mostly only because so far it is the
> > only one using adreno-smmu-priv.
> >
> > Signed-off-by: Rob Clark <robdclark@chromium.org>
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c   | 29 +++++++++++++--
> >  drivers/gpu/drm/msm/adreno/adreno_gpu.c | 15 ++++++++
> >  drivers/gpu/drm/msm/msm_gem.h           |  1 +
> >  drivers/gpu/drm/msm/msm_gem_submit.c    |  1 +
> >  drivers/gpu/drm/msm/msm_gpu.c           | 48 +++++++++++++++++++++++++
> >  drivers/gpu/drm/msm/msm_gpu.h           | 17 +++++++++
> >  drivers/gpu/drm/msm/msm_gpummu.c        |  5 +++
> >  drivers/gpu/drm/msm/msm_iommu.c         | 11 ++++++
> >  drivers/gpu/drm/msm/msm_mmu.h           |  1 +
> >  9 files changed, 126 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 094dc17fd20f..0dcde917e575 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1008,6 +1008,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> >       struct msm_gpu *gpu = arg;
> >       struct adreno_smmu_fault_info *info = data;
> >       const char *type = "UNKNOWN";
> > +     const char *block;
> > +     bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> > +
> > +     /*
> > +      * If we aren't going to be resuming later from fault_worker, then do
> > +      * it now.
> > +      */
> > +     if (!do_devcoredump) {
> > +             gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> > +     }
> >
> >       /*
> >        * Print a default message if we couldn't get the data from the
> > @@ -1031,15 +1041,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
> >       else if (info->fsr & ARM_SMMU_FSR_EF)
> >               type = "EXTERNAL";
> >
> > +     block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> > +
> >       pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> >                       info->ttbr0, iova,
> > -                     flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> > -                     a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> > +                     flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> > +                     type, block,
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> >                       gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
> >
> > +     if (do_devcoredump) {
> > +             /* Turn off the hangcheck timer to keep it from bothering us */
> > +             del_timer(&gpu->hangcheck_timer);
> > +
> > +             gpu->fault_info.ttbr0 = info->ttbr0;
> > +             gpu->fault_info.iova  = iova;
> > +             gpu->fault_info.flags = flags;
> > +             gpu->fault_info.type  = type;
> > +             gpu->fault_info.block = block;
> > +
> > +             kthread_queue_work(gpu->worker, &gpu->fault_work);
> > +     }
> > +
> >       return 0;
> >  }
> >
> > diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > index cf897297656f..4e88d4407667 100644
> > --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> > @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
> >                       adreno_gpu->info->revn, adreno_gpu->rev.core,
> >                       adreno_gpu->rev.major, adreno_gpu->rev.minor,
> >                       adreno_gpu->rev.patchid);
> > +     /*
> > +      * If this is state collected due to iova fault, so fault related info
> > +      *
> > +      * TTBR0 would not be zero, so this is a good way to distinguish
> > +      */
> > +     if (state->fault_info.ttbr0) {
> > +             const struct msm_gpu_fault_info *info = &state->fault_info;
> > +
> > +             drm_puts(p, "fault-info:\n");
> > +             drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
> > +             drm_printf(p, "  - iova=%.16lx\n", info->iova);
> > +             drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> > +             drm_printf(p, "  - type=%s\n", info->type);
> > +             drm_printf(p, "  - source=%s\n", info->block);
> > +     }
> >
> >       drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> > index 03e2cc2a2ce1..405f8411e395 100644
> > --- a/drivers/gpu/drm/msm/msm_gem.h
> > +++ b/drivers/gpu/drm/msm/msm_gem.h
> > @@ -328,6 +328,7 @@ struct msm_gem_submit {
> >       struct dma_fence *fence;
> >       struct msm_gpu_submitqueue *queue;
> >       struct pid *pid;    /* submitting process */
> > +     bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
> >       bool valid;         /* true if no cmdstream patching needed */
> >       bool in_rb;         /* "sudo" mode, copy cmds into RB */
> >       struct msm_ringbuffer *ring;
> > diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> > index 5480852bdeda..44f84bfd0c0e 100644
> > --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> > +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> > @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
> >       submit->cmd = (void *)&submit->bos[nr_bos];
> >       submit->queue = queue;
> >       submit->ring = gpu->rb[queue->prio];
> > +     submit->fault_dumped = false;
> >
> >       /* initially, until copy_from_user() and bo lookup succeeds: */
> >       submit->nr_bos = 0;
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> > index 4d280bf446e6..4da2053c1ffb 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpu.c
> > @@ -401,6 +401,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
> >       /* Fill in the additional crash state information */
> >       state->comm = kstrdup(comm, GFP_KERNEL);
> >       state->cmd = kstrdup(cmd, GFP_KERNEL);
> > +     state->fault_info = gpu->fault_info;
> >
> >       if (submit) {
> >               int i, nr = 0;
> > @@ -573,6 +574,52 @@ static void recover_worker(struct kthread_work *work)
> >       msm_gpu_retire(gpu);
> >  }
> >
> > +static void fault_worker(struct kthread_work *work)
> > +{
> > +     struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> > +     struct drm_device *dev = gpu->dev;
> > +     struct msm_gem_submit *submit;
> > +     struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> > +     char *comm = NULL, *cmd = NULL;
> > +
> > +     mutex_lock(&dev->struct_mutex);
> > +
> > +     submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> > +     if (submit && submit->fault_dumped)
> > +             goto resume_smmu;
> > +
> > +     if (submit) {
> > +             struct task_struct *task;
> > +
> > +             task = get_pid_task(submit->pid, PIDTYPE_PID);
> > +             if (task) {
> > +                     comm = kstrdup(task->comm, GFP_KERNEL);
> > +                     cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> > +                     put_task_struct(task);
> > +             }
> > +
> > +             /*
> > +              * When we get GPU iova faults, we can get 1000s of them,
> > +              * but we really only want to log the first one.
> > +              */
> > +             submit->fault_dumped = true;
> > +     }
> > +
> > +     /* Record the crash state */
> > +     pm_runtime_get_sync(&gpu->pdev->dev);
> > +     msm_gpu_crashstate_capture(gpu, submit, comm, cmd, true);
>
> You are going to run the risk of a race here. Once the IOMMU stalls then the
> various bits of the GPU pipeline are going to stop and as soon one of them hits
> the hang cycles threshold its going to pus the big red HANG! button.
>
> It is fine to keep this infrastructure in place, but at there needs to be an
> escape valve in the hang infrastructure to keep you from double dumping and
> also to keep from resetting the GPU if that isn't your intention.
>
> This can be as simple as adding a RBBM_STATUS3 check in the hang function and
> returning early or you could skip the capture state call here and rely on the
> hang to be the single entry point into the crashstate capture (with the
> appropriate protections to keep from accidentally recovering, of course).

I guess it isn't really the hw hitting the reset button, but just
raising an irq to the driver which hits the reset button?

If this is the case, I think the pragmatic thing is just to check the
stall bit in RBBM_STATUS3.

Even without that we are serializing recover and fault work on a
single worker, so I *think* we are good.. but I suppose the
RBBM_STATUS3 check would be useful to avoid dmesg spam about hangs
when the real issue is a fault

BR,
-R


> Jordan
>
> > +     pm_runtime_put_sync(&gpu->pdev->dev);
> > +
> > +     kfree(cmd);
> > +     kfree(comm);
> > +
> > +resume_smmu:
> > +     memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> > +     gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> > +
> > +     mutex_unlock(&dev->struct_mutex);
> > +}
> > +
> >  static void hangcheck_timer_reset(struct msm_gpu *gpu)
> >  {
> >       mod_timer(&gpu->hangcheck_timer,
> > @@ -949,6 +996,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
> >       INIT_LIST_HEAD(&gpu->active_list);
> >       kthread_init_work(&gpu->retire_work, retire_worker);
> >       kthread_init_work(&gpu->recover_work, recover_worker);
> > +     kthread_init_work(&gpu->fault_work, fault_worker);
> >
> >       timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
> >
> > diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> > index c15e5fd675d2..8dae601085ee 100644
> > --- a/drivers/gpu/drm/msm/msm_gpu.h
> > +++ b/drivers/gpu/drm/msm/msm_gpu.h
> > @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
> >       uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
> >  };
> >
> > +/* Additional state for iommu faults: */
> > +struct msm_gpu_fault_info {
> > +     u64 ttbr0;
> > +     unsigned long iova;
> > +     int flags;
> > +     const char *type;
> > +     const char *block;
> > +};
> > +
> >  struct msm_gpu {
> >       const char *name;
> >       struct drm_device *dev;
> > @@ -135,6 +144,12 @@ struct msm_gpu {
> >  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
> >       struct timer_list hangcheck_timer;
> >
> > +     /* Fault info for most recent iova fault: */
> > +     struct msm_gpu_fault_info fault_info;
> > +
> > +     /* work for handling GPU ioval faults: */
> > +     struct kthread_work fault_work;
> > +
> >       /* work for handling GPU recovery: */
> >       struct kthread_work recover_work;
> >
> > @@ -243,6 +258,8 @@ struct msm_gpu_state {
> >       char *comm;
> >       char *cmd;
> >
> > +     struct msm_gpu_fault_info fault_info;
> > +
> >       int nr_bos;
> >       struct msm_gpu_state_bo *bos;
> >  };
> > diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> > index 379496186c7f..f7d1945e0c9f 100644
> > --- a/drivers/gpu/drm/msm/msm_gpummu.c
> > +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> > @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
> >       return 0;
> >  }
> >
> > +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> > +{
> > +}
> > +
> >  static void msm_gpummu_destroy(struct msm_mmu *mmu)
> >  {
> >       struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> > @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
> >               .map = msm_gpummu_map,
> >               .unmap = msm_gpummu_unmap,
> >               .destroy = msm_gpummu_destroy,
> > +             .resume_translation = msm_gpummu_resume_translation,
> >  };
> >
> >  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> > index 6975b95c3c29..eed2a762e9dd 100644
> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> > @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
> >        * the arm-smmu driver as a trigger to set up TTBR0
> >        */
> >       if (atomic_inc_return(&iommu->pagetables) == 1) {
> > +             /* Enable stall on iommu fault: */
> > +             adreno_smmu->set_stall(adreno_smmu->cookie, true);
> > +
> >               ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
> >               if (ret) {
> >                       free_io_pgtable_ops(pagetable->pgtbl_ops);
> > @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> >       return 0;
> >  }
> >
> > +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> > +{
> > +     struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> > +
> > +     adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> > +}
> > +
> >  static void msm_iommu_detach(struct msm_mmu *mmu)
> >  {
> >       struct msm_iommu *iommu = to_msm_iommu(mmu);
> > @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
> >               .map = msm_iommu_map,
> >               .unmap = msm_iommu_unmap,
> >               .destroy = msm_iommu_destroy,
> > +             .resume_translation = msm_iommu_resume_translation,
> >  };
> >
> >  struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> > diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> > index a88f44c3268d..de158e1bf765 100644
> > --- a/drivers/gpu/drm/msm/msm_mmu.h
> > +++ b/drivers/gpu/drm/msm/msm_mmu.h
> > @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
> >                       size_t len, int prot);
> >       int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
> >       void (*destroy)(struct msm_mmu *mmu);
> > +     void (*resume_translation)(struct msm_mmu *mmu);
> >  };
> >
> >  enum msm_mmu_type {
> > --
> > 2.31.1
> >

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-06-09 21:47 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-01 22:47 [PATCH v4 0/6] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
2021-06-01 22:47 ` Rob Clark
2021-06-01 22:47 ` Rob Clark
2021-06-01 22:47 ` Rob Clark
2021-06-01 22:47 ` [PATCH] drm/msm/dpu: Delete bonkers code Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 1/6] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 2/6] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 3/6] drm/msm: Improve the a6xx page fault handler Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 4/6] iommu/arm-smmu-qcom: Add stall support Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 5/6] drm/msm: Add crashdump support for stalled SMMU Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-08 15:12   ` Jordan Crouse
2021-06-08 15:12     ` Jordan Crouse
2021-06-09 21:46     ` Rob Clark
2021-06-09 21:46       ` Rob Clark
2021-06-01 22:47 ` [PATCH v4 6/6] drm/msm: devcoredump iommu fault support Rob Clark
2021-06-01 22:47   ` Rob Clark
2021-06-08 15:20   ` Jordan Crouse
2021-06-08 15:20     ` Jordan Crouse
2021-06-09 21:50     ` Rob Clark
2021-06-09 21:50       ` Rob Clark

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.