linux-arm-msm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
@ 2021-06-10 21:44 Rob Clark
  2021-06-10 21:44 ` [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Rob Clark,
	Akhil P Oommen, AngeloGioacchino Del Regno, Bjorn Andersson,
	Douglas Anderson, Eric Anholt, Isaac J. Manjarres, Joerg Roedel,
	John Stultz, Jonathan Marek, Konrad Dybcio, Krishna Reddy,
	Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

From: Rob Clark <robdclark@chromium.org>

This picks up an earlier series[1] from Jordan, and adds additional
support needed to generate GPU devcore dumps on iova faults.  Original
description:

This is a stack to add an Adreno GPU specific handler for pagefaults. The first
patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
a adreno-smmu-priv function hook to capture a handful of important debugging
registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
third patch to print more detailed information on page fault such as the TTBR0
for the pagetable that caused the fault and the source of the fault as
determined by a combination of the FSYNR1 register and an internal GPU
register.

This code provides a solid base that we can expand on later for even more
extensive GPU side page fault debugging capabilities.

v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
    GPU snapshotting needs to avoid crashdumper, and check the
    RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
    resume translation after it has had a chance to snapshot the GPUs
    state
v3: Always clear FSR even if the target driver is going to handle resume
v2: Fix comment wording and function pointer check per Rob Clark

[1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/

Jordan Crouse (3):
  iommu/arm-smmu: Add support for driver IOMMU fault handlers
  iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
    info
  drm/msm: Improve the a6xx page fault handler

Rob Clark (2):
  iommu/arm-smmu-qcom: Add stall support
  drm/msm: devcoredump iommu fault support

 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  23 +++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 110 +++++++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  42 ++++++--
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
 drivers/gpu/drm/msm/msm_gem.h               |   1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
 drivers/gpu/drm/msm/msm_gpu.c               |  48 +++++++++
 drivers/gpu/drm/msm/msm_gpu.h               |  17 +++
 drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
 drivers/gpu/drm/msm/msm_iommu.c             |  22 +++-
 drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 +++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
 drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
 include/linux/adreno-smmu-priv.h            |  38 ++++++-
 15 files changed, 367 insertions(+), 21 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
@ 2021-06-10 21:44 ` Rob Clark
  2021-06-14 17:26   ` Bjorn Andersson
  2021-06-10 21:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info Rob Clark
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Jordan Crouse,
	Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel,
	Krishna Reddy, Sai Prakash Ranjan,
	moderated list:ARM SMMU DRIVERS, open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Call report_iommu_fault() to allow upper-level drivers to register their
own fault handlers.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
Acked-by: Will Deacon <will@kernel.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
 1 file changed, 7 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
index 6f72c4d208ca..b4b32d31fc06 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
@@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	int idx = smmu_domain->cfg.cbndx;
+	int ret;
 
 	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
 	if (!(fsr & ARM_SMMU_FSR_FAULT))
@@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
 	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
 
-	dev_err_ratelimited(smmu->dev,
-	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
+	ret = report_iommu_fault(domain, NULL, iova,
+		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
+
+	if (ret == -ENOSYS)
+		dev_err_ratelimited(smmu->dev,
+		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
 			    fsr, iova, fsynr, cbfrsynra, idx);
 
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
  2021-06-10 21:44 ` [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
@ 2021-06-10 21:44 ` Rob Clark
  2021-06-14 17:30   ` Bjorn Andersson
  2021-06-10 21:44 ` [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler Rob Clark
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Jordan Crouse,
	Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel,
	Bjorn Andersson, Isaac J. Manjarres, John Stultz, Krishna Reddy,
	Sai Prakash Ranjan, moderated list:ARM SMMU DRIVERS, open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Add a callback in adreno-smmu-priv to read interesting SMMU
registers to provide an opportunity for a richer debug experience
in the GPU driver.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
 drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
 include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
 3 files changed, 49 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index 98b3a1c2a181..b2e31ea84128 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
+static void qcom_adreno_smmu_get_fault_info(const void *cookie,
+		struct adreno_smmu_fault_info *info)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+
+	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
+	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
+	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
+	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
+	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
+	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
+	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->cookie = smmu_domain;
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
+	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
 
 	return 0;
 }
diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
index c31a59d35c64..84c21c4b0691 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
@@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
 #define ARM_SMMU_CB_FSYNR0		0x68
 #define ARM_SMMU_FSYNR0_WNR		BIT(4)
 
+#define ARM_SMMU_CB_FSYNR1		0x6c
+
 #define ARM_SMMU_CB_S1_TLBIVA		0x600
 #define ARM_SMMU_CB_S1_TLBIASID		0x610
 #define ARM_SMMU_CB_S1_TLBIVAL		0x620
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index a889f28afb42..53fe32fb9214 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -8,6 +8,32 @@
 
 #include <linux/io-pgtable.h>
 
+/**
+ * struct adreno_smmu_fault_info - container for key fault information
+ *
+ * @far: The faulting IOVA from ARM_SMMU_CB_FAR
+ * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
+ * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
+ * @fsr: The fault status from ARM_SMMU_CB_FSR
+ * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
+ * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
+ * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
+ *
+ * This struct passes back key page fault information to the GPU driver
+ * through the get_fault_info function pointer.
+ * The GPU driver can use this information to print informative
+ * log messages and provide deeper GPU specific insight into the fault.
+ */
+struct adreno_smmu_fault_info {
+	u64 far;
+	u64 ttbr0;
+	u32 contextidr;
+	u32 fsr;
+	u32 fsynr0;
+	u32 fsynr1;
+	u32 cbfrsynra;
+};
+
 /**
  * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
  *
@@ -17,6 +43,8 @@
  * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
  *                 NULL config disables TTBR0 translation, otherwise
  *                 TTBR0 translation is enabled with the specified cfg
+ * @get_fault_info: Called by the GPU fault handler to get information about
+ *                  the fault
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -31,6 +59,7 @@ struct adreno_smmu_priv {
     const void *cookie;
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
+    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
 };
 
-#endif /* __ADRENO_SMMU_PRIV_H */
\ No newline at end of file
+#endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
  2021-06-10 21:44 ` [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
  2021-06-10 21:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info Rob Clark
@ 2021-06-10 21:44 ` Rob Clark
  2021-06-14 17:46   ` Bjorn Andersson
  2021-06-25  3:39   ` Bjorn Andersson
  2021-06-10 21:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support Rob Clark
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Jordan Crouse,
	Rob Clark, Rob Clark, Sean Paul, David Airlie, Daniel Vetter,
	AngeloGioacchino Del Regno, Konrad Dybcio,
	Kristian H. Kristensen, Marijn Suijten, Jonathan Marek,
	Akhil P Oommen, Sai Prakash Ranjan, Eric Anholt, Sharat Masetty,
	Douglas Anderson, open list

From: Jordan Crouse <jcrouse@codeaurora.org>

Use the new adreno-smmu-priv fault info function to get more SMMU
debug registers and print the current TTBR0 to debug per-instance
pagetables and figure out which GPU block generated the request.

Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c |  4 +-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
 drivers/gpu/drm/msm/msm_iommu.c       | 11 +++-
 drivers/gpu/drm/msm/msm_mmu.h         |  4 +-
 4 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index f46562c12022..eb030b00bff4 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
 	return true;
 }
 
-static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
+static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
 	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
@@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+	return 0;
 }
 
 static void a5xx_cp_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index c7f0ddb12d8f..fc19db10bff1 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1032,18 +1032,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
 	msm_gpu_hw_init(gpu);
 }
 
-static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
+static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
+{
+	static const char *uche_clients[7] = {
+		"VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
+	};
+	u32 val;
+
+	if (mid < 1 || mid > 3)
+		return "UNKNOWN";
+
+	/*
+	 * The source of the data depends on the mid ID read from FSYNR1.
+	 * and the client ID read from the UCHE block
+	 */
+	val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
+
+	/* mid = 3 is most precise and refers to only one block per client */
+	if (mid == 3)
+		return uche_clients[val & 7];
+
+	/* For mid=2 the source is TP or VFD except when the client id is 0 */
+	if (mid == 2)
+		return ((val & 7) == 0) ? "TP" : "TP|VFD";
+
+	/* For mid=1 just return "UCHE" as a catchall for everything else */
+	return "UCHE";
+}
+
+static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
+{
+	if (id == 0)
+		return "CP";
+	else if (id == 4)
+		return "CCU";
+	else if (id == 6)
+		return "CDP Prefetch";
+
+	return a6xx_uche_fault_block(gpu, id);
+}
+
+#define ARM_SMMU_FSR_TF                 BIT(1)
+#define ARM_SMMU_FSR_PF			BIT(3)
+#define ARM_SMMU_FSR_EF			BIT(4)
+
+static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
 {
 	struct msm_gpu *gpu = arg;
+	struct adreno_smmu_fault_info *info = data;
+	const char *type = "UNKNOWN";
 
-	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
+	/*
+	 * Print a default message if we couldn't get the data from the
+	 * adreno-smmu-priv
+	 */
+	if (!info) {
+		pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
 			iova, flags,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
-	return -EFAULT;
+		return 0;
+	}
+
+	if (info->fsr & ARM_SMMU_FSR_TF)
+		type = "TRANSLATION";
+	else if (info->fsr & ARM_SMMU_FSR_PF)
+		type = "PERMISSION";
+	else if (info->fsr & ARM_SMMU_FSR_EF)
+		type = "EXTERNAL";
+
+	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
+			info->ttbr0, iova,
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
+			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
+			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
+
+	return 0;
 }
 
 static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 50d881794758..6975b95c3c29 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 		unsigned long iova, int flags, void *arg)
 {
 	struct msm_iommu *iommu = arg;
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
+	struct adreno_smmu_fault_info info, *ptr = NULL;
+
+	if (adreno_smmu->get_fault_info) {
+		adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
+		ptr = &info;
+	}
+
 	if (iommu->base.handler)
-		return iommu->base.handler(iommu->base.arg, iova, flags);
+		return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
+
 	pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
 	return 0;
 }
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index 61ade89d9e48..a88f44c3268d 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -26,7 +26,7 @@ enum msm_mmu_type {
 struct msm_mmu {
 	const struct msm_mmu_funcs *funcs;
 	struct device *dev;
-	int (*handler)(void *arg, unsigned long iova, int flags);
+	int (*handler)(void *arg, unsigned long iova, int flags, void *data);
 	void *arg;
 	enum msm_mmu_type type;
 };
@@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
 
 static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
-		int (*handler)(void *arg, unsigned long iova, int flags))
+		int (*handler)(void *arg, unsigned long iova, int flags, void *data))
 {
 	mmu->arg = arg;
 	mmu->handler = handler;
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
                   ` (2 preceding siblings ...)
  2021-06-10 21:44 ` [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler Rob Clark
@ 2021-06-10 21:44 ` Rob Clark
  2021-06-11 13:49   ` Jordan Crouse
  2021-06-14 17:54   ` Bjorn Andersson
  2021-06-10 21:44 ` [PATCH v5 5/5] drm/msm: devcoredump iommu fault support Rob Clark
  2021-07-04 12:53 ` [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Dmitry Baryshkov
  5 siblings, 2 replies; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Rob Clark, Will Deacon,
	Robin Murphy, Joerg Roedel, Bjorn Andersson, Isaac J. Manjarres,
	moderated list:ARM SMMU DRIVERS, open list

From: Rob Clark <robdclark@chromium.org>

Add, via the adreno-smmu-priv interface, a way for the GPU to request
the SMMU to stall translation on faults, and then later resume the
translation, either retrying or terminating the current translation.

This will be used on the GPU side to "freeze" the GPU while we snapshot
useful state for devcoredump.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
 include/linux/adreno-smmu-priv.h           |  7 +++++
 2 files changed, 40 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
index b2e31ea84128..61fc645c1325 100644
--- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
+++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
@@ -13,6 +13,7 @@ struct qcom_smmu {
 	struct arm_smmu_device smmu;
 	bool bypass_quirk;
 	u8 bypass_cbndx;
+	u32 stall_enabled;
 };
 
 static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
@@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
 static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
 		u32 reg)
 {
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
+
 	/*
 	 * On the GPU device we want to process subsequent transactions after a
 	 * fault to keep the GPU from hanging
 	 */
 	reg |= ARM_SMMU_SCTLR_HUPCF;
 
+	if (qsmmu->stall_enabled & BIT(idx))
+		reg |= ARM_SMMU_SCTLR_CFCFG;
+
 	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
 }
 
@@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
 	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
 }
 
+static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
+
+	if (enabled)
+		qsmmu->stall_enabled |= BIT(cfg->cbndx);
+	else
+		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
+}
+
+static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
+{
+	struct arm_smmu_domain *smmu_domain = (void *)cookie;
+	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
+	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	u32 reg = 0;
+
+	if (terminate)
+		reg |= ARM_SMMU_RESUME_TERMINATE;
+
+	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
+}
+
 #define QCOM_ADRENO_SMMU_GPU_SID 0
 
 static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
@@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
 	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
 	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
 	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
+	priv->set_stall = qcom_adreno_smmu_set_stall;
+	priv->resume_translation = qcom_adreno_smmu_resume_translation;
 
 	return 0;
 }
diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
index 53fe32fb9214..c637e0997f6d 100644
--- a/include/linux/adreno-smmu-priv.h
+++ b/include/linux/adreno-smmu-priv.h
@@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
  *                 TTBR0 translation is enabled with the specified cfg
  * @get_fault_info: Called by the GPU fault handler to get information about
  *                  the fault
+ * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
+ *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
+ *                 the GPU driver must call resume_translation()
+ * @resume_translation: Resume translation after a fault
+ *
  *
  * The GPU driver (drm/msm) and adreno-smmu work together for controlling
  * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
@@ -60,6 +65,8 @@ struct adreno_smmu_priv {
     const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
     int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
     void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
+    void (*set_stall)(const void *cookie, bool enabled);
+    void (*resume_translation)(const void *cookie, bool terminate);
 };
 
 #endif /* __ADRENO_SMMU_PRIV_H */
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v5 5/5] drm/msm: devcoredump iommu fault support
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
                   ` (3 preceding siblings ...)
  2021-06-10 21:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support Rob Clark
@ 2021-06-10 21:44 ` Rob Clark
  2021-06-11 13:49   ` Jordan Crouse
  2021-07-04 12:53 ` [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Dmitry Baryshkov
  5 siblings, 1 reply; 19+ messages in thread
From: Rob Clark @ 2021-06-10 21:44 UTC (permalink / raw)
  To: dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Rob Clark, Rob Clark,
	Sean Paul, David Airlie, Daniel Vetter,
	AngeloGioacchino Del Regno, Konrad Dybcio,
	Kristian H. Kristensen, Marijn Suijten, Jonathan Marek,
	Akhil P Oommen, Sai Prakash Ranjan, Eric Anholt, Sharat Masetty,
	Douglas Anderson, Lee Jones, Zhenzhong Duan, Dave Airlie,
	open list

From: Rob Clark <robdclark@chromium.org>

Wire up support to stall the SMMU on iova fault, and collect a devcore-
dump snapshot for easier debugging of faults.

Currently this is a6xx-only, but mostly only because so far it is the
only one using adreno-smmu-priv.

Signed-off-by: Rob Clark <robdclark@chromium.org>
---
 drivers/gpu/drm/msm/adreno/a5xx_gpu.c       | 19 +++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 38 +++++++++++++++-
 drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++++++++++----
 drivers/gpu/drm/msm/adreno/adreno_gpu.c     | 15 +++++++
 drivers/gpu/drm/msm/msm_gem.h               |  1 +
 drivers/gpu/drm/msm/msm_gem_submit.c        |  1 +
 drivers/gpu/drm/msm/msm_gpu.c               | 48 +++++++++++++++++++++
 drivers/gpu/drm/msm/msm_gpu.h               | 17 ++++++++
 drivers/gpu/drm/msm/msm_gpummu.c            |  5 +++
 drivers/gpu/drm/msm/msm_iommu.c             | 11 +++++
 drivers/gpu/drm/msm/msm_mmu.h               |  1 +
 11 files changed, 186 insertions(+), 12 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
index eb030b00bff4..7a271de9a212 100644
--- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
@@ -1200,6 +1200,15 @@ static void a5xx_fault_detect_irq(struct msm_gpu *gpu)
 	struct drm_device *dev = gpu->dev;
 	struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
 
+	/*
+	 * If stalled on SMMU fault, we could trip the GPU's hang detection,
+	 * but the fault handler will trigger the devcore dump, and we want
+	 * to otherwise resume normally rather than killing the submit, so
+	 * just bail.
+	 */
+	if (gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24))
+		return;
+
 	DRM_DEV_ERROR(dev->dev, "gpu fault ring %d fence %x status %8.8X rb %4.4x/%4.4x ib1 %16.16llX/%4.4x ib2 %16.16llX/%4.4x\n",
 		ring ? ring->id : -1, ring ? ring->seqno : 0,
 		gpu_read(gpu, REG_A5XX_RBBM_STATUS),
@@ -1523,6 +1532,7 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
 {
 	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
 			GFP_KERNEL);
+	bool stalled = !!(gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24));
 
 	if (!a5xx_state)
 		return ERR_PTR(-ENOMEM);
@@ -1535,8 +1545,13 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
 
 	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
 
-	/* Get the HLSQ regs with the help of the crashdumper */
-	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
+	/*
+	 * Get the HLSQ regs with the help of the crashdumper, but only if
+	 * we are not stalled in an iommu fault (in which case the crashdumper
+	 * would not have access to memory)
+	 */
+	if (!stalled)
+		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
 
 	a5xx_set_hwcg(gpu, true);
 
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index fc19db10bff1..c3699408bd1f 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1081,6 +1081,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	struct msm_gpu *gpu = arg;
 	struct adreno_smmu_fault_info *info = data;
 	const char *type = "UNKNOWN";
+	const char *block;
+	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
+
+	/*
+	 * If we aren't going to be resuming later from fault_worker, then do
+	 * it now.
+	 */
+	if (!do_devcoredump) {
+		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+	}
 
 	/*
 	 * Print a default message if we couldn't get the data from the
@@ -1104,15 +1114,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
 	else if (info->fsr & ARM_SMMU_FSR_EF)
 		type = "EXTERNAL";
 
+	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
+
 	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
 			info->ttbr0, iova,
-			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
-			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
+			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
+			type, block,
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
 			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
 
+	if (do_devcoredump) {
+		/* Turn off the hangcheck timer to keep it from bothering us */
+		del_timer(&gpu->hangcheck_timer);
+
+		gpu->fault_info.ttbr0 = info->ttbr0;
+		gpu->fault_info.iova  = iova;
+		gpu->fault_info.flags = flags;
+		gpu->fault_info.type  = type;
+		gpu->fault_info.block = block;
+
+		kthread_queue_work(gpu->worker, &gpu->fault_work);
+	}
+
 	return 0;
 }
 
@@ -1164,6 +1189,15 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu)
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
 	struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
 
+	/*
+	 * If stalled on SMMU fault, we could trip the GPU's hang detection,
+	 * but the fault handler will trigger the devcore dump, and we want
+	 * to otherwise resume normally rather than killing the submit, so
+	 * just bail.
+	 */
+	if (gpu_read(gpu, REG_A6XX_RBBM_STATUS3) & A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT)
+		return;
+
 	/*
 	 * Force the GPU to stay on until after we finish
 	 * collecting information
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
index 21c49c5b4519..ad4ea0ed5d99 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
@@ -832,6 +832,20 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
 		a6xx_get_ahb_gpu_registers(gpu,
 				a6xx_state, &a6xx_vbif_reglist,
 				&a6xx_state->registers[index++]);
+	if (!dumper) {
+		/*
+		 * We can't use the crashdumper when the SMMU is stalled,
+		 * because the GPU has no memory access until we resume
+		 * translation (but we don't want to do that until after
+		 * we have captured as much useful GPU state as possible).
+		 * So instead collect registers via the CPU:
+		 */
+		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
+			a6xx_get_ahb_gpu_registers(gpu,
+				a6xx_state, &a6xx_reglist[i],
+				&a6xx_state->registers[index++]);
+		return;
+	}
 
 	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
 		a6xx_get_crashdumper_registers(gpu,
@@ -905,11 +919,13 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
 
 struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
 {
-	struct a6xx_crashdumper dumper = { 0 };
+	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
 	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
 	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
 	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
 		GFP_KERNEL);
+	bool stalled = !!(gpu_read(gpu, REG_A6XX_RBBM_STATUS3) &
+			A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT);
 
 	if (!a6xx_state)
 		return ERR_PTR(-ENOMEM);
@@ -928,14 +944,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
 	/* Get the banks of indexed registers */
 	a6xx_get_indexed_registers(gpu, a6xx_state);
 
-	/* Try to initialize the crashdumper */
-	if (!a6xx_crashdumper_init(gpu, &dumper)) {
-		a6xx_get_registers(gpu, a6xx_state, &dumper);
-		a6xx_get_shaders(gpu, a6xx_state, &dumper);
-		a6xx_get_clusters(gpu, a6xx_state, &dumper);
-		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
+	/*
+	 * Try to initialize the crashdumper, if we are not dumping state
+	 * with the SMMU stalled.  The crashdumper needs memory access to
+	 * write out GPU state, so we need to skip this when the SMMU is
+	 * stalled in response to an iova fault
+	 */
+	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
+		dumper = &_dumper;
+	}
+
+	a6xx_get_registers(gpu, a6xx_state, dumper);
+
+	if (dumper) {
+		a6xx_get_shaders(gpu, a6xx_state, dumper);
+		a6xx_get_clusters(gpu, a6xx_state, dumper);
+		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
 
-		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
+		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
 	}
 
 	if (snapshot_debugbus)
diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
index c1b02f790804..2bfe014995c7 100644
--- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
@@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
 			adreno_gpu->info->revn, adreno_gpu->rev.core,
 			adreno_gpu->rev.major, adreno_gpu->rev.minor,
 			adreno_gpu->rev.patchid);
+	/*
+	 * If this is state collected due to iova fault, so fault related info
+	 *
+	 * TTBR0 would not be zero, so this is a good way to distinguish
+	 */
+	if (state->fault_info.ttbr0) {
+		const struct msm_gpu_fault_info *info = &state->fault_info;
+
+		drm_puts(p, "fault-info:\n");
+		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
+		drm_printf(p, "  - iova=%.16lx\n", info->iova);
+		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
+		drm_printf(p, "  - type=%s\n", info->type);
+		drm_printf(p, "  - source=%s\n", info->block);
+	}
 
 	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
 
diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
index 03e2cc2a2ce1..405f8411e395 100644
--- a/drivers/gpu/drm/msm/msm_gem.h
+++ b/drivers/gpu/drm/msm/msm_gem.h
@@ -328,6 +328,7 @@ struct msm_gem_submit {
 	struct dma_fence *fence;
 	struct msm_gpu_submitqueue *queue;
 	struct pid *pid;    /* submitting process */
+	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
 	bool valid;         /* true if no cmdstream patching needed */
 	bool in_rb;         /* "sudo" mode, copy cmds into RB */
 	struct msm_ringbuffer *ring;
diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
index 5480852bdeda..44f84bfd0c0e 100644
--- a/drivers/gpu/drm/msm/msm_gem_submit.c
+++ b/drivers/gpu/drm/msm/msm_gem_submit.c
@@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
 	submit->cmd = (void *)&submit->bos[nr_bos];
 	submit->queue = queue;
 	submit->ring = gpu->rb[queue->prio];
+	submit->fault_dumped = false;
 
 	/* initially, until copy_from_user() and bo lookup succeeds: */
 	submit->nr_bos = 0;
diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
index fa7691cb4614..414ba2dd34e5 100644
--- a/drivers/gpu/drm/msm/msm_gpu.c
+++ b/drivers/gpu/drm/msm/msm_gpu.c
@@ -400,6 +400,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
 	/* Fill in the additional crash state information */
 	state->comm = kstrdup(comm, GFP_KERNEL);
 	state->cmd = kstrdup(cmd, GFP_KERNEL);
+	state->fault_info = gpu->fault_info;
 
 	if (submit) {
 		int i, nr = 0;
@@ -572,6 +573,52 @@ static void recover_worker(struct kthread_work *work)
 	msm_gpu_retire(gpu);
 }
 
+static void fault_worker(struct kthread_work *work)
+{
+	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
+	struct drm_device *dev = gpu->dev;
+	struct msm_gem_submit *submit;
+	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
+	char *comm = NULL, *cmd = NULL;
+
+	mutex_lock(&dev->struct_mutex);
+
+	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
+	if (submit && submit->fault_dumped)
+		goto resume_smmu;
+
+	if (submit) {
+		struct task_struct *task;
+
+		task = get_pid_task(submit->pid, PIDTYPE_PID);
+		if (task) {
+			comm = kstrdup(task->comm, GFP_KERNEL);
+			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
+			put_task_struct(task);
+		}
+
+		/*
+		 * When we get GPU iova faults, we can get 1000s of them,
+		 * but we really only want to log the first one.
+		 */
+		submit->fault_dumped = true;
+	}
+
+	/* Record the crash state */
+	pm_runtime_get_sync(&gpu->pdev->dev);
+	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
+	pm_runtime_put_sync(&gpu->pdev->dev);
+
+	kfree(cmd);
+	kfree(comm);
+
+resume_smmu:
+	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
+	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
+
+	mutex_unlock(&dev->struct_mutex);
+}
+
 static void hangcheck_timer_reset(struct msm_gpu *gpu)
 {
 	mod_timer(&gpu->hangcheck_timer,
@@ -948,6 +995,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
 	INIT_LIST_HEAD(&gpu->active_list);
 	kthread_init_work(&gpu->retire_work, retire_worker);
 	kthread_init_work(&gpu->recover_work, recover_worker);
+	kthread_init_work(&gpu->fault_work, fault_worker);
 
 	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
 
diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
index 7a082a12d98f..8eefb3aeca10 100644
--- a/drivers/gpu/drm/msm/msm_gpu.h
+++ b/drivers/gpu/drm/msm/msm_gpu.h
@@ -71,6 +71,15 @@ struct msm_gpu_funcs {
 	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
 };
 
+/* Additional state for iommu faults: */
+struct msm_gpu_fault_info {
+	u64 ttbr0;
+	unsigned long iova;
+	int flags;
+	const char *type;
+	const char *block;
+};
+
 struct msm_gpu {
 	const char *name;
 	struct drm_device *dev;
@@ -135,6 +144,12 @@ struct msm_gpu {
 #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
 	struct timer_list hangcheck_timer;
 
+	/* Fault info for most recent iova fault: */
+	struct msm_gpu_fault_info fault_info;
+
+	/* work for handling GPU ioval faults: */
+	struct kthread_work fault_work;
+
 	/* work for handling GPU recovery: */
 	struct kthread_work recover_work;
 
@@ -243,6 +258,8 @@ struct msm_gpu_state {
 	char *comm;
 	char *cmd;
 
+	struct msm_gpu_fault_info fault_info;
+
 	int nr_bos;
 	struct msm_gpu_state_bo *bos;
 };
diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
index 379496186c7f..f7d1945e0c9f 100644
--- a/drivers/gpu/drm/msm/msm_gpummu.c
+++ b/drivers/gpu/drm/msm/msm_gpummu.c
@@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
 	return 0;
 }
 
+static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
+{
+}
+
 static void msm_gpummu_destroy(struct msm_mmu *mmu)
 {
 	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
@@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_gpummu_map,
 		.unmap = msm_gpummu_unmap,
 		.destroy = msm_gpummu_destroy,
+		.resume_translation = msm_gpummu_resume_translation,
 };
 
 struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index 6975b95c3c29..eed2a762e9dd 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
 	 * the arm-smmu driver as a trigger to set up TTBR0
 	 */
 	if (atomic_inc_return(&iommu->pagetables) == 1) {
+		/* Enable stall on iommu fault: */
+		adreno_smmu->set_stall(adreno_smmu->cookie, true);
+
 		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
 		if (ret) {
 			free_io_pgtable_ops(pagetable->pgtbl_ops);
@@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
 	return 0;
 }
 
+static void msm_iommu_resume_translation(struct msm_mmu *mmu)
+{
+	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
+
+	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
+}
+
 static void msm_iommu_detach(struct msm_mmu *mmu)
 {
 	struct msm_iommu *iommu = to_msm_iommu(mmu);
@@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
 		.map = msm_iommu_map,
 		.unmap = msm_iommu_unmap,
 		.destroy = msm_iommu_destroy,
+		.resume_translation = msm_iommu_resume_translation,
 };
 
 struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
index a88f44c3268d..de158e1bf765 100644
--- a/drivers/gpu/drm/msm/msm_mmu.h
+++ b/drivers/gpu/drm/msm/msm_mmu.h
@@ -15,6 +15,7 @@ struct msm_mmu_funcs {
 			size_t len, int prot);
 	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
 	void (*destroy)(struct msm_mmu *mmu);
+	void (*resume_translation)(struct msm_mmu *mmu);
 };
 
 enum msm_mmu_type {
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support
  2021-06-10 21:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support Rob Clark
@ 2021-06-11 13:49   ` Jordan Crouse
  2021-06-14 17:54   ` Bjorn Andersson
  1 sibling, 0 replies; 19+ messages in thread
From: Jordan Crouse @ 2021-06-11 13:49 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, freedreno, linux-arm-msm, Rob Clark,
	Will Deacon, Robin Murphy, Joerg Roedel, Bjorn Andersson,
	Isaac J. Manjarres, moderated list:ARM SMMU DRIVERS, open list

On Thu, Jun 10, 2021 at 02:44:12PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Add, via the adreno-smmu-priv interface, a way for the GPU to request
> the SMMU to stall translation on faults, and then later resume the
> translation, either retrying or terminating the current translation.
> 
> This will be used on the GPU side to "freeze" the GPU while we snapshot
> useful state for devcoredump.
> 

Acked-by: Jordan Crouse <jordan@cosmicpenguin.net>

> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
>  include/linux/adreno-smmu-priv.h           |  7 +++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index b2e31ea84128..61fc645c1325 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -13,6 +13,7 @@ struct qcom_smmu {
>  	struct arm_smmu_device smmu;
>  	bool bypass_quirk;
>  	u8 bypass_cbndx;
> +	u32 stall_enabled;
>  };
>  
>  static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
>  static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
>  		u32 reg)
>  {
> +	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
> +
>  	/*
>  	 * On the GPU device we want to process subsequent transactions after a
>  	 * fault to keep the GPU from hanging
>  	 */
>  	reg |= ARM_SMMU_SCTLR_HUPCF;
>  
> +	if (qsmmu->stall_enabled & BIT(idx))
> +		reg |= ARM_SMMU_SCTLR_CFCFG;
> +
>  	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
>  }
>  
> @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
>  	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
>  }
>  
> +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
> +{
> +	struct arm_smmu_domain *smmu_domain = (void *)cookie;
> +	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> +	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
> +
> +	if (enabled)
> +		qsmmu->stall_enabled |= BIT(cfg->cbndx);
> +	else
> +		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
> +}
> +
> +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
> +{
> +	struct arm_smmu_domain *smmu_domain = (void *)cookie;
> +	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	u32 reg = 0;
> +
> +	if (terminate)
> +		reg |= ARM_SMMU_RESUME_TERMINATE;
> +
> +	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> +}
> +
>  #define QCOM_ADRENO_SMMU_GPU_SID 0
>  
>  static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
>  	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
>  	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> +	priv->set_stall = qcom_adreno_smmu_set_stall;
> +	priv->resume_translation = qcom_adreno_smmu_resume_translation;
>  
>  	return 0;
>  }
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index 53fe32fb9214..c637e0997f6d 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
>   *                 TTBR0 translation is enabled with the specified cfg
>   * @get_fault_info: Called by the GPU fault handler to get information about
>   *                  the fault
> + * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
> + *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
> + *                 the GPU driver must call resume_translation()
> + * @resume_translation: Resume translation after a fault
> + *
>   *
>   * The GPU driver (drm/msm) and adreno-smmu work together for controlling
>   * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
> @@ -60,6 +65,8 @@ struct adreno_smmu_priv {
>      const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
>      int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
>      void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
> +    void (*set_stall)(const void *cookie, bool enabled);
> +    void (*resume_translation)(const void *cookie, bool terminate);
>  };
>  
>  #endif /* __ADRENO_SMMU_PRIV_H */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 5/5] drm/msm: devcoredump iommu fault support
  2021-06-10 21:44 ` [PATCH v5 5/5] drm/msm: devcoredump iommu fault support Rob Clark
@ 2021-06-11 13:49   ` Jordan Crouse
  0 siblings, 0 replies; 19+ messages in thread
From: Jordan Crouse @ 2021-06-11 13:49 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, freedreno, linux-arm-msm, Rob Clark, Sean Paul,
	David Airlie, Daniel Vetter, AngeloGioacchino Del Regno,
	Konrad Dybcio, Kristian H. Kristensen, Marijn Suijten,
	Jonathan Marek, Akhil P Oommen, Sai Prakash Ranjan, Eric Anholt,
	Sharat Masetty, Douglas Anderson, Lee Jones, Zhenzhong Duan,
	Dave Airlie, open list

On Thu, Jun 10, 2021 at 02:44:13PM -0700, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> Wire up support to stall the SMMU on iova fault, and collect a devcore-
> dump snapshot for easier debugging of faults.
> 
> Currently this is a6xx-only, but mostly only because so far it is the
> only one using adreno-smmu-priv.

Acked-by: Jordan Crouse <jordan@cosmicpenguin.net>
 
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c       | 19 +++++++-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 38 +++++++++++++++-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c | 42 ++++++++++++++----
>  drivers/gpu/drm/msm/adreno/adreno_gpu.c     | 15 +++++++
>  drivers/gpu/drm/msm/msm_gem.h               |  1 +
>  drivers/gpu/drm/msm/msm_gem_submit.c        |  1 +
>  drivers/gpu/drm/msm/msm_gpu.c               | 48 +++++++++++++++++++++
>  drivers/gpu/drm/msm/msm_gpu.h               | 17 ++++++++
>  drivers/gpu/drm/msm/msm_gpummu.c            |  5 +++
>  drivers/gpu/drm/msm/msm_iommu.c             | 11 +++++
>  drivers/gpu/drm/msm/msm_mmu.h               |  1 +
>  11 files changed, 186 insertions(+), 12 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index eb030b00bff4..7a271de9a212 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1200,6 +1200,15 @@ static void a5xx_fault_detect_irq(struct msm_gpu *gpu)
>  	struct drm_device *dev = gpu->dev;
>  	struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
>  
> +	/*
> +	 * If stalled on SMMU fault, we could trip the GPU's hang detection,
> +	 * but the fault handler will trigger the devcore dump, and we want
> +	 * to otherwise resume normally rather than killing the submit, so
> +	 * just bail.
> +	 */
> +	if (gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24))
> +		return;
> +
>  	DRM_DEV_ERROR(dev->dev, "gpu fault ring %d fence %x status %8.8X rb %4.4x/%4.4x ib1 %16.16llX/%4.4x ib2 %16.16llX/%4.4x\n",
>  		ring ? ring->id : -1, ring ? ring->seqno : 0,
>  		gpu_read(gpu, REG_A5XX_RBBM_STATUS),
> @@ -1523,6 +1532,7 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
>  {
>  	struct a5xx_gpu_state *a5xx_state = kzalloc(sizeof(*a5xx_state),
>  			GFP_KERNEL);
> +	bool stalled = !!(gpu_read(gpu, REG_A5XX_RBBM_STATUS3) & BIT(24));
>  
>  	if (!a5xx_state)
>  		return ERR_PTR(-ENOMEM);
> @@ -1535,8 +1545,13 @@ static struct msm_gpu_state *a5xx_gpu_state_get(struct msm_gpu *gpu)
>  
>  	a5xx_state->base.rbbm_status = gpu_read(gpu, REG_A5XX_RBBM_STATUS);
>  
> -	/* Get the HLSQ regs with the help of the crashdumper */
> -	a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
> +	/*
> +	 * Get the HLSQ regs with the help of the crashdumper, but only if
> +	 * we are not stalled in an iommu fault (in which case the crashdumper
> +	 * would not have access to memory)
> +	 */
> +	if (!stalled)
> +		a5xx_gpu_state_get_hlsq_regs(gpu, a5xx_state);
>  
>  	a5xx_set_hwcg(gpu, true);
>  
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index fc19db10bff1..c3699408bd1f 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1081,6 +1081,16 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	struct msm_gpu *gpu = arg;
>  	struct adreno_smmu_fault_info *info = data;
>  	const char *type = "UNKNOWN";
> +	const char *block;
> +	bool do_devcoredump = info && !READ_ONCE(gpu->crashstate);
> +
> +	/*
> +	 * If we aren't going to be resuming later from fault_worker, then do
> +	 * it now.
> +	 */
> +	if (!do_devcoredump) {
> +		gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +	}
>  
>  	/*
>  	 * Print a default message if we couldn't get the data from the
> @@ -1104,15 +1114,30 @@ static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *da
>  	else if (info->fsr & ARM_SMMU_FSR_EF)
>  		type = "EXTERNAL";
>  
> +	block = a6xx_fault_block(gpu, info->fsynr1 & 0xff);
> +
>  	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
>  			info->ttbr0, iova,
> -			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> -			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> +			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ",
> +			type, block,
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>  
> +	if (do_devcoredump) {
> +		/* Turn off the hangcheck timer to keep it from bothering us */
> +		del_timer(&gpu->hangcheck_timer);
> +
> +		gpu->fault_info.ttbr0 = info->ttbr0;
> +		gpu->fault_info.iova  = iova;
> +		gpu->fault_info.flags = flags;
> +		gpu->fault_info.type  = type;
> +		gpu->fault_info.block = block;
> +
> +		kthread_queue_work(gpu->worker, &gpu->fault_work);
> +	}
> +
>  	return 0;
>  }
>  
> @@ -1164,6 +1189,15 @@ static void a6xx_fault_detect_irq(struct msm_gpu *gpu)
>  	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>  	struct msm_ringbuffer *ring = gpu->funcs->active_ring(gpu);
>  
> +	/*
> +	 * If stalled on SMMU fault, we could trip the GPU's hang detection,
> +	 * but the fault handler will trigger the devcore dump, and we want
> +	 * to otherwise resume normally rather than killing the submit, so
> +	 * just bail.
> +	 */
> +	if (gpu_read(gpu, REG_A6XX_RBBM_STATUS3) & A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT)
> +		return;
> +
>  	/*
>  	 * Force the GPU to stay on until after we finish
>  	 * collecting information
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> index 21c49c5b4519..ad4ea0ed5d99 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c
> @@ -832,6 +832,20 @@ static void a6xx_get_registers(struct msm_gpu *gpu,
>  		a6xx_get_ahb_gpu_registers(gpu,
>  				a6xx_state, &a6xx_vbif_reglist,
>  				&a6xx_state->registers[index++]);
> +	if (!dumper) {
> +		/*
> +		 * We can't use the crashdumper when the SMMU is stalled,
> +		 * because the GPU has no memory access until we resume
> +		 * translation (but we don't want to do that until after
> +		 * we have captured as much useful GPU state as possible).
> +		 * So instead collect registers via the CPU:
> +		 */
> +		for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
> +			a6xx_get_ahb_gpu_registers(gpu,
> +				a6xx_state, &a6xx_reglist[i],
> +				&a6xx_state->registers[index++]);
> +		return;
> +	}
>  
>  	for (i = 0; i < ARRAY_SIZE(a6xx_reglist); i++)
>  		a6xx_get_crashdumper_registers(gpu,
> @@ -905,11 +919,13 @@ static void a6xx_get_indexed_registers(struct msm_gpu *gpu,
>  
>  struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
>  {
> -	struct a6xx_crashdumper dumper = { 0 };
> +	struct a6xx_crashdumper _dumper = { 0 }, *dumper = NULL;
>  	struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu);
>  	struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu);
>  	struct a6xx_gpu_state *a6xx_state = kzalloc(sizeof(*a6xx_state),
>  		GFP_KERNEL);
> +	bool stalled = !!(gpu_read(gpu, REG_A6XX_RBBM_STATUS3) &
> +			A6XX_RBBM_STATUS3_SMMU_STALLED_ON_FAULT);
>  
>  	if (!a6xx_state)
>  		return ERR_PTR(-ENOMEM);
> @@ -928,14 +944,24 @@ struct msm_gpu_state *a6xx_gpu_state_get(struct msm_gpu *gpu)
>  	/* Get the banks of indexed registers */
>  	a6xx_get_indexed_registers(gpu, a6xx_state);
>  
> -	/* Try to initialize the crashdumper */
> -	if (!a6xx_crashdumper_init(gpu, &dumper)) {
> -		a6xx_get_registers(gpu, a6xx_state, &dumper);
> -		a6xx_get_shaders(gpu, a6xx_state, &dumper);
> -		a6xx_get_clusters(gpu, a6xx_state, &dumper);
> -		a6xx_get_dbgahb_clusters(gpu, a6xx_state, &dumper);
> +	/*
> +	 * Try to initialize the crashdumper, if we are not dumping state
> +	 * with the SMMU stalled.  The crashdumper needs memory access to
> +	 * write out GPU state, so we need to skip this when the SMMU is
> +	 * stalled in response to an iova fault
> +	 */
> +	if (!stalled && !a6xx_crashdumper_init(gpu, &_dumper)) {
> +		dumper = &_dumper;
> +	}
> +
> +	a6xx_get_registers(gpu, a6xx_state, dumper);
> +
> +	if (dumper) {
> +		a6xx_get_shaders(gpu, a6xx_state, dumper);
> +		a6xx_get_clusters(gpu, a6xx_state, dumper);
> +		a6xx_get_dbgahb_clusters(gpu, a6xx_state, dumper);
>  
> -		msm_gem_kernel_put(dumper.bo, gpu->aspace, true);
> +		msm_gem_kernel_put(dumper->bo, gpu->aspace, true);
>  	}
>  
>  	if (snapshot_debugbus)
> diff --git a/drivers/gpu/drm/msm/adreno/adreno_gpu.c b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> index c1b02f790804..2bfe014995c7 100644
> --- a/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/adreno_gpu.c
> @@ -684,6 +684,21 @@ void adreno_show(struct msm_gpu *gpu, struct msm_gpu_state *state,
>  			adreno_gpu->info->revn, adreno_gpu->rev.core,
>  			adreno_gpu->rev.major, adreno_gpu->rev.minor,
>  			adreno_gpu->rev.patchid);
> +	/*
> +	 * If this is state collected due to iova fault, so fault related info
> +	 *
> +	 * TTBR0 would not be zero, so this is a good way to distinguish
> +	 */
> +	if (state->fault_info.ttbr0) {
> +		const struct msm_gpu_fault_info *info = &state->fault_info;
> +
> +		drm_puts(p, "fault-info:\n");
> +		drm_printf(p, "  - ttbr0=%.16llx\n", info->ttbr0);
> +		drm_printf(p, "  - iova=%.16lx\n", info->iova);
> +		drm_printf(p, "  - dir=%s\n", info->flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ");
> +		drm_printf(p, "  - type=%s\n", info->type);
> +		drm_printf(p, "  - source=%s\n", info->block);
> +	}
>  
>  	drm_printf(p, "rbbm-status: 0x%08x\n", state->rbbm_status);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gem.h b/drivers/gpu/drm/msm/msm_gem.h
> index 03e2cc2a2ce1..405f8411e395 100644
> --- a/drivers/gpu/drm/msm/msm_gem.h
> +++ b/drivers/gpu/drm/msm/msm_gem.h
> @@ -328,6 +328,7 @@ struct msm_gem_submit {
>  	struct dma_fence *fence;
>  	struct msm_gpu_submitqueue *queue;
>  	struct pid *pid;    /* submitting process */
> +	bool fault_dumped;  /* Limit devcoredump dumping to one per submit */
>  	bool valid;         /* true if no cmdstream patching needed */
>  	bool in_rb;         /* "sudo" mode, copy cmds into RB */
>  	struct msm_ringbuffer *ring;
> diff --git a/drivers/gpu/drm/msm/msm_gem_submit.c b/drivers/gpu/drm/msm/msm_gem_submit.c
> index 5480852bdeda..44f84bfd0c0e 100644
> --- a/drivers/gpu/drm/msm/msm_gem_submit.c
> +++ b/drivers/gpu/drm/msm/msm_gem_submit.c
> @@ -50,6 +50,7 @@ static struct msm_gem_submit *submit_create(struct drm_device *dev,
>  	submit->cmd = (void *)&submit->bos[nr_bos];
>  	submit->queue = queue;
>  	submit->ring = gpu->rb[queue->prio];
> +	submit->fault_dumped = false;
>  
>  	/* initially, until copy_from_user() and bo lookup succeeds: */
>  	submit->nr_bos = 0;
> diff --git a/drivers/gpu/drm/msm/msm_gpu.c b/drivers/gpu/drm/msm/msm_gpu.c
> index fa7691cb4614..414ba2dd34e5 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.c
> +++ b/drivers/gpu/drm/msm/msm_gpu.c
> @@ -400,6 +400,7 @@ static void msm_gpu_crashstate_capture(struct msm_gpu *gpu,
>  	/* Fill in the additional crash state information */
>  	state->comm = kstrdup(comm, GFP_KERNEL);
>  	state->cmd = kstrdup(cmd, GFP_KERNEL);
> +	state->fault_info = gpu->fault_info;
>  
>  	if (submit) {
>  		int i, nr = 0;
> @@ -572,6 +573,52 @@ static void recover_worker(struct kthread_work *work)
>  	msm_gpu_retire(gpu);
>  }
>  
> +static void fault_worker(struct kthread_work *work)
> +{
> +	struct msm_gpu *gpu = container_of(work, struct msm_gpu, fault_work);
> +	struct drm_device *dev = gpu->dev;
> +	struct msm_gem_submit *submit;
> +	struct msm_ringbuffer *cur_ring = gpu->funcs->active_ring(gpu);
> +	char *comm = NULL, *cmd = NULL;
> +
> +	mutex_lock(&dev->struct_mutex);
> +
> +	submit = find_submit(cur_ring, cur_ring->memptrs->fence + 1);
> +	if (submit && submit->fault_dumped)
> +		goto resume_smmu;
> +
> +	if (submit) {
> +		struct task_struct *task;
> +
> +		task = get_pid_task(submit->pid, PIDTYPE_PID);
> +		if (task) {
> +			comm = kstrdup(task->comm, GFP_KERNEL);
> +			cmd = kstrdup_quotable_cmdline(task, GFP_KERNEL);
> +			put_task_struct(task);
> +		}
> +
> +		/*
> +		 * When we get GPU iova faults, we can get 1000s of them,
> +		 * but we really only want to log the first one.
> +		 */
> +		submit->fault_dumped = true;
> +	}
> +
> +	/* Record the crash state */
> +	pm_runtime_get_sync(&gpu->pdev->dev);
> +	msm_gpu_crashstate_capture(gpu, submit, comm, cmd);
> +	pm_runtime_put_sync(&gpu->pdev->dev);
> +
> +	kfree(cmd);
> +	kfree(comm);
> +
> +resume_smmu:
> +	memset(&gpu->fault_info, 0, sizeof(gpu->fault_info));
> +	gpu->aspace->mmu->funcs->resume_translation(gpu->aspace->mmu);
> +
> +	mutex_unlock(&dev->struct_mutex);
> +}
> +
>  static void hangcheck_timer_reset(struct msm_gpu *gpu)
>  {
>  	mod_timer(&gpu->hangcheck_timer,
> @@ -948,6 +995,7 @@ int msm_gpu_init(struct drm_device *drm, struct platform_device *pdev,
>  	INIT_LIST_HEAD(&gpu->active_list);
>  	kthread_init_work(&gpu->retire_work, retire_worker);
>  	kthread_init_work(&gpu->recover_work, recover_worker);
> +	kthread_init_work(&gpu->fault_work, fault_worker);
>  
>  	timer_setup(&gpu->hangcheck_timer, hangcheck_handler, 0);
>  
> diff --git a/drivers/gpu/drm/msm/msm_gpu.h b/drivers/gpu/drm/msm/msm_gpu.h
> index 7a082a12d98f..8eefb3aeca10 100644
> --- a/drivers/gpu/drm/msm/msm_gpu.h
> +++ b/drivers/gpu/drm/msm/msm_gpu.h
> @@ -71,6 +71,15 @@ struct msm_gpu_funcs {
>  	uint32_t (*get_rptr)(struct msm_gpu *gpu, struct msm_ringbuffer *ring);
>  };
>  
> +/* Additional state for iommu faults: */
> +struct msm_gpu_fault_info {
> +	u64 ttbr0;
> +	unsigned long iova;
> +	int flags;
> +	const char *type;
> +	const char *block;
> +};
> +
>  struct msm_gpu {
>  	const char *name;
>  	struct drm_device *dev;
> @@ -135,6 +144,12 @@ struct msm_gpu {
>  #define DRM_MSM_HANGCHECK_JIFFIES msecs_to_jiffies(DRM_MSM_HANGCHECK_PERIOD)
>  	struct timer_list hangcheck_timer;
>  
> +	/* Fault info for most recent iova fault: */
> +	struct msm_gpu_fault_info fault_info;
> +
> +	/* work for handling GPU ioval faults: */
> +	struct kthread_work fault_work;
> +
>  	/* work for handling GPU recovery: */
>  	struct kthread_work recover_work;
>  
> @@ -243,6 +258,8 @@ struct msm_gpu_state {
>  	char *comm;
>  	char *cmd;
>  
> +	struct msm_gpu_fault_info fault_info;
> +
>  	int nr_bos;
>  	struct msm_gpu_state_bo *bos;
>  };
> diff --git a/drivers/gpu/drm/msm/msm_gpummu.c b/drivers/gpu/drm/msm/msm_gpummu.c
> index 379496186c7f..f7d1945e0c9f 100644
> --- a/drivers/gpu/drm/msm/msm_gpummu.c
> +++ b/drivers/gpu/drm/msm/msm_gpummu.c
> @@ -68,6 +68,10 @@ static int msm_gpummu_unmap(struct msm_mmu *mmu, uint64_t iova, size_t len)
>  	return 0;
>  }
>  
> +static void msm_gpummu_resume_translation(struct msm_mmu *mmu)
> +{
> +}
> +
>  static void msm_gpummu_destroy(struct msm_mmu *mmu)
>  {
>  	struct msm_gpummu *gpummu = to_msm_gpummu(mmu);
> @@ -83,6 +87,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_gpummu_map,
>  		.unmap = msm_gpummu_unmap,
>  		.destroy = msm_gpummu_destroy,
> +		.resume_translation = msm_gpummu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 6975b95c3c29..eed2a762e9dd 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -184,6 +184,9 @@ struct msm_mmu *msm_iommu_pagetable_create(struct msm_mmu *parent)
>  	 * the arm-smmu driver as a trigger to set up TTBR0
>  	 */
>  	if (atomic_inc_return(&iommu->pagetables) == 1) {
> +		/* Enable stall on iommu fault: */
> +		adreno_smmu->set_stall(adreno_smmu->cookie, true);
> +
>  		ret = adreno_smmu->set_ttbr0_cfg(adreno_smmu->cookie, &ttbr0_cfg);
>  		if (ret) {
>  			free_io_pgtable_ops(pagetable->pgtbl_ops);
> @@ -226,6 +229,13 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
>  	return 0;
>  }
>  
> +static void msm_iommu_resume_translation(struct msm_mmu *mmu)
> +{
> +	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(mmu->dev);
> +
> +	adreno_smmu->resume_translation(adreno_smmu->cookie, true);
> +}
> +
>  static void msm_iommu_detach(struct msm_mmu *mmu)
>  {
>  	struct msm_iommu *iommu = to_msm_iommu(mmu);
> @@ -273,6 +283,7 @@ static const struct msm_mmu_funcs funcs = {
>  		.map = msm_iommu_map,
>  		.unmap = msm_iommu_unmap,
>  		.destroy = msm_iommu_destroy,
> +		.resume_translation = msm_iommu_resume_translation,
>  };
>  
>  struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain)
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index a88f44c3268d..de158e1bf765 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -15,6 +15,7 @@ struct msm_mmu_funcs {
>  			size_t len, int prot);
>  	int (*unmap)(struct msm_mmu *mmu, uint64_t iova, size_t len);
>  	void (*destroy)(struct msm_mmu *mmu);
> +	void (*resume_translation)(struct msm_mmu *mmu);
>  };
>  
>  enum msm_mmu_type {
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers
  2021-06-10 21:44 ` [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
@ 2021-06-14 17:26   ` Bjorn Andersson
  0 siblings, 0 replies; 19+ messages in thread
From: Bjorn Andersson @ 2021-06-14 17:26 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, Rob Clark, open list, Will Deacon,
	linux-arm-msm, Robin Murphy, Jordan Crouse, freedreno,
	moderated list:ARM SMMU DRIVERS

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <jcrouse@codeaurora.org>
> 
> Call report_iommu_fault() to allow upper-level drivers to register their
> own fault handlers.
> 
> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> Acked-by: Will Deacon <will@kernel.org>

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu.c | 9 +++++++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> index 6f72c4d208ca..b4b32d31fc06 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c
> @@ -408,6 +408,7 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	int idx = smmu_domain->cfg.cbndx;
> +	int ret;
>  
>  	fsr = arm_smmu_cb_read(smmu, idx, ARM_SMMU_CB_FSR);
>  	if (!(fsr & ARM_SMMU_FSR_FAULT))
> @@ -417,8 +418,12 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>  	iova = arm_smmu_cb_readq(smmu, idx, ARM_SMMU_CB_FAR);
>  	cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(idx));
>  
> -	dev_err_ratelimited(smmu->dev,
> -	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
> +	ret = report_iommu_fault(domain, NULL, iova,
> +		fsynr & ARM_SMMU_FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ);
> +
> +	if (ret == -ENOSYS)
> +		dev_err_ratelimited(smmu->dev,
> +		"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cbfrsynra=0x%x, cb=%d\n",
>  			    fsr, iova, fsynr, cbfrsynra, idx);
>  
>  	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
> -- 
> 2.31.1
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info
  2021-06-10 21:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info Rob Clark
@ 2021-06-14 17:30   ` Bjorn Andersson
  0 siblings, 0 replies; 19+ messages in thread
From: Bjorn Andersson @ 2021-06-14 17:30 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, freedreno, linux-arm-msm, Jordan Crouse,
	Jordan Crouse, Rob Clark, Will Deacon, Robin Murphy,
	Joerg Roedel, Isaac J. Manjarres, John Stultz, Krishna Reddy,
	Sai Prakash Ranjan, moderated list:ARM SMMU DRIVERS, open list

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <jcrouse@codeaurora.org>
> 
> Add a callback in adreno-smmu-priv to read interesting SMMU
> registers to provide an opportunity for a richer debug experience
> in the GPU driver.
> 
> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
> Signed-off-by: Rob Clark <robdclark@chromium.org>

I presume this implies that more generic options has been discussed.
Regardless, if further conclusions are made in that regard I expect that
this could serve as a base for such efforts.

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 17 ++++++++++++
>  drivers/iommu/arm/arm-smmu/arm-smmu.h      |  2 ++
>  include/linux/adreno-smmu-priv.h           | 31 +++++++++++++++++++++-
>  3 files changed, 49 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index 98b3a1c2a181..b2e31ea84128 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -32,6 +32,22 @@ static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
>  	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
>  }
>  
> +static void qcom_adreno_smmu_get_fault_info(const void *cookie,
> +		struct adreno_smmu_fault_info *info)
> +{
> +	struct arm_smmu_domain *smmu_domain = (void *)cookie;
> +	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +
> +	info->fsr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSR);
> +	info->fsynr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR0);
> +	info->fsynr1 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_FSYNR1);
> +	info->far = arm_smmu_cb_readq(smmu, cfg->cbndx, ARM_SMMU_CB_FAR);
> +	info->cbfrsynra = arm_smmu_gr1_read(smmu, ARM_SMMU_GR1_CBFRSYNRA(cfg->cbndx));
> +	info->ttbr0 = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_TTBR0);
> +	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
> +}
> +
>  #define QCOM_ADRENO_SMMU_GPU_SID 0
>  
>  static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -156,6 +172,7 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  	priv->cookie = smmu_domain;
>  	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
>  	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
> +	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
>  
>  	return 0;
>  }
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.h b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> index c31a59d35c64..84c21c4b0691 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu.h
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.h
> @@ -224,6 +224,8 @@ enum arm_smmu_cbar_type {
>  #define ARM_SMMU_CB_FSYNR0		0x68
>  #define ARM_SMMU_FSYNR0_WNR		BIT(4)
>  
> +#define ARM_SMMU_CB_FSYNR1		0x6c
> +
>  #define ARM_SMMU_CB_S1_TLBIVA		0x600
>  #define ARM_SMMU_CB_S1_TLBIASID		0x610
>  #define ARM_SMMU_CB_S1_TLBIVAL		0x620
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index a889f28afb42..53fe32fb9214 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -8,6 +8,32 @@
>  
>  #include <linux/io-pgtable.h>
>  
> +/**
> + * struct adreno_smmu_fault_info - container for key fault information
> + *
> + * @far: The faulting IOVA from ARM_SMMU_CB_FAR
> + * @ttbr0: The current TTBR0 pagetable from ARM_SMMU_CB_TTBR0
> + * @contextidr: The value of ARM_SMMU_CB_CONTEXTIDR
> + * @fsr: The fault status from ARM_SMMU_CB_FSR
> + * @fsynr0: The value of FSYNR0 from ARM_SMMU_CB_FSYNR0
> + * @fsynr1: The value of FSYNR1 from ARM_SMMU_CB_FSYNR0
> + * @cbfrsynra: The value of CBFRSYNRA from ARM_SMMU_GR1_CBFRSYNRA(idx)
> + *
> + * This struct passes back key page fault information to the GPU driver
> + * through the get_fault_info function pointer.
> + * The GPU driver can use this information to print informative
> + * log messages and provide deeper GPU specific insight into the fault.
> + */
> +struct adreno_smmu_fault_info {
> +	u64 far;
> +	u64 ttbr0;
> +	u32 contextidr;
> +	u32 fsr;
> +	u32 fsynr0;
> +	u32 fsynr1;
> +	u32 cbfrsynra;
> +};
> +
>  /**
>   * struct adreno_smmu_priv - private interface between adreno-smmu and GPU
>   *
> @@ -17,6 +43,8 @@
>   * @set_ttbr0_cfg: Set the TTBR0 config for the GPUs context bank.  A
>   *                 NULL config disables TTBR0 translation, otherwise
>   *                 TTBR0 translation is enabled with the specified cfg
> + * @get_fault_info: Called by the GPU fault handler to get information about
> + *                  the fault
>   *
>   * The GPU driver (drm/msm) and adreno-smmu work together for controlling
>   * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
> @@ -31,6 +59,7 @@ struct adreno_smmu_priv {
>      const void *cookie;
>      const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
>      int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
> +    void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
>  };
>  
> -#endif /* __ADRENO_SMMU_PRIV_H */
> \ No newline at end of file
> +#endif /* __ADRENO_SMMU_PRIV_H */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler
  2021-06-10 21:44 ` [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler Rob Clark
@ 2021-06-14 17:46   ` Bjorn Andersson
  2021-06-25  3:39   ` Bjorn Andersson
  1 sibling, 0 replies; 19+ messages in thread
From: Bjorn Andersson @ 2021-06-14 17:46 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, Rob Clark, Douglas Anderson, Akhil P Oommen,
	Jonathan Marek, Eric Anholt, David Airlie, linux-arm-msm,
	Sharat Masetty, Konrad Dybcio, Sean Paul, Jordan Crouse,
	Kristian H. Kristensen, Daniel Vetter,
	AngeloGioacchino Del Regno, Marijn Suijten, freedreno, open list

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Jordan Crouse <jcrouse@codeaurora.org>
> 
> Use the new adreno-smmu-priv fault info function to get more SMMU
> debug registers and print the current TTBR0 to debug per-instance
> pagetables and figure out which GPU block generated the request.
> 

Acked-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> Signed-off-by: Jordan Crouse <jcrouse@codeaurora.org>
> Signed-off-by: Rob Clark <robdclark@chromium.org>
> ---
>  drivers/gpu/drm/msm/adreno/a5xx_gpu.c |  4 +-
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 76 +++++++++++++++++++++++++--
>  drivers/gpu/drm/msm/msm_iommu.c       | 11 +++-
>  drivers/gpu/drm/msm/msm_mmu.h         |  4 +-
>  4 files changed, 87 insertions(+), 8 deletions(-)
> 
> diff --git a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> index f46562c12022..eb030b00bff4 100644
> --- a/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a5xx_gpu.c
> @@ -1075,7 +1075,7 @@ bool a5xx_idle(struct msm_gpu *gpu, struct msm_ringbuffer *ring)
>  	return true;
>  }
>  
> -static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static int a5xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
>  {
>  	struct msm_gpu *gpu = arg;
>  	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> @@ -1085,7 +1085,7 @@ static int a5xx_fault_handler(void *arg, unsigned long iova, int flags)
>  			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(6)),
>  			gpu_read(gpu, REG_A5XX_CP_SCRATCH_REG(7)));
>  
> -	return -EFAULT;
> +	return 0;
>  }
>  
>  static void a5xx_cp_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index c7f0ddb12d8f..fc19db10bff1 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1032,18 +1032,88 @@ static void a6xx_recover(struct msm_gpu *gpu)
>  	msm_gpu_hw_init(gpu);
>  }
>  
> -static int a6xx_fault_handler(void *arg, unsigned long iova, int flags)
> +static const char *a6xx_uche_fault_block(struct msm_gpu *gpu, u32 mid)
> +{
> +	static const char *uche_clients[7] = {
> +		"VFD", "SP", "VSC", "VPC", "HLSQ", "PC", "LRZ",
> +	};
> +	u32 val;
> +
> +	if (mid < 1 || mid > 3)
> +		return "UNKNOWN";
> +
> +	/*
> +	 * The source of the data depends on the mid ID read from FSYNR1.
> +	 * and the client ID read from the UCHE block
> +	 */
> +	val = gpu_read(gpu, REG_A6XX_UCHE_CLIENT_PF);
> +
> +	/* mid = 3 is most precise and refers to only one block per client */
> +	if (mid == 3)
> +		return uche_clients[val & 7];
> +
> +	/* For mid=2 the source is TP or VFD except when the client id is 0 */
> +	if (mid == 2)
> +		return ((val & 7) == 0) ? "TP" : "TP|VFD";
> +
> +	/* For mid=1 just return "UCHE" as a catchall for everything else */
> +	return "UCHE";
> +}
> +
> +static const char *a6xx_fault_block(struct msm_gpu *gpu, u32 id)
> +{
> +	if (id == 0)
> +		return "CP";
> +	else if (id == 4)
> +		return "CCU";
> +	else if (id == 6)
> +		return "CDP Prefetch";
> +
> +	return a6xx_uche_fault_block(gpu, id);
> +}
> +
> +#define ARM_SMMU_FSR_TF                 BIT(1)
> +#define ARM_SMMU_FSR_PF			BIT(3)
> +#define ARM_SMMU_FSR_EF			BIT(4)
> +
> +static int a6xx_fault_handler(void *arg, unsigned long iova, int flags, void *data)
>  {
>  	struct msm_gpu *gpu = arg;
> +	struct adreno_smmu_fault_info *info = data;
> +	const char *type = "UNKNOWN";
>  
> -	pr_warn_ratelimited("*** gpu fault: iova=%08lx, flags=%d (%u,%u,%u,%u)\n",
> +	/*
> +	 * Print a default message if we couldn't get the data from the
> +	 * adreno-smmu-priv
> +	 */
> +	if (!info) {
> +		pr_warn_ratelimited("*** gpu fault: iova=%.16lx flags=%d (%u,%u,%u,%u)\n",
>  			iova, flags,
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
>  			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
>  
> -	return -EFAULT;
> +		return 0;
> +	}
> +
> +	if (info->fsr & ARM_SMMU_FSR_TF)
> +		type = "TRANSLATION";
> +	else if (info->fsr & ARM_SMMU_FSR_PF)
> +		type = "PERMISSION";
> +	else if (info->fsr & ARM_SMMU_FSR_EF)
> +		type = "EXTERNAL";
> +
> +	pr_warn_ratelimited("*** gpu fault: ttbr0=%.16llx iova=%.16lx dir=%s type=%s source=%s (%u,%u,%u,%u)\n",
> +			info->ttbr0, iova,
> +			flags & IOMMU_FAULT_WRITE ? "WRITE" : "READ", type,
> +			a6xx_fault_block(gpu, info->fsynr1 & 0xff),
> +			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(4)),
> +			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(5)),
> +			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(6)),
> +			gpu_read(gpu, REG_A6XX_CP_SCRATCH_REG(7)));
> +
> +	return 0;
>  }
>  
>  static void a6xx_cp_hw_err_irq(struct msm_gpu *gpu)
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 50d881794758..6975b95c3c29 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
>  		unsigned long iova, int flags, void *arg)
>  {
>  	struct msm_iommu *iommu = arg;
> +	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> +	struct adreno_smmu_fault_info info, *ptr = NULL;
> +
> +	if (adreno_smmu->get_fault_info) {
> +		adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
> +		ptr = &info;
> +	}
> +
>  	if (iommu->base.handler)
> -		return iommu->base.handler(iommu->base.arg, iova, flags);
> +		return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
> +
>  	pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
>  	return 0;
>  }
> diff --git a/drivers/gpu/drm/msm/msm_mmu.h b/drivers/gpu/drm/msm/msm_mmu.h
> index 61ade89d9e48..a88f44c3268d 100644
> --- a/drivers/gpu/drm/msm/msm_mmu.h
> +++ b/drivers/gpu/drm/msm/msm_mmu.h
> @@ -26,7 +26,7 @@ enum msm_mmu_type {
>  struct msm_mmu {
>  	const struct msm_mmu_funcs *funcs;
>  	struct device *dev;
> -	int (*handler)(void *arg, unsigned long iova, int flags);
> +	int (*handler)(void *arg, unsigned long iova, int flags, void *data);
>  	void *arg;
>  	enum msm_mmu_type type;
>  };
> @@ -43,7 +43,7 @@ struct msm_mmu *msm_iommu_new(struct device *dev, struct iommu_domain *domain);
>  struct msm_mmu *msm_gpummu_new(struct device *dev, struct msm_gpu *gpu);
>  
>  static inline void msm_mmu_set_fault_handler(struct msm_mmu *mmu, void *arg,
> -		int (*handler)(void *arg, unsigned long iova, int flags))
> +		int (*handler)(void *arg, unsigned long iova, int flags, void *data))
>  {
>  	mmu->arg = arg;
>  	mmu->handler = handler;
> -- 
> 2.31.1
> 
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support
  2021-06-10 21:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support Rob Clark
  2021-06-11 13:49   ` Jordan Crouse
@ 2021-06-14 17:54   ` Bjorn Andersson
  1 sibling, 0 replies; 19+ messages in thread
From: Bjorn Andersson @ 2021-06-14 17:54 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, freedreno, linux-arm-msm, Jordan Crouse,
	Rob Clark, Will Deacon, Robin Murphy, Joerg Roedel,
	Isaac J. Manjarres, moderated list:ARM SMMU DRIVERS, open list

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:

> From: Rob Clark <robdclark@chromium.org>
> 
> Add, via the adreno-smmu-priv interface, a way for the GPU to request
> the SMMU to stall translation on faults, and then later resume the
> translation, either retrying or terminating the current translation.
> 
> This will be used on the GPU side to "freeze" the GPU while we snapshot
> useful state for devcoredump.
> 
> Signed-off-by: Rob Clark <robdclark@chromium.org>

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Regards,
Bjorn

> ---
>  drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 33 ++++++++++++++++++++++
>  include/linux/adreno-smmu-priv.h           |  7 +++++
>  2 files changed, 40 insertions(+)
> 
> diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> index b2e31ea84128..61fc645c1325 100644
> --- a/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c
> @@ -13,6 +13,7 @@ struct qcom_smmu {
>  	struct arm_smmu_device smmu;
>  	bool bypass_quirk;
>  	u8 bypass_cbndx;
> +	u32 stall_enabled;
>  };
>  
>  static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
> @@ -23,12 +24,17 @@ static struct qcom_smmu *to_qcom_smmu(struct arm_smmu_device *smmu)
>  static void qcom_adreno_smmu_write_sctlr(struct arm_smmu_device *smmu, int idx,
>  		u32 reg)
>  {
> +	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu);
> +
>  	/*
>  	 * On the GPU device we want to process subsequent transactions after a
>  	 * fault to keep the GPU from hanging
>  	 */
>  	reg |= ARM_SMMU_SCTLR_HUPCF;
>  
> +	if (qsmmu->stall_enabled & BIT(idx))
> +		reg |= ARM_SMMU_SCTLR_CFCFG;
> +
>  	arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
>  }
>  
> @@ -48,6 +54,31 @@ static void qcom_adreno_smmu_get_fault_info(const void *cookie,
>  	info->contextidr = arm_smmu_cb_read(smmu, cfg->cbndx, ARM_SMMU_CB_CONTEXTIDR);
>  }
>  
> +static void qcom_adreno_smmu_set_stall(const void *cookie, bool enabled)
> +{
> +	struct arm_smmu_domain *smmu_domain = (void *)cookie;
> +	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> +	struct qcom_smmu *qsmmu = to_qcom_smmu(smmu_domain->smmu);
> +
> +	if (enabled)
> +		qsmmu->stall_enabled |= BIT(cfg->cbndx);
> +	else
> +		qsmmu->stall_enabled &= ~BIT(cfg->cbndx);
> +}
> +
> +static void qcom_adreno_smmu_resume_translation(const void *cookie, bool terminate)
> +{
> +	struct arm_smmu_domain *smmu_domain = (void *)cookie;
> +	struct arm_smmu_cfg *cfg = &smmu_domain->cfg;
> +	struct arm_smmu_device *smmu = smmu_domain->smmu;
> +	u32 reg = 0;
> +
> +	if (terminate)
> +		reg |= ARM_SMMU_RESUME_TERMINATE;
> +
> +	arm_smmu_cb_write(smmu, cfg->cbndx, ARM_SMMU_CB_RESUME, reg);
> +}
> +
>  #define QCOM_ADRENO_SMMU_GPU_SID 0
>  
>  static bool qcom_adreno_smmu_is_gpu_device(struct device *dev)
> @@ -173,6 +204,8 @@ static int qcom_adreno_smmu_init_context(struct arm_smmu_domain *smmu_domain,
>  	priv->get_ttbr1_cfg = qcom_adreno_smmu_get_ttbr1_cfg;
>  	priv->set_ttbr0_cfg = qcom_adreno_smmu_set_ttbr0_cfg;
>  	priv->get_fault_info = qcom_adreno_smmu_get_fault_info;
> +	priv->set_stall = qcom_adreno_smmu_set_stall;
> +	priv->resume_translation = qcom_adreno_smmu_resume_translation;
>  
>  	return 0;
>  }
> diff --git a/include/linux/adreno-smmu-priv.h b/include/linux/adreno-smmu-priv.h
> index 53fe32fb9214..c637e0997f6d 100644
> --- a/include/linux/adreno-smmu-priv.h
> +++ b/include/linux/adreno-smmu-priv.h
> @@ -45,6 +45,11 @@ struct adreno_smmu_fault_info {
>   *                 TTBR0 translation is enabled with the specified cfg
>   * @get_fault_info: Called by the GPU fault handler to get information about
>   *                  the fault
> + * @set_stall:     Configure whether stall on fault (CFCFG) is enabled.  Call
> + *                 before set_ttbr0_cfg().  If stalling on fault is enabled,
> + *                 the GPU driver must call resume_translation()
> + * @resume_translation: Resume translation after a fault
> + *
>   *
>   * The GPU driver (drm/msm) and adreno-smmu work together for controlling
>   * the GPU's SMMU instance.  This is by necessity, as the GPU is directly
> @@ -60,6 +65,8 @@ struct adreno_smmu_priv {
>      const struct io_pgtable_cfg *(*get_ttbr1_cfg)(const void *cookie);
>      int (*set_ttbr0_cfg)(const void *cookie, const struct io_pgtable_cfg *cfg);
>      void (*get_fault_info)(const void *cookie, struct adreno_smmu_fault_info *info);
> +    void (*set_stall)(const void *cookie, bool enabled);
> +    void (*resume_translation)(const void *cookie, bool terminate);
>  };
>  
>  #endif /* __ADRENO_SMMU_PRIV_H */
> -- 
> 2.31.1
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler
  2021-06-10 21:44 ` [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler Rob Clark
  2021-06-14 17:46   ` Bjorn Andersson
@ 2021-06-25  3:39   ` Bjorn Andersson
  2021-06-25 15:42     ` Rob Clark
  1 sibling, 1 reply; 19+ messages in thread
From: Bjorn Andersson @ 2021-06-25  3:39 UTC (permalink / raw)
  To: Rob Clark
  Cc: dri-devel, iommu, Rob Clark, Douglas Anderson, Akhil P Oommen,
	Jonathan Marek, Eric Anholt, David Airlie, linux-arm-msm,
	Sharat Masetty, Konrad Dybcio, Sean Paul, Jordan Crouse,
	Kristian H. Kristensen, Daniel Vetter,
	AngeloGioacchino Del Regno, Marijn Suijten, freedreno, open list

On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:
[..]
> diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> index 50d881794758..6975b95c3c29 100644
> --- a/drivers/gpu/drm/msm/msm_iommu.c
> +++ b/drivers/gpu/drm/msm/msm_iommu.c
> @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
>  		unsigned long iova, int flags, void *arg)
>  {
>  	struct msm_iommu *iommu = arg;
> +	struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> +	struct adreno_smmu_fault_info info, *ptr = NULL;
> +
> +	if (adreno_smmu->get_fault_info) {

This seemed reasonable when I read it last time, but I didn't realize
that the msm_fault_handler() is installed for all msm_iommu instances.

So while we're trying to recover from the boot splash and setup the new
framebuffer we end up here with iommu->base.dev being the mdss device.
Naturally drvdata of mdss is not a struct adreno_smmu_priv.

> +		adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);

So here we just jump straight out into hyperspace, never to return.

Not sure how to wire this up to avoid the problem, but right now I don't
think we can boot any device with a boot splash.

Regards,
Bjorn

> +		ptr = &info;
> +	}
> +
>  	if (iommu->base.handler)
> -		return iommu->base.handler(iommu->base.arg, iova, flags);
> +		return iommu->base.handler(iommu->base.arg, iova, flags, ptr);
> +
>  	pr_warn_ratelimited("*** fault: iova=%16lx, flags=%d\n", iova, flags);
>  	return 0;
>  }

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler
  2021-06-25  3:39   ` Bjorn Andersson
@ 2021-06-25 15:42     ` Rob Clark
  0 siblings, 0 replies; 19+ messages in thread
From: Rob Clark @ 2021-06-25 15:42 UTC (permalink / raw)
  To: Bjorn Andersson
  Cc: dri-devel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	Rob Clark, Douglas Anderson, Akhil P Oommen, Jonathan Marek,
	Eric Anholt, David Airlie, linux-arm-msm, Sharat Masetty,
	Konrad Dybcio, Sean Paul, Jordan Crouse, Kristian H. Kristensen,
	Daniel Vetter, AngeloGioacchino Del Regno, Marijn Suijten,
	freedreno, open list

On Thu, Jun 24, 2021 at 8:39 PM Bjorn Andersson
<bjorn.andersson@linaro.org> wrote:
>
> On Thu 10 Jun 16:44 CDT 2021, Rob Clark wrote:
> [..]
> > diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
> > index 50d881794758..6975b95c3c29 100644
> > --- a/drivers/gpu/drm/msm/msm_iommu.c
> > +++ b/drivers/gpu/drm/msm/msm_iommu.c
> > @@ -211,8 +211,17 @@ static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
> >               unsigned long iova, int flags, void *arg)
> >  {
> >       struct msm_iommu *iommu = arg;
> > +     struct adreno_smmu_priv *adreno_smmu = dev_get_drvdata(iommu->base.dev);
> > +     struct adreno_smmu_fault_info info, *ptr = NULL;
> > +
> > +     if (adreno_smmu->get_fault_info) {
>
> This seemed reasonable when I read it last time, but I didn't realize
> that the msm_fault_handler() is installed for all msm_iommu instances.
>
> So while we're trying to recover from the boot splash and setup the new
> framebuffer we end up here with iommu->base.dev being the mdss device.
> Naturally drvdata of mdss is not a struct adreno_smmu_priv.
>
> > +             adreno_smmu->get_fault_info(adreno_smmu->cookie, &info);
>
> So here we just jump straight out into hyperspace, never to return.
>
> Not sure how to wire this up to avoid the problem, but right now I don't
> think we can boot any device with a boot splash.
>

I think we could do:

------------------------
diff --git a/drivers/gpu/drm/msm/msm_iommu.c b/drivers/gpu/drm/msm/msm_iommu.c
index eed2a762e9dd..30ee8866154e 100644
--- a/drivers/gpu/drm/msm/msm_iommu.c
+++ b/drivers/gpu/drm/msm/msm_iommu.c
@@ -29,6 +29,9 @@ static struct msm_iommu_pagetable
*to_pagetable(struct msm_mmu *mmu)
  return container_of(mmu, struct msm_iommu_pagetable, base);
 }

+static int msm_fault_handler(struct iommu_domain *domain, struct device *dev,
+ unsigned long iova, int flags, void *arg);
+
 static int msm_iommu_pagetable_unmap(struct msm_mmu *mmu, u64 iova,
  size_t size)
 {
@@ -151,6 +154,8 @@ struct msm_mmu *msm_iommu_pagetable_create(struct
msm_mmu *parent)
  struct io_pgtable_cfg ttbr0_cfg;
  int ret;

+ iommu_set_fault_handler(iommu->domain, msm_fault_handler, iommu);
+
  /* Get the pagetable configuration from the domain */
  if (adreno_smmu->cookie)
  ttbr1_cfg = adreno_smmu->get_ttbr1_cfg(adreno_smmu->cookie);
@@ -300,7 +305,6 @@ struct msm_mmu *msm_iommu_new(struct device *dev,
struct iommu_domain *domain)

  iommu->domain = domain;
  msm_mmu_init(&iommu->base, dev, &funcs, MSM_MMU_IOMMU);
- iommu_set_fault_handler(domain, msm_fault_handler, iommu);

  atomic_set(&iommu->pagetables, 0);

------------------------

That would have the result of setting the same fault handler multiple
times, but that looks harmless.  Mostly the fault handling stuff is to
make it easier to debug userspace issues, the fallback dmesg spam from
arm-smmu should be sufficient for any kernel side issues.

BR,
-R

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
  2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
                   ` (4 preceding siblings ...)
  2021-06-10 21:44 ` [PATCH v5 5/5] drm/msm: devcoredump iommu fault support Rob Clark
@ 2021-07-04 12:53 ` Dmitry Baryshkov
  2021-07-04 18:20   ` Rob Clark
  5 siblings, 1 reply; 19+ messages in thread
From: Dmitry Baryshkov @ 2021-07-04 12:53 UTC (permalink / raw)
  To: Rob Clark, dri-devel, iommu
  Cc: freedreno, linux-arm-msm, Jordan Crouse, Rob Clark,
	Akhil P Oommen, AngeloGioacchino Del Regno, Bjorn Andersson,
	Douglas Anderson, Eric Anholt, Isaac J. Manjarres, Joerg Roedel,
	John Stultz, Jonathan Marek, Konrad Dybcio, Krishna Reddy,
	Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

[-- Attachment #1: Type: text/plain, Size: 3072 bytes --]

Hi,

I've had splash screen disabled on my RB3. However once I've enabled it, 
I've got the attached crash during the boot on the msm/msm-next. It 
looks like it is related to this particular set of changes.

On 11/06/2021 00:44, Rob Clark wrote:
> From: Rob Clark <robdclark@chromium.org>
> 
> This picks up an earlier series[1] from Jordan, and adds additional
> support needed to generate GPU devcore dumps on iova faults.  Original
> description:
> 
> This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> a adreno-smmu-priv function hook to capture a handful of important debugging
> registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> third patch to print more detailed information on page fault such as the TTBR0
> for the pagetable that caused the fault and the source of the fault as
> determined by a combination of the FSYNR1 register and an internal GPU
> register.
> 
> This code provides a solid base that we can expand on later for even more
> extensive GPU side page fault debugging capabilities.
> 
> v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
>      GPU snapshotting needs to avoid crashdumper, and check the
>      RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
>      resume translation after it has had a chance to snapshot the GPUs
>      state
> v3: Always clear FSR even if the target driver is going to handle resume
> v2: Fix comment wording and function pointer check per Rob Clark
> 
> [1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/
> 
> Jordan Crouse (3):
>    iommu/arm-smmu: Add support for driver IOMMU fault handlers
>    iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
>      info
>    drm/msm: Improve the a6xx page fault handler
> 
> Rob Clark (2):
>    iommu/arm-smmu-qcom: Add stall support
>    drm/msm: devcoredump iommu fault support
> 
>   drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  23 +++-
>   drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 110 +++++++++++++++++++-
>   drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  42 ++++++--
>   drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
>   drivers/gpu/drm/msm/msm_gem.h               |   1 +
>   drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
>   drivers/gpu/drm/msm/msm_gpu.c               |  48 +++++++++
>   drivers/gpu/drm/msm/msm_gpu.h               |  17 +++
>   drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
>   drivers/gpu/drm/msm/msm_iommu.c             |  22 +++-
>   drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
>   drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 +++++++++
>   drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
>   drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
>   include/linux/adreno-smmu-priv.h            |  38 ++++++-
>   15 files changed, 367 insertions(+), 21 deletions(-)
> 


-- 
With best wishes
Dmitry

[-- Attachment #2: log-rb3-crash.txt --]
[-- Type: text/plain, Size: 29485 bytes --]

\0\0Handling Cmd: set_active:a
SetActiveSlot: _a already active slot
Handling Cmd: download:061b8800
Download Finished
Handling Cmd: boot
A/B retry count NOT decremented
Booting Into Mission Mode
No dtbo partition is found, Skip dtbo
Exit key detection timer
GetVmData: making ScmCall to get HypInfo
GetVmData: No Vm data present! Status = (0x3)
No Ffbm cookie found, ignore: Not Found
Memory Base Address: 0x80000000
Decompressing kernel image start: 32313 ms
Decompressing kernel image done: 35477 ms
BootLinux: failed to get dtbo image
DTB offset is incorrect, kernel image does not have appended DTB
Cmdline: ignore_loglevel console=ttyMSM0,115200n8 clk_ignore_unused pd_ignore_unused earlycon pcie_pme=nomsi root=PARTLABEL=userdata rootwait androidboot.bootdevice=1d84000.ufshc androidboot.serialno=8f186bb6 androidboot.baseband=msm msm_drm.dsi_display0=
RAM Partitions
Add Base: 0x0000000080000000 Available Length: 0x00000000FDFA0000 
WARNING: Unsupported EFI_RAMPARTITION_PROTOCOL
ERROR: Could not get splash memory region node
kaslr-Seed is added to chosen node

Shutting Down UEFI Boot Services: 36987 ms
BDS: LogFs sync skipped, Unsupported
App Log Flush : 0 ms
Exit BS        [37141] UEFI End
[    0.000000] Booting Linux on physical CPU 0x0000000000 [0x517f803c]
[    0.000000] Linux version 5.13.0-rc3-00115-g9b6193ea776c-dirty (lumag@eriador) (aarch64-linux-gnu-gcc (Debian 11.1.0-1) 11.1.0, GNU ld (GNU Binutils for Debian) 2.35.2) #142 SMP PREEMPT Sun Jul 4 15:25:15 MSK 2021
[    0.000000] Machine model: Thundercomm Dragonboard 845c
[    0.000000] printk: debug: ignoring loglevel setting.
[    0.000000] earlycon: qcom_geni0 at MMIO 0x0000000000a84000 (options '115200n8')
[    0.000000] printk: bootconsole [qcom_geni0] enabled
[    0.000000] efi: UEFI not found.
[    0.000000] NUMA: No NUMA configuration found
[    0.000000] NUMA: Faking a node at [mem 0x0000000080000000-0x000000017df9ffff]
[    0.000000] NUMA: NODE_DATA [mem 0x17d78c200-0x17d78dfff]
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000080000000-0x00000000ffffffff]
[    0.000000]   DMA32    empty
[    0.000000]   Normal   [mem 0x0000000100000000-0x000000017df9ffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000080000000-0x00000000856fffff]
[    0.000000]   node   0: [mem 0x0000000085700000-0x0000000085cfffff]
[    0.000000]   node   0: [mem 0x0000000085d00000-0x0000000085dfffff]
[    0.000000]   node   0: [mem 0x0000000085e00000-0x0000000085efffff]
[    0.000000]   node   0: [mem 0x0000000085f00000-0x0000000085fbffff]
[    0.000000]   node   0: [mem 0x0000000085fc0000-0x00000000890fffff]
[    0.000000]   node   0: [mem 0x0000000089100000-0x000000008aafffff]
[    0.000000]   node   0: [mem 0x000000008ab00000-0x000000008c416fff]
[    0.000000]   node   0: [mem 0x000000008c417000-0x000000008c4fffff]
[    0.000000]   node   0: [mem 0x000000008c500000-0x0000000097bfffff]
[    0.000000]   node   0: [mem 0x0000000097c00000-0x000000017df9ffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000080000000-0x000000017df9ffff]
[    0.000000] On node 0 totalpages: 1040288
[    0.000000]   DMA zone: 8192 pages used for memmap
[    0.000000]   DMA zone: 0 pages reserved
[    0.000000]   DMA zone: 524288 pages, LIFO batch:63
[    0.000000]   Normal zone: 8063 pages used for memmap
[    0.000000]   Normal zone: 516000 pages, LIFO batch:63
[    0.000000]   Normal zone: 96 pages in unavailable ranges
[    0.000000] cma: Reserved 32 MiB at 0x00000000fe000000
[    0.000000] psci: probing for conduit method from DT.
[    0.000000] psci: PSCIv1.1 detected in firmware.
[    0.000000] psci: Using standard PSCI v0.2 function IDs
[    0.000000] psci: MIGRATE_INFO_TYPE not supported.
[    0.000000] psci: SMC Calling Convention v1.2
[    0.000000] psci: OSI mode supported.
[    0.000000] percpu: Embedded 23 pages/cpu s56152 r8192 d29864 u94208
[    0.000000] pcpu-alloc: s56152 r8192 d29864 u94208 alloc=23*4096
[    0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [0] 4 [0] 5 [0] 6 [0] 7 
[    0.000000] Detected VIPT I-cache on CPU0
[    0.000000] CPU features: detected: GIC system register CPU interface
[    0.000000] CPU features: detected: Hardware dirty bit management
[    0.000000] CPU features: kernel page table isolation forced ON by KASLR
[    0.000000] CPU features: detected: Kernel page table isolation (KPTI)
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 1024033
[    0.000000] Policy zone: Normal
[    0.000000] Kernel command line: ignore_loglevel console=ttyMSM0,115200n8 clk_ignore_unused pd_ignore_unused earlycon pcie_pme=nomsi root=PARTLABEL=userdata rootwait androidboot.bootdevice=1d84000.ufshc androidboot.serialno=8f186bb6 androidboot.baseband=msm msm_drm.dsi_display0=dsi_lt9611_1080_video_display:
[    0.000000] Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes, linear)
[    0.000000] Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] software IO TLB: mapped [mem 0x00000000fa000000-0x00000000fe000000] (64MB)
[    0.000000] Memory: 3593368K/4161152K available (14400K kernel code, 2062K rwdata, 7628K rodata, 3328K init, 418K bss, 535016K reserved, 32768K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=8, Nodes=1
[    0.000000] rcu: Preemptible hierarchical RCU implementation.
[    0.000000] rcu: 	RCU event tracing is enabled.
[    0.000000] rcu: 	RCU restricting CPUs from NR_CPUS=256 to nr_cpu_ids=8.
[    0.000000] 	Trampoline variant of Tasks RCU enabled.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] rcu: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=8
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] GICv3: 768 SPIs implemented
[    0.000000] GICv3: 0 Extended SPIs implemented
[    0.000000] GICv3: Distributor has no Range Selector support
[    0.000000] Root IRQ handler: gic_handle_irq
[    0.000000] GICv3: 16 PPIs implemented
[    0.000000] GICv3: CPU0: found redistributor 0 region 0:0x0000000017a60000
[    0.000000] ITS: No ITS available, not enabling LPIs
[    0.000000] random: get_random_bytes called from start_kernel+0x348/0x52c with crng_init=0
[    0.000000] arch_timer: cp15 and mmio timer(s) running at 19.20MHz (virt/virt).
[    0.000000] clocksource: arch_sys_counter: mask: 0xffffffffffffff max_cycles: 0x46d987e47, max_idle_ns: 440795202767 ns
[    0.000001] sched_clock: 56 bits at 19MHz, resolution 52ns, wraps every 4398046511078ns
[    0.008313] Console: colour dummy device 80x25
[    0.012882] Calibrating delay loop (skipped), value calculated using timer frequency.. 38.40 BogoMIPS (lpj=76800)
[    0.023232] pid_max: default: 32768 minimum: 301
[    0.027957] LSM: Security Framework initializing
[    0.032677] Mount-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.040145] Mountpoint-cache hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.049625] rcu: Hierarchical SRCU implementation.
[    0.055932] EFI services will not be available.
[    0.060866] smp: Bringing up secondary CPUs ...
[    0.067019] Detected VIPT I-cache on CPU1
[    0.067073] GICv3: CPU1: found redistributor 100 region 0:0x0000000017a80000
[    0.067166] CPU1: Booted secondary processor 0x0000000100 [0x517f803c]
[    0.068727] Detected VIPT I-cache on CPU2
[    0.068761] GICv3: CPU2: found redistributor 200 region 0:0x0000000017aa0000
[    0.068827] CPU2: Booted secondary processor 0x0000000200 [0x517f803c]
[    0.070471] Detected VIPT I-cache on CPU3
[    0.070496] GICv3: CPU3: found redistributor 300 region 0:0x0000000017ac0000
[    0.070555] CPU3: Booted secondary processor 0x0000000300 [0x517f803c]
[    0.072816] CPU features: detected: Spectre-v2
[    0.072830] Detected VIPT I-cache on CPU4
[    0.072856] GICv3: CPU4: found redistributor 400 region 0:0x0000000017ae0000
[    0.072918] CPU4: Booted secondary processor 0x0000000400 [0x516f802d]
[    0.074954] Detected VIPT I-cache on CPU5
[    0.074984] GICv3: CPU5: found redistributor 500 region 0:0x0000000017b00000
[    0.075047] CPU5: Booted secondary processor 0x0000000500 [0x516f802d]
[    0.077204] Detected VIPT I-cache on CPU6
[    0.077235] GICv3: CPU6: found redistributor 600 region 0:0x0000000017b20000
[    0.077297] CPU6: Booted secondary processor 0x0000000600 [0x516f802d]
[    0.079653] Detected VIPT I-cache on CPU7
[    0.079686] GICv3: CPU7: found redistributor 700 region 0:0x0000000017b40000
[    0.079749] CPU7: Booted secondary processor 0x0000000700 [0x516f802d]
[    0.079831] smp: Brought up 1 node, 8 CPUs
[    0.212583] SMP: Total of 8 processors activated.
[    0.217331] CPU features: detected: 32-bit EL0 Support
[    0.222532] CPU features: detected: 32-bit EL1 Support
[    0.227722] CPU features: detected: Data cache clean to the PoU not required for I/D coherence
[    0.236401] CPU features: detected: Common not Private translations
[    0.242723] CPU features: detected: CRC32 instructions
[    0.247912] CPU features: detected: RCpc load-acquire (LDAPR)
[    0.253711] CPU features: detected: LSE atomic instructions
[    0.259338] CPU features: detected: Privileged Access Never
[    0.264966] CPU features: detected: RAS Extension Support
[    0.294165] CPU: All CPU(s) started at EL1
[    0.298362] alternatives: patching kernel code
[    0.304965] devtmpfs: initialized
[    0.319923] KASLR enabled
[    0.322745] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.332582] futex hash table entries: 2048 (order: 5, 131072 bytes, linear)
[    0.340366] pinctrl core: initialized pinctrl subsystem
[    0.346196] DMI not present or invalid.
[    0.350404] NET: Registered protocol family 16
[    0.356280] DMA: preallocated 512 KiB GFP_KERNEL pool for atomic allocations
[    0.363529] DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA pool for atomic allocations
[    0.371545] DMA: preallocated 512 KiB GFP_KERNEL|GFP_DMA32 pool for atomic allocations
[    0.379581] audit: initializing netlink subsys (disabled)
[    0.385161] audit: type=2000 audit(0.292:1): state=initialized audit_enabled=0 res=1
[    0.385517] thermal_sys: Registered thermal governor 'step_wise'
[    0.392973] thermal_sys: Registered thermal governor 'power_allocator'
[    0.401318] cpuidle: using governor menu
[    0.411961] NET: Registered protocol family 42
[    0.416616] hw-breakpoint: found 6 breakpoint and 4 watchpoint registers.
[    0.423672] ASID allocator initialised with 32768 entries
[    0.430661] Serial: AMBA PL011 UART driver
[    0.447842] platform 1c06000.phy: Fixing up cyclic dependency with 100000.clock-controller
[    0.456596] platform 1c0a000.phy: Fixing up cyclic dependency with 100000.clock-controller
[    0.493694] HugeTLB registered 1.00 GiB page size, pre-allocated 0 pages
[    0.500473] HugeTLB registered 32.0 MiB page size, pre-allocated 0 pages
[    0.507245] HugeTLB registered 2.00 MiB page size, pre-allocated 0 pages
[    0.514004] HugeTLB registered 64.0 KiB page size, pre-allocated 0 pages
[    0.521872] cryptd: max_cpu_qlen set to 1000
[    0.529643] VBAT: supplied by DC12V
[    0.533365] VBAT_SOM: supplied by DC12V
[    0.537418] VDC_3V3: supplied by DC12V
[    0.541371] VDC_5V: supplied by DC12V
[    0.545342] vph_pwr: supplied by VBAT_SOM
[    0.549974] iommu: Default domain type: Translated 
[    0.555030] vgaarb: loaded
[    0.557977] SCSI subsystem initialized
[    0.561867] libata version 3.00 loaded.
[    0.565887] usbcore: registered new interface driver usbfs
[    0.571455] usbcore: registered new interface driver hub
[    0.576838] usbcore: registered new device driver usb
[    0.582196] mc: Linux media interface: v0.10
[    0.586535] videodev: Linux video capture interface: v2.00
[    0.592128] pps_core: LinuxPPS API ver. 1 registered
[    0.597147] pps_core: Software ver. 5.3.6 - Copyright 2005-2007 Rodolfo Giometti <giometti@linux.it>
[    0.606357] PTP clock support registered
[    0.610437] EDAC MC: Ver: 3.0.0
[    0.614065] qcom_scm: convention: smc arm 64
[    0.618941] FPGA manager framework
[    0.622446] Advanced Linux Sound Architecture Driver Initialized.
[    0.628970] Bluetooth: Core ver 2.22
[    0.632605] NET: Registered protocol family 31
[    0.637098] Bluetooth: HCI device and connection manager initialized
[    0.643517] Bluetooth: HCI socket layer initialized
[    0.648444] Bluetooth: L2CAP socket layer initialized
[    0.653551] Bluetooth: SCO socket layer initialized
[    0.658862] clocksource: Switched to clocksource arch_sys_counter
[    0.665127] VFS: Disk quotas dquot_6.6.0
[    0.669140] VFS: Dquot-cache hash table entries: 512 (order 0, 4096 bytes)
[    0.681221] NET: Registered protocol family 2
[    0.685747] IP idents hash table entries: 65536 (order: 7, 524288 bytes, linear)
[    0.695536] tcp_listen_portaddr_hash hash table entries: 2048 (order: 3, 32768 bytes, linear)
[    0.704215] TCP established hash table entries: 32768 (order: 6, 262144 bytes, linear)
[    0.712366] TCP bind hash table entries: 32768 (order: 7, 524288 bytes, linear)
[    0.720117] TCP: Hash tables configured (established 32768 bind 32768)
[    0.726796] UDP hash table entries: 2048 (order: 4, 65536 bytes, linear)
[    0.733633] UDP-Lite hash table entries: 2048 (order: 4, 65536 bytes, linear)
[    0.740999] NET: Registered protocol family 1
[    0.745779] RPC: Registered named UNIX socket transport module.
[    0.751771] RPC: Registered udp transport module.
[    0.756530] RPC: Registered tcp transport module.
[    0.761287] RPC: Registered tcp NFSv4.1 backchannel transport module.
[    0.768282] PCI: CLS 0 bytes, default 64
[    0.772442] Unpacking initramfs...
[    0.776152] hw perfevents: enabled with armv8_pmuv3 PMU driver, 7 counters available
[    0.784475] kvm [1]: HYP mode not available
[    0.792629] Initialise system trusted keyrings
[    0.797251] workingset: timestamp_bits=44 max_order=20 bucket_order=0
[    0.808337] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[    0.814681] NFS: Registering the id_resolver key type
[    0.819805] Key type id_resolver registered
[    0.824038] Key type id_legacy registered
[    0.828149] nfs4filelayout_init: NFSv4 File Layout Driver Registering...
[    0.834912] nfs4flexfilelayout_init: NFSv4 Flexfile Layout Driver Registering...
[    0.842521] 9p: Installing v9fs 9p2000 file system support
[    0.881103] Key type asymmetric registered
[    0.885248] Asymmetric key parser 'x509' registered
[    0.890203] Block layer SCSI generic (bsg) driver version 0.4 loaded (major 244)
[    0.897664] io scheduler mq-deadline registered
[    0.902244] io scheduler kyber registered
[    0.918213] bob: Setting 3312000-3600000uV
[    0.922549] bob: supplied by vph_pwr
[    0.929266] Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
[    0.936983] msm_serial: driver initialized
[    0.942224] arm-smmu 15000000.iommu: probing hardware configuration...
[    0.948831] arm-smmu 15000000.iommu: SMMUv2 with:
[    0.953607] arm-smmu 15000000.iommu: 	stage 1 translation
[    0.959060] arm-smmu 15000000.iommu: 	non-coherent table walk
[    0.964865] arm-smmu 15000000.iommu: 	(IDR0.CTTW overridden by FW configuration)
[    0.972331] arm-smmu 15000000.iommu: 	stream matching with 76 register groups
[    0.979550] arm-smmu 15000000.iommu: 	45 context banks (0 stage-2 only)
[    0.986694] arm-smmu 15000000.iommu: 	Supported page sizes: 0x61311000
[    0.993287] arm-smmu 15000000.iommu: 	Stage-1: 48-bit VA -> 48-bit IPA
[    1.012286] loop: module loaded
[    1.016058] megasas: 07.714.04.00-rc1
[    1.022118] spmi spmi-0: PMIC arbiter version v5 (0x50000000)
[    1.036242] libphy: Fixed MDIO Bus: probed
[    1.041077] tun: Universal TUN/TAP device driver, 1.6
[    1.046346] CAN device driver interface
[    1.050486] usbcore: registered new interface driver asix
[    1.055970] usbcore: registered new interface driver ax88179_178a
[    1.062236] VFIO - User Level meta-driver version: 0.3
[    1.068820] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
[    1.075420] ehci-pci: EHCI PCI platform driver
[    1.079933] ehci-platform: EHCI generic platform driver
[    1.085345] ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
[    1.091604] ohci-pci: OHCI PCI platform driver
[    1.096121] ohci-platform: OHCI generic platform driver
[    1.101857] usbcore: registered new interface driver usb-storage
[    1.109186] udc-core: couldn't find an available UDC - added [zero] to list of pending drivers
[    1.119295] rtc-pm8xxx c440000.spmi:pmic@0:rtc@6000: registered as rtc0
[    1.126041] rtc-pm8xxx c440000.spmi:pmic@0:rtc@6000: setting system clock to 1970-01-01T00:00:37 UTC (37)
[    1.135823] i2c /dev entries driver
[    1.146180] Bluetooth: HCI UART driver ver 2.3
[    1.150692] Bluetooth: HCI UART protocol H4 registered
[    1.155881] Bluetooth: HCI UART protocol BCSP registered
[    1.161264] Bluetooth: HCI UART protocol LL registered
[    1.166467] Bluetooth: HCI UART protocol Three-wire (H5) registered
[    1.172875] Bluetooth: HCI UART protocol Broadcom registered
[    1.178601] Bluetooth: HCI UART protocol QCA registered
[    1.185508] sdhci: Secure Digital Host Controller Interface driver
[    1.191750] sdhci: Copyright(c) Pierre Ossman
[    1.196249] Synopsys Designware Multimedia Card Interface Driver
[    1.202775] sdhci-pltfm: SDHCI platform and OF driver helper
[    1.210549] ledtrig-cpu: registered to indicate activity on CPUs
[    1.216870] SMCCC: SOC_ID: ARCH_FEATURES(ARCH_SOC_ID) returned error: fffffffffffffffe
[    1.225556] usbcore: registered new interface driver usbhid
[    1.231194] usbhid: USB HID core driver
[    1.236471] qcom_q6v5_pas remoteproc-adsp: supply cx not found, using dummy regulator
[    1.244499] qcom_q6v5_pas remoteproc-adsp: supply px not found, using dummy regulator
[    1.253485] remoteproc remoteproc0: remoteproc-adsp is available
[    1.260008] qcom_q6v5_pas remoteproc-cdsp: supply cx not found, using dummy regulator
[    1.268045] qcom_q6v5_pas remoteproc-cdsp: supply px not found, using dummy regulator
[    1.277141] remoteproc remoteproc1: remoteproc-cdsp is available
[    1.313294] NET: Registered protocol family 17
[    1.317807] can: controller area network core
[    1.322237] NET: Registered protocol family 29
[    1.326722] Bluetooth: HIDP (Human Interface Emulation) ver 1.2
[    1.332699] Bluetooth: HIDP socket layer initialized
[    1.337824] 9pnet: Installing 9P2000 support
[    1.342172] Key type dns_resolver registered
[    1.346923] registered taskstats version 1
[    1.351069] Loading compiled-in X.509 certificates
[    1.358036] amba 6041000.funnel: Fixing up cyclic dependency with 6002000.stm
[    1.366636] amba 6045000.funnel: Fixing up cyclic dependency with 6043000.funnel
[    1.374140] amba 6045000.funnel: Fixing up cyclic dependency with 6041000.funnel
[    1.382975] amba 6047000.etf: Fixing up cyclic dependency with 6045000.funnel
[    1.390208] amba 6047000.etf: Fixing up cyclic dependency with 6046000.replicator
[    1.398500] amba 6048000.etr: Fixing up cyclic dependency with 6046000.replicator
[    1.413307] amba 7800000.funnel: Fixing up cyclic dependency with 7740000.etm
[    1.420572] amba 7800000.funnel: Fixing up cyclic dependency with 7640000.etm
[    1.427803] amba 7800000.funnel: Fixing up cyclic dependency with 7540000.etm
[    1.435031] amba 7800000.funnel: Fixing up cyclic dependency with 7440000.etm
[    1.442253] amba 7800000.funnel: Fixing up cyclic dependency with 7340000.etm
[    1.449473] amba 7800000.funnel: Fixing up cyclic dependency with 7240000.etm
[    1.456704] amba 7800000.funnel: Fixing up cyclic dependency with 7140000.etm
[    1.463925] amba 7800000.funnel: Fixing up cyclic dependency with 7040000.etm
[    1.471914] amba 7810000.funnel: Fixing up cyclic dependency with 7800000.funnel
[    1.479405] amba 7810000.funnel: Fixing up cyclic dependency with 6043000.funnel
[    1.508341] LT9611_1V8: supplied by VDC_5V
[    1.512917] LT9611_3V3: supplied by VDC_3V3
[    1.517507] PCIE0_1.05V: supplied by VBAT
[    1.522532] CAM0_DVDD_1V2: supplied by VBAT
[    1.527434] CAM0_AVDD_2V8: supplied by VBAT
[    1.531941] CAM3_AVDD_2V8: supplied by VBAT
[    1.536597] VLDO_3V3: supplied by VBAT
[    1.540681] V5P0_HDMIOUT: supplied by VDC_5V
[    1.547168] bam-dma-engine 1dc4000.dma: Adding to iommu group 0
[    1.553801] bam-dma-engine 1dc4000.dma: num-channels unspecified in dt
[    1.560399] bam-dma-engine 1dc4000.dma: num-ees unspecified in dt
[    1.568653] bam-dma-engine 17184000.dma-controller: Adding to iommu group 1
[    1.577559] geni_se_qup 8c0000.geniqup: Adding to iommu group 2
[    1.592398] 898000.serial: ttyHS0 at MMIO 0x898000 (irq = 166, base_baud = 0) is a MSM
[    1.602930] serial serial0: tty port ttyHS0 registered
[    1.603587] Bluetooth: Failed to init regulators:-517
[    1.609497] geni_se_qup ac0000.geniqup: Adding to iommu group 3
[    1.622549] a84000.serial: ttyMSM0 at MMIO 0xa84000 (irq = 167, base_baud = 0) is a MSM
[    1.638796] printk: console [ttyMSM0] enabled
[    1.638796] printk: console [ttyMSM0] enabled
[    1.647593] printk: bootconsole [qcom_geni0] disabled
[    1.647593] printk: bootconsole [qcom_geni0] disabled
[    1.666234] random: fast init done
[    1.670684] i2c 10-003b: Fixing up cyclic dependency with hdmi-out
[    1.679756] geni_i2c a8c000.i2c: Bus frequency not specified, default to 100kHz.
[    1.693843] geni_i2c a98000.i2c: Bus frequency not specified, default to 100kHz.
[    1.705196] smps3: Setting 1352000-1352000uV
[    1.710256] smps3: supplied by vph_pwr
[    1.714220] smps5: Setting 1904000-2040000uV
[    1.718740] smps5: supplied by vph_pwr
[    1.722627] smps7: Setting 900000-1028000uV
[    1.726971] smps7: supplied by vph_pwr
[    1.730836] ldo1: Setting 880000-880000uV
[    1.735343] ldo1: supplied by smps7
[    1.738960] ldo5: Setting 800000-800000uV
[    1.743455] ldo5: supplied by smps7
[    1.747068] ldo12: Setting 1800000-1800000uV
[    1.752255] ldo12: supplied by smps5
[    1.756033] ldo7: Setting 1800000-1800000uV
[    1.761124] ldo7: supplied by smps5
[    1.764804] ldo13: Setting 1800000-2960000uV
[    1.770425] ldo13: supplied by bob
[    1.774003] ldo17: Setting 1304000-1304000uV
[    1.778728] ldo17: supplied by smps3
[    1.782437] ldo20: Setting 2960000-2968000uV
[    1.787539] ldo20: supplied by bob
[    1.791093] ldo21: Setting 2960000-2968000uV
[    1.796245] ldo21: supplied by bob
[    1.799766] ldo24: Setting 3088000-3088000uV
[    1.804833] ldo24: supplied by bob
[    1.808347] ldo25: Setting 3300000-3312000uV
[    1.813408] ldo25: supplied by bob
[    1.816918] ldo26: Setting 1200000-1200000uV
[    1.822026] ldo26: supplied by smps3
[    1.826922] lvs1: supplied by vreg_s4a_1p8
[    1.832370] lvs2: supplied by vreg_s4a_1p8
[    1.838021] arm-smmu 5040000.iommu: probing hardware configuration...
[    1.844532] arm-smmu 5040000.iommu: SMMUv2 with:
[    1.849238] arm-smmu 5040000.iommu: 	stage 1 translation
[    1.854602] arm-smmu 5040000.iommu: 	address translation ops
[    1.860317] arm-smmu 5040000.iommu: 	non-coherent table walk
[    1.866029] arm-smmu 5040000.iommu: 	(IDR0.CTTW overridden by FW configuration)
[    1.873444] arm-smmu 5040000.iommu: 	stream matching with 5 register groups
[    1.880485] arm-smmu 5040000.iommu: 	5 context banks (0 stage-2 only)
[    1.887007] arm-smmu 5040000.iommu: 	Supported page sizes: 0x63315000
[    1.893511] arm-smmu 5040000.iommu: 	Stage-1: 48-bit VA -> 36-bit IPA
[    1.906923] adreno 5000000.gpu: Adding to iommu group 4
[    1.914037] msm ae00000.mdss: Adding to iommu group 5
[    1.920258] platform ae94000.dsi: Fixing up cyclic dependency with 10-003b
[    1.927259] platform ae94000.dsi: Fixing up cyclic dependency with ae01000.mdp
[    1.958684] qcrypto 1dfa000.crypto: Adding to iommu group 0
[    1.959006] sdhci_msm 8804000.sdhci: Adding to iommu group 6
[    1.970817] sdhci_msm 8804000.sdhci: Got CD GPIO
[    1.973397] remoteproc remoteproc2: 4080000.remoteproc is available
[    1.976097] sdhci_msm 8804000.sdhci: TCXO clk not present (-2)
[    1.991236] cpu cpu0: EM: created perf domain
[    1.999243] cpu cpu4: EM: created perf domain
[    2.006161] qcom-qmp-phy 1c06000.phy: Registered Qcom-QMP phy
[    2.012325] qcom-qmp-phy 1c0a000.phy: Registered Qcom-QMP phy
[    2.018368] qcom-qmp-phy 1d87000.phy: Registered Qcom-QMP phy
[    2.021541] mmc0: SDHCI controller on 8804000.sdhci [8804000.sdhci] using ADMA 64-bit
[    2.024451] qcom-qmp-phy 88e9000.phy: Registered Qcom-QMP phy
[    2.038244] qcom-qmp-phy 88eb000.phy: Registered Qcom-QMP phy
[    2.044284] qcom-qusb2-phy 88e2000.phy: Registered Qcom-QUSB2 phy
[    2.050633] qcom-qusb2-phy 88e3000.phy: Registered Qcom-QUSB2 phy
[    2.057089] dw-pcie 1c00000.pci: Adding to iommu group 7
[    2.062619] dw-pcie 1c00000.pci: IRQ index 1 not found
[    2.068312] dw-pcie 1c08000.pci: Adding to iommu group 8
[    2.073856] dw-pcie 1c08000.pci: IRQ index 1 not found
[    2.198106] Bluetooth: hci0: setting up wcn399x
[    2.261308] lt9611 10-003b: LT9611 revision: 0xe1
[    2.279914] msm ae00000.mdss: bound ae01000.mdp (ops dpu_ops)
[    2.285798] msm_dsi ae94000.dsi: supply gdsc not found, using dummy regulator
[    2.293518] msm ae00000.mdss: bound ae94000.dsi (ops dsi_ops)
[    2.300219] adreno 5000000.gpu: supply vdd not found, using dummy regulator
[    2.307354] adreno 5000000.gpu: supply vddcx not found, using dummy regulator
[    2.314635] adreno 5000000.gpu: [drm:msm_gpu_init] *ERROR* Couldn't register GPU cooling device
[    2.324199] platform 506a000.gmu: Adding to iommu group 9
[    2.330186] msm ae00000.mdss: bound 5000000.gpu (ops a3xx_ops)
[    2.336864] [drm:dpu_kms_hw_init:1005] dpu hardware revision:0x40000000
[    2.343740] Unable to handle kernel execute from non-executable memory at virtual address ffff3f8a45696c00
[    2.353512] Mem abort info:
[    2.356371]   ESR = 0x8600000f
[    2.359545]   EC = 0x21: IABT (current EL), IL = 32 bits
[    2.364935]   SET = 0, FnV = 0
[    2.368035]   EA = 0, S1PTW = 0
[    2.371297] swapper pgtable: 4k pages, 48-bit VAs, pgdp=00000000a1598000
[    2.378134] [ffff3f8a45696c00] pgd=180000017df97003, p4d=180000017df97003, pud=180000017dc1e003, pmd=180000017dbf2003, pte=0068000105696f07
[    2.390835] Internal error: Oops: 8600000f [#1] PREEMPT SMP
[    2.396549] Modules linked in:
[    2.399711] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.13.0-rc3-00115-g9b6193ea776c-dirty #142
[    2.408576] Hardware name: Thundercomm Dragonboard 845c (DT)
[    2.414368] pstate: 404000c5 (nZcv daIF +PAN -UAO -TCO BTYPE=--)
[    2.420486] pc : 0xffff3f8a45696c00
[    2.424209] lr : msm_fault_handler+0x50/0xd0
[    2.428797] sp : ffff800010003e20
[    2.432225] x29: ffff800010003e20 x28: ffffc0d80c0f2c40 x27: 0000000099f8f000
[    2.439505] x26: ffffc0d80bbb0e78 x25: ffffc0d800000880 x24: 000000009d969800
[    2.446760] x23: 00000000001a0020 x22: ffff3f8a4ba04e58 x21: 0000000000000000
[    2.454037] x20: 000000009d969800 x19: ffff3f8a4ba42b80 x18: ffffffffffffffff
[    2.461304] x17: 000c0400bb44ffff x16: 004000b2b5503510 x15: 0000000000000000
[    2.468575] x14: ffffc0d80c0f2c40 x13: ffff7eb2b1897000 x12: 0000000000000040
[    2.475868] x11: ffff3f8a405426c8 x10: ffff3f8a405426ca x9 : ffffc0d80c1720e0
[    2.483128] x8 : ffff3f8a40542938 x7 : 0000000000000000 x6 : ffff3f8a405429b0
[    2.490387] x5 : ffffc0d80af089f0 x4 : ffff3f8a4ba42b80 x3 : 0000000000000000
[    2.497669] x2 : ffff3f8a45696c00 x1 : ffff800010003e50 x0 : 0000000100000000
[    2.504948] Call trace:
[    2.507507]  0xffff3f8a45696c00
[    2.510744]  report_iommu_fault+0x20/0x3c
[    2.514942]  arm_smmu_context_fault+0x120/0x250
[    2.519584]  __handle_irq_event_percpu+0x54/0x170
[    2.524435]  handle_irq_event+0x64/0x140
[    2.528483]  handle_fasteoi_irq+0xa4/0x1a0
[    2.532665]  __handle_domain_irq+0x7c/0xe0
[    2.536878]  gic_handle_irq+0xbc/0x140
[    2.540848]  el1_irq+0xbc/0x154
[    2.544099]  cpuidle_enter_state+0x12c/0x2f0
[    2.548599]  cpuidle_enter+0x38/0x50
[    2.552240]  do_idle+0x21c/0x2ac
[    2.555567]  cpu_startup_entry+0x24/0x70
[    2.559541]  rest_init+0xd8/0xe8
[    2.562930]  arch_call_rest_init+0x10/0x1c
[    2.567214]  start_kernel+0x4f4/0x52c
[    2.571065] Code: ffffffff ffffffff ffffffff ffffffff (4699c010) 
[    2.577373] ---[ end trace df19a8ce39b5e100 ]---
[    2.582175] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[    2.589200] SMP: stopping secondary CPUs
[    2.594616] Kernel Offset: 0x40d7fa800000 from 0xffff800010000000
[    2.600755] PHYS_OFFSET: 0xffffc076c0000000
[    2.604968] CPU features: 0x00000211,23300e46
[    2.609369] Memory Limit: none
[    2.612471] ---[ end Kernel panic - not syncing: Oops: Fatal exception in interrupt ]---
Waiting for ssh to finish

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
  2021-07-04 12:53 ` [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Dmitry Baryshkov
@ 2021-07-04 18:20   ` Rob Clark
  2021-07-06 21:36     ` Bjorn Andersson
  2021-07-07  5:12     ` John Stultz
  0 siblings, 2 replies; 19+ messages in thread
From: Rob Clark @ 2021-07-04 18:20 UTC (permalink / raw)
  To: Dmitry Baryshkov
  Cc: dri-devel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	freedreno, linux-arm-msm, Jordan Crouse, Rob Clark,
	Akhil P Oommen, AngeloGioacchino Del Regno, Bjorn Andersson,
	Douglas Anderson, Eric Anholt, Isaac J. Manjarres, Joerg Roedel,
	John Stultz, Jonathan Marek, Konrad Dybcio, Krishna Reddy,
	Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

I suspect you are getting a dpu fault, and need:

https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/

I suppose Bjorn was expecting me to send that patch

BR,
-R

On Sun, Jul 4, 2021 at 5:53 AM Dmitry Baryshkov
<dmitry.baryshkov@linaro.org> wrote:
>
> Hi,
>
> I've had splash screen disabled on my RB3. However once I've enabled it,
> I've got the attached crash during the boot on the msm/msm-next. It
> looks like it is related to this particular set of changes.
>
> On 11/06/2021 00:44, Rob Clark wrote:
> > From: Rob Clark <robdclark@chromium.org>
> >
> > This picks up an earlier series[1] from Jordan, and adds additional
> > support needed to generate GPU devcore dumps on iova faults.  Original
> > description:
> >
> > This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> > patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> > a adreno-smmu-priv function hook to capture a handful of important debugging
> > registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> > third patch to print more detailed information on page fault such as the TTBR0
> > for the pagetable that caused the fault and the source of the fault as
> > determined by a combination of the FSYNR1 register and an internal GPU
> > register.
> >
> > This code provides a solid base that we can expand on later for even more
> > extensive GPU side page fault debugging capabilities.
> >
> > v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
> >      GPU snapshotting needs to avoid crashdumper, and check the
> >      RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> > v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
> >      resume translation after it has had a chance to snapshot the GPUs
> >      state
> > v3: Always clear FSR even if the target driver is going to handle resume
> > v2: Fix comment wording and function pointer check per Rob Clark
> >
> > [1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/
> >
> > Jordan Crouse (3):
> >    iommu/arm-smmu: Add support for driver IOMMU fault handlers
> >    iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
> >      info
> >    drm/msm: Improve the a6xx page fault handler
> >
> > Rob Clark (2):
> >    iommu/arm-smmu-qcom: Add stall support
> >    drm/msm: devcoredump iommu fault support
> >
> >   drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  23 +++-
> >   drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 110 +++++++++++++++++++-
> >   drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  42 ++++++--
> >   drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
> >   drivers/gpu/drm/msm/msm_gem.h               |   1 +
> >   drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
> >   drivers/gpu/drm/msm/msm_gpu.c               |  48 +++++++++
> >   drivers/gpu/drm/msm/msm_gpu.h               |  17 +++
> >   drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
> >   drivers/gpu/drm/msm/msm_iommu.c             |  22 +++-
> >   drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
> >   drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 +++++++++
> >   drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
> >   drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
> >   include/linux/adreno-smmu-priv.h            |  38 ++++++-
> >   15 files changed, 367 insertions(+), 21 deletions(-)
> >
>
>
> --
> With best wishes
> Dmitry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
  2021-07-04 18:20   ` Rob Clark
@ 2021-07-06 21:36     ` Bjorn Andersson
  2021-07-07  5:12     ` John Stultz
  1 sibling, 0 replies; 19+ messages in thread
From: Bjorn Andersson @ 2021-07-06 21:36 UTC (permalink / raw)
  To: Rob Clark
  Cc: Dmitry Baryshkov, dri-devel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	freedreno, linux-arm-msm, Jordan Crouse, Rob Clark,
	Akhil P Oommen, AngeloGioacchino Del Regno, Douglas Anderson,
	Eric Anholt, Isaac J. Manjarres, Joerg Roedel, John Stultz,
	Jonathan Marek, Konrad Dybcio, Krishna Reddy,
	Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

On Sun 04 Jul 13:20 CDT 2021, Rob Clark wrote:

> I suspect you are getting a dpu fault, and need:
> 
> https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
> 
> I suppose Bjorn was expecting me to send that patch
> 

No, I left that discussion with the same understanding as you... But I
ended up side tracked by some other craziness.

Did you post this somewhere or would you still like me to test it and
spin a patch?

Regards,
Bjorn

> BR,
> -R
> 
> On Sun, Jul 4, 2021 at 5:53 AM Dmitry Baryshkov
> <dmitry.baryshkov@linaro.org> wrote:
> >
> > Hi,
> >
> > I've had splash screen disabled on my RB3. However once I've enabled it,
> > I've got the attached crash during the boot on the msm/msm-next. It
> > looks like it is related to this particular set of changes.
> >
> > On 11/06/2021 00:44, Rob Clark wrote:
> > > From: Rob Clark <robdclark@chromium.org>
> > >
> > > This picks up an earlier series[1] from Jordan, and adds additional
> > > support needed to generate GPU devcore dumps on iova faults.  Original
> > > description:
> > >
> > > This is a stack to add an Adreno GPU specific handler for pagefaults. The first
> > > patch starts by wiring up report_iommu_fault for arm-smmu. The next patch adds
> > > a adreno-smmu-priv function hook to capture a handful of important debugging
> > > registers such as TTBR0, CONTEXTIDR, FSYNR0 and others. This is used by the
> > > third patch to print more detailed information on page fault such as the TTBR0
> > > for the pagetable that caused the fault and the source of the fault as
> > > determined by a combination of the FSYNR1 register and an internal GPU
> > > register.
> > >
> > > This code provides a solid base that we can expand on later for even more
> > > extensive GPU side page fault debugging capabilities.
> > >
> > > v5: [Rob] Use RBBM_STATUS3.SMMU_STALLED_ON_FAULT to detect case where
> > >      GPU snapshotting needs to avoid crashdumper, and check the
> > >      RBBM_STATUS3.SMMU_STALLED_ON_FAULT in GPU hang irq paths
> > > v4: [Rob] Add support to stall SMMU on fault, and let the GPU driver
> > >      resume translation after it has had a chance to snapshot the GPUs
> > >      state
> > > v3: Always clear FSR even if the target driver is going to handle resume
> > > v2: Fix comment wording and function pointer check per Rob Clark
> > >
> > > [1] https://lore.kernel.org/dri-devel/20210225175135.91922-1-jcrouse@codeaurora.org/
> > >
> > > Jordan Crouse (3):
> > >    iommu/arm-smmu: Add support for driver IOMMU fault handlers
> > >    iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault
> > >      info
> > >    drm/msm: Improve the a6xx page fault handler
> > >
> > > Rob Clark (2):
> > >    iommu/arm-smmu-qcom: Add stall support
> > >    drm/msm: devcoredump iommu fault support
> > >
> > >   drivers/gpu/drm/msm/adreno/a5xx_gpu.c       |  23 +++-
> > >   drivers/gpu/drm/msm/adreno/a6xx_gpu.c       | 110 +++++++++++++++++++-
> > >   drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c |  42 ++++++--
> > >   drivers/gpu/drm/msm/adreno/adreno_gpu.c     |  15 +++
> > >   drivers/gpu/drm/msm/msm_gem.h               |   1 +
> > >   drivers/gpu/drm/msm/msm_gem_submit.c        |   1 +
> > >   drivers/gpu/drm/msm/msm_gpu.c               |  48 +++++++++
> > >   drivers/gpu/drm/msm/msm_gpu.h               |  17 +++
> > >   drivers/gpu/drm/msm/msm_gpummu.c            |   5 +
> > >   drivers/gpu/drm/msm/msm_iommu.c             |  22 +++-
> > >   drivers/gpu/drm/msm/msm_mmu.h               |   5 +-
> > >   drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c  |  50 +++++++++
> > >   drivers/iommu/arm/arm-smmu/arm-smmu.c       |   9 +-
> > >   drivers/iommu/arm/arm-smmu/arm-smmu.h       |   2 +
> > >   include/linux/adreno-smmu-priv.h            |  38 ++++++-
> > >   15 files changed, 367 insertions(+), 21 deletions(-)
> > >
> >
> >
> > --
> > With best wishes
> > Dmitry

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
  2021-07-04 18:20   ` Rob Clark
  2021-07-06 21:36     ` Bjorn Andersson
@ 2021-07-07  5:12     ` John Stultz
  2021-07-07 17:38       ` Rob Clark
  1 sibling, 1 reply; 19+ messages in thread
From: John Stultz @ 2021-07-07  5:12 UTC (permalink / raw)
  To: Rob Clark
  Cc: Dmitry Baryshkov, dri-devel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	freedreno, linux-arm-msm, Jordan Crouse, Rob Clark,
	Akhil P Oommen, AngeloGioacchino Del Regno, Bjorn Andersson,
	Douglas Anderson, Eric Anholt, Isaac J. Manjarres, Joerg Roedel,
	Jonathan Marek, Konrad Dybcio, Krishna Reddy,
	Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

On Sun, Jul 4, 2021 at 11:16 AM Rob Clark <robdclark@gmail.com> wrote:
>
> I suspect you are getting a dpu fault, and need:
>
> https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
>
> I suppose Bjorn was expecting me to send that patch

If it's helpful, I applied that and it got the db845c booting mainline
again for me (along with some reverts for a separate ext4 shrinker
crash).
Tested-by: John Stultz <john.stultz@linaro.org>

thanks
-john

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling
  2021-07-07  5:12     ` John Stultz
@ 2021-07-07 17:38       ` Rob Clark
  0 siblings, 0 replies; 19+ messages in thread
From: Rob Clark @ 2021-07-07 17:38 UTC (permalink / raw)
  To: John Stultz
  Cc: Rob Clark, Dmitry Baryshkov, dri-devel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	freedreno, linux-arm-msm, Jordan Crouse, Akhil P Oommen,
	AngeloGioacchino Del Regno, Bjorn Andersson, Douglas Anderson,
	Eric Anholt, Isaac J. Manjarres, Joerg Roedel, Jonathan Marek,
	Konrad Dybcio, Krishna Reddy, Kristian H. Kristensen, Lee Jones,
	moderated list:ARM SMMU DRIVERS, open list, Marijn Suijten,
	Robin Murphy, Sai Prakash Ranjan, Sharat Masetty, Will Deacon,
	Zhenzhong Duan

On Tue, Jul 6, 2021 at 10:12 PM John Stultz <john.stultz@linaro.org> wrote:
>
> On Sun, Jul 4, 2021 at 11:16 AM Rob Clark <robdclark@gmail.com> wrote:
> >
> > I suspect you are getting a dpu fault, and need:
> >
> > https://lore.kernel.org/linux-arm-msm/CAF6AEGvTjTUQXqom-xhdh456tdLscbVFPQ+iud1H1gHc8A2=hA@mail.gmail.com/
> >
> > I suppose Bjorn was expecting me to send that patch
>
> If it's helpful, I applied that and it got the db845c booting mainline
> again for me (along with some reverts for a separate ext4 shrinker
> crash).
> Tested-by: John Stultz <john.stultz@linaro.org>
>

Thanks, I'll send a patch shortly

BR,
-R

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2021-07-07 17:34 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-10 21:44 [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Rob Clark
2021-06-10 21:44 ` [PATCH v5 1/5] iommu/arm-smmu: Add support for driver IOMMU fault handlers Rob Clark
2021-06-14 17:26   ` Bjorn Andersson
2021-06-10 21:44 ` [PATCH v5 2/5] iommu/arm-smmu-qcom: Add an adreno-smmu-priv callback to get pagefault info Rob Clark
2021-06-14 17:30   ` Bjorn Andersson
2021-06-10 21:44 ` [PATCH v5 3/5] drm/msm: Improve the a6xx page fault handler Rob Clark
2021-06-14 17:46   ` Bjorn Andersson
2021-06-25  3:39   ` Bjorn Andersson
2021-06-25 15:42     ` Rob Clark
2021-06-10 21:44 ` [PATCH v5 4/5] iommu/arm-smmu-qcom: Add stall support Rob Clark
2021-06-11 13:49   ` Jordan Crouse
2021-06-14 17:54   ` Bjorn Andersson
2021-06-10 21:44 ` [PATCH v5 5/5] drm/msm: devcoredump iommu fault support Rob Clark
2021-06-11 13:49   ` Jordan Crouse
2021-07-04 12:53 ` [PATCH v5 0/5] iommu/arm-smmu: adreno-smmu page fault handling Dmitry Baryshkov
2021-07-04 18:20   ` Rob Clark
2021-07-06 21:36     ` Bjorn Andersson
2021-07-07  5:12     ` John Stultz
2021-07-07 17:38       ` Rob Clark

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).