All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8
@ 2016-08-18 13:05 ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi all,

Here's the current pile of ARM SMMU fixes I have queued for 4.8. I plan
to send a pull to Joerg tomorrow afternoon.

Cheers,

Will

--->8

Robin Murphy (1):
  iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks

Will Deacon (3):
  iommu/arm-smmu: Fix CMDQ error handling
  iommu/arm-smmu: Disable stalling faults for all endpoints
  iommu/arm-smmu: Don't BUG() if we find aborting STEs with
    disable_bypass

 drivers/iommu/arm-smmu-v3.c        |  7 +++++--
 drivers/iommu/arm-smmu.c           | 34 +++++++---------------------------
 drivers/iommu/io-pgtable-arm-v7s.c |  4 +++-
 3 files changed, 15 insertions(+), 30 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8
@ 2016-08-18 13:05 ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

Hi all,

Here's the current pile of ARM SMMU fixes I have queued for 4.8. I plan
to send a pull to Joerg tomorrow afternoon.

Cheers,

Will

--->8

Robin Murphy (1):
  iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks

Will Deacon (3):
  iommu/arm-smmu: Fix CMDQ error handling
  iommu/arm-smmu: Disable stalling faults for all endpoints
  iommu/arm-smmu: Don't BUG() if we find aborting STEs with
    disable_bypass

 drivers/iommu/arm-smmu-v3.c        |  7 +++++--
 drivers/iommu/arm-smmu.c           | 34 +++++++---------------------------
 drivers/iommu/io-pgtable-arm-v7s.c |  4 +++-
 3 files changed, 15 insertions(+), 30 deletions(-)

-- 
2.1.4

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 1/4] iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
  2016-08-18 13:05 ` Will Deacon
  (?)
@ 2016-08-18 13:05   ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, stable, Will Deacon

From: Robin Murphy <robin.murphy@arm.com>

Due to the attribute bits being all over the place in the different
types of short-descriptor PTEs, when remapping an existing entry, e.g.
splitting a section into pages, we take the approach of decomposing
the PTE attributes back to the IOMMU API flags to start from scratch.

On inspection, though, the existing code seems to have got the read-only
bit backwards and ignored the XN bit. How embarrassing...

Fortunately the primary user so far, the Mediatek IOMMU, both never
splits blocks (because it only serves non-overlapping DMA API calls) and
also ignores permissions anyway, but let's put things right before any
future users trip up.

Cc: <stable@vger.kernel.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8c6139986d7d..def8ca1c982d 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -286,12 +286,14 @@ static int arm_v7s_pte_to_prot(arm_v7s_iopte pte, int lvl)
 	int prot = IOMMU_READ;
 	arm_v7s_iopte attr = pte >> ARM_V7S_ATTR_SHIFT(lvl);
 
-	if (attr & ARM_V7S_PTE_AP_RDONLY)
+	if (!(attr & ARM_V7S_PTE_AP_RDONLY))
 		prot |= IOMMU_WRITE;
 	if ((attr & (ARM_V7S_TEX_MASK << ARM_V7S_TEX_SHIFT)) == 0)
 		prot |= IOMMU_MMIO;
 	else if (pte & ARM_V7S_ATTR_C)
 		prot |= IOMMU_CACHE;
+	if (pte & ARM_V7S_ATTR_XN(lvl))
+		prot |= IOMMU_NOEXEC;
 
 	return prot;
 }
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/4] iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, stable, Will Deacon

From: Robin Murphy <robin.murphy@arm.com>

Due to the attribute bits being all over the place in the different
types of short-descriptor PTEs, when remapping an existing entry, e.g.
splitting a section into pages, we take the approach of decomposing
the PTE attributes back to the IOMMU API flags to start from scratch.

On inspection, though, the existing code seems to have got the read-only
bit backwards and ignored the XN bit. How embarrassing...

Fortunately the primary user so far, the Mediatek IOMMU, both never
splits blocks (because it only serves non-overlapping DMA API calls) and
also ignores permissions anyway, but let's put things right before any
future users trip up.

Cc: <stable@vger.kernel.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8c6139986d7d..def8ca1c982d 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -286,12 +286,14 @@ static int arm_v7s_pte_to_prot(arm_v7s_iopte pte, int lvl)
 	int prot = IOMMU_READ;
 	arm_v7s_iopte attr = pte >> ARM_V7S_ATTR_SHIFT(lvl);
 
-	if (attr & ARM_V7S_PTE_AP_RDONLY)
+	if (!(attr & ARM_V7S_PTE_AP_RDONLY))
 		prot |= IOMMU_WRITE;
 	if ((attr & (ARM_V7S_TEX_MASK << ARM_V7S_TEX_SHIFT)) == 0)
 		prot |= IOMMU_MMIO;
 	else if (pte & ARM_V7S_ATTR_C)
 		prot |= IOMMU_CACHE;
+	if (pte & ARM_V7S_ATTR_XN(lvl))
+		prot |= IOMMU_NOEXEC;
 
 	return prot;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 1/4] iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

From: Robin Murphy <robin.murphy@arm.com>

Due to the attribute bits being all over the place in the different
types of short-descriptor PTEs, when remapping an existing entry, e.g.
splitting a section into pages, we take the approach of decomposing
the PTE attributes back to the IOMMU API flags to start from scratch.

On inspection, though, the existing code seems to have got the read-only
bit backwards and ignored the XN bit. How embarrassing...

Fortunately the primary user so far, the Mediatek IOMMU, both never
splits blocks (because it only serves non-overlapping DMA API calls) and
also ignores permissions anyway, but let's put things right before any
future users trip up.

Cc: <stable@vger.kernel.org>
Signed-off-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/io-pgtable-arm-v7s.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/io-pgtable-arm-v7s.c b/drivers/iommu/io-pgtable-arm-v7s.c
index 8c6139986d7d..def8ca1c982d 100644
--- a/drivers/iommu/io-pgtable-arm-v7s.c
+++ b/drivers/iommu/io-pgtable-arm-v7s.c
@@ -286,12 +286,14 @@ static int arm_v7s_pte_to_prot(arm_v7s_iopte pte, int lvl)
 	int prot = IOMMU_READ;
 	arm_v7s_iopte attr = pte >> ARM_V7S_ATTR_SHIFT(lvl);
 
-	if (attr & ARM_V7S_PTE_AP_RDONLY)
+	if (!(attr & ARM_V7S_PTE_AP_RDONLY))
 		prot |= IOMMU_WRITE;
 	if ((attr & (ARM_V7S_TEX_MASK << ARM_V7S_TEX_SHIFT)) == 0)
 		prot |= IOMMU_MMIO;
 	else if (pte & ARM_V7S_ATTR_C)
 		prot |= IOMMU_CACHE;
+	if (pte & ARM_V7S_ATTR_XN(lvl))
+		prot |= IOMMU_NOEXEC;
 
 	return prot;
 }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/4] iommu/arm-smmu: Fix CMDQ error handling
  2016-08-18 13:05 ` Will Deacon
  (?)
@ 2016-08-18 13:05   ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, Will Deacon, stable

In the unlikely event of a global command queue error, the ARM SMMUv3
driver attempts to convert the problematic command into a CMD_SYNC and
resume the command queue. Unfortunately, this code is pretty badly
broken:

  1. It uses the index into the error string table as the CMDQ index,
     so we probably read the wrong entry out of the queue

  2. The arguments to queue_write are the wrong way round, so we end up
     writing from the queue onto the stack.

These happily cancel out, so the kernel is likely to stay alive, but
the command queue will probably fault again when we resume.

This patch fixes the error handling code to use the correct queue index
and write back the CMD_SYNC to the faulting entry.

Cc: <stable@vger.kernel.org>
Reported-by: Diwakar Subraveti <Diwakar.Subraveti@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ce801170d5f2..330623f8e344 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -879,7 +879,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 	 * We may have concurrent producers, so we need to be careful
 	 * not to touch any of the shadow cmdq state.
 	 */
-	queue_read(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
 	dev_err(smmu->dev, "skipping command in error state:\n");
 	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
 		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
@@ -890,7 +890,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 		return;
 	}
 
-	queue_write(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
 }
 
 static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/4] iommu/arm-smmu: Fix CMDQ error handling
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, Will Deacon, stable

In the unlikely event of a global command queue error, the ARM SMMUv3
driver attempts to convert the problematic command into a CMD_SYNC and
resume the command queue. Unfortunately, this code is pretty badly
broken:

  1. It uses the index into the error string table as the CMDQ index,
     so we probably read the wrong entry out of the queue

  2. The arguments to queue_write are the wrong way round, so we end up
     writing from the queue onto the stack.

These happily cancel out, so the kernel is likely to stay alive, but
the command queue will probably fault again when we resume.

This patch fixes the error handling code to use the correct queue index
and write back the CMD_SYNC to the faulting entry.

Cc: <stable@vger.kernel.org>
Reported-by: Diwakar Subraveti <Diwakar.Subraveti@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ce801170d5f2..330623f8e344 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -879,7 +879,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 	 * We may have concurrent producers, so we need to be careful
 	 * not to touch any of the shadow cmdq state.
 	 */
-	queue_read(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
 	dev_err(smmu->dev, "skipping command in error state:\n");
 	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
 		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
@@ -890,7 +890,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 		return;
 	}
 
-	queue_write(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
 }
 
 static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 2/4] iommu/arm-smmu: Fix CMDQ error handling
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

In the unlikely event of a global command queue error, the ARM SMMUv3
driver attempts to convert the problematic command into a CMD_SYNC and
resume the command queue. Unfortunately, this code is pretty badly
broken:

  1. It uses the index into the error string table as the CMDQ index,
     so we probably read the wrong entry out of the queue

  2. The arguments to queue_write are the wrong way round, so we end up
     writing from the queue onto the stack.

These happily cancel out, so the kernel is likely to stay alive, but
the command queue will probably fault again when we resume.

This patch fixes the error handling code to use the correct queue index
and write back the CMD_SYNC to the faulting entry.

Cc: <stable@vger.kernel.org>
Reported-by: Diwakar Subraveti <Diwakar.Subraveti@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index ce801170d5f2..330623f8e344 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -879,7 +879,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 	 * We may have concurrent producers, so we need to be careful
 	 * not to touch any of the shadow cmdq state.
 	 */
-	queue_read(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_read(cmd, Q_ENT(q, cons), q->ent_dwords);
 	dev_err(smmu->dev, "skipping command in error state:\n");
 	for (i = 0; i < ARRAY_SIZE(cmd); ++i)
 		dev_err(smmu->dev, "\t0x%016llx\n", (unsigned long long)cmd[i]);
@@ -890,7 +890,7 @@ static void arm_smmu_cmdq_skip_err(struct arm_smmu_device *smmu)
 		return;
 	}
 
-	queue_write(cmd, Q_ENT(q, idx), q->ent_dwords);
+	queue_write(Q_ENT(q, cons), cmd, q->ent_dwords);
 }
 
 static void arm_smmu_cmdq_issue_cmd(struct arm_smmu_device *smmu,
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-08-18 13:05 ` Will Deacon
  (?)
@ 2016-08-18 13:05   ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, Will Deacon, stable

Enabling stalling faults can result in hardware deadlock on poorly
designed systems, particularly those with a PCI root complex upstream of
the SMMU.

Although it's not really Linux's job to save hardware integrators from
their own misfortune, it *is* our job to stop userspace (e.g. VFIO
clients) from hosing the system for everybody else, even if they might
already be required to have elevated privileges.

Given that the fault handling code currently executes entirely in IRQ
context, there is nothing that can sensibly be done to recover from
things like page faults anyway, so let's rip this code out for now and
avoid the potential for deadlock.

Cc: <stable@vger.kernel.org>
Reported-by: Matt Evans <matt.evans@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 4f49fe29f202..2db74ebc3240 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
 
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
-	int flags, ret;
-	u32 fsr, fsynr, resume;
+	u32 fsr, fsynr;
 	unsigned long iova;
 	struct iommu_domain *domain = dev;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	if (!(fsr & FSR_FAULT))
 		return IRQ_NONE;
 
-	if (fsr & FSR_IGN)
-		dev_err_ratelimited(smmu->dev,
-				    "Unexpected context fault (fsr 0x%x)\n",
-				    fsr);
-
 	fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
-	flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
-
 	iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
-	if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
-		ret = IRQ_HANDLED;
-		resume = RESUME_RETRY;
-	} else {
-		dev_err_ratelimited(smmu->dev,
-		    "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
-		    iova, fsynr, cfg->cbndx);
-		ret = IRQ_NONE;
-		resume = RESUME_TERMINATE;
-	}
-
-	/* Clear the faulting FSR */
-	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
 
-	/* Retry or terminate any stalled transactions */
-	if (fsr & FSR_SS)
-		writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
+	dev_err_ratelimited(smmu->dev,
+	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
+			    fsr, iova, fsynr, cfg->cbndx);
 
-	return ret;
+	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
@@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
 	}
 
 	/* SCTLR */
-	reg = SCTLR_CFCFG | SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
+	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
 	if (stage1)
 		reg |= SCTLR_S1_ASIDPNE;
 #ifdef __BIG_ENDIAN
-- 
2.1.4


^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu; +Cc: linux-arm-kernel, robin.murphy, joro, Will Deacon, stable

Enabling stalling faults can result in hardware deadlock on poorly
designed systems, particularly those with a PCI root complex upstream of
the SMMU.

Although it's not really Linux's job to save hardware integrators from
their own misfortune, it *is* our job to stop userspace (e.g. VFIO
clients) from hosing the system for everybody else, even if they might
already be required to have elevated privileges.

Given that the fault handling code currently executes entirely in IRQ
context, there is nothing that can sensibly be done to recover from
things like page faults anyway, so let's rip this code out for now and
avoid the potential for deadlock.

Cc: <stable@vger.kernel.org>
Reported-by: Matt Evans <matt.evans@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 4f49fe29f202..2db74ebc3240 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
 
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
-	int flags, ret;
-	u32 fsr, fsynr, resume;
+	u32 fsr, fsynr;
 	unsigned long iova;
 	struct iommu_domain *domain = dev;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	if (!(fsr & FSR_FAULT))
 		return IRQ_NONE;
 
-	if (fsr & FSR_IGN)
-		dev_err_ratelimited(smmu->dev,
-				    "Unexpected context fault (fsr 0x%x)\n",
-				    fsr);
-
 	fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
-	flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
-
 	iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
-	if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
-		ret = IRQ_HANDLED;
-		resume = RESUME_RETRY;
-	} else {
-		dev_err_ratelimited(smmu->dev,
-		    "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
-		    iova, fsynr, cfg->cbndx);
-		ret = IRQ_NONE;
-		resume = RESUME_TERMINATE;
-	}
-
-	/* Clear the faulting FSR */
-	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
 
-	/* Retry or terminate any stalled transactions */
-	if (fsr & FSR_SS)
-		writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
+	dev_err_ratelimited(smmu->dev,
+	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
+			    fsr, iova, fsynr, cfg->cbndx);
 
-	return ret;
+	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
@@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
 	}
 
 	/* SCTLR */
-	reg = SCTLR_CFCFG | SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
+	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
 	if (stage1)
 		reg |= SCTLR_S1_ASIDPNE;
 #ifdef __BIG_ENDIAN
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-08-18 13:05   ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

Enabling stalling faults can result in hardware deadlock on poorly
designed systems, particularly those with a PCI root complex upstream of
the SMMU.

Although it's not really Linux's job to save hardware integrators from
their own misfortune, it *is* our job to stop userspace (e.g. VFIO
clients) from hosing the system for everybody else, even if they might
already be required to have elevated privileges.

Given that the fault handling code currently executes entirely in IRQ
context, there is nothing that can sensibly be done to recover from
things like page faults anyway, so let's rip this code out for now and
avoid the potential for deadlock.

Cc: <stable@vger.kernel.org>
Reported-by: Matt Evans <matt.evans@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
 1 file changed, 7 insertions(+), 27 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 4f49fe29f202..2db74ebc3240 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
 
 static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 {
-	int flags, ret;
-	u32 fsr, fsynr, resume;
+	u32 fsr, fsynr;
 	unsigned long iova;
 	struct iommu_domain *domain = dev;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
 	if (!(fsr & FSR_FAULT))
 		return IRQ_NONE;
 
-	if (fsr & FSR_IGN)
-		dev_err_ratelimited(smmu->dev,
-				    "Unexpected context fault (fsr 0x%x)\n",
-				    fsr);
-
 	fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
-	flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
-
 	iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
-	if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
-		ret = IRQ_HANDLED;
-		resume = RESUME_RETRY;
-	} else {
-		dev_err_ratelimited(smmu->dev,
-		    "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
-		    iova, fsynr, cfg->cbndx);
-		ret = IRQ_NONE;
-		resume = RESUME_TERMINATE;
-	}
-
-	/* Clear the faulting FSR */
-	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
 
-	/* Retry or terminate any stalled transactions */
-	if (fsr & FSR_SS)
-		writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
+	dev_err_ratelimited(smmu->dev,
+	"Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
+			    fsr, iova, fsynr, cfg->cbndx);
 
-	return ret;
+	writel(fsr, cb_base + ARM_SMMU_CB_FSR);
+	return IRQ_HANDLED;
 }
 
 static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
@@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
 	}
 
 	/* SCTLR */
-	reg = SCTLR_CFCFG | SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
+	reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
 	if (stage1)
 		reg |= SCTLR_S1_ASIDPNE;
 #ifdef __BIG_ENDIAN
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/4] iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
  2016-08-18 13:05 ` Will Deacon
@ 2016-08-18 13:05     ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA
  Cc: Will Deacon, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

The disable_bypass cmdline option changes the SMMUv3 driver to put down
faulting stream table entries by default, as opposed to bypassing
transactions from unconfigured devices.

In this mode of operation, it is entirely expected to see aborting
entries in the stream table if and when we come to installing a valid
translation, so don't trigger a BUG() as a result of misdiagnosing these
entries as stream table corruption.

Tested-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Reported-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Reviewed-by: Robin Murphy <robin.murphy-5wv7dgnIgG8@public.gmane.org>
Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
---
 drivers/iommu/arm-smmu-v3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 330623f8e344..641e88761319 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1034,6 +1034,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 		case STRTAB_STE_0_CFG_S2_TRANS:
 			ste_live = true;
 			break;
+		case STRTAB_STE_0_CFG_ABORT:
+			if (disable_bypass)
+				break;
 		default:
 			BUG(); /* STE corruption */
 		}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* [PATCH 4/4] iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass
@ 2016-08-18 13:05     ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-08-18 13:05 UTC (permalink / raw)
  To: linux-arm-kernel

The disable_bypass cmdline option changes the SMMUv3 driver to put down
faulting stream table entries by default, as opposed to bypassing
transactions from unconfigured devices.

In this mode of operation, it is entirely expected to see aborting
entries in the stream table if and when we come to installing a valid
translation, so don't trigger a BUG() as a result of misdiagnosing these
entries as stream table corruption.

Tested-by: Robin Murphy <robin.murphy@arm.com>
Reported-by: Robin Murphy <robin.murphy@arm.com>
Reviewed-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 drivers/iommu/arm-smmu-v3.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
index 330623f8e344..641e88761319 100644
--- a/drivers/iommu/arm-smmu-v3.c
+++ b/drivers/iommu/arm-smmu-v3.c
@@ -1034,6 +1034,9 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_device *smmu, u32 sid,
 		case STRTAB_STE_0_CFG_S2_TRANS:
 			ste_live = true;
 			break;
+		case STRTAB_STE_0_CFG_ABORT:
+			if (disable_bypass)
+				break;
 		default:
 			BUG(); /* STE corruption */
 		}
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 33+ messages in thread

* Re: [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8
  2016-08-18 13:05 ` Will Deacon
@ 2016-08-18 16:52     ` Joerg Roedel
  -1 siblings, 0 replies; 33+ messages in thread
From: Joerg Roedel @ 2016-08-18 16:52 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

On Thu, Aug 18, 2016 at 02:05:38PM +0100, Will Deacon wrote:
> Here's the current pile of ARM SMMU fixes I have queued for 4.8. I plan
> to send a pull to Joerg tomorrow afternoon.

Please also add 'Fixes:' tags to the patches before sending the
pull-request.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8
@ 2016-08-18 16:52     ` Joerg Roedel
  0 siblings, 0 replies; 33+ messages in thread
From: Joerg Roedel @ 2016-08-18 16:52 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

On Thu, Aug 18, 2016 at 02:05:38PM +0100, Will Deacon wrote:
> Here's the current pile of ARM SMMU fixes I have queued for 4.8. I plan
> to send a pull to Joerg tomorrow afternoon.

Please also add 'Fixes:' tags to the patches before sending the
pull-request.

Thanks,

	Joerg

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-08-18 13:05   ` Will Deacon
@ 2016-12-06 23:30       ` Rob Clark
  -1 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2016-12-06 23:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-msm, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jordan Crouse, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> Enabling stalling faults can result in hardware deadlock on poorly
> designed systems, particularly those with a PCI root complex upstream of
> the SMMU.
>
> Although it's not really Linux's job to save hardware integrators from
> their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> clients) from hosing the system for everybody else, even if they might
> already be required to have elevated privileges.
>
> Given that the fault handling code currently executes entirely in IRQ
> context, there is nothing that can sensibly be done to recover from
> things like page faults anyway, so let's rip this code out for now and
> avoid the potential for deadlock.

Hi Will,

so, I'd like to re-introduce this feature, I *guess* as some sort of
opt-in quirk (ie. disabled by default unless something in DT tells you
otherwise??  But I'm open to suggestions.  I'm not entirely sure what
hw was having problems due to this feature.)

On newer snapdragon devices we are using arm-smmu for the GPU, and
halting the GPU so the driver's fault handler can dump some GPU state
on faults is enormously helpful for debugging and tracking down where
in the gpu cmdstream the fault was triggered.  In addition, we will
eventually want the ability to update pagetables from fault handler
and resuming the faulting transition.

Some additional comments below..

> Cc: <stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
> Reported-by: Matt Evans <matt.evans-5wv7dgnIgG8@public.gmane.org>
> Signed-off-by: Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org>
> ---
>  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
>  1 file changed, 7 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 4f49fe29f202..2db74ebc3240 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>
>  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>  {
> -       int flags, ret;
> -       u32 fsr, fsynr, resume;
> +       u32 fsr, fsynr;
>         unsigned long iova;
>         struct iommu_domain *domain = dev;
>         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>         if (!(fsr & FSR_FAULT))
>                 return IRQ_NONE;
>
> -       if (fsr & FSR_IGN)
> -               dev_err_ratelimited(smmu->dev,
> -                                   "Unexpected context fault (fsr 0x%x)\n",
> -                                   fsr);
> -
>         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> -
>         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> -               ret = IRQ_HANDLED;
> -               resume = RESUME_RETRY;
> -       } else {
> -               dev_err_ratelimited(smmu->dev,
> -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> -                   iova, fsynr, cfg->cbndx);

I would like to decouple this dev_err_ratelimit() print from the
RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
indicate by return from my fault handler whether to resume or
terminate.  But I already have my own ratelimted prints and would
prefer not to spam dmesg twice.

I'm thinking about report_iommu_fault() returning:

  0 => RESUME_RETRY
  -EFAULT => RESUME_TERMINATE but don't print
  anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print

thoughts?

> -               ret = IRQ_NONE;
> -               resume = RESUME_TERMINATE;
> -       }
> -
> -       /* Clear the faulting FSR */
> -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>
> -       /* Retry or terminate any stalled transactions */
> -       if (fsr & FSR_SS)
> -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);

This might be a bug in qcom's implementation of the smmu spec, but
seems like we don't have SS bit set, yet we still require RESUME reg
to be written, otherwise gpu is perma-wedged.  Maybe topic for a
separate quirk?  I'm not sure if writing RESUME reg on other hw when
SS bit is not set is likely to cause problems?  If not I suppose we
could just unconditionally write it.

Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
in between debugging freedreno I'll try to put together some patches.

BR,
-R

> +       dev_err_ratelimited(smmu->dev,
> +       "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> +                           fsr, iova, fsynr, cfg->cbndx);
>
> -       return ret;
> +       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> +       return IRQ_HANDLED;
>  }
>
>  static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
> @@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
>         }
>
>         /* SCTLR */
> -       reg = SCTLR_CFCFG | SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
> +       reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
>         if (stage1)
>                 reg |= SCTLR_S1_ASIDPNE;
>  #ifdef __BIG_ENDIAN
> --
> 2.1.4
>
> _______________________________________________
> iommu mailing list
> iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-06 23:30       ` Rob Clark
  0 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2016-12-06 23:30 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> Enabling stalling faults can result in hardware deadlock on poorly
> designed systems, particularly those with a PCI root complex upstream of
> the SMMU.
>
> Although it's not really Linux's job to save hardware integrators from
> their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> clients) from hosing the system for everybody else, even if they might
> already be required to have elevated privileges.
>
> Given that the fault handling code currently executes entirely in IRQ
> context, there is nothing that can sensibly be done to recover from
> things like page faults anyway, so let's rip this code out for now and
> avoid the potential for deadlock.

Hi Will,

so, I'd like to re-introduce this feature, I *guess* as some sort of
opt-in quirk (ie. disabled by default unless something in DT tells you
otherwise??  But I'm open to suggestions.  I'm not entirely sure what
hw was having problems due to this feature.)

On newer snapdragon devices we are using arm-smmu for the GPU, and
halting the GPU so the driver's fault handler can dump some GPU state
on faults is enormously helpful for debugging and tracking down where
in the gpu cmdstream the fault was triggered.  In addition, we will
eventually want the ability to update pagetables from fault handler
and resuming the faulting transition.

Some additional comments below..

> Cc: <stable@vger.kernel.org>
> Reported-by: Matt Evans <matt.evans@arm.com>
> Signed-off-by: Will Deacon <will.deacon@arm.com>
> ---
>  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
>  1 file changed, 7 insertions(+), 27 deletions(-)
>
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 4f49fe29f202..2db74ebc3240 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>
>  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>  {
> -       int flags, ret;
> -       u32 fsr, fsynr, resume;
> +       u32 fsr, fsynr;
>         unsigned long iova;
>         struct iommu_domain *domain = dev;
>         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>         if (!(fsr & FSR_FAULT))
>                 return IRQ_NONE;
>
> -       if (fsr & FSR_IGN)
> -               dev_err_ratelimited(smmu->dev,
> -                                   "Unexpected context fault (fsr 0x%x)\n",
> -                                   fsr);
> -
>         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> -
>         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> -               ret = IRQ_HANDLED;
> -               resume = RESUME_RETRY;
> -       } else {
> -               dev_err_ratelimited(smmu->dev,
> -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> -                   iova, fsynr, cfg->cbndx);

I would like to decouple this dev_err_ratelimit() print from the
RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
indicate by return from my fault handler whether to resume or
terminate.  But I already have my own ratelimted prints and would
prefer not to spam dmesg twice.

I'm thinking about report_iommu_fault() returning:

  0 => RESUME_RETRY
  -EFAULT => RESUME_TERMINATE but don't print
  anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print

thoughts?

> -               ret = IRQ_NONE;
> -               resume = RESUME_TERMINATE;
> -       }
> -
> -       /* Clear the faulting FSR */
> -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>
> -       /* Retry or terminate any stalled transactions */
> -       if (fsr & FSR_SS)
> -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);

This might be a bug in qcom's implementation of the smmu spec, but
seems like we don't have SS bit set, yet we still require RESUME reg
to be written, otherwise gpu is perma-wedged.  Maybe topic for a
separate quirk?  I'm not sure if writing RESUME reg on other hw when
SS bit is not set is likely to cause problems?  If not I suppose we
could just unconditionally write it.

Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
in between debugging freedreno I'll try to put together some patches.

BR,
-R

> +       dev_err_ratelimited(smmu->dev,
> +       "Unhandled context fault: fsr=0x%x, iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> +                           fsr, iova, fsynr, cfg->cbndx);
>
> -       return ret;
> +       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> +       return IRQ_HANDLED;
>  }
>
>  static irqreturn_t arm_smmu_global_fault(int irq, void *dev)
> @@ -837,7 +817,7 @@ static void arm_smmu_init_context_bank(struct arm_smmu_domain *smmu_domain,
>         }
>
>         /* SCTLR */
> -       reg = SCTLR_CFCFG | SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
> +       reg = SCTLR_CFIE | SCTLR_CFRE | SCTLR_M | SCTLR_EAE_SBOP;
>         if (stage1)
>                 reg |= SCTLR_S1_ASIDPNE;
>  #ifdef __BIG_ENDIAN
> --
> 2.1.4
>
> _______________________________________________
> iommu mailing list
> iommu at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-06 23:30       ` Rob Clark
@ 2016-12-07  0:00         ` Jordan Crouse
  -1 siblings, 0 replies; 33+ messages in thread
From: Jordan Crouse @ 2016-12-07  0:00 UTC (permalink / raw)
  To: Rob Clark
  Cc: Will Deacon, iommu, linux-arm-kernel, linux-arm-msm, Sricharan R

On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> Hi Will,
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.
> 
> Some additional comments below..
> 
> > Cc: <stable@vger.kernel.org>
> > Reported-by: Matt Evans <matt.evans@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
> >  1 file changed, 7 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index 4f49fe29f202..2db74ebc3240 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
> >
> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >  {
> > -       int flags, ret;
> > -       u32 fsr, fsynr, resume;
> > +       u32 fsr, fsynr;
> >         unsigned long iova;
> >         struct iommu_domain *domain = dev;
> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >         if (!(fsr & FSR_FAULT))
> >                 return IRQ_NONE;
> >
> > -       if (fsr & FSR_IGN)
> > -               dev_err_ratelimited(smmu->dev,
> > -                                   "Unexpected context fault (fsr 0x%x)\n",
> > -                                   fsr);
> > -
> >         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> > -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> > -
> >         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> > -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> > -               ret = IRQ_HANDLED;
> > -               resume = RESUME_RETRY;
> > -       } else {
> > -               dev_err_ratelimited(smmu->dev,
> > -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> > -                   iova, fsynr, cfg->cbndx);
> 
> I would like to decouple this dev_err_ratelimit() print from the
> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
> indicate by return from my fault handler whether to resume or
> terminate.  But I already have my own ratelimted prints and would
> prefer not to spam dmesg twice.
> 
> I'm thinking about report_iommu_fault() returning:
> 
>   0 => RESUME_RETRY
>   -EFAULT => RESUME_TERMINATE but don't print
>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
> 
> thoughts?
> 
> > -               ret = IRQ_NONE;
> > -               resume = RESUME_TERMINATE;
> > -       }
> > -
> > -       /* Clear the faulting FSR */
> > -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> >
> > -       /* Retry or terminate any stalled transactions */
> > -       if (fsr & FSR_SS)
> > -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
> 
> This might be a bug in qcom's implementation of the smmu spec, but
> seems like we don't have SS bit set, yet we still require RESUME reg
> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
> separate quirk?  I'm not sure if writing RESUME reg on other hw when
> SS bit is not set is likely to cause problems?  If not I suppose we
> could just unconditionally write it.
> 
> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
> in between debugging freedreno I'll try to put together some patches.

>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
the operation just gets terminated immediately and *then* we get notification
but by then the system keeps going.

I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
through eternity) but I don't know how it works.

>From my very unlearned understanding I think we do want to set CFCFG and then
stall and let the interrupt handler decide to retry/terminate.

Jordan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-07  0:00         ` Jordan Crouse
  0 siblings, 0 replies; 33+ messages in thread
From: Jordan Crouse @ 2016-12-07  0:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> Hi Will,
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.
> 
> Some additional comments below..
> 
> > Cc: <stable@vger.kernel.org>
> > Reported-by: Matt Evans <matt.evans@arm.com>
> > Signed-off-by: Will Deacon <will.deacon@arm.com>
> > ---
> >  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
> >  1 file changed, 7 insertions(+), 27 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index 4f49fe29f202..2db74ebc3240 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
> >
> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >  {
> > -       int flags, ret;
> > -       u32 fsr, fsynr, resume;
> > +       u32 fsr, fsynr;
> >         unsigned long iova;
> >         struct iommu_domain *domain = dev;
> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
> >         if (!(fsr & FSR_FAULT))
> >                 return IRQ_NONE;
> >
> > -       if (fsr & FSR_IGN)
> > -               dev_err_ratelimited(smmu->dev,
> > -                                   "Unexpected context fault (fsr 0x%x)\n",
> > -                                   fsr);
> > -
> >         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
> > -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
> > -
> >         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
> > -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
> > -               ret = IRQ_HANDLED;
> > -               resume = RESUME_RETRY;
> > -       } else {
> > -               dev_err_ratelimited(smmu->dev,
> > -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
> > -                   iova, fsynr, cfg->cbndx);
> 
> I would like to decouple this dev_err_ratelimit() print from the
> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
> indicate by return from my fault handler whether to resume or
> terminate.  But I already have my own ratelimted prints and would
> prefer not to spam dmesg twice.
> 
> I'm thinking about report_iommu_fault() returning:
> 
>   0 => RESUME_RETRY
>   -EFAULT => RESUME_TERMINATE but don't print
>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
> 
> thoughts?
> 
> > -               ret = IRQ_NONE;
> > -               resume = RESUME_TERMINATE;
> > -       }
> > -
> > -       /* Clear the faulting FSR */
> > -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
> >
> > -       /* Retry or terminate any stalled transactions */
> > -       if (fsr & FSR_SS)
> > -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
> 
> This might be a bug in qcom's implementation of the smmu spec, but
> seems like we don't have SS bit set, yet we still require RESUME reg
> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
> separate quirk?  I'm not sure if writing RESUME reg on other hw when
> SS bit is not set is likely to cause problems?  If not I suppose we
> could just unconditionally write it.
> 
> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
> in between debugging freedreno I'll try to put together some patches.

>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
the operation just gets terminated immediately and *then* we get notification
but by then the system keeps going.

I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
through eternity) but I don't know how it works.

>From my very unlearned understanding I think we do want to set CFCFG and then
stall and let the interrupt handler decide to retry/terminate.

Jordan

-- 
The Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-07  0:00         ` Jordan Crouse
@ 2016-12-10 15:44           ` Sricharan
  -1 siblings, 0 replies; 33+ messages in thread
From: Sricharan @ 2016-12-10 15:44 UTC (permalink / raw)
  To: 'Jordan Crouse', 'Rob Clark'
  Cc: 'linux-arm-msm', iommu, 'Will Deacon', linux-arm-kernel

Hi Rob,

>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> Hi Will,
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>>
>> Some additional comments below..
>>
>> > Cc: <stable@vger.kernel.org>
>> > Reported-by: Matt Evans <matt.evans@arm.com>
>> > Signed-off-by: Will Deacon <will.deacon@arm.com>
>> > ---
>> >  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
>> >  1 file changed, 7 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
>> > index 4f49fe29f202..2db74ebc3240 100644
>> > --- a/drivers/iommu/arm-smmu.c
>> > +++ b/drivers/iommu/arm-smmu.c
>> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>> >
>> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>> >  {
>> > -       int flags, ret;
>> > -       u32 fsr, fsynr, resume;
>> > +       u32 fsr, fsynr;
>> >         unsigned long iova;
>> >         struct iommu_domain *domain = dev;
>> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>> >         if (!(fsr & FSR_FAULT))
>> >                 return IRQ_NONE;
>> >
>> > -       if (fsr & FSR_IGN)
>> > -               dev_err_ratelimited(smmu->dev,
>> > -                                   "Unexpected context fault (fsr 0x%x)\n",
>> > -                                   fsr);
>> > -
>> >         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
>> > -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
>> > -
>> >         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
>> > -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
>> > -               ret = IRQ_HANDLED;
>> > -               resume = RESUME_RETRY;
>> > -       } else {
>> > -               dev_err_ratelimited(smmu->dev,
>> > -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
>> > -                   iova, fsynr, cfg->cbndx);
>>
>> I would like to decouple this dev_err_ratelimit() print from the
>> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
>> indicate by return from my fault handler whether to resume or
>> terminate.  But I already have my own ratelimted prints and would
>> prefer not to spam dmesg twice.
>>
>> I'm thinking about report_iommu_fault() returning:
>>
>>   0 => RESUME_RETRY
>>   -EFAULT => RESUME_TERMINATE but don't print
>>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
>>
>> thoughts?
>>
>> > -               ret = IRQ_NONE;
>> > -               resume = RESUME_TERMINATE;
>> > -       }
>> > -
>> > -       /* Clear the faulting FSR */
>> > -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>> >
>> > -       /* Retry or terminate any stalled transactions */
>> > -       if (fsr & FSR_SS)
>> > -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
>>
>> This might be a bug in qcom's implementation of the smmu spec, but
>> seems like we don't have SS bit set, yet we still require RESUME reg
>> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
>> separate quirk?  I'm not sure if writing RESUME reg on other hw when
>> SS bit is not set is likely to cause problems?  If not I suppose we
>> could just unconditionally write it.
>>
>> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
>> in between debugging freedreno I'll try to put together some patches.
>
>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
>the operation just gets terminated immediately and *then* we get notification
>but by then the system keeps going.
>

Yes thats right, SCTLR_CFCFG was removed as a part of this patch.

>I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
>through eternity) but I don't know how it works.
>
>From my very unlearned understanding I think we do want to set CFCFG and then
>stall and let the interrupt handler decide to retry/terminate.

Yes, infact i was thinking of adding this as a new patch, but now that this was
a reverted one. As you said, it would be good to know  the hw which was having
problem with this and probably having this has a quirk otherwise.

Also i see that FSR_SS is implemented by the qcom  and the 
downstream code uses it.

Regards,
  Sricharan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-10 15:44           ` Sricharan
  0 siblings, 0 replies; 33+ messages in thread
From: Sricharan @ 2016-12-10 15:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> Hi Will,
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>>
>> Some additional comments below..
>>
>> > Cc: <stable@vger.kernel.org>
>> > Reported-by: Matt Evans <matt.evans@arm.com>
>> > Signed-off-by: Will Deacon <will.deacon@arm.com>
>> > ---
>> >  drivers/iommu/arm-smmu.c | 34 +++++++---------------------------
>> >  1 file changed, 7 insertions(+), 27 deletions(-)
>> >
>> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
>> > index 4f49fe29f202..2db74ebc3240 100644
>> > --- a/drivers/iommu/arm-smmu.c
>> > +++ b/drivers/iommu/arm-smmu.c
>> > @@ -686,8 +686,7 @@ static struct iommu_gather_ops arm_smmu_gather_ops = {
>> >
>> >  static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>> >  {
>> > -       int flags, ret;
>> > -       u32 fsr, fsynr, resume;
>> > +       u32 fsr, fsynr;
>> >         unsigned long iova;
>> >         struct iommu_domain *domain = dev;
>> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>> > @@ -701,34 +700,15 @@ static irqreturn_t arm_smmu_context_fault(int irq, void *dev)
>> >         if (!(fsr & FSR_FAULT))
>> >                 return IRQ_NONE;
>> >
>> > -       if (fsr & FSR_IGN)
>> > -               dev_err_ratelimited(smmu->dev,
>> > -                                   "Unexpected context fault (fsr 0x%x)\n",
>> > -                                   fsr);
>> > -
>> >         fsynr = readl_relaxed(cb_base + ARM_SMMU_CB_FSYNR0);
>> > -       flags = fsynr & FSYNR0_WNR ? IOMMU_FAULT_WRITE : IOMMU_FAULT_READ;
>> > -
>> >         iova = readq_relaxed(cb_base + ARM_SMMU_CB_FAR);
>> > -       if (!report_iommu_fault(domain, smmu->dev, iova, flags)) {
>> > -               ret = IRQ_HANDLED;
>> > -               resume = RESUME_RETRY;
>> > -       } else {
>> > -               dev_err_ratelimited(smmu->dev,
>> > -                   "Unhandled context fault: iova=0x%08lx, fsynr=0x%x, cb=%d\n",
>> > -                   iova, fsynr, cfg->cbndx);
>>
>> I would like to decouple this dev_err_ratelimit() print from the
>> RESUME_RETRY vs RESUME_TERMINATE behaviour.  I need the ability to
>> indicate by return from my fault handler whether to resume or
>> terminate.  But I already have my own ratelimted prints and would
>> prefer not to spam dmesg twice.
>>
>> I'm thinking about report_iommu_fault() returning:
>>
>>   0 => RESUME_RETRY
>>   -EFAULT => RESUME_TERMINATE but don't print
>>   anything else (or specifically -ENOSYS?) => RESUME_TERMINATE and print
>>
>> thoughts?
>>
>> > -               ret = IRQ_NONE;
>> > -               resume = RESUME_TERMINATE;
>> > -       }
>> > -
>> > -       /* Clear the faulting FSR */
>> > -       writel(fsr, cb_base + ARM_SMMU_CB_FSR);
>> >
>> > -       /* Retry or terminate any stalled transactions */
>> > -       if (fsr & FSR_SS)
>> > -               writel_relaxed(resume, cb_base + ARM_SMMU_CB_RESUME);
>>
>> This might be a bug in qcom's implementation of the smmu spec, but
>> seems like we don't have SS bit set, yet we still require RESUME reg
>> to be written, otherwise gpu is perma-wedged.  Maybe topic for a
>> separate quirk?  I'm not sure if writing RESUME reg on other hw when
>> SS bit is not set is likely to cause problems?  If not I suppose we
>> could just unconditionally write it.
>>
>> Anyways, I'm not super-familiar w/ arm-smmu so suggestions welcome..
>> in between debugging freedreno I'll try to put together some patches.
>
>From what I can tell we need SCTLR_CFCFG to make the stall happen otherwise
>the operation just gets terminated immediately and *then* we get notification
>but by then the system keeps going.
>

Yes thats right, SCTLR_CFCFG was removed as a part of this patch.

>I think SCTLR_HUPCF helps control that behavior (i.e. we don't go off faulting
>through eternity) but I don't know how it works.
>
>From my very unlearned understanding I think we do want to set CFCFG and then
>stall and let the interrupt handler decide to retry/terminate.

Yes, infact i was thinking of adding this as a new patch, but now that this was
a reverted one. As you said, it would be good to know  the hw which was having
problem with this and probably having this has a quirk otherwise.

Also i see that FSR_SS is implemented by the qcom  and the 
downstream code uses it.

Regards,
  Sricharan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-06 23:30       ` Rob Clark
@ 2016-12-16 11:54           ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-12-16 11:54 UTC (permalink / raw)
  To: Rob Clark
  Cc: linux-arm-msm, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	Jordan Crouse, linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Rob,

On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.

I'm not against reintroducing this, but it would certainly need to be
opt-in, as you suggest. If we want to try to use stall faults to enable
demand paging for DMA, then that means running core mm code to resolve
the fault and we can't do that in irq context. Consequently, we have to
hand this off to a thread, which means the hardware must allow the SS
bit to remain set without immediately reasserting the interrupt line.
Furthermore, we can't handle multiple faults on a context-bank, so we'd
need to restrict ourselves to one device (i.e. faulting stream) per
domain (CB).

I think that means we want both specific compatible strings to identify
the SS bit behaviour, but also a way to opt-in for the stall model as a
separate property to indicate that the SoC integration can support this
without e.g. deadlocking.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-16 11:54           ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-12-16 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Rob,

On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Enabling stalling faults can result in hardware deadlock on poorly
> > designed systems, particularly those with a PCI root complex upstream of
> > the SMMU.
> >
> > Although it's not really Linux's job to save hardware integrators from
> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> > clients) from hosing the system for everybody else, even if they might
> > already be required to have elevated privileges.
> >
> > Given that the fault handling code currently executes entirely in IRQ
> > context, there is nothing that can sensibly be done to recover from
> > things like page faults anyway, so let's rip this code out for now and
> > avoid the potential for deadlock.
> 
> so, I'd like to re-introduce this feature, I *guess* as some sort of
> opt-in quirk (ie. disabled by default unless something in DT tells you
> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> hw was having problems due to this feature.)
> 
> On newer snapdragon devices we are using arm-smmu for the GPU, and
> halting the GPU so the driver's fault handler can dump some GPU state
> on faults is enormously helpful for debugging and tracking down where
> in the gpu cmdstream the fault was triggered.  In addition, we will
> eventually want the ability to update pagetables from fault handler
> and resuming the faulting transition.

I'm not against reintroducing this, but it would certainly need to be
opt-in, as you suggest. If we want to try to use stall faults to enable
demand paging for DMA, then that means running core mm code to resolve
the fault and we can't do that in irq context. Consequently, we have to
hand this off to a thread, which means the hardware must allow the SS
bit to remain set without immediately reasserting the interrupt line.
Furthermore, we can't handle multiple faults on a context-bank, so we'd
need to restrict ourselves to one device (i.e. faulting stream) per
domain (CB).

I think that means we want both specific compatible strings to identify
the SS bit behaviour, but also a way to opt-in for the stall model as a
separate property to indicate that the SoC integration can support this
without e.g. deadlocking.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* RE: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-16 11:54           ` Will Deacon
@ 2016-12-19  9:03               ` Sricharan
  -1 siblings, 0 replies; 33+ messages in thread
From: Sricharan @ 2016-12-19  9:03 UTC (permalink / raw)
  To: 'Will Deacon', 'Rob Clark'
  Cc: 'linux-arm-msm',
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	'Jordan Crouse',
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

Hi Will,

>On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
>I'm not against reintroducing this, but it would certainly need to be
>opt-in, as you suggest. If we want to try to use stall faults to enable
>demand paging for DMA, then that means running core mm code to resolve
>the fault and we can't do that in irq context. Consequently, we have to
>hand this off to a thread, which means the hardware must allow the SS
>bit to remain set without immediately reasserting the interrupt line.
>Furthermore, we can't handle multiple faults on a context-bank, so we'd
>need to restrict ourselves to one device (i.e. faulting stream) per
>domain (CB).
>
>I think that means we want both specific compatible strings to identify
>the SS bit behaviour, but also a way to opt-in for the stall model as a
>separate property to indicate that the SoC integration can support this
>without e.g. deadlocking.
>

To understand the reason on the need for the quirk based on SS bit behavior,
if the platform supports stall model and enabled, then SS bit should be implemented
and remain set until the RESUME register is written back, means same behavior
always ?

Regards,
 Sricharan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-19  9:03               ` Sricharan
  0 siblings, 0 replies; 33+ messages in thread
From: Sricharan @ 2016-12-19  9:03 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Will,

>On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
>I'm not against reintroducing this, but it would certainly need to be
>opt-in, as you suggest. If we want to try to use stall faults to enable
>demand paging for DMA, then that means running core mm code to resolve
>the fault and we can't do that in irq context. Consequently, we have to
>hand this off to a thread, which means the hardware must allow the SS
>bit to remain set without immediately reasserting the interrupt line.
>Furthermore, we can't handle multiple faults on a context-bank, so we'd
>need to restrict ourselves to one device (i.e. faulting stream) per
>domain (CB).
>
>I think that means we want both specific compatible strings to identify
>the SS bit behaviour, but also a way to opt-in for the stall model as a
>separate property to indicate that the SoC integration can support this
>without e.g. deadlocking.
>

To understand the reason on the need for the quirk based on SS bit behavior,
if the platform supports stall model and enabled, then SS bit should be implemented
and remain set until the RESUME register is written back, means same behavior
always ?

Regards,
 Sricharan

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-19  9:03               ` Sricharan
@ 2016-12-20 16:17                 ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-12-20 16:17 UTC (permalink / raw)
  To: Sricharan
  Cc: 'linux-arm-msm', 'Jordan Crouse',
	iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Mon, Dec 19, 2016 at 02:33:36PM +0530, Sricharan wrote:
> >On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> >I'm not against reintroducing this, but it would certainly need to be
> >opt-in, as you suggest. If we want to try to use stall faults to enable
> >demand paging for DMA, then that means running core mm code to resolve
> >the fault and we can't do that in irq context. Consequently, we have to
> >hand this off to a thread, which means the hardware must allow the SS
> >bit to remain set without immediately reasserting the interrupt line.
> >Furthermore, we can't handle multiple faults on a context-bank, so we'd
> >need to restrict ourselves to one device (i.e. faulting stream) per
> >domain (CB).
> >
> >I think that means we want both specific compatible strings to identify
> >the SS bit behaviour, but also a way to opt-in for the stall model as a
> >separate property to indicate that the SoC integration can support this
> >without e.g. deadlocking.
> >
> 
> To understand the reason on the need for the quirk based on SS bit behavior,
> if the platform supports stall model and enabled, then SS bit should be implemented
> and remain set until the RESUME register is written back, means same behavior
> always ?

The behaviour of the SS bit is IMPLEMENTATION DEFINED per the architecture,
so we need to know which way the given implementation chose to go. If we
want to support paging, then we absolutely need a way to return from the
interrupt handler without having handled the stall (i.e. without having
written to the RESUME register). That means that we mustn't take the same
interrupt immediately, otherwise we'll end up getting stuck in an infinite
fault. One hacky option would be to mask the interrupt at the GIC, but
that adds an additional requirement of one interrupt per context bank,
which isn't typically implemented in my experience.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2016-12-20 16:17                 ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2016-12-20 16:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Dec 19, 2016 at 02:33:36PM +0530, Sricharan wrote:
> >On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> >I'm not against reintroducing this, but it would certainly need to be
> >opt-in, as you suggest. If we want to try to use stall faults to enable
> >demand paging for DMA, then that means running core mm code to resolve
> >the fault and we can't do that in irq context. Consequently, we have to
> >hand this off to a thread, which means the hardware must allow the SS
> >bit to remain set without immediately reasserting the interrupt line.
> >Furthermore, we can't handle multiple faults on a context-bank, so we'd
> >need to restrict ourselves to one device (i.e. faulting stream) per
> >domain (CB).
> >
> >I think that means we want both specific compatible strings to identify
> >the SS bit behaviour, but also a way to opt-in for the stall model as a
> >separate property to indicate that the SoC integration can support this
> >without e.g. deadlocking.
> >
> 
> To understand the reason on the need for the quirk based on SS bit behavior,
> if the platform supports stall model and enabled, then SS bit should be implemented
> and remain set until the RESUME register is written back, means same behavior
> always ?

The behaviour of the SS bit is IMPLEMENTATION DEFINED per the architecture,
so we need to know which way the given implementation chose to go. If we
want to support paging, then we absolutely need a way to return from the
interrupt handler without having handled the stall (i.e. without having
written to the RESUME register). That means that we mustn't take the same
interrupt immediately, otherwise we'll end up getting stuck in an infinite
fault. One hacky option would be to mask the interrupt at the GIC, but
that adds an additional requirement of one interrupt per context bank,
which isn't typically implemented in my experience.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2016-12-16 11:54           ` Will Deacon
@ 2017-09-13 19:31               ` Rob Clark
  -1 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2017-09-13 19:31 UTC (permalink / raw)
  To: Will Deacon
  Cc: linux-arm-msm, iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r

On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
> Hi Rob,
>
> On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon-5wv7dgnIgG8@public.gmane.org> wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
> I'm not against reintroducing this, but it would certainly need to be
> opt-in, as you suggest. If we want to try to use stall faults to enable
> demand paging for DMA, then that means running core mm code to resolve
> the fault and we can't do that in irq context. Consequently, we have to
> hand this off to a thread, which means the hardware must allow the SS
> bit to remain set without immediately reasserting the interrupt line.
> Furthermore, we can't handle multiple faults on a context-bank, so we'd
> need to restrict ourselves to one device (i.e. faulting stream) per
> domain (CB).
>
> I think that means we want both specific compatible strings to identify
> the SS bit behaviour, but also a way to opt-in for the stall model as a
> separate property to indicate that the SoC integration can support this
> without e.g. deadlocking.
>

How do you feel about a short-term step to keep things in irq context,
but enable stall mode?  I'm debugging an issue, where it appears that
the gpu cannot handle a non-stalled fault (triggers some fairly
bizarre follow-on problems).  So I think even if we don't add a fault
handler callback, we still want to set the CFCFG bit and
RESUME_TERMINATE in the fault handler on this hardware.

BR,
-R

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2017-09-13 19:31               ` Rob Clark
  0 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2017-09-13 19:31 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@arm.com> wrote:
> Hi Rob,
>
> On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > Enabling stalling faults can result in hardware deadlock on poorly
>> > designed systems, particularly those with a PCI root complex upstream of
>> > the SMMU.
>> >
>> > Although it's not really Linux's job to save hardware integrators from
>> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> > clients) from hosing the system for everybody else, even if they might
>> > already be required to have elevated privileges.
>> >
>> > Given that the fault handling code currently executes entirely in IRQ
>> > context, there is nothing that can sensibly be done to recover from
>> > things like page faults anyway, so let's rip this code out for now and
>> > avoid the potential for deadlock.
>>
>> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> opt-in quirk (ie. disabled by default unless something in DT tells you
>> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> hw was having problems due to this feature.)
>>
>> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> halting the GPU so the driver's fault handler can dump some GPU state
>> on faults is enormously helpful for debugging and tracking down where
>> in the gpu cmdstream the fault was triggered.  In addition, we will
>> eventually want the ability to update pagetables from fault handler
>> and resuming the faulting transition.
>
> I'm not against reintroducing this, but it would certainly need to be
> opt-in, as you suggest. If we want to try to use stall faults to enable
> demand paging for DMA, then that means running core mm code to resolve
> the fault and we can't do that in irq context. Consequently, we have to
> hand this off to a thread, which means the hardware must allow the SS
> bit to remain set without immediately reasserting the interrupt line.
> Furthermore, we can't handle multiple faults on a context-bank, so we'd
> need to restrict ourselves to one device (i.e. faulting stream) per
> domain (CB).
>
> I think that means we want both specific compatible strings to identify
> the SS bit behaviour, but also a way to opt-in for the stall model as a
> separate property to indicate that the SoC integration can support this
> without e.g. deadlocking.
>

How do you feel about a short-term step to keep things in irq context,
but enable stall mode?  I'm debugging an issue, where it appears that
the gpu cannot handle a non-stalled fault (triggers some fairly
bizarre follow-on problems).  So I think even if we don't add a fault
handler callback, we still want to set the CFCFG bit and
RESUME_TERMINATE in the fault handler on this hardware.

BR,
-R

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2017-09-13 19:31               ` Rob Clark
@ 2017-09-18 17:33                 ` Will Deacon
  -1 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-09-18 17:33 UTC (permalink / raw)
  To: Rob Clark
  Cc: iommu, linux-arm-kernel, linux-arm-msm, Sricharan R, Jordan Crouse

On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Hi Rob,
> >
> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> > I'm not against reintroducing this, but it would certainly need to be
> > opt-in, as you suggest. If we want to try to use stall faults to enable
> > demand paging for DMA, then that means running core mm code to resolve
> > the fault and we can't do that in irq context. Consequently, we have to
> > hand this off to a thread, which means the hardware must allow the SS
> > bit to remain set without immediately reasserting the interrupt line.
> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
> > need to restrict ourselves to one device (i.e. faulting stream) per
> > domain (CB).
> >
> > I think that means we want both specific compatible strings to identify
> > the SS bit behaviour, but also a way to opt-in for the stall model as a
> > separate property to indicate that the SoC integration can support this
> > without e.g. deadlocking.
> >
> 
> How do you feel about a short-term step to keep things in irq context,
> but enable stall mode?  I'm debugging an issue, where it appears that
> the gpu cannot handle a non-stalled fault (triggers some fairly
> bizarre follow-on problems).  So I think even if we don't add a fault
> handler callback, we still want to set the CFCFG bit and
> RESUME_TERMINATE in the fault handler on this hardware.

Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
I'd rather we bite the bullet and implement things properly on top of
a workqueue so that you can build on the same basic infrastructure as
the SVM work that Jean-Philippe is looking at, particular as you also
have a use-case for running fault code in non-atomic context.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2017-09-18 17:33                 ` Will Deacon
  0 siblings, 0 replies; 33+ messages in thread
From: Will Deacon @ 2017-09-18 17:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@arm.com> wrote:
> > Hi Rob,
> >
> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
> >> > Enabling stalling faults can result in hardware deadlock on poorly
> >> > designed systems, particularly those with a PCI root complex upstream of
> >> > the SMMU.
> >> >
> >> > Although it's not really Linux's job to save hardware integrators from
> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
> >> > clients) from hosing the system for everybody else, even if they might
> >> > already be required to have elevated privileges.
> >> >
> >> > Given that the fault handling code currently executes entirely in IRQ
> >> > context, there is nothing that can sensibly be done to recover from
> >> > things like page faults anyway, so let's rip this code out for now and
> >> > avoid the potential for deadlock.
> >>
> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
> >> opt-in quirk (ie. disabled by default unless something in DT tells you
> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
> >> hw was having problems due to this feature.)
> >>
> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
> >> halting the GPU so the driver's fault handler can dump some GPU state
> >> on faults is enormously helpful for debugging and tracking down where
> >> in the gpu cmdstream the fault was triggered.  In addition, we will
> >> eventually want the ability to update pagetables from fault handler
> >> and resuming the faulting transition.
> >
> > I'm not against reintroducing this, but it would certainly need to be
> > opt-in, as you suggest. If we want to try to use stall faults to enable
> > demand paging for DMA, then that means running core mm code to resolve
> > the fault and we can't do that in irq context. Consequently, we have to
> > hand this off to a thread, which means the hardware must allow the SS
> > bit to remain set without immediately reasserting the interrupt line.
> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
> > need to restrict ourselves to one device (i.e. faulting stream) per
> > domain (CB).
> >
> > I think that means we want both specific compatible strings to identify
> > the SS bit behaviour, but also a way to opt-in for the stall model as a
> > separate property to indicate that the SoC integration can support this
> > without e.g. deadlocking.
> >
> 
> How do you feel about a short-term step to keep things in irq context,
> but enable stall mode?  I'm debugging an issue, where it appears that
> the gpu cannot handle a non-stalled fault (triggers some fairly
> bizarre follow-on problems).  So I think even if we don't add a fault
> handler callback, we still want to set the CFCFG bit and
> RESUME_TERMINATE in the fault handler on this hardware.

Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
I'd rather we bite the bullet and implement things properly on top of
a workqueue so that you can build on the same basic infrastructure as
the SVM work that Jean-Philippe is looking at, particular as you also
have a use-case for running fault code in non-atomic context.

Will

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
  2017-09-18 17:33                 ` Will Deacon
@ 2017-09-18 18:45                   ` Rob Clark
  -1 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2017-09-18 18:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu, linux-arm-kernel, linux-arm-msm, Sricharan R, Jordan Crouse

On Mon, Sep 18, 2017 at 1:33 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
>> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > Hi Rob,
>> >
>> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
>> >> > Enabling stalling faults can result in hardware deadlock on poorly
>> >> > designed systems, particularly those with a PCI root complex upstream of
>> >> > the SMMU.
>> >> >
>> >> > Although it's not really Linux's job to save hardware integrators from
>> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> >> > clients) from hosing the system for everybody else, even if they might
>> >> > already be required to have elevated privileges.
>> >> >
>> >> > Given that the fault handling code currently executes entirely in IRQ
>> >> > context, there is nothing that can sensibly be done to recover from
>> >> > things like page faults anyway, so let's rip this code out for now and
>> >> > avoid the potential for deadlock.
>> >>
>> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> >> opt-in quirk (ie. disabled by default unless something in DT tells you
>> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> >> hw was having problems due to this feature.)
>> >>
>> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> >> halting the GPU so the driver's fault handler can dump some GPU state
>> >> on faults is enormously helpful for debugging and tracking down where
>> >> in the gpu cmdstream the fault was triggered.  In addition, we will
>> >> eventually want the ability to update pagetables from fault handler
>> >> and resuming the faulting transition.
>> >
>> > I'm not against reintroducing this, but it would certainly need to be
>> > opt-in, as you suggest. If we want to try to use stall faults to enable
>> > demand paging for DMA, then that means running core mm code to resolve
>> > the fault and we can't do that in irq context. Consequently, we have to
>> > hand this off to a thread, which means the hardware must allow the SS
>> > bit to remain set without immediately reasserting the interrupt line.
>> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
>> > need to restrict ourselves to one device (i.e. faulting stream) per
>> > domain (CB).
>> >
>> > I think that means we want both specific compatible strings to identify
>> > the SS bit behaviour, but also a way to opt-in for the stall model as a
>> > separate property to indicate that the SoC integration can support this
>> > without e.g. deadlocking.
>> >
>>
>> How do you feel about a short-term step to keep things in irq context,
>> but enable stall mode?  I'm debugging an issue, where it appears that
>> the gpu cannot handle a non-stalled fault (triggers some fairly
>> bizarre follow-on problems).  So I think even if we don't add a fault
>> handler callback, we still want to set the CFCFG bit and
>> RESUME_TERMINATE in the fault handler on this hardware.
>
> Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
> I'd rather we bite the bullet and implement things properly on top of
> a workqueue so that you can build on the same basic infrastructure as
> the SVM work that Jean-Philippe is looking at, particular as you also
> have a use-case for running fault code in non-atomic context.
>

So, it seems like setting either CFCFG or HUPCF (which is what
downstream android kernel apparently does) avoids this issue.

It seems like some sort of hw bug, but not sure if in the iommu or the
gpu.. you'd probably have to ask someone @qcom about the details, but
without setting either of these bits, it seems that other concurrent
memory transactions to the one triggering fault and up returning bogus
values to the GPU.  So the CP (which is reading cmdstream somewhat far
ahead of the DRAW or COMPUTE shader that triggers a fault) ends up
reading bogus cmdstream values and triggers all sorts of spectacular
fireworks.

BR,
-R

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints
@ 2017-09-18 18:45                   ` Rob Clark
  0 siblings, 0 replies; 33+ messages in thread
From: Rob Clark @ 2017-09-18 18:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 18, 2017 at 1:33 PM, Will Deacon <will.deacon@arm.com> wrote:
> On Wed, Sep 13, 2017 at 03:31:20PM -0400, Rob Clark wrote:
>> On Fri, Dec 16, 2016 at 6:54 AM, Will Deacon <will.deacon@arm.com> wrote:
>> > Hi Rob,
>> >
>> > On Tue, Dec 06, 2016 at 06:30:21PM -0500, Rob Clark wrote:
>> >> On Thu, Aug 18, 2016 at 9:05 AM, Will Deacon <will.deacon@arm.com> wrote:
>> >> > Enabling stalling faults can result in hardware deadlock on poorly
>> >> > designed systems, particularly those with a PCI root complex upstream of
>> >> > the SMMU.
>> >> >
>> >> > Although it's not really Linux's job to save hardware integrators from
>> >> > their own misfortune, it *is* our job to stop userspace (e.g. VFIO
>> >> > clients) from hosing the system for everybody else, even if they might
>> >> > already be required to have elevated privileges.
>> >> >
>> >> > Given that the fault handling code currently executes entirely in IRQ
>> >> > context, there is nothing that can sensibly be done to recover from
>> >> > things like page faults anyway, so let's rip this code out for now and
>> >> > avoid the potential for deadlock.
>> >>
>> >> so, I'd like to re-introduce this feature, I *guess* as some sort of
>> >> opt-in quirk (ie. disabled by default unless something in DT tells you
>> >> otherwise??  But I'm open to suggestions.  I'm not entirely sure what
>> >> hw was having problems due to this feature.)
>> >>
>> >> On newer snapdragon devices we are using arm-smmu for the GPU, and
>> >> halting the GPU so the driver's fault handler can dump some GPU state
>> >> on faults is enormously helpful for debugging and tracking down where
>> >> in the gpu cmdstream the fault was triggered.  In addition, we will
>> >> eventually want the ability to update pagetables from fault handler
>> >> and resuming the faulting transition.
>> >
>> > I'm not against reintroducing this, but it would certainly need to be
>> > opt-in, as you suggest. If we want to try to use stall faults to enable
>> > demand paging for DMA, then that means running core mm code to resolve
>> > the fault and we can't do that in irq context. Consequently, we have to
>> > hand this off to a thread, which means the hardware must allow the SS
>> > bit to remain set without immediately reasserting the interrupt line.
>> > Furthermore, we can't handle multiple faults on a context-bank, so we'd
>> > need to restrict ourselves to one device (i.e. faulting stream) per
>> > domain (CB).
>> >
>> > I think that means we want both specific compatible strings to identify
>> > the SS bit behaviour, but also a way to opt-in for the stall model as a
>> > separate property to indicate that the SoC integration can support this
>> > without e.g. deadlocking.
>> >
>>
>> How do you feel about a short-term step to keep things in irq context,
>> but enable stall mode?  I'm debugging an issue, where it appears that
>> the gpu cannot handle a non-stalled fault (triggers some fairly
>> bizarre follow-on problems).  So I think even if we don't add a fault
>> handler callback, we still want to set the CFCFG bit and
>> RESUME_TERMINATE in the fault handler on this hardware.
>
> Hmm, colour me unconvinced. Why does enabling stalls fix this problem?
> I'd rather we bite the bullet and implement things properly on top of
> a workqueue so that you can build on the same basic infrastructure as
> the SVM work that Jean-Philippe is looking at, particular as you also
> have a use-case for running fault code in non-atomic context.
>

So, it seems like setting either CFCFG or HUPCF (which is what
downstream android kernel apparently does) avoids this issue.

It seems like some sort of hw bug, but not sure if in the iommu or the
gpu.. you'd probably have to ask someone @qcom about the details, but
without setting either of these bits, it seems that other concurrent
memory transactions to the one triggering fault and up returning bogus
values to the GPU.  So the CP (which is reading cmdstream somewhat far
ahead of the DRAW or COMPUTE shader that triggers a fault) ends up
reading bogus cmdstream values and triggers all sorts of spectacular
fireworks.

BR,
-R

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2017-09-18 18:45 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-18 13:05 [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8 Will Deacon
2016-08-18 13:05 ` Will Deacon
2016-08-18 13:05 ` [PATCH 1/4] iommu/io-pgtable-arm-v7s: Fix attributes when splitting blocks Will Deacon
2016-08-18 13:05   ` Will Deacon
2016-08-18 13:05   ` Will Deacon
2016-08-18 13:05 ` [PATCH 2/4] iommu/arm-smmu: Fix CMDQ error handling Will Deacon
2016-08-18 13:05   ` Will Deacon
2016-08-18 13:05   ` Will Deacon
2016-08-18 13:05 ` [PATCH 3/4] iommu/arm-smmu: Disable stalling faults for all endpoints Will Deacon
2016-08-18 13:05   ` Will Deacon
2016-08-18 13:05   ` Will Deacon
     [not found]   ` <1471525542-14969-4-git-send-email-will.deacon-5wv7dgnIgG8@public.gmane.org>
2016-12-06 23:30     ` Rob Clark
2016-12-06 23:30       ` Rob Clark
2016-12-07  0:00       ` Jordan Crouse
2016-12-07  0:00         ` Jordan Crouse
2016-12-10 15:44         ` Sricharan
2016-12-10 15:44           ` Sricharan
     [not found]       ` <CAF6AEGujUb37qagBZK1T13eAh=50zy0z5HpFF9FsuvkZZ7c5qw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2016-12-16 11:54         ` Will Deacon
2016-12-16 11:54           ` Will Deacon
     [not found]           ` <20161216115413.GE13191-5wv7dgnIgG8@public.gmane.org>
2016-12-19  9:03             ` Sricharan
2016-12-19  9:03               ` Sricharan
2016-12-20 16:17               ` Will Deacon
2016-12-20 16:17                 ` Will Deacon
2017-09-13 19:31             ` Rob Clark
2017-09-13 19:31               ` Rob Clark
2017-09-18 17:33               ` Will Deacon
2017-09-18 17:33                 ` Will Deacon
2017-09-18 18:45                 ` Rob Clark
2017-09-18 18:45                   ` Rob Clark
     [not found] ` <1471525542-14969-1-git-send-email-will.deacon-5wv7dgnIgG8@public.gmane.org>
2016-08-18 13:05   ` [PATCH 4/4] iommu/arm-smmu: Don't BUG() if we find aborting STEs with disable_bypass Will Deacon
2016-08-18 13:05     ` Will Deacon
2016-08-18 16:52   ` [PATCH 0/4] iommu/arm-smmu: Fixes for 4.8 Joerg Roedel
2016-08-18 16:52     ` Joerg Roedel

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.