Re: [PATCH] iomm/arm-smmu: Add stall implementation hook

From: Robin Murphy <robin.murphy@arm.com>
To: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>,
	Will Deacon <will@kernel.org>, Joerg Roedel <joro@8bytes.org>,
	Rob Clark <robdclark@gmail.com>,
	Jordan Crouse <jcrouse@codeaurora.org>
Cc: linux-arm-msm@vger.kernel.org, iommu@lists.linux-foundation.org,
	linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [PATCH] iomm/arm-smmu: Add stall implementation hook
Date: Thu, 7 May 2020 11:55:54 +0100	[thread overview]
Message-ID: <1ced023b-157c-21a0-ac75-1adef7f029f0@arm.com> (raw)
In-Reply-To: <b491e02ad790a437115fdeab6b21bc48@codeaurora.org>

On 2020-05-07 11:14 am, Sai Prakash Ranjan wrote:
> Hi Will, Robin
> 
> On 2020-04-22 01:50, Sai Prakash Ranjan wrote:
>> Add stall implementation hook to enable stalling
>> faults on QCOM platforms which supports it without
>> causing any kind of hardware mishaps. Without this
>> on QCOM platforms, GPU faults can cause unrelated
>> GPU memory accesses to return zeroes. This has the
>> unfortunate result of command-stream reads from CP
>> getting invalid data, causing a cascade of fail.

I think this came up before, but something about this rationale doesn't 
add up - we're not *using* stalls at all, we're still terminating 
faulting transactions unconditionally; we're just using CFCFG to 
terminate them with a slight delay, rather than immediately. It's really 
not clear how or why that makes a difference. Is it a GPU bug? Or an 
SMMU bug? Is this reliable (or even a documented workaround for 
something), or might things start blowing up again if any other 
behaviour subtly changes? I'm not dead set against adding this, but I'd 
*really* like to have a lot more confidence in it.

>> Suggested-by: Rob Clark <robdclark@gmail.com>
>> Signed-off-by: Sai Prakash Ranjan <saiprakash.ranjan@codeaurora.org>
>> ---
>> This has been attempted previously by Rob Clark in 2017, 2018.
>> Hopefully we can get something concluded in 2020.
>>  * https://patchwork.kernel.org/patch/9953803/
>>  * https://patchwork.kernel.org/patch/10618713/
>> ---
>>  drivers/iommu/arm-smmu-qcom.c | 1 +
>>  drivers/iommu/arm-smmu.c      | 7 +++++++
>>  drivers/iommu/arm-smmu.h      | 1 +
>>  3 files changed, 9 insertions(+)
>>
>> diff --git a/drivers/iommu/arm-smmu-qcom.c 
>> b/drivers/iommu/arm-smmu-qcom.c
>> index 24c071c1d8b0..a13b229389d4 100644
>> --- a/drivers/iommu/arm-smmu-qcom.c
>> +++ b/drivers/iommu/arm-smmu-qcom.c
>> @@ -32,6 +32,7 @@ static int qcom_sdm845_smmu500_reset(struct
>> arm_smmu_device *smmu)
>>
>>  static const struct arm_smmu_impl qcom_smmu_impl = {
>>      .reset = qcom_sdm845_smmu500_reset,
>> +    .stall = true,
>>  };
>>
>>  struct arm_smmu_device *qcom_smmu_impl_init(struct arm_smmu_device 
>> *smmu)
>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
>> index e622f4e33379..16b03fca9966 100644
>> --- a/drivers/iommu/arm-smmu.c
>> +++ b/drivers/iommu/arm-smmu.c
>> @@ -488,6 +488,11 @@ static irqreturn_t arm_smmu_context_fault(int
>> irq, void *dev)
>>                  fsr, iova, fsynr, cbfrsynra, idx);
>>
>>      arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_FSR, fsr);
>> +
>> +    if (smmu->impl && smmu->impl->stall && (fsr & ARM_SMMU_FSR_SS))
>> +        arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_RESUME,
>> +                  ARM_SMMU_RESUME_TERMINATE);

Shouldn't this be *before* the write to FSR, in case the outstanding 
fault causes that to be immediately reasserted before we write CB_RESUME 
and we end up immediately taking the IRQ a second time?

(The overall enablement being in impl is sound, but you still don't get 
to play "works on my machine" in the architectural code :P)

Robin.

>> +
>>      return IRQ_HANDLED;
>>  }
>>
>> @@ -659,6 +664,8 @@ static void arm_smmu_write_context_bank(struct
>> arm_smmu_device *smmu, int idx)
>>          reg |= ARM_SMMU_SCTLR_S1_ASIDPNE;
>>      if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
>>          reg |= ARM_SMMU_SCTLR_E;
>> +    if (smmu->impl && smmu->impl->stall)
>> +        reg |= ARM_SMMU_SCTLR_CFCFG;
>>
>>      arm_smmu_cb_write(smmu, idx, ARM_SMMU_CB_SCTLR, reg);
>>  }
>> diff --git a/drivers/iommu/arm-smmu.h b/drivers/iommu/arm-smmu.h
>> index 8d1cd54d82a6..d5134e0d5cce 100644
>> --- a/drivers/iommu/arm-smmu.h
>> +++ b/drivers/iommu/arm-smmu.h
>> @@ -386,6 +386,7 @@ struct arm_smmu_impl {
>>      int (*init_context)(struct arm_smmu_domain *smmu_domain);
>>      void (*tlb_sync)(struct arm_smmu_device *smmu, int page, int sync,
>>               int status);
>> +    bool stall;
>>  };
>>
>>  static inline void __iomem *arm_smmu_page(struct arm_smmu_device 
>> *smmu, int n)
> 
> Any comments on this patch?
> 
> Thanks,
> Sai
> 
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu