linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
@ 2019-01-21  5:53 Vivek Gautam
  2019-01-21  5:53 ` [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes Vivek Gautam
                   ` (3 more replies)
  0 siblings, 4 replies; 19+ messages in thread
From: Vivek Gautam @ 2019-01-21  5:53 UTC (permalink / raw)
  To: will.deacon, robin.murphy, joro, iommu
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, tfiga, pratikp,
	pdaly, jcrouse, Vivek Gautam

Qualcomm SoCs have an additional level of cache called as
System cache, aka. Last level cache (LLC). This cache sits right
before the DDR, and is tightly coupled with the memory controller.
The clients using this cache request their slices from this
system cache, make it active, and can then start using it.
For these clients with smmu, to start using the system cache for
buffers and, related page tables [1], memory attributes need to be
set accordingly. This series add the required support.

This change is a realisation of following changes from downstream msm-4.9:
iommu: io-pgtable-arm: Support DOMAIN_ATTRIBUTE_USE_UPSTREAM_HINT[2]
iommu: io-pgtable-arm: Implement IOMMU_USE_UPSTREAM_HINT[3]

Changes since v2:
 - Split the patches into io-pgtable-arm driver and arm-smmu driver.
 - Converted smmu domain attributes to a bitmap, so multiple attributes
   can be managed easily.
 - With addition of non-coherent page table mapping support [4], this
   patch series now aligns with the understanding of upgrading the
   non-coherent devices to use some level of outer cache.
 - Updated the macros and comments to reflect the use of QCOM_SYS_CACHE.
 - QCOM_SYS_CACHE can still be used at stage 2, so that doens't depend on
   stage-1 mapping.
 - Added change to disable the attribute from arm_smmu_domain_set_attr()
   when needed.
 - Removed the page protection controls for QCOM_SYS_CACHE at the DMA API
   level.

Goes on top of the non-coherent page tables support patch series [4]

[1] https://patchwork.kernel.org/patch/10302791/
[2] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=bf762276796e79ca90014992f4d9da5593fa7d51
[3] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=d4c72c413ea27c43f60825193d4de9cb8ffd9602
[4] https://lore.kernel.org/patchwork/cover/1032938/

Vivek Gautam (3):
  iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
  iommu/io-pgtable-arm: Add support to use system cache
  iommu/arm-smmu: Add support to use system cache

 drivers/iommu/arm-smmu.c       | 28 ++++++++++++++++++++++++----
 drivers/iommu/io-pgtable-arm.c | 15 +++++++++++++--
 drivers/iommu/io-pgtable.h     |  4 ++++
 include/linux/iommu.h          |  2 ++
 4 files changed, 43 insertions(+), 6 deletions(-)

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
  2019-01-21  5:53 [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Vivek Gautam
@ 2019-01-21  5:53 ` Vivek Gautam
  2019-01-21 13:51   ` Robin Murphy
  2019-01-21  5:53 ` [PATCH 2/3] iommu/io-pgtable-arm: Add support to use system cache Vivek Gautam
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 19+ messages in thread
From: Vivek Gautam @ 2019-01-21  5:53 UTC (permalink / raw)
  To: will.deacon, robin.murphy, joro, iommu
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, tfiga, pratikp,
	pdaly, jcrouse, Vivek Gautam

A number of arm_smmu_domain's attributes can be assigned based
on the iommu domains's attributes. These local attributes better
be managed by a bitmap.
So remove boolean flags and move to a 32-bit bitmap, and enable
each bits separtely.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
---
 drivers/iommu/arm-smmu.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 7ebbcf1b2eb3..52b300dfc096 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -257,10 +257,11 @@ struct arm_smmu_domain {
 	const struct iommu_gather_ops	*tlb_ops;
 	struct arm_smmu_cfg		cfg;
 	enum arm_smmu_domain_stage	stage;
-	bool				non_strict;
 	struct mutex			init_mutex; /* Protects smmu pointer */
 	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
 	struct iommu_domain		domain;
+#define ARM_SMMU_DOMAIN_ATTR_NON_STRICT	BIT(0)
+	unsigned int			attr;
 };
 
 struct arm_smmu_option_prop {
@@ -901,7 +902,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
 		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
 
-	if (smmu_domain->non_strict)
+	if (smmu_domain->attr & ARM_SMMU_DOMAIN_ATTR_NON_STRICT)
 		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
 
 	/* Non coherent page table mappings only for Stage-1 */
@@ -1598,7 +1599,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 	case IOMMU_DOMAIN_DMA:
 		switch (attr) {
 		case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-			*(int *)data = smmu_domain->non_strict;
+			*(int *)data = !!(smmu_domain->attr &
+					  ARM_SMMU_DOMAIN_ATTR_NON_STRICT);
 			return 0;
 		default:
 			return -ENODEV;
@@ -1638,7 +1640,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 	case IOMMU_DOMAIN_DMA:
 		switch (attr) {
 		case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
-			smmu_domain->non_strict = *(int *)data;
+			smmu_domain->attr |= ARM_SMMU_DOMAIN_ATTR_NON_STRICT;
 			break;
 		default:
 			ret = -ENODEV;
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 2/3] iommu/io-pgtable-arm: Add support to use system cache
  2019-01-21  5:53 [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Vivek Gautam
  2019-01-21  5:53 ` [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes Vivek Gautam
@ 2019-01-21  5:53 ` Vivek Gautam
  2019-01-21  5:53 ` [PATCH 3/3] iommu/arm-smmu: " Vivek Gautam
  2019-01-21  7:26 ` [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Ard Biesheuvel
  3 siblings, 0 replies; 19+ messages in thread
From: Vivek Gautam @ 2019-01-21  5:53 UTC (permalink / raw)
  To: will.deacon, robin.murphy, joro, iommu
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, tfiga, pratikp,
	pdaly, jcrouse, Vivek Gautam

Few Qualcomm platforms such as, sdm845 have an additional outer
cache called as System cache, aka. Last level cache (LLC) that
allows non-coherent devices to upgrade to using caching.

There is a fundamental assumption that non-coherent devices can't
access caches. This change adds an exception where they *can* use
some level of cache despite still being non-coherent overall.
The coherent devices that use cacheable memory, and CPU make use of
this system cache by default.

Looking at memory types, we have following -
a) Normal uncached :- MAIR 0x44, inner non-cacheable,
                      outer non-cacheable;
b) Normal cached :-   MAIR 0xff, inner read write-back non-transient,
                      outer read write-back non-transient;
                      attribute setting for coherenet I/O devices.
and, for non-coherent i/o devices that can allocate in system cache
another type gets added -
c) Normal sys-cached :- MAIR 0xf4, inner non-cacheable,
                        outer read write-back non-transient

Coherent I/O devices use system cache by marking the memory as
normal cached.
Non-coherent I/O devices should mark the memory as normal
sys-cached in page tables to use system cache.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
---
 drivers/iommu/io-pgtable-arm.c | 15 +++++++++++++--
 drivers/iommu/io-pgtable.h     |  4 ++++
 include/linux/iommu.h          |  2 ++
 3 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/drivers/iommu/io-pgtable-arm.c b/drivers/iommu/io-pgtable-arm.c
index c76919c30f1a..0e55772702da 100644
--- a/drivers/iommu/io-pgtable-arm.c
+++ b/drivers/iommu/io-pgtable-arm.c
@@ -168,10 +168,12 @@
 #define ARM_LPAE_MAIR_ATTR_MASK		0xff
 #define ARM_LPAE_MAIR_ATTR_DEVICE	0x04
 #define ARM_LPAE_MAIR_ATTR_NC		0x44
+#define ARM_LPAE_MAIR_ATTR_QCOM_SYS_CACHE	0xf4
 #define ARM_LPAE_MAIR_ATTR_WBRWA	0xff
 #define ARM_LPAE_MAIR_ATTR_IDX_NC	0
 #define ARM_LPAE_MAIR_ATTR_IDX_CACHE	1
 #define ARM_LPAE_MAIR_ATTR_IDX_DEV	2
+#define ARM_LPAE_MAIR_ATTR_IDX_QCOM_SYS_CACHE	3
 
 /* IOPTE accessors */
 #define iopte_deref(pte,d) __va(iopte_to_paddr(pte, d))
@@ -443,6 +445,9 @@ static arm_lpae_iopte arm_lpae_prot_to_pte(struct arm_lpae_io_pgtable *data,
 		else if (prot & IOMMU_CACHE)
 			pte |= (ARM_LPAE_MAIR_ATTR_IDX_CACHE
 				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
+		else if (prot & IOMMU_QCOM_SYS_CACHE)
+			pte |= (ARM_LPAE_MAIR_ATTR_IDX_QCOM_SYS_CACHE
+				<< ARM_LPAE_PTE_ATTRINDX_SHIFT);
 	} else {
 		pte = ARM_LPAE_PTE_HAP_FAULT;
 		if (prot & IOMMU_READ)
@@ -781,7 +786,8 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 
 	if (cfg->quirks & ~(IO_PGTABLE_QUIRK_ARM_NS | IO_PGTABLE_QUIRK_NO_DMA |
 			    IO_PGTABLE_QUIRK_NON_STRICT |
-			    IO_PGTABLE_QUIRK_NON_COHERENT))
+			    IO_PGTABLE_QUIRK_NON_COHERENT |
+			    IO_PGTABLE_QUIRK_QCOM_SYS_CACHE))
 		return NULL;
 
 	data = arm_lpae_alloc_pgtable(cfg);
@@ -794,6 +800,9 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	if (cfg->quirks & IO_PGTABLE_QUIRK_NON_COHERENT)
 		reg |= ARM_LPAE_TCR_RGN_NC << ARM_LPAE_TCR_IRGN0_SHIFT |
 		       ARM_LPAE_TCR_RGN_NC << ARM_LPAE_TCR_ORGN0_SHIFT;
+	else if (cfg->quirks & IO_PGTABLE_QUIRK_QCOM_SYS_CACHE)
+		reg |= ARM_LPAE_TCR_RGN_NC << ARM_LPAE_TCR_IRGN0_SHIFT |
+		      ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT;
 	else
 		reg |= ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_IRGN0_SHIFT |
 		       ARM_LPAE_TCR_RGN_WBWA << ARM_LPAE_TCR_ORGN0_SHIFT;
@@ -848,7 +857,9 @@ arm_64_lpae_alloc_pgtable_s1(struct io_pgtable_cfg *cfg, void *cookie)
 	      (ARM_LPAE_MAIR_ATTR_WBRWA
 	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_CACHE)) |
 	      (ARM_LPAE_MAIR_ATTR_DEVICE
-	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV));
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_DEV)) |
+	      (ARM_LPAE_MAIR_ATTR_QCOM_SYS_CACHE
+	       << ARM_LPAE_MAIR_ATTR_SHIFT(ARM_LPAE_MAIR_ATTR_IDX_QCOM_SYS_CACHE));
 
 	cfg->arm_lpae_s1_cfg.mair[0] = reg;
 	cfg->arm_lpae_s1_cfg.mair[1] = 0;
diff --git a/drivers/iommu/io-pgtable.h b/drivers/iommu/io-pgtable.h
index 46604cf7b017..fb237e8aa9f1 100644
--- a/drivers/iommu/io-pgtable.h
+++ b/drivers/iommu/io-pgtable.h
@@ -80,6 +80,9 @@ struct io_pgtable_cfg {
 	 *      pagetables even on a coherent SMMU for cases where reducing
 	 *      snoop traffic/latency on walks outweighs the cost of cache
 	 *      maintenance on PTE updates.
+	 *
+	 * IO_PGTABLE_QUIRK_QCOM_SYS_CACHE: Force using outer system cache
+	 *      for non-coherent devices on Qcom platforms.
 	 */
 	#define IO_PGTABLE_QUIRK_ARM_NS		BIT(0)
 	#define IO_PGTABLE_QUIRK_NO_PERMS	BIT(1)
@@ -88,6 +91,7 @@ struct io_pgtable_cfg {
 	#define IO_PGTABLE_QUIRK_NO_DMA		BIT(4)
 	#define IO_PGTABLE_QUIRK_NON_STRICT	BIT(5)
 	#define IO_PGTABLE_QUIRK_NON_COHERENT	BIT(6)
+	#define IO_PGTABLE_QUIRK_QCOM_SYS_CACHE BIT(7)
 	unsigned long			quirks;
 	unsigned long			pgsize_bitmap;
 	unsigned int			ias;
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index e90da6b6f3d1..08bb6befad5e 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -31,6 +31,7 @@
 #define IOMMU_CACHE	(1 << 2) /* DMA cache coherency */
 #define IOMMU_NOEXEC	(1 << 3)
 #define IOMMU_MMIO	(1 << 4) /* e.g. things like MSI doorbells */
+#define IOMMU_QCOM_SYS_CACHE	(1 << 6)
 /*
  * Where the bus hardware includes a privilege level as part of its access type
  * markings, and certain devices are capable of issuing transactions marked as
@@ -125,6 +126,7 @@ enum iommu_attr {
 	DOMAIN_ATTR_FSL_PAMUV1,
 	DOMAIN_ATTR_NESTING,	/* two stages of translation */
 	DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE,
+	DOMAIN_ATTR_QCOM_SYS_CACHE,
 	DOMAIN_ATTR_MAX,
 };
 
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH 3/3] iommu/arm-smmu: Add support to use system cache
  2019-01-21  5:53 [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Vivek Gautam
  2019-01-21  5:53 ` [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes Vivek Gautam
  2019-01-21  5:53 ` [PATCH 2/3] iommu/io-pgtable-arm: Add support to use system cache Vivek Gautam
@ 2019-01-21  5:53 ` Vivek Gautam
  2019-01-21  7:26 ` [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Ard Biesheuvel
  3 siblings, 0 replies; 19+ messages in thread
From: Vivek Gautam @ 2019-01-21  5:53 UTC (permalink / raw)
  To: will.deacon, robin.murphy, joro, iommu
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, tfiga, pratikp,
	pdaly, jcrouse, Vivek Gautam

Few Qualcomm platforms, such as sdm845 have an additional outer
cache called as System cache, aka. Last level cache (LLC) that
allows non-coherent devices to upgrade to using caching.
This last level cache sits right before the DDR, and is tightly
coupled with the memory controller.
The cache is available to a number of devices - coherent and
non-coherent, present in the SoC system, and to CPUs.
The devices request their slices from this system cache, make it
active, and can then start using it.

Devices can set iommu domain attributes and page protection
while mapping the buffers to set the required memory attributes
to use system cache for buffers and page tables.
This change adds the support for iommu domain attributes and the
interaction with io page table driver.

Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
---
 drivers/iommu/arm-smmu.c | 20 +++++++++++++++++++-
 1 file changed, 19 insertions(+), 1 deletion(-)

diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
index 52b300dfc096..324f3bb54c78 100644
--- a/drivers/iommu/arm-smmu.c
+++ b/drivers/iommu/arm-smmu.c
@@ -260,7 +260,8 @@ struct arm_smmu_domain {
 	struct mutex			init_mutex; /* Protects smmu pointer */
 	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
 	struct iommu_domain		domain;
-#define ARM_SMMU_DOMAIN_ATTR_NON_STRICT	BIT(0)
+#define ARM_SMMU_DOMAIN_ATTR_QCOM_SYS_CACHE	BIT(1)
+#define ARM_SMMU_DOMAIN_ATTR_NON_STRICT		BIT(0)
 	unsigned int			attr;
 };
 
@@ -910,6 +911,9 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
 	    smmu_domain->stage == ARM_SMMU_DOMAIN_S1)
 		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_COHERENT;
 
+	if (smmu_domain->attr & ARM_SMMU_DOMAIN_ATTR_QCOM_SYS_CACHE)
+		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_QCOM_SYS_CACHE;
+
 	smmu_domain->smmu = smmu;
 	pgtbl_ops = alloc_io_pgtable_ops(fmt, &pgtbl_cfg, smmu_domain);
 	if (!pgtbl_ops) {
@@ -1592,6 +1596,10 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
 		case DOMAIN_ATTR_NESTING:
 			*(int *)data = (smmu_domain->stage == ARM_SMMU_DOMAIN_NESTED);
 			return 0;
+		case DOMAIN_ATTR_QCOM_SYS_CACHE:
+			*(int *)data = !!(smmu_domain->attr &
+					  ARM_SMMU_DOMAIN_ATTR_QCOM_SYS_CACHE);
+			return 0;
 		default:
 			return -ENODEV;
 		}
@@ -1633,6 +1641,16 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
 			else
 				smmu_domain->stage = ARM_SMMU_DOMAIN_S1;
 			break;
+		case DOMAIN_ATTR_QCOM_SYS_CACHE:
+			if (smmu_domain->smmu) {
+				ret = -EPERM;
+				goto out_unlock;
+			}
+			if (*(int *)data)
+				smmu_domain->attr |= ARM_SMMU_DOMAIN_ATTR_QCOM_SYS_CACHE;
+			else
+				smmu_domain->attr &= ~ARM_SMMU_DOMAIN_ATTR_QCOM_SYS_CACHE;
+			break;
 		default:
 			ret = -ENODEV;
 		}
-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21  5:53 [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Vivek Gautam
                   ` (2 preceding siblings ...)
  2019-01-21  5:53 ` [PATCH 3/3] iommu/arm-smmu: " Vivek Gautam
@ 2019-01-21  7:26 ` Ard Biesheuvel
  2019-01-21 10:17   ` Vivek Gautam
  3 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-21  7:26 UTC (permalink / raw)
  To: Vivek Gautam
  Cc: Will Deacon, Robin Murphy, Joerg Roedel, iommu, pdaly,
	linux-arm-msm, Linux Kernel Mailing List, tfiga, jcrouse,
	pratikp, linux-arm-kernel

On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>
> Qualcomm SoCs have an additional level of cache called as
> System cache, aka. Last level cache (LLC). This cache sits right
> before the DDR, and is tightly coupled with the memory controller.
> The clients using this cache request their slices from this
> system cache, make it active, and can then start using it.
> For these clients with smmu, to start using the system cache for
> buffers and, related page tables [1], memory attributes need to be
> set accordingly. This series add the required support.
>

Does this actually improve performance on reads from a device? The
non-cache coherent DMA routines perform an unconditional D-cache
invalidate by VA to the PoC before reading from the buffers filled by
the device, and I would expect the PoC to be defined as lying beyond
the LLC to still guarantee the architected behavior.



> This change is a realisation of following changes from downstream msm-4.9:
> iommu: io-pgtable-arm: Support DOMAIN_ATTRIBUTE_USE_UPSTREAM_HINT[2]
> iommu: io-pgtable-arm: Implement IOMMU_USE_UPSTREAM_HINT[3]
>
> Changes since v2:
>  - Split the patches into io-pgtable-arm driver and arm-smmu driver.
>  - Converted smmu domain attributes to a bitmap, so multiple attributes
>    can be managed easily.
>  - With addition of non-coherent page table mapping support [4], this
>    patch series now aligns with the understanding of upgrading the
>    non-coherent devices to use some level of outer cache.
>  - Updated the macros and comments to reflect the use of QCOM_SYS_CACHE.
>  - QCOM_SYS_CACHE can still be used at stage 2, so that doens't depend on
>    stage-1 mapping.
>  - Added change to disable the attribute from arm_smmu_domain_set_attr()
>    when needed.
>  - Removed the page protection controls for QCOM_SYS_CACHE at the DMA API
>    level.
>
> Goes on top of the non-coherent page tables support patch series [4]
>
> [1] https://patchwork.kernel.org/patch/10302791/
> [2] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=bf762276796e79ca90014992f4d9da5593fa7d51
> [3] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=d4c72c413ea27c43f60825193d4de9cb8ffd9602
> [4] https://lore.kernel.org/patchwork/cover/1032938/
>
> Vivek Gautam (3):
>   iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
>   iommu/io-pgtable-arm: Add support to use system cache
>   iommu/arm-smmu: Add support to use system cache
>
>  drivers/iommu/arm-smmu.c       | 28 ++++++++++++++++++++++++----
>  drivers/iommu/io-pgtable-arm.c | 15 +++++++++++++--
>  drivers/iommu/io-pgtable.h     |  4 ++++
>  include/linux/iommu.h          |  2 ++
>  4 files changed, 43 insertions(+), 6 deletions(-)
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21  7:26 ` [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Ard Biesheuvel
@ 2019-01-21 10:17   ` Vivek Gautam
  2019-01-21 10:50     ` Ard Biesheuvel
  0 siblings, 1 reply; 19+ messages in thread
From: Vivek Gautam @ 2019-01-21 10:17 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Will Deacon, Robin Murphy, Joerg Roedel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	pdaly, linux-arm-msm, Linux Kernel Mailing List, Tomasz Figa,
	Jordan Crouse, pratikp, linux-arm-kernel

Hi,


On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >
> > Qualcomm SoCs have an additional level of cache called as
> > System cache, aka. Last level cache (LLC). This cache sits right
> > before the DDR, and is tightly coupled with the memory controller.
> > The clients using this cache request their slices from this
> > system cache, make it active, and can then start using it.
> > For these clients with smmu, to start using the system cache for
> > buffers and, related page tables [1], memory attributes need to be
> > set accordingly. This series add the required support.
> >
>
> Does this actually improve performance on reads from a device? The
> non-cache coherent DMA routines perform an unconditional D-cache
> invalidate by VA to the PoC before reading from the buffers filled by
> the device, and I would expect the PoC to be defined as lying beyond
> the LLC to still guarantee the architected behavior.

We have seen performance improvements when running Manhattan
GFXBench benchmarks.

As for the PoC, from my knowledge on sdm845 the system cache, aka
Last level cache (LLC) lies beyond the point of coherency.
Non-cache coherent buffers will not be cached to system cache also, and
no additional software cache maintenance ops are required for system cache.
Pratik can add more if I am missing something.

To take care of the memory attributes from DMA APIs side, we can add a
DMA_ATTR definition to take care of any dma non-coherent APIs calls.

Regards
Vivek
>
>
>
> > This change is a realisation of following changes from downstream msm-4.9:
> > iommu: io-pgtable-arm: Support DOMAIN_ATTRIBUTE_USE_UPSTREAM_HINT[2]
> > iommu: io-pgtable-arm: Implement IOMMU_USE_UPSTREAM_HINT[3]
> >
> > Changes since v2:
> >  - Split the patches into io-pgtable-arm driver and arm-smmu driver.
> >  - Converted smmu domain attributes to a bitmap, so multiple attributes
> >    can be managed easily.
> >  - With addition of non-coherent page table mapping support [4], this
> >    patch series now aligns with the understanding of upgrading the
> >    non-coherent devices to use some level of outer cache.
> >  - Updated the macros and comments to reflect the use of QCOM_SYS_CACHE.
> >  - QCOM_SYS_CACHE can still be used at stage 2, so that doens't depend on
> >    stage-1 mapping.
> >  - Added change to disable the attribute from arm_smmu_domain_set_attr()
> >    when needed.
> >  - Removed the page protection controls for QCOM_SYS_CACHE at the DMA API
> >    level.
> >
> > Goes on top of the non-coherent page tables support patch series [4]
> >
> > [1] https://patchwork.kernel.org/patch/10302791/
> > [2] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=bf762276796e79ca90014992f4d9da5593fa7d51
> > [3] https://source.codeaurora.org/quic/la/kernel/msm-4.9/commit/?h=msm-4.9&id=d4c72c413ea27c43f60825193d4de9cb8ffd9602
> > [4] https://lore.kernel.org/patchwork/cover/1032938/
> >
> > Vivek Gautam (3):
> >   iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
> >   iommu/io-pgtable-arm: Add support to use system cache
> >   iommu/arm-smmu: Add support to use system cache
> >
> >  drivers/iommu/arm-smmu.c       | 28 ++++++++++++++++++++++++----
> >  drivers/iommu/io-pgtable-arm.c | 15 +++++++++++++--
> >  drivers/iommu/io-pgtable.h     |  4 ++++
> >  include/linux/iommu.h          |  2 ++
> >  4 files changed, 43 insertions(+), 6 deletions(-)
> >
> > --
> > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> > of Code Aurora Forum, hosted by The Linux Foundation
> >
> >
> > _______________________________________________
> > linux-arm-kernel mailing list
> > linux-arm-kernel@lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 10:17   ` Vivek Gautam
@ 2019-01-21 10:50     ` Ard Biesheuvel
  2019-01-21 13:25       ` Robin Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-21 10:50 UTC (permalink / raw)
  To: Vivek Gautam
  Cc: Will Deacon, Robin Murphy, Joerg Roedel,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	pdaly, linux-arm-msm, Linux Kernel Mailing List, Tomasz Figa,
	Jordan Crouse, pratikp, linux-arm-kernel

On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>
> Hi,
>
>
> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > >
> > > Qualcomm SoCs have an additional level of cache called as
> > > System cache, aka. Last level cache (LLC). This cache sits right
> > > before the DDR, and is tightly coupled with the memory controller.
> > > The clients using this cache request their slices from this
> > > system cache, make it active, and can then start using it.
> > > For these clients with smmu, to start using the system cache for
> > > buffers and, related page tables [1], memory attributes need to be
> > > set accordingly. This series add the required support.
> > >
> >
> > Does this actually improve performance on reads from a device? The
> > non-cache coherent DMA routines perform an unconditional D-cache
> > invalidate by VA to the PoC before reading from the buffers filled by
> > the device, and I would expect the PoC to be defined as lying beyond
> > the LLC to still guarantee the architected behavior.
>
> We have seen performance improvements when running Manhattan
> GFXBench benchmarks.
>

Ah ok, that makes sense, since in that case, the data flow is mostly
to the device, not from the device.

> As for the PoC, from my knowledge on sdm845 the system cache, aka
> Last level cache (LLC) lies beyond the point of coherency.
> Non-cache coherent buffers will not be cached to system cache also, and
> no additional software cache maintenance ops are required for system cache.
> Pratik can add more if I am missing something.
>
> To take care of the memory attributes from DMA APIs side, we can add a
> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
>

So does the device use the correct inner non-cacheable, outer
writeback cacheable attributes if the SMMU is in pass-through?

We have been looking into another use case where the fact that the
SMMU overrides memory attributes is causing issues (WC mappings used
by the radeon and amdgpu driver). So if the SMMU would honour the
existing attributes, would you still need the SMMU changes?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 10:50     ` Ard Biesheuvel
@ 2019-01-21 13:25       ` Robin Murphy
  2019-01-21 13:36         ` Ard Biesheuvel
  0 siblings, 1 reply; 19+ messages in thread
From: Robin Murphy @ 2019-01-21 13:25 UTC (permalink / raw)
  To: Ard Biesheuvel, Vivek Gautam
  Cc: Will Deacon, Joerg Roedel, list@263.net:IOMMU DRIVERS,
	Joerg Roedel, iommu, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On 21/01/2019 10:50, Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>
>> Hi,
>>
>>
>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>>
>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>>>
>>>> Qualcomm SoCs have an additional level of cache called as
>>>> System cache, aka. Last level cache (LLC). This cache sits right
>>>> before the DDR, and is tightly coupled with the memory controller.
>>>> The clients using this cache request their slices from this
>>>> system cache, make it active, and can then start using it.
>>>> For these clients with smmu, to start using the system cache for
>>>> buffers and, related page tables [1], memory attributes need to be
>>>> set accordingly. This series add the required support.
>>>>
>>>
>>> Does this actually improve performance on reads from a device? The
>>> non-cache coherent DMA routines perform an unconditional D-cache
>>> invalidate by VA to the PoC before reading from the buffers filled by
>>> the device, and I would expect the PoC to be defined as lying beyond
>>> the LLC to still guarantee the architected behavior.
>>
>> We have seen performance improvements when running Manhattan
>> GFXBench benchmarks.
>>
> 
> Ah ok, that makes sense, since in that case, the data flow is mostly
> to the device, not from the device.
> 
>> As for the PoC, from my knowledge on sdm845 the system cache, aka
>> Last level cache (LLC) lies beyond the point of coherency.
>> Non-cache coherent buffers will not be cached to system cache also, and
>> no additional software cache maintenance ops are required for system cache.
>> Pratik can add more if I am missing something.
>>
>> To take care of the memory attributes from DMA APIs side, we can add a
>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
>>
> 
> So does the device use the correct inner non-cacheable, outer
> writeback cacheable attributes if the SMMU is in pass-through?
> 
> We have been looking into another use case where the fact that the
> SMMU overrides memory attributes is causing issues (WC mappings used
> by the radeon and amdgpu driver). So if the SMMU would honour the
> existing attributes, would you still need the SMMU changes?

Even if we could force a stage 2 mapping with the weakest pagetable 
attributes (such that combining would work), there would still need to 
be a way to set the TCR attributes appropriately if this behaviour is 
wanted for the SMMU's own table walks as well.

Robin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 13:25       ` Robin Murphy
@ 2019-01-21 13:36         ` Ard Biesheuvel
  2019-01-21 13:56           ` Robin Murphy
  0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-21 13:36 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vivek Gautam, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >>
> >> Hi,
> >>
> >>
> >> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> >> <ard.biesheuvel@linaro.org> wrote:
> >>>
> >>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >>>>
> >>>> Qualcomm SoCs have an additional level of cache called as
> >>>> System cache, aka. Last level cache (LLC). This cache sits right
> >>>> before the DDR, and is tightly coupled with the memory controller.
> >>>> The clients using this cache request their slices from this
> >>>> system cache, make it active, and can then start using it.
> >>>> For these clients with smmu, to start using the system cache for
> >>>> buffers and, related page tables [1], memory attributes need to be
> >>>> set accordingly. This series add the required support.
> >>>>
> >>>
> >>> Does this actually improve performance on reads from a device? The
> >>> non-cache coherent DMA routines perform an unconditional D-cache
> >>> invalidate by VA to the PoC before reading from the buffers filled by
> >>> the device, and I would expect the PoC to be defined as lying beyond
> >>> the LLC to still guarantee the architected behavior.
> >>
> >> We have seen performance improvements when running Manhattan
> >> GFXBench benchmarks.
> >>
> >
> > Ah ok, that makes sense, since in that case, the data flow is mostly
> > to the device, not from the device.
> >
> >> As for the PoC, from my knowledge on sdm845 the system cache, aka
> >> Last level cache (LLC) lies beyond the point of coherency.
> >> Non-cache coherent buffers will not be cached to system cache also, and
> >> no additional software cache maintenance ops are required for system cache.
> >> Pratik can add more if I am missing something.
> >>
> >> To take care of the memory attributes from DMA APIs side, we can add a
> >> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> >>
> >
> > So does the device use the correct inner non-cacheable, outer
> > writeback cacheable attributes if the SMMU is in pass-through?
> >
> > We have been looking into another use case where the fact that the
> > SMMU overrides memory attributes is causing issues (WC mappings used
> > by the radeon and amdgpu driver). So if the SMMU would honour the
> > existing attributes, would you still need the SMMU changes?
>
> Even if we could force a stage 2 mapping with the weakest pagetable
> attributes (such that combining would work), there would still need to
> be a way to set the TCR attributes appropriately if this behaviour is
> wanted for the SMMU's own table walks as well.
>

Isn't that just a matter of implementing support for SMMUs that lack
the 'dma-coherent' attribute?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
  2019-01-21  5:53 ` [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes Vivek Gautam
@ 2019-01-21 13:51   ` Robin Murphy
  2019-01-22 17:06     ` Vivek Gautam
  0 siblings, 1 reply; 19+ messages in thread
From: Robin Murphy @ 2019-01-21 13:51 UTC (permalink / raw)
  To: Vivek Gautam, will.deacon, joro, iommu
  Cc: linux-arm-kernel, linux-kernel, linux-arm-msm, tfiga, pratikp,
	pdaly, jcrouse

On 21/01/2019 05:53, Vivek Gautam wrote:
> A number of arm_smmu_domain's attributes can be assigned based
> on the iommu domains's attributes. These local attributes better
> be managed by a bitmap.
> So remove boolean flags and move to a 32-bit bitmap, and enable
> each bits separtely.
> 
> Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
> ---
>   drivers/iommu/arm-smmu.c | 10 ++++++----
>   1 file changed, 6 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index 7ebbcf1b2eb3..52b300dfc096 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -257,10 +257,11 @@ struct arm_smmu_domain {
>   	const struct iommu_gather_ops	*tlb_ops;
>   	struct arm_smmu_cfg		cfg;
>   	enum arm_smmu_domain_stage	stage;
> -	bool				non_strict;
>   	struct mutex			init_mutex; /* Protects smmu pointer */
>   	spinlock_t			cb_lock; /* Serialises ATS1* ops and TLB syncs */
>   	struct iommu_domain		domain;
> +#define ARM_SMMU_DOMAIN_ATTR_NON_STRICT	BIT(0)
> +	unsigned int			attr;
>   };
>   
>   struct arm_smmu_option_prop {
> @@ -901,7 +902,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
>   	if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
>   		pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
>   
> -	if (smmu_domain->non_strict)
> +	if (smmu_domain->attr & ARM_SMMU_DOMAIN_ATTR_NON_STRICT)
>   		pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
>   
>   	/* Non coherent page table mappings only for Stage-1 */
> @@ -1598,7 +1599,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
>   	case IOMMU_DOMAIN_DMA:
>   		switch (attr) {
>   		case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
> -			*(int *)data = smmu_domain->non_strict;
> +			*(int *)data = !!(smmu_domain->attr &
> +					  ARM_SMMU_DOMAIN_ATTR_NON_STRICT);
>   			return 0;
>   		default:
>   			return -ENODEV;
> @@ -1638,7 +1640,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
>   	case IOMMU_DOMAIN_DMA:
>   		switch (attr) {
>   		case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
> -			smmu_domain->non_strict = *(int *)data;
> +			smmu_domain->attr |= ARM_SMMU_DOMAIN_ATTR_NON_STRICT;

But what if *data == 0?

Robin.

>   			break;
>   		default:
>   			ret = -ENODEV;
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 13:36         ` Ard Biesheuvel
@ 2019-01-21 13:56           ` Robin Murphy
  2019-01-21 14:24             ` Ard Biesheuvel
  0 siblings, 1 reply; 19+ messages in thread
From: Robin Murphy @ 2019-01-21 13:56 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Vivek Gautam, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On 21/01/2019 13:36, Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 21/01/2019 10:50, Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>>>
>>>> Hi,
>>>>
>>>>
>>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
>>>> <ard.biesheuvel@linaro.org> wrote:
>>>>>
>>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>>>>>
>>>>>> Qualcomm SoCs have an additional level of cache called as
>>>>>> System cache, aka. Last level cache (LLC). This cache sits right
>>>>>> before the DDR, and is tightly coupled with the memory controller.
>>>>>> The clients using this cache request their slices from this
>>>>>> system cache, make it active, and can then start using it.
>>>>>> For these clients with smmu, to start using the system cache for
>>>>>> buffers and, related page tables [1], memory attributes need to be
>>>>>> set accordingly. This series add the required support.
>>>>>>
>>>>>
>>>>> Does this actually improve performance on reads from a device? The
>>>>> non-cache coherent DMA routines perform an unconditional D-cache
>>>>> invalidate by VA to the PoC before reading from the buffers filled by
>>>>> the device, and I would expect the PoC to be defined as lying beyond
>>>>> the LLC to still guarantee the architected behavior.
>>>>
>>>> We have seen performance improvements when running Manhattan
>>>> GFXBench benchmarks.
>>>>
>>>
>>> Ah ok, that makes sense, since in that case, the data flow is mostly
>>> to the device, not from the device.
>>>
>>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
>>>> Last level cache (LLC) lies beyond the point of coherency.
>>>> Non-cache coherent buffers will not be cached to system cache also, and
>>>> no additional software cache maintenance ops are required for system cache.
>>>> Pratik can add more if I am missing something.
>>>>
>>>> To take care of the memory attributes from DMA APIs side, we can add a
>>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
>>>>
>>>
>>> So does the device use the correct inner non-cacheable, outer
>>> writeback cacheable attributes if the SMMU is in pass-through?
>>>
>>> We have been looking into another use case where the fact that the
>>> SMMU overrides memory attributes is causing issues (WC mappings used
>>> by the radeon and amdgpu driver). So if the SMMU would honour the
>>> existing attributes, would you still need the SMMU changes?
>>
>> Even if we could force a stage 2 mapping with the weakest pagetable
>> attributes (such that combining would work), there would still need to
>> be a way to set the TCR attributes appropriately if this behaviour is
>> wanted for the SMMU's own table walks as well.
>>
> 
> Isn't that just a matter of implementing support for SMMUs that lack
> the 'dma-coherent' attribute?

Not quite - in general they need INC-ONC attributes in case there 
actually is something in the architectural outer-cacheable domain. The 
case of the outer cacheablility being not that but a hint to control 
non-CPU traffic through some not-quite-transparent cache behind the PoC 
definitely stays wrapped up in qcom-specific magic ;)

Robin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 13:56           ` Robin Murphy
@ 2019-01-21 14:24             ` Ard Biesheuvel
  2019-01-21 15:15               ` Robin Murphy
  2019-01-24  6:58               ` Vivek Gautam
  0 siblings, 2 replies; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-21 14:24 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Vivek Gautam, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> >>
> >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>>
> >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> >>>> <ard.biesheuvel@linaro.org> wrote:
> >>>>>
> >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >>>>>>
> >>>>>> Qualcomm SoCs have an additional level of cache called as
> >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> >>>>>> before the DDR, and is tightly coupled with the memory controller.
> >>>>>> The clients using this cache request their slices from this
> >>>>>> system cache, make it active, and can then start using it.
> >>>>>> For these clients with smmu, to start using the system cache for
> >>>>>> buffers and, related page tables [1], memory attributes need to be
> >>>>>> set accordingly. This series add the required support.
> >>>>>>
> >>>>>
> >>>>> Does this actually improve performance on reads from a device? The
> >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> >>>>> the device, and I would expect the PoC to be defined as lying beyond
> >>>>> the LLC to still guarantee the architected behavior.
> >>>>
> >>>> We have seen performance improvements when running Manhattan
> >>>> GFXBench benchmarks.
> >>>>
> >>>
> >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> >>> to the device, not from the device.
> >>>
> >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> >>>> Last level cache (LLC) lies beyond the point of coherency.
> >>>> Non-cache coherent buffers will not be cached to system cache also, and
> >>>> no additional software cache maintenance ops are required for system cache.
> >>>> Pratik can add more if I am missing something.
> >>>>
> >>>> To take care of the memory attributes from DMA APIs side, we can add a
> >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> >>>>
> >>>
> >>> So does the device use the correct inner non-cacheable, outer
> >>> writeback cacheable attributes if the SMMU is in pass-through?
> >>>
> >>> We have been looking into another use case where the fact that the
> >>> SMMU overrides memory attributes is causing issues (WC mappings used
> >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> >>> existing attributes, would you still need the SMMU changes?
> >>
> >> Even if we could force a stage 2 mapping with the weakest pagetable
> >> attributes (such that combining would work), there would still need to
> >> be a way to set the TCR attributes appropriately if this behaviour is
> >> wanted for the SMMU's own table walks as well.
> >>
> >
> > Isn't that just a matter of implementing support for SMMUs that lack
> > the 'dma-coherent' attribute?
>
> Not quite - in general they need INC-ONC attributes in case there
> actually is something in the architectural outer-cacheable domain.

But is it a problem to use INC-ONC attributes for the SMMU PTW on this
chip? AIUI, the reason for the SMMU changes is to avoid the
performance hit of snooping, which is more expensive than cache
maintenance of SMMU page tables. So are you saying the by-VA cache
maintenance is not relayed to this system cache, resulting in page
table updates to be invisible to masters using INC-ONC attributes?

> The
> case of the outer cacheablility being not that but a hint to control
> non-CPU traffic through some not-quite-transparent cache behind the PoC
> definitely stays wrapped up in qcom-specific magic ;)
>

I'm not surprised ...

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 14:24             ` Ard Biesheuvel
@ 2019-01-21 15:15               ` Robin Murphy
  2019-01-24  6:58               ` Vivek Gautam
  1 sibling, 0 replies; 19+ messages in thread
From: Robin Murphy @ 2019-01-21 15:15 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Vivek Gautam, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On 21/01/2019 14:24, Ard Biesheuvel wrote:
> On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
>>
>> On 21/01/2019 13:36, Ard Biesheuvel wrote:
>>> On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
>>>>
>>>> On 21/01/2019 10:50, Ard Biesheuvel wrote:
>>>>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>
>>>>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
>>>>>> <ard.biesheuvel@linaro.org> wrote:
>>>>>>>
>>>>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>>>>>>>>
>>>>>>>> Qualcomm SoCs have an additional level of cache called as
>>>>>>>> System cache, aka. Last level cache (LLC). This cache sits right
>>>>>>>> before the DDR, and is tightly coupled with the memory controller.
>>>>>>>> The clients using this cache request their slices from this
>>>>>>>> system cache, make it active, and can then start using it.
>>>>>>>> For these clients with smmu, to start using the system cache for
>>>>>>>> buffers and, related page tables [1], memory attributes need to be
>>>>>>>> set accordingly. This series add the required support.
>>>>>>>>
>>>>>>>
>>>>>>> Does this actually improve performance on reads from a device? The
>>>>>>> non-cache coherent DMA routines perform an unconditional D-cache
>>>>>>> invalidate by VA to the PoC before reading from the buffers filled by
>>>>>>> the device, and I would expect the PoC to be defined as lying beyond
>>>>>>> the LLC to still guarantee the architected behavior.
>>>>>>
>>>>>> We have seen performance improvements when running Manhattan
>>>>>> GFXBench benchmarks.
>>>>>>
>>>>>
>>>>> Ah ok, that makes sense, since in that case, the data flow is mostly
>>>>> to the device, not from the device.
>>>>>
>>>>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
>>>>>> Last level cache (LLC) lies beyond the point of coherency.
>>>>>> Non-cache coherent buffers will not be cached to system cache also, and
>>>>>> no additional software cache maintenance ops are required for system cache.
>>>>>> Pratik can add more if I am missing something.
>>>>>>
>>>>>> To take care of the memory attributes from DMA APIs side, we can add a
>>>>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
>>>>>>
>>>>>
>>>>> So does the device use the correct inner non-cacheable, outer
>>>>> writeback cacheable attributes if the SMMU is in pass-through?
>>>>>
>>>>> We have been looking into another use case where the fact that the
>>>>> SMMU overrides memory attributes is causing issues (WC mappings used
>>>>> by the radeon and amdgpu driver). So if the SMMU would honour the
>>>>> existing attributes, would you still need the SMMU changes?
>>>>
>>>> Even if we could force a stage 2 mapping with the weakest pagetable
>>>> attributes (such that combining would work), there would still need to
>>>> be a way to set the TCR attributes appropriately if this behaviour is
>>>> wanted for the SMMU's own table walks as well.
>>>>
>>>
>>> Isn't that just a matter of implementing support for SMMUs that lack
>>> the 'dma-coherent' attribute?
>>
>> Not quite - in general they need INC-ONC attributes in case there
>> actually is something in the architectural outer-cacheable domain.
> 
> But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> chip? AIUI, the reason for the SMMU changes is to avoid the
> performance hit of snooping, which is more expensive than cache
> maintenance of SMMU page tables. So are you saying the by-VA cache
> maintenance is not relayed to this system cache, resulting in page
> table updates to be invisible to masters using INC-ONC attributes?

I only have a relatively vague impression of how this Qcom interconnect 
actually behaves, but AIUI the outer attribute has no correctness impact 
(it's effectively mismatched between CPU and devices already), only some 
degree of latency improvement which is effectively the opposite of 
no-snoop, in allowing certain non-coherent device traffic to still 
allocate in the LLC. I'm assuming that if that latency matters for the 
device accesses themselves, it might also matter for the associated 
table walks depending on the TLB miss rate.

Robin.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes
  2019-01-21 13:51   ` Robin Murphy
@ 2019-01-22 17:06     ` Vivek Gautam
  0 siblings, 0 replies; 19+ messages in thread
From: Vivek Gautam @ 2019-01-22 17:06 UTC (permalink / raw)
  To: Robin Murphy
  Cc: Will Deacon,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	list@263.net:IOMMU DRIVERS
	<iommu@lists.linux-foundation.org>,
	Joerg Roedel <joro@8bytes.org>,,
	pdaly, linux-arm-msm, open list, pratikp, Linux ARM

On Mon, Jan 21, 2019 at 7:23 PM Robin Murphy <robin.murphy@arm.com> wrote:
>
> On 21/01/2019 05:53, Vivek Gautam wrote:
> > A number of arm_smmu_domain's attributes can be assigned based
> > on the iommu domains's attributes. These local attributes better
> > be managed by a bitmap.
> > So remove boolean flags and move to a 32-bit bitmap, and enable
> > each bits separtely.
> >
> > Signed-off-by: Vivek Gautam <vivek.gautam@codeaurora.org>
> > ---
> >   drivers/iommu/arm-smmu.c | 10 ++++++----
> >   1 file changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> > index 7ebbcf1b2eb3..52b300dfc096 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -257,10 +257,11 @@ struct arm_smmu_domain {
> >       const struct iommu_gather_ops   *tlb_ops;
> >       struct arm_smmu_cfg             cfg;
> >       enum arm_smmu_domain_stage      stage;
> > -     bool                            non_strict;
> >       struct mutex                    init_mutex; /* Protects smmu pointer */
> >       spinlock_t                      cb_lock; /* Serialises ATS1* ops and TLB syncs */
> >       struct iommu_domain             domain;
> > +#define ARM_SMMU_DOMAIN_ATTR_NON_STRICT      BIT(0)
> > +     unsigned int                    attr;
> >   };
> >
> >   struct arm_smmu_option_prop {
> > @@ -901,7 +902,7 @@ static int arm_smmu_init_domain_context(struct iommu_domain *domain,
> >       if (smmu->features & ARM_SMMU_FEAT_COHERENT_WALK)
> >               pgtbl_cfg.quirks = IO_PGTABLE_QUIRK_NO_DMA;
> >
> > -     if (smmu_domain->non_strict)
> > +     if (smmu_domain->attr & ARM_SMMU_DOMAIN_ATTR_NON_STRICT)
> >               pgtbl_cfg.quirks |= IO_PGTABLE_QUIRK_NON_STRICT;
> >
> >       /* Non coherent page table mappings only for Stage-1 */
> > @@ -1598,7 +1599,8 @@ static int arm_smmu_domain_get_attr(struct iommu_domain *domain,
> >       case IOMMU_DOMAIN_DMA:
> >               switch (attr) {
> >               case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
> > -                     *(int *)data = smmu_domain->non_strict;
> > +                     *(int *)data = !!(smmu_domain->attr &
> > +                                       ARM_SMMU_DOMAIN_ATTR_NON_STRICT);
> >                       return 0;
> >               default:
> >                       return -ENODEV;
> > @@ -1638,7 +1640,7 @@ static int arm_smmu_domain_set_attr(struct iommu_domain *domain,
> >       case IOMMU_DOMAIN_DMA:
> >               switch (attr) {
> >               case DOMAIN_ATTR_DMA_USE_FLUSH_QUEUE:
> > -                     smmu_domain->non_strict = *(int *)data;
> > +                     smmu_domain->attr |= ARM_SMMU_DOMAIN_ATTR_NON_STRICT;
>
> But what if *data == 0?

Right, a check for data here also similar to what we are doing for
QCOM_SYS_CACHE [1].

[1] https://lore.kernel.org/patchwork/patch/1033796/

Regards
Vivek

>
> Robin.
>
> >                       break;
> >               default:
> >                       ret = -ENODEV;
> >
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-21 14:24             ` Ard Biesheuvel
  2019-01-21 15:15               ` Robin Murphy
@ 2019-01-24  6:58               ` Vivek Gautam
  2019-01-24  7:54                 ` Ard Biesheuvel
  1 sibling, 1 reply; 19+ messages in thread
From: Vivek Gautam @ 2019-01-24  6:58 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Robin Murphy, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On Mon, Jan 21, 2019 at 7:55 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
> >
> > On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> > >>
> > >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>>
> > >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> > >>>> <ard.biesheuvel@linaro.org> wrote:
> > >>>>>
> > >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > >>>>>>
> > >>>>>> Qualcomm SoCs have an additional level of cache called as
> > >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> > >>>>>> before the DDR, and is tightly coupled with the memory controller.
> > >>>>>> The clients using this cache request their slices from this
> > >>>>>> system cache, make it active, and can then start using it.
> > >>>>>> For these clients with smmu, to start using the system cache for
> > >>>>>> buffers and, related page tables [1], memory attributes need to be
> > >>>>>> set accordingly. This series add the required support.
> > >>>>>>
> > >>>>>
> > >>>>> Does this actually improve performance on reads from a device? The
> > >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> > >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> > >>>>> the device, and I would expect the PoC to be defined as lying beyond
> > >>>>> the LLC to still guarantee the architected behavior.
> > >>>>
> > >>>> We have seen performance improvements when running Manhattan
> > >>>> GFXBench benchmarks.
> > >>>>
> > >>>
> > >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> > >>> to the device, not from the device.
> > >>>
> > >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> > >>>> Last level cache (LLC) lies beyond the point of coherency.
> > >>>> Non-cache coherent buffers will not be cached to system cache also, and
> > >>>> no additional software cache maintenance ops are required for system cache.
> > >>>> Pratik can add more if I am missing something.
> > >>>>
> > >>>> To take care of the memory attributes from DMA APIs side, we can add a
> > >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> > >>>>
> > >>>
> > >>> So does the device use the correct inner non-cacheable, outer
> > >>> writeback cacheable attributes if the SMMU is in pass-through?
> > >>>
> > >>> We have been looking into another use case where the fact that the
> > >>> SMMU overrides memory attributes is causing issues (WC mappings used
> > >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> > >>> existing attributes, would you still need the SMMU changes?
> > >>
> > >> Even if we could force a stage 2 mapping with the weakest pagetable
> > >> attributes (such that combining would work), there would still need to
> > >> be a way to set the TCR attributes appropriately if this behaviour is
> > >> wanted for the SMMU's own table walks as well.
> > >>
> > >
> > > Isn't that just a matter of implementing support for SMMUs that lack
> > > the 'dma-coherent' attribute?
> >
> > Not quite - in general they need INC-ONC attributes in case there
> > actually is something in the architectural outer-cacheable domain.
>
> But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> chip? AIUI, the reason for the SMMU changes is to avoid the
> performance hit of snooping, which is more expensive than cache
> maintenance of SMMU page tables. So are you saying the by-VA cache
> maintenance is not relayed to this system cache, resulting in page
> table updates to be invisible to masters using INC-ONC attributes?

The reason for this SMMU changes is that the non-coherent devices
can't access the inner caches at all. But they have a way to allocate
and lookup in system cache.

CPU will by default make use of system cache when the inner-cacheable
and outer-cacheable memory attribute is set.

So for SMMU page tables to be visible to PTW,
-- For IO coherent clients, the CPU cache maintenance operations are not
required for buffers marked Normal Cached to achieve a coherent view of
memory. However, client-specific cache maintenance may still be
required for devices
with local caches (for example, compute DSP local L1 or L2).
-- For non-IO coherent clients, the CPU cache maintenance operations (cleans
and/or invalidates) are required at buffer handoff points for buffers marked as
Normal Cached in any CPU page table in order to observe the latest updates.


Regards
Vivek

>
> > The
> > case of the outer cacheablility being not that but a hint to control
> > non-CPU traffic through some not-quite-transparent cache behind the PoC
> > definitely stays wrapped up in qcom-specific magic ;)
> >
>
> I'm not surprised ...



-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-24  6:58               ` Vivek Gautam
@ 2019-01-24  7:54                 ` Ard Biesheuvel
  2019-01-28 11:27                   ` Vivek Gautam
  0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-24  7:54 UTC (permalink / raw)
  To: Vivek Gautam
  Cc: Robin Murphy, Will Deacon, Joerg Roedel,
	list@263.net:IOMMU DRIVERS, pdaly, linux-arm-msm,
	Linux Kernel Mailing List, Tomasz Figa, Jordan Crouse, pratikp,
	linux-arm-kernel

On Thu, 24 Jan 2019 at 07:58, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>
> On Mon, Jan 21, 2019 at 7:55 PM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
> > >
> > > On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > > > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> > > >>
> > > >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > > >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>>
> > > >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> > > >>>> <ard.biesheuvel@linaro.org> wrote:
> > > >>>>>
> > > >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > >>>>>>
> > > >>>>>> Qualcomm SoCs have an additional level of cache called as
> > > >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> > > >>>>>> before the DDR, and is tightly coupled with the memory controller.
> > > >>>>>> The clients using this cache request their slices from this
> > > >>>>>> system cache, make it active, and can then start using it.
> > > >>>>>> For these clients with smmu, to start using the system cache for
> > > >>>>>> buffers and, related page tables [1], memory attributes need to be
> > > >>>>>> set accordingly. This series add the required support.
> > > >>>>>>
> > > >>>>>
> > > >>>>> Does this actually improve performance on reads from a device? The
> > > >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> > > >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> > > >>>>> the device, and I would expect the PoC to be defined as lying beyond
> > > >>>>> the LLC to still guarantee the architected behavior.
> > > >>>>
> > > >>>> We have seen performance improvements when running Manhattan
> > > >>>> GFXBench benchmarks.
> > > >>>>
> > > >>>
> > > >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> > > >>> to the device, not from the device.
> > > >>>
> > > >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> > > >>>> Last level cache (LLC) lies beyond the point of coherency.
> > > >>>> Non-cache coherent buffers will not be cached to system cache also, and
> > > >>>> no additional software cache maintenance ops are required for system cache.
> > > >>>> Pratik can add more if I am missing something.
> > > >>>>
> > > >>>> To take care of the memory attributes from DMA APIs side, we can add a
> > > >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> > > >>>>
> > > >>>
> > > >>> So does the device use the correct inner non-cacheable, outer
> > > >>> writeback cacheable attributes if the SMMU is in pass-through?
> > > >>>
> > > >>> We have been looking into another use case where the fact that the
> > > >>> SMMU overrides memory attributes is causing issues (WC mappings used
> > > >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> > > >>> existing attributes, would you still need the SMMU changes?
> > > >>
> > > >> Even if we could force a stage 2 mapping with the weakest pagetable
> > > >> attributes (such that combining would work), there would still need to
> > > >> be a way to set the TCR attributes appropriately if this behaviour is
> > > >> wanted for the SMMU's own table walks as well.
> > > >>
> > > >
> > > > Isn't that just a matter of implementing support for SMMUs that lack
> > > > the 'dma-coherent' attribute?
> > >
> > > Not quite - in general they need INC-ONC attributes in case there
> > > actually is something in the architectural outer-cacheable domain.
> >
> > But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> > chip? AIUI, the reason for the SMMU changes is to avoid the
> > performance hit of snooping, which is more expensive than cache
> > maintenance of SMMU page tables. So are you saying the by-VA cache
> > maintenance is not relayed to this system cache, resulting in page
> > table updates to be invisible to masters using INC-ONC attributes?
>
> The reason for this SMMU changes is that the non-coherent devices
> can't access the inner caches at all. But they have a way to allocate
> and lookup in system cache.
>
> CPU will by default make use of system cache when the inner-cacheable
> and outer-cacheable memory attribute is set.
>
> So for SMMU page tables to be visible to PTW,
> -- For IO coherent clients, the CPU cache maintenance operations are not
> required for buffers marked Normal Cached to achieve a coherent view of
> memory. However, client-specific cache maintenance may still be
> required for devices
> with local caches (for example, compute DSP local L1 or L2).

Why would devices need to access the SMMU page tables?

> -- For non-IO coherent clients, the CPU cache maintenance operations (cleans
> and/or invalidates) are required at buffer handoff points for buffers marked as
> Normal Cached in any CPU page table in order to observe the latest updates.
>

Indeed, and this is what your non-coherent SMMU PTW requires, and what
you /should/ get when you omit the 'dma-coherent' property from its DT
node (and if you don't, it is a bug in the SMMU driver that should get
fixed)

The question is whether using inner-non-cached/outer-cacheable
attributes for the PTW is required for correctness, or whether it is
merely an optimization (since the point of this exercise was to avoid
snoop latency from the SMMU PTW). If it is an optimization, I would
like to understand whether the performance delta between SMMU page
tables in DRAM vs SMMU page tables in the LLC justifies these
intrusive changes to the SMMU driver.


> >
> > > The
> > > case of the outer cacheablility being not that but a hint to control
> > > non-CPU traffic through some not-quite-transparent cache behind the PoC
> > > definitely stays wrapped up in qcom-specific magic ;)
> > >
> >
> > I'm not surprised ...
>
>
>
> --
> QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-24  7:54                 ` Ard Biesheuvel
@ 2019-01-28 11:27                   ` Vivek Gautam
  2019-01-29 15:02                     ` Ard Biesheuvel
  0 siblings, 1 reply; 19+ messages in thread
From: Vivek Gautam @ 2019-01-28 11:27 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: pdaly, linux-arm-msm, Will Deacon, Linux Kernel Mailing List,
	list@263.net:IOMMU DRIVERS, Robin Murphy, linux-arm-kernel,
	pratikp

Hi Ard,

On Thu, Jan 24, 2019 at 1:25 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> On Thu, 24 Jan 2019 at 07:58, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >
> > On Mon, Jan 21, 2019 at 7:55 PM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
> > > >
> > > > On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > > > > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > >>
> > > > >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > > > >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>>
> > > > >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> > > > >>>> <ard.biesheuvel@linaro.org> wrote:
> > > > >>>>>
> > > > >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > >>>>>>
> > > > >>>>>> Qualcomm SoCs have an additional level of cache called as
> > > > >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> > > > >>>>>> before the DDR, and is tightly coupled with the memory controller.
> > > > >>>>>> The clients using this cache request their slices from this
> > > > >>>>>> system cache, make it active, and can then start using it.
> > > > >>>>>> For these clients with smmu, to start using the system cache for
> > > > >>>>>> buffers and, related page tables [1], memory attributes need to be
> > > > >>>>>> set accordingly. This series add the required support.
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>> Does this actually improve performance on reads from a device? The
> > > > >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> > > > >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> > > > >>>>> the device, and I would expect the PoC to be defined as lying beyond
> > > > >>>>> the LLC to still guarantee the architected behavior.
> > > > >>>>
> > > > >>>> We have seen performance improvements when running Manhattan
> > > > >>>> GFXBench benchmarks.
> > > > >>>>
> > > > >>>
> > > > >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> > > > >>> to the device, not from the device.
> > > > >>>
> > > > >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> > > > >>>> Last level cache (LLC) lies beyond the point of coherency.
> > > > >>>> Non-cache coherent buffers will not be cached to system cache also, and
> > > > >>>> no additional software cache maintenance ops are required for system cache.
> > > > >>>> Pratik can add more if I am missing something.
> > > > >>>>
> > > > >>>> To take care of the memory attributes from DMA APIs side, we can add a
> > > > >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> > > > >>>>
> > > > >>>
> > > > >>> So does the device use the correct inner non-cacheable, outer
> > > > >>> writeback cacheable attributes if the SMMU is in pass-through?
> > > > >>>
> > > > >>> We have been looking into another use case where the fact that the
> > > > >>> SMMU overrides memory attributes is causing issues (WC mappings used
> > > > >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> > > > >>> existing attributes, would you still need the SMMU changes?
> > > > >>
> > > > >> Even if we could force a stage 2 mapping with the weakest pagetable
> > > > >> attributes (such that combining would work), there would still need to
> > > > >> be a way to set the TCR attributes appropriately if this behaviour is
> > > > >> wanted for the SMMU's own table walks as well.
> > > > >>
> > > > >
> > > > > Isn't that just a matter of implementing support for SMMUs that lack
> > > > > the 'dma-coherent' attribute?
> > > >
> > > > Not quite - in general they need INC-ONC attributes in case there
> > > > actually is something in the architectural outer-cacheable domain.
> > >
> > > But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> > > chip? AIUI, the reason for the SMMU changes is to avoid the
> > > performance hit of snooping, which is more expensive than cache
> > > maintenance of SMMU page tables. So are you saying the by-VA cache
> > > maintenance is not relayed to this system cache, resulting in page
> > > table updates to be invisible to masters using INC-ONC attributes?
> >
> > The reason for this SMMU changes is that the non-coherent devices
> > can't access the inner caches at all. But they have a way to allocate
> > and lookup in system cache.
> >
> > CPU will by default make use of system cache when the inner-cacheable
> > and outer-cacheable memory attribute is set.
> >
> > So for SMMU page tables to be visible to PTW,
> > -- For IO coherent clients, the CPU cache maintenance operations are not
> > required for buffers marked Normal Cached to achieve a coherent view of
> > memory. However, client-specific cache maintenance may still be
> > required for devices
> > with local caches (for example, compute DSP local L1 or L2).
>
> Why would devices need to access the SMMU page tables?

No, the devices don't need to access the page tables, rather the PTW does.
Sorry for mixing it up.

>
> > -- For non-IO coherent clients, the CPU cache maintenance operations (cleans
> > and/or invalidates) are required at buffer handoff points for buffers marked as
> > Normal Cached in any CPU page table in order to observe the latest updates.
> >
>
> Indeed, and this is what your non-coherent SMMU PTW requires, and what
> you /should/ get when you omit the 'dma-coherent' property from its DT
> node (and if you don't, it is a bug in the SMMU driver that should get
> fixed)
>
> The question is whether using inner-non-cached/outer-cacheable
> attributes for the PTW is required for correctness, or whether it is
> merely an optimization (since the point of this exercise was to avoid
> snoop latency from the SMMU PTW). If it is an optimization, I would
> like to understand whether the performance delta between SMMU page
> tables in DRAM vs SMMU page tables in the LLC justifies these
> intrusive changes to the SMMU driver.

IIUC, SMMU uses the TCR configurations to decide how PTW should access
the memory. TCR doesn't direct CPU whether to use cacheable or non -cacheable
memory to allocate page tables. Is that right?
Currently, these TCR configurations are set for inner-cacheable, and
outer-cacheable.
With this, is it assumed that PTW would snoop into the CPU caches for
any updates
of the page tables?

When we omit 'dma-coherent', CPU will allocate non-coherent memory
for these page tables, and software has to explicitly flush CPU caches to
make the changes visible to SMMU.
The CPU will still mark this memory as Normal Cached, i.e. inner cached,
outer cached, and the non-IO coherent SMMU PTW won't be able to snoop into
CPU caches. Does the following code in io-pgtable-arm.c ensures that SMMU
sees the latest page tables?

   } else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
                    !(pte & ARM_LPAE_PTE_SW_SYNC)) {
                 __arm_lpae_sync_pte(ptep, cfg);
   }

This change is mostly to get optimized PTW. As seen in the patch [1] for GPU,
there's a separate slice for page tables - "gpuhtw_llc_slice".
Let me try to get the numbers for this optimization.


[1] https://patchwork.kernel.org/patch/10302791/


Regards
Vivek

>
>
> > >
> > > > The
> > > > case of the outer cacheablility being not that but a hint to control
> > > > non-CPU traffic through some not-quite-transparent cache behind the PoC
> > > > definitely stays wrapped up in qcom-specific magic ;)
> > > >
> > >
> > > I'm not surprised ...
> >
> >
> >
> > --
> > QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
> > of Code Aurora Forum, hosted by The Linux Foundation
> _______________________________________________
> iommu mailing list
> iommu@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu



--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-28 11:27                   ` Vivek Gautam
@ 2019-01-29 15:02                     ` Ard Biesheuvel
  2019-01-30  5:39                       ` Vivek Gautam
  0 siblings, 1 reply; 19+ messages in thread
From: Ard Biesheuvel @ 2019-01-29 15:02 UTC (permalink / raw)
  To: Vivek Gautam, Bjorn Andersson
  Cc: pdaly, linux-arm-msm, Will Deacon, Linux Kernel Mailing List,
	list@263.net:IOMMU DRIVERS, Robin Murphy, linux-arm-kernel,
	pratikp

(+ Bjorn)

On Mon, 28 Jan 2019 at 12:27, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
>
> Hi Ard,
>
> On Thu, Jan 24, 2019 at 1:25 PM Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
> >
> > On Thu, 24 Jan 2019 at 07:58, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > >
> > > On Mon, Jan 21, 2019 at 7:55 PM Ard Biesheuvel
> > > <ard.biesheuvel@linaro.org> wrote:
> > > >
> > > > On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > >
> > > > > On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > > > > > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > >>
> > > > > >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > > > > >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > > >>>>
> > > > > >>>> Hi,
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> > > > > >>>> <ard.biesheuvel@linaro.org> wrote:
> > > > > >>>>>
> > > > > >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > > >>>>>>
> > > > > >>>>>> Qualcomm SoCs have an additional level of cache called as
> > > > > >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> > > > > >>>>>> before the DDR, and is tightly coupled with the memory controller.
> > > > > >>>>>> The clients using this cache request their slices from this
> > > > > >>>>>> system cache, make it active, and can then start using it.
> > > > > >>>>>> For these clients with smmu, to start using the system cache for
> > > > > >>>>>> buffers and, related page tables [1], memory attributes need to be
> > > > > >>>>>> set accordingly. This series add the required support.
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Does this actually improve performance on reads from a device? The
> > > > > >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> > > > > >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> > > > > >>>>> the device, and I would expect the PoC to be defined as lying beyond
> > > > > >>>>> the LLC to still guarantee the architected behavior.
> > > > > >>>>
> > > > > >>>> We have seen performance improvements when running Manhattan
> > > > > >>>> GFXBench benchmarks.
> > > > > >>>>
> > > > > >>>
> > > > > >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> > > > > >>> to the device, not from the device.
> > > > > >>>
> > > > > >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> > > > > >>>> Last level cache (LLC) lies beyond the point of coherency.
> > > > > >>>> Non-cache coherent buffers will not be cached to system cache also, and
> > > > > >>>> no additional software cache maintenance ops are required for system cache.
> > > > > >>>> Pratik can add more if I am missing something.
> > > > > >>>>
> > > > > >>>> To take care of the memory attributes from DMA APIs side, we can add a
> > > > > >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> > > > > >>>>
> > > > > >>>
> > > > > >>> So does the device use the correct inner non-cacheable, outer
> > > > > >>> writeback cacheable attributes if the SMMU is in pass-through?
> > > > > >>>
> > > > > >>> We have been looking into another use case where the fact that the
> > > > > >>> SMMU overrides memory attributes is causing issues (WC mappings used
> > > > > >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> > > > > >>> existing attributes, would you still need the SMMU changes?
> > > > > >>
> > > > > >> Even if we could force a stage 2 mapping with the weakest pagetable
> > > > > >> attributes (such that combining would work), there would still need to
> > > > > >> be a way to set the TCR attributes appropriately if this behaviour is
> > > > > >> wanted for the SMMU's own table walks as well.
> > > > > >>
> > > > > >
> > > > > > Isn't that just a matter of implementing support for SMMUs that lack
> > > > > > the 'dma-coherent' attribute?
> > > > >
> > > > > Not quite - in general they need INC-ONC attributes in case there
> > > > > actually is something in the architectural outer-cacheable domain.
> > > >
> > > > But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> > > > chip? AIUI, the reason for the SMMU changes is to avoid the
> > > > performance hit of snooping, which is more expensive than cache
> > > > maintenance of SMMU page tables. So are you saying the by-VA cache
> > > > maintenance is not relayed to this system cache, resulting in page
> > > > table updates to be invisible to masters using INC-ONC attributes?
> > >
> > > The reason for this SMMU changes is that the non-coherent devices
> > > can't access the inner caches at all. But they have a way to allocate
> > > and lookup in system cache.
> > >
> > > CPU will by default make use of system cache when the inner-cacheable
> > > and outer-cacheable memory attribute is set.
> > >
> > > So for SMMU page tables to be visible to PTW,
> > > -- For IO coherent clients, the CPU cache maintenance operations are not
> > > required for buffers marked Normal Cached to achieve a coherent view of
> > > memory. However, client-specific cache maintenance may still be
> > > required for devices
> > > with local caches (for example, compute DSP local L1 or L2).
> >
> > Why would devices need to access the SMMU page tables?
>
> No, the devices don't need to access the page tables, rather the PTW does.
> Sorry for mixing it up.
>
> >
> > > -- For non-IO coherent clients, the CPU cache maintenance operations (cleans
> > > and/or invalidates) are required at buffer handoff points for buffers marked as
> > > Normal Cached in any CPU page table in order to observe the latest updates.
> > >
> >
> > Indeed, and this is what your non-coherent SMMU PTW requires, and what
> > you /should/ get when you omit the 'dma-coherent' property from its DT
> > node (and if you don't, it is a bug in the SMMU driver that should get
> > fixed)
> >
> > The question is whether using inner-non-cached/outer-cacheable
> > attributes for the PTW is required for correctness, or whether it is
> > merely an optimization (since the point of this exercise was to avoid
> > snoop latency from the SMMU PTW). If it is an optimization, I would
> > like to understand whether the performance delta between SMMU page
> > tables in DRAM vs SMMU page tables in the LLC justifies these
> > intrusive changes to the SMMU driver.
>
> IIUC, SMMU uses the TCR configurations to decide how PTW should access
> the memory. TCR doesn't direct CPU whether to use cacheable or non -cacheable
> memory to allocate page tables. Is that right?

Correct

> Currently, these TCR configurations are set for inner-cacheable, and
> outer-cacheable.
> With this, is it assumed that PTW would snoop into the CPU caches for
> any updates
> of the page tables?
>
`
Yes, and if I understand the issue correctly, this snooping is costly,
which is why you want to avoid it, right?

> When we omit 'dma-coherent', CPU will allocate non-coherent memory
> for these page tables, and software has to explicitly flush CPU caches to
> make the changes visible to SMMU.

Indeed. But I would expect the TCR configuration to reflect this as
well, and that doesn't appear the case.

> The CPU will still mark this memory as Normal Cached, i.e. inner cached,
> outer cached, and the non-IO coherent SMMU PTW won't be able to snoop into
> CPU caches. Does the following code in io-pgtable-arm.c ensures that SMMU
> sees the latest page tables?
>
>    } else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
>                     !(pte & ARM_LPAE_PTE_SW_SYNC)) {
>                  __arm_lpae_sync_pte(ptep, cfg);
>    }
>

I don't know the history of why NO_DMA is implemented as a quirk (and
why it is called like that in the first place).
But it indeed appears that this is where the cache maintenance occurs
for non-coherent PTWs.

> This change is mostly to get optimized PTW. As seen in the patch [1] for GPU,
> there's a separate slice for page tables - "gpuhtw_llc_slice".
> Let me try to get the numbers for this optimization.
>

Yes, please. We'd need to compare page tables in the LLC with page
tables in system RAM, and for completeness, it would be nice to
include the cache-coherent configuration as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache
  2019-01-29 15:02                     ` Ard Biesheuvel
@ 2019-01-30  5:39                       ` Vivek Gautam
  0 siblings, 0 replies; 19+ messages in thread
From: Vivek Gautam @ 2019-01-30  5:39 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Bjorn Andersson, pdaly, linux-arm-msm, Will Deacon,
	Linux Kernel Mailing List, list@263.net:IOMMU DRIVERS,
	Robin Murphy, linux-arm-kernel, pratikp

On Tue, Jan 29, 2019 at 8:34 PM Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
>
> (+ Bjorn)
>
> On Mon, 28 Jan 2019 at 12:27, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> >
> > Hi Ard,
> >
> > On Thu, Jan 24, 2019 at 1:25 PM Ard Biesheuvel
> > <ard.biesheuvel@linaro.org> wrote:
> > >
> > > On Thu, 24 Jan 2019 at 07:58, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > >
> > > > On Mon, Jan 21, 2019 at 7:55 PM Ard Biesheuvel
> > > > <ard.biesheuvel@linaro.org> wrote:
> > > > >
> > > > > On Mon, 21 Jan 2019 at 14:56, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > >
> > > > > > On 21/01/2019 13:36, Ard Biesheuvel wrote:
> > > > > > > On Mon, 21 Jan 2019 at 14:25, Robin Murphy <robin.murphy@arm.com> wrote:
> > > > > > >>
> > > > > > >> On 21/01/2019 10:50, Ard Biesheuvel wrote:
> > > > > > >>> On Mon, 21 Jan 2019 at 11:17, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > > > >>>>
> > > > > > >>>> Hi,
> > > > > > >>>>
> > > > > > >>>>
> > > > > > >>>> On Mon, Jan 21, 2019 at 12:56 PM Ard Biesheuvel
> > > > > > >>>> <ard.biesheuvel@linaro.org> wrote:
> > > > > > >>>>>
> > > > > > >>>>> On Mon, 21 Jan 2019 at 06:54, Vivek Gautam <vivek.gautam@codeaurora.org> wrote:
> > > > > > >>>>>>
> > > > > > >>>>>> Qualcomm SoCs have an additional level of cache called as
> > > > > > >>>>>> System cache, aka. Last level cache (LLC). This cache sits right
> > > > > > >>>>>> before the DDR, and is tightly coupled with the memory controller.
> > > > > > >>>>>> The clients using this cache request their slices from this
> > > > > > >>>>>> system cache, make it active, and can then start using it.
> > > > > > >>>>>> For these clients with smmu, to start using the system cache for
> > > > > > >>>>>> buffers and, related page tables [1], memory attributes need to be
> > > > > > >>>>>> set accordingly. This series add the required support.
> > > > > > >>>>>>
> > > > > > >>>>>
> > > > > > >>>>> Does this actually improve performance on reads from a device? The
> > > > > > >>>>> non-cache coherent DMA routines perform an unconditional D-cache
> > > > > > >>>>> invalidate by VA to the PoC before reading from the buffers filled by
> > > > > > >>>>> the device, and I would expect the PoC to be defined as lying beyond
> > > > > > >>>>> the LLC to still guarantee the architected behavior.
> > > > > > >>>>
> > > > > > >>>> We have seen performance improvements when running Manhattan
> > > > > > >>>> GFXBench benchmarks.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> Ah ok, that makes sense, since in that case, the data flow is mostly
> > > > > > >>> to the device, not from the device.
> > > > > > >>>
> > > > > > >>>> As for the PoC, from my knowledge on sdm845 the system cache, aka
> > > > > > >>>> Last level cache (LLC) lies beyond the point of coherency.
> > > > > > >>>> Non-cache coherent buffers will not be cached to system cache also, and
> > > > > > >>>> no additional software cache maintenance ops are required for system cache.
> > > > > > >>>> Pratik can add more if I am missing something.
> > > > > > >>>>
> > > > > > >>>> To take care of the memory attributes from DMA APIs side, we can add a
> > > > > > >>>> DMA_ATTR definition to take care of any dma non-coherent APIs calls.
> > > > > > >>>>
> > > > > > >>>
> > > > > > >>> So does the device use the correct inner non-cacheable, outer
> > > > > > >>> writeback cacheable attributes if the SMMU is in pass-through?
> > > > > > >>>
> > > > > > >>> We have been looking into another use case where the fact that the
> > > > > > >>> SMMU overrides memory attributes is causing issues (WC mappings used
> > > > > > >>> by the radeon and amdgpu driver). So if the SMMU would honour the
> > > > > > >>> existing attributes, would you still need the SMMU changes?
> > > > > > >>
> > > > > > >> Even if we could force a stage 2 mapping with the weakest pagetable
> > > > > > >> attributes (such that combining would work), there would still need to
> > > > > > >> be a way to set the TCR attributes appropriately if this behaviour is
> > > > > > >> wanted for the SMMU's own table walks as well.
> > > > > > >>
> > > > > > >
> > > > > > > Isn't that just a matter of implementing support for SMMUs that lack
> > > > > > > the 'dma-coherent' attribute?
> > > > > >
> > > > > > Not quite - in general they need INC-ONC attributes in case there
> > > > > > actually is something in the architectural outer-cacheable domain.
> > > > >
> > > > > But is it a problem to use INC-ONC attributes for the SMMU PTW on this
> > > > > chip? AIUI, the reason for the SMMU changes is to avoid the
> > > > > performance hit of snooping, which is more expensive than cache
> > > > > maintenance of SMMU page tables. So are you saying the by-VA cache
> > > > > maintenance is not relayed to this system cache, resulting in page
> > > > > table updates to be invisible to masters using INC-ONC attributes?
> > > >
> > > > The reason for this SMMU changes is that the non-coherent devices
> > > > can't access the inner caches at all. But they have a way to allocate
> > > > and lookup in system cache.
> > > >
> > > > CPU will by default make use of system cache when the inner-cacheable
> > > > and outer-cacheable memory attribute is set.
> > > >
> > > > So for SMMU page tables to be visible to PTW,
> > > > -- For IO coherent clients, the CPU cache maintenance operations are not
> > > > required for buffers marked Normal Cached to achieve a coherent view of
> > > > memory. However, client-specific cache maintenance may still be
> > > > required for devices
> > > > with local caches (for example, compute DSP local L1 or L2).
> > >
> > > Why would devices need to access the SMMU page tables?
> >
> > No, the devices don't need to access the page tables, rather the PTW does.
> > Sorry for mixing it up.
> >
> > >
> > > > -- For non-IO coherent clients, the CPU cache maintenance operations (cleans
> > > > and/or invalidates) are required at buffer handoff points for buffers marked as
> > > > Normal Cached in any CPU page table in order to observe the latest updates.
> > > >
> > >
> > > Indeed, and this is what your non-coherent SMMU PTW requires, and what
> > > you /should/ get when you omit the 'dma-coherent' property from its DT
> > > node (and if you don't, it is a bug in the SMMU driver that should get
> > > fixed)
> > >
> > > The question is whether using inner-non-cached/outer-cacheable
> > > attributes for the PTW is required for correctness, or whether it is
> > > merely an optimization (since the point of this exercise was to avoid
> > > snoop latency from the SMMU PTW). If it is an optimization, I would
> > > like to understand whether the performance delta between SMMU page
> > > tables in DRAM vs SMMU page tables in the LLC justifies these
> > > intrusive changes to the SMMU driver.
> >
> > IIUC, SMMU uses the TCR configurations to decide how PTW should access
> > the memory. TCR doesn't direct CPU whether to use cacheable or non -cacheable
> > memory to allocate page tables. Is that right?
>
> Correct
>
> > Currently, these TCR configurations are set for inner-cacheable, and
> > outer-cacheable.
> > With this, is it assumed that PTW would snoop into the CPU caches for
> > any updates
> > of the page tables?
> >
> `
> Yes, and if I understand the issue correctly, this snooping is costly,
> which is why you want to avoid it, right?
>
> > When we omit 'dma-coherent', CPU will allocate non-coherent memory
> > for these page tables, and software has to explicitly flush CPU caches to
> > make the changes visible to SMMU.
>
> Indeed. But I would expect the TCR configuration to reflect this as
> well, and that doesn't appear the case.

Yea, we are discussing this case of TCR configurations for non-coherent memory
in another thread [1].

[1] https://lore.kernel.org/patchwork/patch/1032939/

>
> > The CPU will still mark this memory as Normal Cached, i.e. inner cached,
> > outer cached, and the non-IO coherent SMMU PTW won't be able to snoop into
> > CPU caches. Does the following code in io-pgtable-arm.c ensures that SMMU
> > sees the latest page tables?
> >
> >    } else if (!(cfg->quirks & IO_PGTABLE_QUIRK_NO_DMA) &&
> >                     !(pte & ARM_LPAE_PTE_SW_SYNC)) {
> >                  __arm_lpae_sync_pte(ptep, cfg);
> >    }
> >
>
> I don't know the history of why NO_DMA is implemented as a quirk (and
> why it is called like that in the first place).
> But it indeed appears that this is where the cache maintenance occurs
> for non-coherent PTWs.
>
> > This change is mostly to get optimized PTW. As seen in the patch [1] for GPU,
> > there's a separate slice for page tables - "gpuhtw_llc_slice".
> > Let me try to get the numbers for this optimization.
> >
>
> Yes, please. We'd need to compare page tables in the LLC with page
> tables in system RAM, and for completeness, it would be nice to
> include the cache-coherent configuration as well.

Sure, let me ping Pratik and Patrick to check if they have already have
these numbers available, else I can check it.

Regards
Vivek

-- 
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member
of Code Aurora Forum, hosted by The Linux Foundation

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2019-01-30  5:39 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-21  5:53 [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Vivek Gautam
2019-01-21  5:53 ` [PATCH 1/3] iommu/arm-smmu: Move to bitmap for arm_smmu_domain atrributes Vivek Gautam
2019-01-21 13:51   ` Robin Murphy
2019-01-22 17:06     ` Vivek Gautam
2019-01-21  5:53 ` [PATCH 2/3] iommu/io-pgtable-arm: Add support to use system cache Vivek Gautam
2019-01-21  5:53 ` [PATCH 3/3] iommu/arm-smmu: " Vivek Gautam
2019-01-21  7:26 ` [PATCH 0/3] iommu/arm-smmu: Add support to use Last level cache Ard Biesheuvel
2019-01-21 10:17   ` Vivek Gautam
2019-01-21 10:50     ` Ard Biesheuvel
2019-01-21 13:25       ` Robin Murphy
2019-01-21 13:36         ` Ard Biesheuvel
2019-01-21 13:56           ` Robin Murphy
2019-01-21 14:24             ` Ard Biesheuvel
2019-01-21 15:15               ` Robin Murphy
2019-01-24  6:58               ` Vivek Gautam
2019-01-24  7:54                 ` Ard Biesheuvel
2019-01-28 11:27                   ` Vivek Gautam
2019-01-29 15:02                     ` Ard Biesheuvel
2019-01-30  5:39                       ` Vivek Gautam

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).