All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/19] Update SMMUv3 to the modern iommu API (part 1/2)
@ 2023-10-11  0:33 ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The SMMUv3 driver was originally written in 2015 when the iommu driver
facing API looked quite different. The API has evolved, especially lately,
and the driver has fallen behind.

This work aims to bring make the SMMUv3 driver the best IOMMU driver with
the most comprehensive implementation of the API. After all parts it
addresses:

 - Global static BLOCKED and IDENTITY domains with 'never fail' attach
   semantics. BLOCKED is desired for efficient VFIO.

 - Support map before attach for PAGING iommu_domains.

 - attach_dev failure does not change the HW configuration.

 - Fully hitless transitions between IDENTITY -> DMA -> IDENTITY.
   The API has IOMMU_RESV_DIRECT which is expected to be
   continuously translating.

 - Safe transitions between PAGING -> BLOCKED, do not ever temporarily
   do IDENTITY. This is required for iommufd security.

 - Full PASID API support including:
    - S1/SVA domains attached to PASIDs
    - IDENTITY/BLOCKED/S1 attached to RID
    - Change of the RID domain while PASIDs are attached

 - Streamlined SVA support using the core infrastructure

 - Hitless, whenever possible, change between two domains

Over all these things are going to become more accessible to iommufd, and
exposed to VMs, so it is important for the driver to have a robust
implementation of the API.

The work is split into two parts, with this part largely focusing on the
STE and building up to the BLOCKED & IDENTITY global static domains.

The second part largely focuses on the CD and builds up to having a common
PASID infrastructure that SVA and S1 domains equally use.

Overall this takes the approach of turning the STE/CD programming upside
down where the CD/STE value is computed right at a driver callback
function and then pushed down into programming logic. The programming
logic hides the details of the required CD/STE tear-less update. This
makes the CD/STE functions independent of the arm_smmu_domain which makes
it fairly straightforward to untangle all the different call chains.

Furhter, this frees the arm_smmu_domain related logic from keeping track
of what state the STE/CD is currently in so it can carefully sequence the
correct update. There are many new update pairs that are subtly introduced
as the work progresses.

The locking to support BTM via arm_smmu_asid_lock is a bit subtle right
now and patches throughout this work adjust and tighten this so that it is
clearer and doesn't get broken.

Once the lower STE layers no longer need to touch arm_smmu_domain we can
isolate struct arm_smmu_domain to be only used for PAGING domains, audit
all the to_smmu_domain() calls to be only in PAGING domain ops, and
introduce the normal global static BLOCKED/IDENTITY domains using the new
STE infrastructure. Part 2 will ultimately migrate SVA over to use
arm_smmu_domain as well.

This relies on Michael's series to move the cd_table to the master:

 https://lore.kernel.org/linux-iommu/20230915132051.2646055-1-mshavit@google.com/

As well as the DART series:

 https://lore.kernel.org/linux-iommu/0-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com

Both parts are on github:

 https://github.com/jgunthorpe/linux/commits/smmuv3_newapi

Jason Gunthorpe (19):
  iommu/arm-smmu-v3: Add a type for the STE
  iommu/arm-smmu-v3: Master cannot be NULL in
    arm_smmu_write_strtab_ent()
  iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
  iommu/arm-smmu-v3: Make STE programming independent of the callers
  iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into
    functions
  iommu/arm-smmu-v3: Build the whole STE in
    arm_smmu_make_s2_domain_ste()
  iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  iommu/arm-smmu-v3: Compute the STE only once for each master
  iommu/arm-smmu-v3: Do not change the STE twice during
    arm_smmu_attach_dev()
  iommu/arm-smmu-v3: Put writing the context descriptor in the right
    order
  iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
  iommu/arm-smmu-v3: Remove arm_smmu_master->domain
  iommu/arm-smmu-v3: Add a global static IDENTITY domain
  iommu/arm-smmu-v3: Add a global static BLOCKED domain
  iommu/arm-smmu-v3: Use the identity/blocked domain during release
  iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to
    finalize
  iommu/arm-smmu-v3: Convert to domain_alloc_paging()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 695 +++++++++++++-------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  12 +-
 2 files changed, 447 insertions(+), 260 deletions(-)


base-commit: acbc9971cf0d3b9df17ce598fbbfaa2d5d0d7968
-- 
2.42.0


^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 00/19] Update SMMUv3 to the modern iommu API (part 1/2)
@ 2023-10-11  0:33 ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The SMMUv3 driver was originally written in 2015 when the iommu driver
facing API looked quite different. The API has evolved, especially lately,
and the driver has fallen behind.

This work aims to bring make the SMMUv3 driver the best IOMMU driver with
the most comprehensive implementation of the API. After all parts it
addresses:

 - Global static BLOCKED and IDENTITY domains with 'never fail' attach
   semantics. BLOCKED is desired for efficient VFIO.

 - Support map before attach for PAGING iommu_domains.

 - attach_dev failure does not change the HW configuration.

 - Fully hitless transitions between IDENTITY -> DMA -> IDENTITY.
   The API has IOMMU_RESV_DIRECT which is expected to be
   continuously translating.

 - Safe transitions between PAGING -> BLOCKED, do not ever temporarily
   do IDENTITY. This is required for iommufd security.

 - Full PASID API support including:
    - S1/SVA domains attached to PASIDs
    - IDENTITY/BLOCKED/S1 attached to RID
    - Change of the RID domain while PASIDs are attached

 - Streamlined SVA support using the core infrastructure

 - Hitless, whenever possible, change between two domains

Over all these things are going to become more accessible to iommufd, and
exposed to VMs, so it is important for the driver to have a robust
implementation of the API.

The work is split into two parts, with this part largely focusing on the
STE and building up to the BLOCKED & IDENTITY global static domains.

The second part largely focuses on the CD and builds up to having a common
PASID infrastructure that SVA and S1 domains equally use.

Overall this takes the approach of turning the STE/CD programming upside
down where the CD/STE value is computed right at a driver callback
function and then pushed down into programming logic. The programming
logic hides the details of the required CD/STE tear-less update. This
makes the CD/STE functions independent of the arm_smmu_domain which makes
it fairly straightforward to untangle all the different call chains.

Furhter, this frees the arm_smmu_domain related logic from keeping track
of what state the STE/CD is currently in so it can carefully sequence the
correct update. There are many new update pairs that are subtly introduced
as the work progresses.

The locking to support BTM via arm_smmu_asid_lock is a bit subtle right
now and patches throughout this work adjust and tighten this so that it is
clearer and doesn't get broken.

Once the lower STE layers no longer need to touch arm_smmu_domain we can
isolate struct arm_smmu_domain to be only used for PAGING domains, audit
all the to_smmu_domain() calls to be only in PAGING domain ops, and
introduce the normal global static BLOCKED/IDENTITY domains using the new
STE infrastructure. Part 2 will ultimately migrate SVA over to use
arm_smmu_domain as well.

This relies on Michael's series to move the cd_table to the master:

 https://lore.kernel.org/linux-iommu/20230915132051.2646055-1-mshavit@google.com/

As well as the DART series:

 https://lore.kernel.org/linux-iommu/0-v2-bff223cf6409+282-dart_paging_jgg@nvidia.com

Both parts are on github:

 https://github.com/jgunthorpe/linux/commits/smmuv3_newapi

Jason Gunthorpe (19):
  iommu/arm-smmu-v3: Add a type for the STE
  iommu/arm-smmu-v3: Master cannot be NULL in
    arm_smmu_write_strtab_ent()
  iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
  iommu/arm-smmu-v3: Make STE programming independent of the callers
  iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into
    functions
  iommu/arm-smmu-v3: Build the whole STE in
    arm_smmu_make_s2_domain_ste()
  iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  iommu/arm-smmu-v3: Compute the STE only once for each master
  iommu/arm-smmu-v3: Do not change the STE twice during
    arm_smmu_attach_dev()
  iommu/arm-smmu-v3: Put writing the context descriptor in the right
    order
  iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
  iommu/arm-smmu-v3: Remove arm_smmu_master->domain
  iommu/arm-smmu-v3: Add a global static IDENTITY domain
  iommu/arm-smmu-v3: Add a global static BLOCKED domain
  iommu/arm-smmu-v3: Use the identity/blocked domain during release
  iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to
    finalize
  iommu/arm-smmu-v3: Convert to domain_alloc_paging()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 695 +++++++++++++-------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  12 +-
 2 files changed, 447 insertions(+), 260 deletions(-)


base-commit: acbc9971cf0d3b9df17ce598fbbfaa2d5d0d7968
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Instead of passing a naked __le16 * around to represent a STE wrap it in a
"struct arm_smmu_ste" with an array of the correct size. This makes it
much clearer which functions will comprise the "STE API".

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 54 ++++++++++-----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  7 ++-
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7445454c2af244..519749d15fbda0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1249,7 +1249,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      __le64 *dst)
+				      struct arm_smmu_ste *dst)
 {
 	/*
 	 * This is hideously complicated, but we only really care about
@@ -1267,7 +1267,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
 	 * 3. Update Config, sync
 	 */
-	u64 val = le64_to_cpu(dst[0]);
+	u64 val = le64_to_cpu(dst->data[0]);
 	bool ste_live = false;
 	struct arm_smmu_device *smmu = NULL;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
@@ -1325,10 +1325,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst[0] = cpu_to_le64(val);
-		dst[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		dst->data[0] = cpu_to_le64(val);
+		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst[2] = 0; /* Nuke the VMID */
+		dst->data[2] = 0; /* Nuke the VMID */
 		/*
 		 * The SMMU can perform negative caching, so we must sync
 		 * the STE regardless of whether the old value was live.
@@ -1343,7 +1343,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
 		BUG_ON(ste_live);
-		dst[1] = cpu_to_le64(
+		dst->data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1352,7 +1352,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1362,7 +1362,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 	if (s2_cfg) {
 		BUG_ON(ste_live);
-		dst[2] = cpu_to_le64(
+		dst->data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1371,18 +1371,18 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
 	arm_smmu_sync_ste_for_sid(smmu, sid);
 	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst[0], cpu_to_le64(val));
+	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
 	arm_smmu_sync_ste_for_sid(smmu, sid);
 
 	/* It's likely that we'll want to use the new STE soon */
@@ -1390,7 +1390,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
 }
 
-static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent, bool force)
+static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
+				      unsigned int nent, bool force)
 {
 	unsigned int i;
 	u64 val = STRTAB_STE_0_V;
@@ -1401,11 +1402,11 @@ static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent, bool fo
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
 	for (i = 0; i < nent; ++i) {
-		strtab[0] = cpu_to_le64(val);
-		strtab[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
-						   STRTAB_STE_1_SHCFG_INCOMING));
-		strtab[2] = 0;
-		strtab += STRTAB_STE_DWORDS;
+		strtab->data[0] = cpu_to_le64(val);
+		strtab->data[1] = cpu_to_le64(FIELD_PREP(
+			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
+		strtab->data[2] = 0;
+		strtab++;
 	}
 }
 
@@ -2209,26 +2210,22 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	return 0;
 }
 
-static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 {
-	__le64 *step;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
 
 	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-		struct arm_smmu_strtab_l1_desc *l1_desc;
 		int idx;
 
 		/* Two-level walk */
 		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
-		l1_desc = &cfg->l1_desc[idx];
-		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
-		step = &l1_desc->l2ptr[idx];
+		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
 	} else {
 		/* Simple linear lookup */
-		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
+		return (struct arm_smmu_ste *)&cfg
+			       ->strtab[sid * STRTAB_STE_DWORDS];
 	}
-
-	return step;
 }
 
 static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
@@ -2238,7 +2235,8 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 
 	for (i = 0; i < master->num_streams; ++i) {
 		u32 sid = master->streams[i].id;
-		__le64 *step = arm_smmu_get_step_for_sid(smmu, sid);
+		struct arm_smmu_ste *step =
+			arm_smmu_get_step_for_sid(smmu, sid);
 
 		/* Bridged PCI devices may end up with duplicated IDs */
 		for (j = 0; j < i; j++)
@@ -3769,7 +3767,7 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
 
 	list_for_each_entry(e, &rmr_list, list) {
-		__le64 *step;
+		struct arm_smmu_ste *step;
 		struct iommu_iort_rmr_data *rmr;
 		int ret, i;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 961205ba86d25d..03f9e526cbd92f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -206,6 +206,11 @@
 #define STRTAB_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 6)
 
 #define STRTAB_STE_DWORDS		8
+
+struct arm_smmu_ste {
+	__le64 data[STRTAB_STE_DWORDS];
+};
+
 #define STRTAB_STE_0_V			(1UL << 0)
 #define STRTAB_STE_0_CFG		GENMASK_ULL(3, 1)
 #define STRTAB_STE_0_CFG_ABORT		0
@@ -571,7 +576,7 @@ struct arm_smmu_priq {
 struct arm_smmu_strtab_l1_desc {
 	u8				span;
 
-	__le64				*l2ptr;
+	struct arm_smmu_ste		*l2ptr;
 	dma_addr_t			l2ptr_dma;
 };
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Instead of passing a naked __le16 * around to represent a STE wrap it in a
"struct arm_smmu_ste" with an array of the correct size. This makes it
much clearer which functions will comprise the "STE API".

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 54 ++++++++++-----------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  7 ++-
 2 files changed, 32 insertions(+), 29 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 7445454c2af244..519749d15fbda0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1249,7 +1249,7 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      __le64 *dst)
+				      struct arm_smmu_ste *dst)
 {
 	/*
 	 * This is hideously complicated, but we only really care about
@@ -1267,7 +1267,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
 	 * 3. Update Config, sync
 	 */
-	u64 val = le64_to_cpu(dst[0]);
+	u64 val = le64_to_cpu(dst->data[0]);
 	bool ste_live = false;
 	struct arm_smmu_device *smmu = NULL;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
@@ -1325,10 +1325,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst[0] = cpu_to_le64(val);
-		dst[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		dst->data[0] = cpu_to_le64(val);
+		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst[2] = 0; /* Nuke the VMID */
+		dst->data[2] = 0; /* Nuke the VMID */
 		/*
 		 * The SMMU can perform negative caching, so we must sync
 		 * the STE regardless of whether the old value was live.
@@ -1343,7 +1343,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
 		BUG_ON(ste_live);
-		dst[1] = cpu_to_le64(
+		dst->data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1352,7 +1352,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1362,7 +1362,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 	if (s2_cfg) {
 		BUG_ON(ste_live);
-		dst[2] = cpu_to_le64(
+		dst->data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1371,18 +1371,18 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
 	arm_smmu_sync_ste_for_sid(smmu, sid);
 	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst[0], cpu_to_le64(val));
+	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
 	arm_smmu_sync_ste_for_sid(smmu, sid);
 
 	/* It's likely that we'll want to use the new STE soon */
@@ -1390,7 +1390,8 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
 }
 
-static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent, bool force)
+static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
+				      unsigned int nent, bool force)
 {
 	unsigned int i;
 	u64 val = STRTAB_STE_0_V;
@@ -1401,11 +1402,11 @@ static void arm_smmu_init_bypass_stes(__le64 *strtab, unsigned int nent, bool fo
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
 	for (i = 0; i < nent; ++i) {
-		strtab[0] = cpu_to_le64(val);
-		strtab[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
-						   STRTAB_STE_1_SHCFG_INCOMING));
-		strtab[2] = 0;
-		strtab += STRTAB_STE_DWORDS;
+		strtab->data[0] = cpu_to_le64(val);
+		strtab->data[1] = cpu_to_le64(FIELD_PREP(
+			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
+		strtab->data[2] = 0;
+		strtab++;
 	}
 }
 
@@ -2209,26 +2210,22 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	return 0;
 }
 
-static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
+static struct arm_smmu_ste *
+arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 {
-	__le64 *step;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
 
 	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
-		struct arm_smmu_strtab_l1_desc *l1_desc;
 		int idx;
 
 		/* Two-level walk */
 		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
-		l1_desc = &cfg->l1_desc[idx];
-		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
-		step = &l1_desc->l2ptr[idx];
+		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
 	} else {
 		/* Simple linear lookup */
-		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
+		return (struct arm_smmu_ste *)&cfg
+			       ->strtab[sid * STRTAB_STE_DWORDS];
 	}
-
-	return step;
 }
 
 static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
@@ -2238,7 +2235,8 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 
 	for (i = 0; i < master->num_streams; ++i) {
 		u32 sid = master->streams[i].id;
-		__le64 *step = arm_smmu_get_step_for_sid(smmu, sid);
+		struct arm_smmu_ste *step =
+			arm_smmu_get_step_for_sid(smmu, sid);
 
 		/* Bridged PCI devices may end up with duplicated IDs */
 		for (j = 0; j < i; j++)
@@ -3769,7 +3767,7 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
 
 	list_for_each_entry(e, &rmr_list, list) {
-		__le64 *step;
+		struct arm_smmu_ste *step;
 		struct iommu_iort_rmr_data *rmr;
 		int ret, i;
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 961205ba86d25d..03f9e526cbd92f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -206,6 +206,11 @@
 #define STRTAB_L1_DESC_L2PTR_MASK	GENMASK_ULL(51, 6)
 
 #define STRTAB_STE_DWORDS		8
+
+struct arm_smmu_ste {
+	__le64 data[STRTAB_STE_DWORDS];
+};
+
 #define STRTAB_STE_0_V			(1UL << 0)
 #define STRTAB_STE_0_CFG		GENMASK_ULL(3, 1)
 #define STRTAB_STE_0_CFG_ABORT		0
@@ -571,7 +576,7 @@ struct arm_smmu_priq {
 struct arm_smmu_strtab_l1_desc {
 	u8				span;
 
-	__le64				*l2ptr;
+	struct arm_smmu_ste		*l2ptr;
 	dma_addr_t			l2ptr_dma;
 };
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 02/19] iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The only caller is arm_smmu_install_ste_for_dev() which never has a NULL
master. Remove the confusing if.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 519749d15fbda0..9117e769a965e1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1269,10 +1269,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	 */
 	u64 val = le64_to_cpu(dst->data[0]);
 	bool ste_live = false;
-	struct arm_smmu_device *smmu = NULL;
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
-	struct arm_smmu_domain *smmu_domain = NULL;
+	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_cmdq_ent prefetch_cmd = {
 		.opcode		= CMDQ_OP_PREFETCH_CFG,
 		.prefetch	= {
@@ -1280,11 +1280,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		},
 	};
 
-	if (master) {
-		smmu_domain = master->domain;
-		smmu = master->smmu;
-	}
-
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
 		case ARM_SMMU_DOMAIN_S1:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 02/19] iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The only caller is arm_smmu_install_ste_for_dev() which never has a NULL
master. Remove the confusing if.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 519749d15fbda0..9117e769a965e1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1269,10 +1269,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	 */
 	u64 val = le64_to_cpu(dst->data[0]);
 	bool ste_live = false;
-	struct arm_smmu_device *smmu = NULL;
+	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
-	struct arm_smmu_domain *smmu_domain = NULL;
+	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_cmdq_ent prefetch_cmd = {
 		.opcode		= CMDQ_OP_PREFETCH_CFG,
 		.prefetch	= {
@@ -1280,11 +1280,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		},
 	};
 
-	if (master) {
-		smmu_domain = master->domain;
-		smmu = master->smmu;
-	}
-
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
 		case ARM_SMMU_DOMAIN_S1:
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 03/19] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Currently this is exactly the same as ARM_SMMU_DOMAIN_S2, so just remove
it. The ongoing work to add nesting support through iommufd will do
something a little different.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9117e769a965e1..bf7218adbc2822 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1286,7 +1286,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			cd_table = &master->cd_table;
 			break;
 		case ARM_SMMU_DOMAIN_S2:
-		case ARM_SMMU_DOMAIN_NESTED:
 			s2_cfg = &smmu_domain->s2_cfg;
 			break;
 		default:
@@ -2167,7 +2166,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		fmt = ARM_64_LPAE_S1;
 		finalise_stage_fn = arm_smmu_domain_finalise_s1;
 		break;
-	case ARM_SMMU_DOMAIN_NESTED:
 	case ARM_SMMU_DOMAIN_S2:
 		ias = smmu->ias;
 		oas = smmu->oas;
@@ -2735,7 +2733,7 @@ static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 	if (smmu_domain->smmu)
 		ret = -EPERM;
 	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
+		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
 	mutex_unlock(&smmu_domain->init_mutex);
 
 	return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 03f9e526cbd92f..27ddf1acd12cea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -715,7 +715,6 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
-	ARM_SMMU_DOMAIN_NESTED,
 	ARM_SMMU_DOMAIN_BYPASS,
 };
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 03/19] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Currently this is exactly the same as ARM_SMMU_DOMAIN_S2, so just remove
it. The ongoing work to add nesting support through iommufd will do
something a little different.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 4 +---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 1 -
 2 files changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9117e769a965e1..bf7218adbc2822 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1286,7 +1286,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			cd_table = &master->cd_table;
 			break;
 		case ARM_SMMU_DOMAIN_S2:
-		case ARM_SMMU_DOMAIN_NESTED:
 			s2_cfg = &smmu_domain->s2_cfg;
 			break;
 		default:
@@ -2167,7 +2166,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 		fmt = ARM_64_LPAE_S1;
 		finalise_stage_fn = arm_smmu_domain_finalise_s1;
 		break;
-	case ARM_SMMU_DOMAIN_NESTED:
 	case ARM_SMMU_DOMAIN_S2:
 		ias = smmu->ias;
 		oas = smmu->oas;
@@ -2735,7 +2733,7 @@ static int arm_smmu_enable_nesting(struct iommu_domain *domain)
 	if (smmu_domain->smmu)
 		ret = -EPERM;
 	else
-		smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED;
+		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
 	mutex_unlock(&smmu_domain->init_mutex);
 
 	return ret;
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 03f9e526cbd92f..27ddf1acd12cea 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -715,7 +715,6 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
-	ARM_SMMU_DOMAIN_NESTED,
 	ARM_SMMU_DOMAIN_BYPASS,
 };
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.

Thus, consolidate all the complexity here. Callers do not have to care
about what STE transition they are doing, this function will handle
everything optimally.

Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
required programming sequence to avoid creating an incoherent 'torn' STE
in the HW caches. The update algorithm follows the same design that the
driver already uses: it is safe to change bits that HW doesn't currently
use and then do a single 64 bit update, with sync's in between.

The basic idea is to express in a bitmask what bits the HW is actually
using based on the V and CFG bits. Based on that mask we know what STE
changes are safe and which are disruptive. We can count how many 64 bit
QWORDS need a disruptive update and know if a step with V=0 is required.

This gives two basic flows through the algorithm.

If only a single 64 bit quantity needs disruptive replacement:
 - Write the target value into all currently unused bits
 - Write the single 64 bit quantity
 - Zero the remaining different bits

If multiple 64 bit quantities need disruptive replacement then do:
 - Write V=0 to QWORD 0
 - Write the entire STE except QWORD 0
 - Write QWORD 0

With HW flushes at each step, that can be skipped if the STE didn't change
in that step.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 247 +++++++++++++++-----
 1 file changed, 183 insertions(+), 64 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bf7218adbc2822..6e6b1ebb5ac0ef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,69 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Do one step along the coherent update algorithm. Each step either changes
+ * only bits that the HW isn't using or entirely changes 1 qword. It may take
+ * several iterations of this routine to make the full change.
+ */
+static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
+				      const __le64 *target,
+				      const __le64 *target_used, __le64 *step,
+				      __le64 v_bit,
+				      unsigned int len)
+{
+	u8 step_used_diff = 0;
+	u8 step_change = 0;
+	unsigned int i;
+
+	/*
+	 * Compute a step that has all the bits currently unused by HW set to
+	 * their target values.
+	 */
+	for (i = 0; i != len; i++) {
+		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
+		if (cur[i] != step[i])
+			step_change |= 1 << i;
+		/*
+		 * Each bit indicates if the step is incorrect compared to the
+		 * target, considering only the used bits in the target
+		 */
+		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
+			step_used_diff |= 1 << i;
+	}
+
+	if (hweight8(step_used_diff) > 1) {
+		/*
+		 * More than 1 qword is mismatched, this cannot be done without
+		 * a break. Clear the V bit and go again.
+		 */
+		step[0] &= ~v_bit;
+	} else if (!step_change && step_used_diff) {
+		/*
+		 * Have exactly one critical qword, all the other qwords are set
+		 * correctly, so we can set this qword now.
+		 */
+		i = ffs(step_used_diff) - 1;
+		step[i] = target[i];
+	} else if (!step_change) {
+		/* cur == target, so all done */
+		if (memcmp(cur, target, sizeof(*cur)) == 0)
+			return true;
+
+		/*
+		 * All the used HW bits match, but unused bits are different.
+		 * Set them as well. Technically this isn't necessary but it
+		 * brings the entry to the full target state, so if there are
+		 * bugs in the mask calculation this will obscure them.
+		 */
+		memcpy(step, target, len * sizeof(*step));
+	}
+
+	for (i = 0; i != len; i++)
+		WRITE_ONCE(cur[i], step[i]);
+	return false;
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1248,37 +1311,122 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
+				  struct arm_smmu_ste *used_bits)
+{
+	memset(used_bits, 0, sizeof(*used_bits));
+
+	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
+	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+						  STRTAB_STE_0_S1CTXPTR_MASK |
+						  STRTAB_STE_0_S1CDMAX);
+		used_bits->data[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+
+		if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+			used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		used_bits->data[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF, sizeof(*used_bits));
+	}
+}
+
+static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
+				    const struct arm_smmu_ste *target,
+				    const struct arm_smmu_ste *target_used)
+{
+	struct arm_smmu_ste cur_used;
+	struct arm_smmu_ste step;
+
+	arm_smmu_get_ste_used(cur, &cur_used);
+	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
+					 target_used->data, step.data,
+					 cpu_to_le64(STRTAB_STE_0_V),
+					 ARRAY_SIZE(cur->data));
+}
+
+/*
+ * This algorithm updates any STE to any value without creating a situation
+ * where the HW can percieve a corrupted STE. HW is only required to have a 64
+ * bit atomicity with stores.
+ *
+ * In the most general case we can make any update by disrupting the STE (making
+ * it abort, or clearing the V bit) using a single qword store. Then all the
+ * other qwords can be written safely, and finally the full STE written.
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect.
+ */
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste target_used;
+	int i;
+
+	arm_smmu_get_ste_used(target, &target_used);
+	/* Masks in arm_smmu_get_ste_used() are up to date */
+	for (i = 0; i != ARRAY_SIZE(target->data); i++)
+		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+
+	while (true) {
+		if (arm_smmu_write_ste_step(ste, target, &target_used))
+			break;
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+	}
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
-	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1293,22 +1441,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1319,16 +1451,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(smmu, sid, dst, &target);
 		return;
 	}
 
@@ -1336,8 +1463,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1472,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1481,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.

Thus, consolidate all the complexity here. Callers do not have to care
about what STE transition they are doing, this function will handle
everything optimally.

Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
required programming sequence to avoid creating an incoherent 'torn' STE
in the HW caches. The update algorithm follows the same design that the
driver already uses: it is safe to change bits that HW doesn't currently
use and then do a single 64 bit update, with sync's in between.

The basic idea is to express in a bitmask what bits the HW is actually
using based on the V and CFG bits. Based on that mask we know what STE
changes are safe and which are disruptive. We can count how many 64 bit
QWORDS need a disruptive update and know if a step with V=0 is required.

This gives two basic flows through the algorithm.

If only a single 64 bit quantity needs disruptive replacement:
 - Write the target value into all currently unused bits
 - Write the single 64 bit quantity
 - Zero the remaining different bits

If multiple 64 bit quantities need disruptive replacement then do:
 - Write V=0 to QWORD 0
 - Write the entire STE except QWORD 0
 - Write QWORD 0

With HW flushes at each step, that can be skipped if the STE didn't change
in that step.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 247 +++++++++++++++-----
 1 file changed, 183 insertions(+), 64 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bf7218adbc2822..6e6b1ebb5ac0ef 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,69 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Do one step along the coherent update algorithm. Each step either changes
+ * only bits that the HW isn't using or entirely changes 1 qword. It may take
+ * several iterations of this routine to make the full change.
+ */
+static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
+				      const __le64 *target,
+				      const __le64 *target_used, __le64 *step,
+				      __le64 v_bit,
+				      unsigned int len)
+{
+	u8 step_used_diff = 0;
+	u8 step_change = 0;
+	unsigned int i;
+
+	/*
+	 * Compute a step that has all the bits currently unused by HW set to
+	 * their target values.
+	 */
+	for (i = 0; i != len; i++) {
+		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
+		if (cur[i] != step[i])
+			step_change |= 1 << i;
+		/*
+		 * Each bit indicates if the step is incorrect compared to the
+		 * target, considering only the used bits in the target
+		 */
+		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
+			step_used_diff |= 1 << i;
+	}
+
+	if (hweight8(step_used_diff) > 1) {
+		/*
+		 * More than 1 qword is mismatched, this cannot be done without
+		 * a break. Clear the V bit and go again.
+		 */
+		step[0] &= ~v_bit;
+	} else if (!step_change && step_used_diff) {
+		/*
+		 * Have exactly one critical qword, all the other qwords are set
+		 * correctly, so we can set this qword now.
+		 */
+		i = ffs(step_used_diff) - 1;
+		step[i] = target[i];
+	} else if (!step_change) {
+		/* cur == target, so all done */
+		if (memcmp(cur, target, sizeof(*cur)) == 0)
+			return true;
+
+		/*
+		 * All the used HW bits match, but unused bits are different.
+		 * Set them as well. Technically this isn't necessary but it
+		 * brings the entry to the full target state, so if there are
+		 * bugs in the mask calculation this will obscure them.
+		 */
+		memcpy(step, target, len * sizeof(*step));
+	}
+
+	for (i = 0; i != len; i++)
+		WRITE_ONCE(cur[i], step[i]);
+	return false;
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1248,37 +1311,122 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
+				  struct arm_smmu_ste *used_bits)
+{
+	memset(used_bits, 0, sizeof(*used_bits));
+
+	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
+	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+						  STRTAB_STE_0_S1CTXPTR_MASK |
+						  STRTAB_STE_0_S1CDMAX);
+		used_bits->data[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+
+		if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+			used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		used_bits->data[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF, sizeof(*used_bits));
+	}
+}
+
+static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
+				    const struct arm_smmu_ste *target,
+				    const struct arm_smmu_ste *target_used)
+{
+	struct arm_smmu_ste cur_used;
+	struct arm_smmu_ste step;
+
+	arm_smmu_get_ste_used(cur, &cur_used);
+	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
+					 target_used->data, step.data,
+					 cpu_to_le64(STRTAB_STE_0_V),
+					 ARRAY_SIZE(cur->data));
+}
+
+/*
+ * This algorithm updates any STE to any value without creating a situation
+ * where the HW can percieve a corrupted STE. HW is only required to have a 64
+ * bit atomicity with stores.
+ *
+ * In the most general case we can make any update by disrupting the STE (making
+ * it abort, or clearing the V bit) using a single qword store. Then all the
+ * other qwords can be written safely, and finally the full STE written.
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect.
+ */
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste target_used;
+	int i;
+
+	arm_smmu_get_ste_used(target, &target_used);
+	/* Masks in arm_smmu_get_ste_used() are up to date */
+	for (i = 0; i != ARRAY_SIZE(target->data); i++)
+		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+
+	while (true) {
+		if (arm_smmu_write_ste_step(ste, target, &target_used))
+			break;
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+	}
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
-	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1293,22 +1441,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1319,16 +1451,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(smmu, sid, dst, &target);
 		return;
 	}
 
@@ -1336,8 +1463,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1472,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1481,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 05/19] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This allows writing the flow of arm_smmu_write_strtab_ent() around abort
and bypass domains more naturally.

Note that the core code no longer supplies NULL domains, though there is
still a flow in the driver that end up in arm_smmu_write_strtab_ent() with
NULL. A later patch will remove it.

Remove the duplicate calculation of the STE in arm_smmu_init_bypass_stes()
and remove the force parameter. arm_smmu_rmr_install_bypass_ste() can now
simply invoke arm_smmu_make_bypass_ste() directly.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +++++++++++----------
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6e6b1ebb5ac0ef..91e2bd1d8ed40b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1418,6 +1418,24 @@ static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 	}
 }
 
+static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
+}
+
+static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS));
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
@@ -1428,37 +1446,31 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
-	if (smmu_domain) {
-		switch (smmu_domain->stage) {
-		case ARM_SMMU_DOMAIN_S1:
-			cd_table = &master->cd_table;
-			break;
-		case ARM_SMMU_DOMAIN_S2:
-			s2_cfg = &smmu_domain->s2_cfg;
-			break;
-		default:
-			break;
-		}
+	if (!smmu_domain) {
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(&target);
+		else
+			arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(smmu, sid, dst, &target);
+		return;
+	}
+
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
+		cd_table = &master->cd_table;
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		s2_cfg = &smmu_domain->s2_cfg;
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(smmu, sid, dst, &target);
+		return;
 	}
 
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
-	/* Bypass/fault */
-	if (!smmu_domain || !(cd_table || s2_cfg)) {
-		if (!smmu_domain && disable_bypass)
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-		else
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
-
-		target.data[0] = cpu_to_le64(val);
-		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
-						STRTAB_STE_1_SHCFG_INCOMING));
-		target.data[2] = 0; /* Nuke the VMID */
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
-	}
-
 	if (cd_table) {
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
@@ -1504,21 +1516,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-				      unsigned int nent, bool force)
+				      unsigned int nent)
 {
 	unsigned int i;
-	u64 val = STRTAB_STE_0_V;
-
-	if (disable_bypass && !force)
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-	else
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
 	for (i = 0; i < nent; ++i) {
-		strtab->data[0] = cpu_to_le64(val);
-		strtab->data[1] = cpu_to_le64(FIELD_PREP(
-			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
-		strtab->data[2] = 0;
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(strtab);
+		else
+			arm_smmu_make_bypass_ste(strtab);
 		strtab++;
 	}
 }
@@ -1546,7 +1552,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 		return -ENOMEM;
 	}
 
-	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT, false);
+	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
 	arm_smmu_write_strtab_l1_desc(strtab, desc);
 	return 0;
 }
@@ -3168,7 +3174,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
 	cfg->strtab_base_cfg = reg;
 
-	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents, false);
+	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
 	return 0;
 }
 
@@ -3879,7 +3885,6 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
 
 	list_for_each_entry(e, &rmr_list, list) {
-		struct arm_smmu_ste *step;
 		struct iommu_iort_rmr_data *rmr;
 		int ret, i;
 
@@ -3892,8 +3897,8 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 				continue;
 			}
 
-			step = arm_smmu_get_step_for_sid(smmu, rmr->sids[i]);
-			arm_smmu_init_bypass_stes(step, 1, true);
+			arm_smmu_make_bypass_ste(
+				arm_smmu_get_step_for_sid(smmu, rmr->sids[i]));
 		}
 	}
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 05/19] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This allows writing the flow of arm_smmu_write_strtab_ent() around abort
and bypass domains more naturally.

Note that the core code no longer supplies NULL domains, though there is
still a flow in the driver that end up in arm_smmu_write_strtab_ent() with
NULL. A later patch will remove it.

Remove the duplicate calculation of the STE in arm_smmu_init_bypass_stes()
and remove the force parameter. arm_smmu_rmr_install_bypass_ste() can now
simply invoke arm_smmu_make_bypass_ste() directly.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +++++++++++----------
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6e6b1ebb5ac0ef..91e2bd1d8ed40b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1418,6 +1418,24 @@ static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 	}
 }
 
+static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
+}
+
+static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS));
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
@@ -1428,37 +1446,31 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
-	if (smmu_domain) {
-		switch (smmu_domain->stage) {
-		case ARM_SMMU_DOMAIN_S1:
-			cd_table = &master->cd_table;
-			break;
-		case ARM_SMMU_DOMAIN_S2:
-			s2_cfg = &smmu_domain->s2_cfg;
-			break;
-		default:
-			break;
-		}
+	if (!smmu_domain) {
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(&target);
+		else
+			arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(smmu, sid, dst, &target);
+		return;
+	}
+
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
+		cd_table = &master->cd_table;
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		s2_cfg = &smmu_domain->s2_cfg;
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(smmu, sid, dst, &target);
+		return;
 	}
 
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
-	/* Bypass/fault */
-	if (!smmu_domain || !(cd_table || s2_cfg)) {
-		if (!smmu_domain && disable_bypass)
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-		else
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
-
-		target.data[0] = cpu_to_le64(val);
-		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
-						STRTAB_STE_1_SHCFG_INCOMING));
-		target.data[2] = 0; /* Nuke the VMID */
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
-	}
-
 	if (cd_table) {
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
@@ -1504,21 +1516,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-				      unsigned int nent, bool force)
+				      unsigned int nent)
 {
 	unsigned int i;
-	u64 val = STRTAB_STE_0_V;
-
-	if (disable_bypass && !force)
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-	else
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
 	for (i = 0; i < nent; ++i) {
-		strtab->data[0] = cpu_to_le64(val);
-		strtab->data[1] = cpu_to_le64(FIELD_PREP(
-			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
-		strtab->data[2] = 0;
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(strtab);
+		else
+			arm_smmu_make_bypass_ste(strtab);
 		strtab++;
 	}
 }
@@ -1546,7 +1552,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 		return -ENOMEM;
 	}
 
-	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT, false);
+	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
 	arm_smmu_write_strtab_l1_desc(strtab, desc);
 	return 0;
 }
@@ -3168,7 +3174,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
 	cfg->strtab_base_cfg = reg;
 
-	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents, false);
+	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
 	return 0;
 }
 
@@ -3879,7 +3885,6 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
 
 	list_for_each_entry(e, &rmr_list, list) {
-		struct arm_smmu_ste *step;
 		struct iommu_iort_rmr_data *rmr;
 		int ret, i;
 
@@ -3892,8 +3897,8 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 				continue;
 			}
 
-			step = arm_smmu_get_step_for_sid(smmu, rmr->sids[i]);
-			arm_smmu_init_bypass_stes(step, 1, true);
+			arm_smmu_make_bypass_ste(
+				arm_smmu_get_step_for_sid(smmu, rmr->sids[i]));
 		}
 	}
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 06/19] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Logically arm_smmu_init_strtab_linear() is the function that allocates and
populates the stream table with the initial value of the STEs. After this
function returns the stream table should be fully ready.

arm_smmu_rmr_install_bypass_ste() adjusts the initial stream table to force
any SIDs that the FW says have IOMMU_RESV_DIRECT to use bypass. This
ensures there is no disruption to the identity mapping during boot.

Put arm_smmu_rmr_install_bypass_ste() into arm_smmu_init_strtab_linear(),
it already executes immediately after arm_smmu_init_strtab_linear().

No functional change intended.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 91e2bd1d8ed40b..27eac8d4d86f03 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -86,6 +86,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ 0, NULL},
 };
 
+static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -3175,6 +3177,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	cfg->strtab_base_cfg = reg;
 
 	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
+
+	/* Check for RMRs and install bypass STEs if any */
+	arm_smmu_rmr_install_bypass_ste(smmu);
 	return 0;
 }
 
@@ -3988,9 +3993,6 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	/* Record our private device structure */
 	platform_set_drvdata(pdev, smmu);
 
-	/* Check for RMRs and install bypass STEs if any */
-	arm_smmu_rmr_install_bypass_ste(smmu);
-
 	/* Reset the device */
 	ret = arm_smmu_device_reset(smmu, bypass);
 	if (ret)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 06/19] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Logically arm_smmu_init_strtab_linear() is the function that allocates and
populates the stream table with the initial value of the STEs. After this
function returns the stream table should be fully ready.

arm_smmu_rmr_install_bypass_ste() adjusts the initial stream table to force
any SIDs that the FW says have IOMMU_RESV_DIRECT to use bypass. This
ensures there is no disruption to the identity mapping during boot.

Put arm_smmu_rmr_install_bypass_ste() into arm_smmu_init_strtab_linear(),
it already executes immediately after arm_smmu_init_strtab_linear().

No functional change intended.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 91e2bd1d8ed40b..27eac8d4d86f03 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -86,6 +86,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ 0, NULL},
 };
 
+static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -3175,6 +3177,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	cfg->strtab_base_cfg = reg;
 
 	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
+
+	/* Check for RMRs and install bypass STEs if any */
+	arm_smmu_rmr_install_bypass_ste(smmu);
 	return 0;
 }
 
@@ -3988,9 +3993,6 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	/* Record our private device structure */
 	platform_set_drvdata(pdev, smmu);
 
-	/* Check for RMRs and install bypass STEs if any */
-	arm_smmu_rmr_install_bypass_ste(smmu);
-
 	/* Reset the device */
 	ret = arm_smmu_device_reset(smmu, bypass);
 	if (ret)
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 07/19] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This is preperation to move the STE calculation higher up in to the call
chain and remove arm_smmu_write_strtab_ent(). These new functions will be
called directly from attach_dev.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 115 +++++++++++---------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 27eac8d4d86f03..00d960756823e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1438,13 +1438,70 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }
 
+static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+				      struct arm_smmu_master *master,
+				      struct arm_smmu_ctx_desc_cfg *cd_table)
+{
+	struct arm_smmu_device *smmu = master->smmu;
+
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
+		FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt) |
+		(cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
+		FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax));
+
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
+		FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
+		((smmu->features & ARM_SMMU_FEAT_STALLS &&
+		  !master->stall_enabled) ?
+			 STRTAB_STE_1_S1STALLD :
+			 0) |
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0) |
+		FIELD_PREP(STRTAB_STE_1_STRW,
+			   (smmu->features & ARM_SMMU_FEAT_E2H) ?
+				   STRTAB_STE_1_STRW_EL2 :
+				   STRTAB_STE_1_STRW_NSEL1));
+}
+
+static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
+					struct arm_smmu_master *master,
+					struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+
+	memset(target, 0, sizeof(*target));
+
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS));
+
+	target->data[1] |= cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
+
+	target->data[2] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		STRTAB_STE_2_S2AA64 |
+#ifdef __BIG_ENDIAN
+		STRTAB_STE_2_S2ENDI |
+#endif
+		STRTAB_STE_2_S2PTW |
+		STRTAB_STE_2_S2R);
+
+	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
-	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
@@ -1459,61 +1516,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 	switch (smmu_domain->stage) {
 	case ARM_SMMU_DOMAIN_S1:
-		cd_table = &master->cd_table;
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
-		s2_cfg = &smmu_domain->s2_cfg;
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
+		break;
 	}
-
-	/* Nuke the existing STE_0 value, as we're going to rewrite it */
-	val = STRTAB_STE_0_V;
-
-	if (cd_table) {
-		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
-			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
-
-		target.data[1] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
-			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
-			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
-
-		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
-		    !master->stall_enabled)
-			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
-
-		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
-			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
-			FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax) |
-			FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt);
-	}
-
-	if (s2_cfg) {
-		target.data[2] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
-#ifdef __BIG_ENDIAN
-			 STRTAB_STE_2_S2ENDI |
-#endif
-			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
-			 STRTAB_STE_2_S2R);
-
-		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
-
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
-	}
-
-	if (master->ats_enabled)
-		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
-						 STRTAB_STE_1_EATS_TRANS));
-
-	target.data[0] = cpu_to_le64(val);
 	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 07/19] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This is preperation to move the STE calculation higher up in to the call
chain and remove arm_smmu_write_strtab_ent(). These new functions will be
called directly from attach_dev.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 115 +++++++++++---------
 1 file changed, 63 insertions(+), 52 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 27eac8d4d86f03..00d960756823e3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1438,13 +1438,70 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }
 
+static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+				      struct arm_smmu_master *master,
+				      struct arm_smmu_ctx_desc_cfg *cd_table)
+{
+	struct arm_smmu_device *smmu = master->smmu;
+
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
+		FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt) |
+		(cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
+		FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax));
+
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
+		FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
+		((smmu->features & ARM_SMMU_FEAT_STALLS &&
+		  !master->stall_enabled) ?
+			 STRTAB_STE_1_S1STALLD :
+			 0) |
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0) |
+		FIELD_PREP(STRTAB_STE_1_STRW,
+			   (smmu->features & ARM_SMMU_FEAT_E2H) ?
+				   STRTAB_STE_1_STRW_EL2 :
+				   STRTAB_STE_1_STRW_NSEL1));
+}
+
+static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
+					struct arm_smmu_master *master,
+					struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+
+	memset(target, 0, sizeof(*target));
+
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS));
+
+	target->data[1] |= cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
+
+	target->data[2] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		STRTAB_STE_2_S2AA64 |
+#ifdef __BIG_ENDIAN
+		STRTAB_STE_2_S2ENDI |
+#endif
+		STRTAB_STE_2_S2PTW |
+		STRTAB_STE_2_S2R);
+
+	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
-	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
@@ -1459,61 +1516,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 	switch (smmu_domain->stage) {
 	case ARM_SMMU_DOMAIN_S1:
-		cd_table = &master->cd_table;
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
-		s2_cfg = &smmu_domain->s2_cfg;
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
+		break;
 	}
-
-	/* Nuke the existing STE_0 value, as we're going to rewrite it */
-	val = STRTAB_STE_0_V;
-
-	if (cd_table) {
-		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
-			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
-
-		target.data[1] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
-			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
-			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
-
-		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
-		    !master->stall_enabled)
-			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
-
-		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
-			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
-			FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax) |
-			FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt);
-	}
-
-	if (s2_cfg) {
-		target.data[2] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
-#ifdef __BIG_ENDIAN
-			 STRTAB_STE_2_S2ENDI |
-#endif
-			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
-			 STRTAB_STE_2_S2R);
-
-		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
-
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
-	}
-
-	if (master->ats_enabled)
-		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
-						 STRTAB_STE_1_EATS_TRANS));
-
-	target.data[0] = cpu_to_le64(val);
 	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 08/19] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Half the code was living in arm_smmu_domain_finalise_s2(), just move it
here and take the values directly from the pgtbl_ops instead of storing
copies.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 ++++++++++++---------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 --
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 00d960756823e3..2c06d3e3abe2b1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1474,6 +1474,11 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 					struct arm_smmu_domain *smmu_domain)
 {
 	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+	const struct io_pgtable_cfg *pgtbl_cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
+	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr =
+		&pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
+	u64 vtcr_val;
 
 	memset(target, 0, sizeof(*target));
 
@@ -1485,9 +1490,16 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		FIELD_PREP(STRTAB_STE_1_EATS,
 			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
 
+	vtcr_val = FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	target->data[2] = cpu_to_le64(
 		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, vtcr_val) |
 		STRTAB_STE_2_S2AA64 |
 #ifdef __BIG_ENDIAN
 		STRTAB_STE_2_S2ENDI |
@@ -1495,7 +1507,8 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		STRTAB_STE_2_S2PTW |
 		STRTAB_STE_2_S2R);
 
-	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+	target->data[3] = cpu_to_le64(pgtbl_cfg->arm_lpae_s2_cfg.vttbr &
+				      STRTAB_STE_3_S2TTB_MASK);
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2252,7 +2265,6 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	int vmid;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
-	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
 	vmid = ida_alloc_range(&smmu->vmid_map, 1, (1 << smmu->vmid_bits) - 1,
@@ -2260,16 +2272,7 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	if (vmid < 0)
 		return vmid;
 
-	vtcr = &pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
 	cfg->vmid	= (u16)vmid;
-	cfg->vttbr	= pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
-	cfg->vtcr	= FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	return 0;
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 27ddf1acd12cea..1be0c1151c50c3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -609,8 +609,6 @@ struct arm_smmu_ctx_desc_cfg {
 
 struct arm_smmu_s2_cfg {
 	u16				vmid;
-	u64				vttbr;
-	u64				vtcr;
 };
 
 struct arm_smmu_strtab_cfg {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 08/19] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Half the code was living in arm_smmu_domain_finalise_s2(), just move it
here and take the values directly from the pgtbl_ops instead of storing
copies.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 ++++++++++++---------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 --
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 00d960756823e3..2c06d3e3abe2b1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1474,6 +1474,11 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 					struct arm_smmu_domain *smmu_domain)
 {
 	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+	const struct io_pgtable_cfg *pgtbl_cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
+	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr =
+		&pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
+	u64 vtcr_val;
 
 	memset(target, 0, sizeof(*target));
 
@@ -1485,9 +1490,16 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		FIELD_PREP(STRTAB_STE_1_EATS,
 			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
 
+	vtcr_val = FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	target->data[2] = cpu_to_le64(
 		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, vtcr_val) |
 		STRTAB_STE_2_S2AA64 |
 #ifdef __BIG_ENDIAN
 		STRTAB_STE_2_S2ENDI |
@@ -1495,7 +1507,8 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		STRTAB_STE_2_S2PTW |
 		STRTAB_STE_2_S2R);
 
-	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+	target->data[3] = cpu_to_le64(pgtbl_cfg->arm_lpae_s2_cfg.vttbr &
+				      STRTAB_STE_3_S2TTB_MASK);
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2252,7 +2265,6 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	int vmid;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
-	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
 	vmid = ida_alloc_range(&smmu->vmid_map, 1, (1 << smmu->vmid_bits) - 1,
@@ -2260,16 +2272,7 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	if (vmid < 0)
 		return vmid;
 
-	vtcr = &pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
 	cfg->vmid	= (u16)vmid;
-	cfg->vttbr	= pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
-	cfg->vtcr	= FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	return 0;
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 27ddf1acd12cea..1be0c1151c50c3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -609,8 +609,6 @@ struct arm_smmu_ctx_desc_cfg {
 
 struct arm_smmu_s2_cfg {
 	u16				vmid;
-	u64				vttbr;
-	u64				vtcr;
 };
 
 struct arm_smmu_strtab_cfg {
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The BTM support wants to be able to change the ASID of any smmu_domain.
When it goes to do this it holds the arm_smmu_asid_lock and iterates over
the target domain's devices list.

During attach of a S1 domain we must ensure that the devices list and
CD are in sync, otherwise we could miss CD updates or a parallel CD update
could push an out of date CD.

This is pretty complicated, and works today because arm_smmu_detach_dev()
remove the CD table from the STE before working on the CD entries.

The next patch will allow the CD table to remain in the STE so solve this
racy by holding the lock for a longer period. The lock covers both of the
changes to the device list and the CD table entries.

Move arm_smmu_detach_dev() till after we have initialized the domain so
the lock can be held for less time.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2c06d3e3abe2b1..a29421f133a3c0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		return -EBUSY;
 	}
 
-	arm_smmu_detach_dev(master);
-
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
@@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	mutex_unlock(&smmu_domain->init_mutex);
 	if (ret)
-		return ret;
+		goto out_unlock;
+
+	/*
+	 * Prevent arm_smmu_share_asid() from trying to change the ASID
+	 * of either the old or new domain while we are working on it.
+	 * This allows the STE and the smmu_domain->devices list to
+	 * be inconsistent during this routine.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	arm_smmu_detach_dev(master);
 
 	master->domain = smmu_domain;
 
@@ -2576,13 +2584,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			}
 		}
 
-		/*
-		 * Prevent SVA from concurrently modifying the CD or writing to
-		 * the CD entry
-		 */
-		mutex_lock(&arm_smmu_asid_lock);
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		mutex_unlock(&arm_smmu_asid_lock);
 		if (ret) {
 			master->domain = NULL;
 			goto out_list_del;
@@ -2592,13 +2594,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	arm_smmu_install_ste_for_dev(master);
 
 	arm_smmu_enable_ats(master);
-	return 0;
+	goto out_unlock;
 
 out_list_del:
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
+out_unlock:
+	mutex_unlock(&arm_smmu_asid_lock);
 	return ret;
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The BTM support wants to be able to change the ASID of any smmu_domain.
When it goes to do this it holds the arm_smmu_asid_lock and iterates over
the target domain's devices list.

During attach of a S1 domain we must ensure that the devices list and
CD are in sync, otherwise we could miss CD updates or a parallel CD update
could push an out of date CD.

This is pretty complicated, and works today because arm_smmu_detach_dev()
remove the CD table from the STE before working on the CD entries.

The next patch will allow the CD table to remain in the STE so solve this
racy by holding the lock for a longer period. The lock covers both of the
changes to the device list and the CD table entries.

Move arm_smmu_detach_dev() till after we have initialized the domain so
the lock can be held for less time.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 2c06d3e3abe2b1..a29421f133a3c0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		return -EBUSY;
 	}
 
-	arm_smmu_detach_dev(master);
-
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
@@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	mutex_unlock(&smmu_domain->init_mutex);
 	if (ret)
-		return ret;
+		goto out_unlock;
+
+	/*
+	 * Prevent arm_smmu_share_asid() from trying to change the ASID
+	 * of either the old or new domain while we are working on it.
+	 * This allows the STE and the smmu_domain->devices list to
+	 * be inconsistent during this routine.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	arm_smmu_detach_dev(master);
 
 	master->domain = smmu_domain;
 
@@ -2576,13 +2584,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			}
 		}
 
-		/*
-		 * Prevent SVA from concurrently modifying the CD or writing to
-		 * the CD entry
-		 */
-		mutex_lock(&arm_smmu_asid_lock);
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		mutex_unlock(&arm_smmu_asid_lock);
 		if (ret) {
 			master->domain = NULL;
 			goto out_list_del;
@@ -2592,13 +2594,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	arm_smmu_install_ste_for_dev(master);
 
 	arm_smmu_enable_ats(master);
-	return 0;
+	goto out_unlock;
 
 out_list_del:
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
+out_unlock:
+	mutex_unlock(&arm_smmu_asid_lock);
 	return ret;
 }
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 10/19] iommu/arm-smmu-v3: Compute the STE only once for each master
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Currently arm_smmu_install_ste_for_dev() iterates over every SID and
computes from scratch an identical STE. Every SID should have the same STE
contents. Turn this inside out so that the STE is supplied by the caller
and arm_smmu_install_ste_for_dev() simply installs it to every SID.

This is possible now that the STE generation does not inform what sequence
should be used to program it.

This allows splitting the STE calculation up according to the call site,
which following patches will make use of, and removes the confusing NULL
domain special case that only supported arm_smmu_detach_dev().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 58 ++++++++-------------
 1 file changed, 22 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a29421f133a3c0..ce1bbdba66c48a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1511,36 +1511,6 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 				      STRTAB_STE_3_S2TTB_MASK);
 }
 
-static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      struct arm_smmu_ste *dst)
-{
-	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_ste target = {};
-
-	if (!smmu_domain) {
-		if (disable_bypass)
-			arm_smmu_make_abort_ste(&target);
-		else
-			arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
-	}
-
-	switch (smmu_domain->stage) {
-	case ARM_SMMU_DOMAIN_S1:
-		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
-		break;
-	case ARM_SMMU_DOMAIN_S2:
-		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
-		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		break;
-	}
-	arm_smmu_write_ste(smmu, sid, dst, &target);
-}
-
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
 				      unsigned int nent)
 {
@@ -2362,7 +2332,8 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	}
 }
 
-static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
+static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
+					 const struct arm_smmu_ste *target)
 {
 	int i, j;
 	struct arm_smmu_device *smmu = master->smmu;
@@ -2379,7 +2350,7 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 		if (j < i)
 			continue;
 
-		arm_smmu_write_strtab_ent(master, sid, step);
+		arm_smmu_write_ste(smmu,sid, step, target);
 	}
 }
 
@@ -2486,6 +2457,7 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2499,7 +2471,11 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	arm_smmu_install_ste_for_dev(master);
+	if (disable_bypass)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2514,6 +2490,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret = 0;
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2575,7 +2552,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	list_add(&master->domain_head, &smmu_domain->devices);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
 			if (ret) {
@@ -2589,9 +2567,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			master->domain = NULL;
 			goto out_list_del;
 		}
-	}
 
-	arm_smmu_install_ste_for_dev(master);
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		break;
+	}
+	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 10/19] iommu/arm-smmu-v3: Compute the STE only once for each master
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Currently arm_smmu_install_ste_for_dev() iterates over every SID and
computes from scratch an identical STE. Every SID should have the same STE
contents. Turn this inside out so that the STE is supplied by the caller
and arm_smmu_install_ste_for_dev() simply installs it to every SID.

This is possible now that the STE generation does not inform what sequence
should be used to program it.

This allows splitting the STE calculation up according to the call site,
which following patches will make use of, and removes the confusing NULL
domain special case that only supported arm_smmu_detach_dev().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 58 ++++++++-------------
 1 file changed, 22 insertions(+), 36 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index a29421f133a3c0..ce1bbdba66c48a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1511,36 +1511,6 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 				      STRTAB_STE_3_S2TTB_MASK);
 }
 
-static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      struct arm_smmu_ste *dst)
-{
-	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_ste target = {};
-
-	if (!smmu_domain) {
-		if (disable_bypass)
-			arm_smmu_make_abort_ste(&target);
-		else
-			arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(smmu, sid, dst, &target);
-		return;
-	}
-
-	switch (smmu_domain->stage) {
-	case ARM_SMMU_DOMAIN_S1:
-		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
-		break;
-	case ARM_SMMU_DOMAIN_S2:
-		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
-		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		break;
-	}
-	arm_smmu_write_ste(smmu, sid, dst, &target);
-}
-
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
 				      unsigned int nent)
 {
@@ -2362,7 +2332,8 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	}
 }
 
-static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
+static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
+					 const struct arm_smmu_ste *target)
 {
 	int i, j;
 	struct arm_smmu_device *smmu = master->smmu;
@@ -2379,7 +2350,7 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 		if (j < i)
 			continue;
 
-		arm_smmu_write_strtab_ent(master, sid, step);
+		arm_smmu_write_ste(smmu,sid, step, target);
 	}
 }
 
@@ -2486,6 +2457,7 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2499,7 +2471,11 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	arm_smmu_install_ste_for_dev(master);
+	if (disable_bypass)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2514,6 +2490,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret = 0;
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2575,7 +2552,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	list_add(&master->domain_head, &smmu_domain->devices);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
 			if (ret) {
@@ -2589,9 +2567,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			master->domain = NULL;
 			goto out_list_del;
 		}
-	}
 
-	arm_smmu_install_ste_for_dev(master);
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		break;
+	}
+	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 11/19] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This was needed because the STE code required the STE to be in
ABORT/BYPASS inorder to program a cdtable or S2 STE. Now that the STE code
can automatically handle all transitions we can remove this step
from the attach_dev flow.

A few small bugs exist because of this:

1) If the core code does BLOCKED -> UNMANAGED with disable_bypass=false
   then there will be a moment where the STE points at BYPASS. Since
   this can be done by VFIO/IOMMUFD it is a small security race.

2) If the core code does IDENTITY -> DMA then any IOMMU_RESV_DIRECT
   regions will temporarily become BLOCKED. We'd like drivers to
   work in a way that allows IOMMU_RESV_DIRECT to be continuously
   functional during these transitions.

Make arm_smmu_release_device() put the STE back to the correct
ABORT/BYPASS setting. Fix a bug where a IOMMU_RESV_DIRECT was ignored on
this path.

Notice this subtly depends on the prior arm_smmu_asid_lock change as the
STE must be put to non-paging before removing the device for the linked
list to avoid races with arm_smmu_share_asid().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ce1bbdba66c48a..b84cd91dc5e596 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2457,7 +2457,6 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
-	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2471,11 +2470,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	if (disable_bypass)
-		arm_smmu_make_abort_ste(&target);
-	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2827,9 +2821,18 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
+
+	/* Put the STE back to what arm_smmu_init_strtab() sets */
+	if (disable_bypass && !dev->iommu->require_direct)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
+
 	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 11/19] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

This was needed because the STE code required the STE to be in
ABORT/BYPASS inorder to program a cdtable or S2 STE. Now that the STE code
can automatically handle all transitions we can remove this step
from the attach_dev flow.

A few small bugs exist because of this:

1) If the core code does BLOCKED -> UNMANAGED with disable_bypass=false
   then there will be a moment where the STE points at BYPASS. Since
   this can be done by VFIO/IOMMUFD it is a small security race.

2) If the core code does IDENTITY -> DMA then any IOMMU_RESV_DIRECT
   regions will temporarily become BLOCKED. We'd like drivers to
   work in a way that allows IOMMU_RESV_DIRECT to be continuously
   functional during these transitions.

Make arm_smmu_release_device() put the STE back to the correct
ABORT/BYPASS setting. Fix a bug where a IOMMU_RESV_DIRECT was ignored on
this path.

Notice this subtly depends on the prior arm_smmu_asid_lock change as the
STE must be put to non-paging before removing the device for the linked
list to avoid races with arm_smmu_share_asid().

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index ce1bbdba66c48a..b84cd91dc5e596 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2457,7 +2457,6 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
-	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2471,11 +2470,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	if (disable_bypass)
-		arm_smmu_make_abort_ste(&target);
-	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2827,9 +2821,18 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
+
+	/* Put the STE back to what arm_smmu_init_strtab() sets */
+	if (disable_bypass && !dev->iommu->require_direct)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
+
 	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Get closer to the IOMMU API ideal that changes between domains can be
hitless. The ordering for the CD table entry is not entirely clean from
this perspective.

When switching away from a STE with a CD table programmed in it we should
write the new STE first, then clear any old data in the CD entry.

If we are programming a CD table for the first time to a STE then the CD
entry should be programmed before the STE is loaded.

If we are replacing a CD table entry when the STE already points at the CD
entry then we just need to do the make/break sequence.

Lift this code out of arm_smmu_detach_dev() so it can all be sequenced
properly. The only other caller is arm_smmu_release_device() and it is
going to free the cdtable anyhow, so it doesn't matter what is in it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++++-------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b84cd91dc5e596..540f38bb44873e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2470,14 +2470,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	/*
-	 * Clearing the CD entry isn't strictly required to detach the domain
-	 * since the table is uninstalled anyway, but it helps avoid confusion
-	 * in the call to arm_smmu_write_ctx_desc on the next attach (which
-	 * expects the entry to be empty).
-	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1 && master->cd_table.cdtab)
-		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
 }
 
 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
@@ -2554,6 +2546,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 				master->domain = NULL;
 				goto out_list_del;
 			}
+		} else {
+			/*
+			 * arm_smmu_write_ctx_desc() relies on the entry being
+			 * invalid to work, clear any existing entry.
+			 */
+			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
+			if (ret) {
+				master->domain = NULL;
+				goto out_list_del;
+			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
@@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		arm_smmu_install_ste_for_dev(master, &target);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
 		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	}
-	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Get closer to the IOMMU API ideal that changes between domains can be
hitless. The ordering for the CD table entry is not entirely clean from
this perspective.

When switching away from a STE with a CD table programmed in it we should
write the new STE first, then clear any old data in the CD entry.

If we are programming a CD table for the first time to a STE then the CD
entry should be programmed before the STE is loaded.

If we are replacing a CD table entry when the STE already points at the CD
entry then we just need to do the make/break sequence.

Lift this code out of arm_smmu_detach_dev() so it can all be sequenced
properly. The only other caller is arm_smmu_release_device() and it is
going to free the cdtable anyhow, so it doesn't matter what is in it.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++++-------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b84cd91dc5e596..540f38bb44873e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2470,14 +2470,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	/*
-	 * Clearing the CD entry isn't strictly required to detach the domain
-	 * since the table is uninstalled anyway, but it helps avoid confusion
-	 * in the call to arm_smmu_write_ctx_desc on the next attach (which
-	 * expects the entry to be empty).
-	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1 && master->cd_table.cdtab)
-		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
 }
 
 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
@@ -2554,6 +2546,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 				master->domain = NULL;
 				goto out_list_del;
 			}
+		} else {
+			/*
+			 * arm_smmu_write_ctx_desc() relies on the entry being
+			 * invalid to work, clear any existing entry.
+			 */
+			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
+			if (ret) {
+				master->domain = NULL;
+				goto out_list_del;
+			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
@@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		arm_smmu_install_ste_for_dev(master, &target);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
 		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	}
-	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 13/19] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The caller already has the domain, just pass it in. A following patch will
remove master->domain.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 540f38bb44873e..34893801d05c73 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2369,12 +2369,12 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
-static void arm_smmu_enable_ats(struct arm_smmu_master *master)
+static void arm_smmu_enable_ats(struct arm_smmu_master *master,
+				struct arm_smmu_domain *smmu_domain)
 {
 	size_t stu;
 	struct pci_dev *pdev;
 	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
 	if (!master->ats_enabled)
@@ -2390,10 +2390,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
 }
 
-static void arm_smmu_disable_ats(struct arm_smmu_master *master)
+static void arm_smmu_disable_ats(struct arm_smmu_master *master,
+				 struct arm_smmu_domain *smmu_domain)
 {
-	struct arm_smmu_domain *smmu_domain = master->domain;
-
 	if (!master->ats_enabled)
 		return;
 
@@ -2462,7 +2461,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	if (!smmu_domain)
 		return;
 
-	arm_smmu_disable_ats(master);
+	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
@@ -2584,7 +2583,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		break;
 	}
 
-	arm_smmu_enable_ats(master);
+	arm_smmu_enable_ats(master, smmu_domain);
 	goto out_unlock;
 
 out_list_del:
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 13/19] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

The caller already has the domain, just pass it in. A following patch will
remove master->domain.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 540f38bb44873e..34893801d05c73 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2369,12 +2369,12 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
-static void arm_smmu_enable_ats(struct arm_smmu_master *master)
+static void arm_smmu_enable_ats(struct arm_smmu_master *master,
+				struct arm_smmu_domain *smmu_domain)
 {
 	size_t stu;
 	struct pci_dev *pdev;
 	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
 	if (!master->ats_enabled)
@@ -2390,10 +2390,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
 }
 
-static void arm_smmu_disable_ats(struct arm_smmu_master *master)
+static void arm_smmu_disable_ats(struct arm_smmu_master *master,
+				 struct arm_smmu_domain *smmu_domain)
 {
-	struct arm_smmu_domain *smmu_domain = master->domain;
-
 	if (!master->ats_enabled)
 		return;
 
@@ -2462,7 +2461,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	if (!smmu_domain)
 		return;
 
-	arm_smmu_disable_ats(master);
+	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
@@ -2584,7 +2583,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		break;
 	}
 
-	arm_smmu_enable_ats(master);
+	arm_smmu_enable_ats(master, smmu_domain);
 	goto out_unlock;
 
 out_list_del:
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 14/19] iommu/arm-smmu-v3: Remove arm_smmu_master->domain
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Introducing global statics which are of type struct iommu_domain, not
struct arm_smmu_domain makes it difficult to retain
arm_smmu_master->domain, as it can no longer point to an IDENTITY or
BLOCKED domain.

The only place that uses the value is arm_smmu_detach_dev(). Change things
to work like other drivers and call iommu_get_domain_for_dev() to obtain
the current domain.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 21 +++++++--------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 34893801d05c73..26d3200c127450 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2455,19 +2455,20 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
+	struct iommu_domain *domain = iommu_get_domain_for_dev(master->dev);
+	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
-	if (!smmu_domain)
+	if (!domain)
 		return;
 
+	smmu_domain = to_smmu_domain(domain);
 	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	master->domain = NULL;
 	master->ats_enabled = false;
 }
 
@@ -2521,8 +2522,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	master->domain = smmu_domain;
-
 	/*
 	 * The SMMU does not support enabling ATS with bypass. When the STE is
 	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
@@ -2541,10 +2540,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		} else {
 			/*
 			 * arm_smmu_write_ctx_desc() relies on the entry being
@@ -2552,17 +2549,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			 */
 			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		if (ret) {
-			master->domain = NULL;
+		if (ret)
 			goto out_list_del;
-		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		arm_smmu_install_ste_for_dev(master, &target);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1be0c1151c50c3..21f2f73501019a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -695,7 +695,6 @@ struct arm_smmu_stream {
 struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
-	struct arm_smmu_domain		*domain;
 	struct list_head		domain_head;
 	struct arm_smmu_stream		*streams;
 	/* Locked by the iommu core using the group mutex */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 14/19] iommu/arm-smmu-v3: Remove arm_smmu_master->domain
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Introducing global statics which are of type struct iommu_domain, not
struct arm_smmu_domain makes it difficult to retain
arm_smmu_master->domain, as it can no longer point to an IDENTITY or
BLOCKED domain.

The only place that uses the value is arm_smmu_detach_dev(). Change things
to work like other drivers and call iommu_get_domain_for_dev() to obtain
the current domain.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 21 +++++++--------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 7 insertions(+), 15 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 34893801d05c73..26d3200c127450 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2455,19 +2455,20 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
+	struct iommu_domain *domain = iommu_get_domain_for_dev(master->dev);
+	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
-	if (!smmu_domain)
+	if (!domain)
 		return;
 
+	smmu_domain = to_smmu_domain(domain);
 	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	master->domain = NULL;
 	master->ats_enabled = false;
 }
 
@@ -2521,8 +2522,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	master->domain = smmu_domain;
-
 	/*
 	 * The SMMU does not support enabling ATS with bypass. When the STE is
 	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
@@ -2541,10 +2540,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		} else {
 			/*
 			 * arm_smmu_write_ctx_desc() relies on the entry being
@@ -2552,17 +2549,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			 */
 			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		if (ret) {
-			master->domain = NULL;
+		if (ret)
 			goto out_list_del;
-		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		arm_smmu_install_ste_for_dev(master, &target);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 1be0c1151c50c3..21f2f73501019a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -695,7 +695,6 @@ struct arm_smmu_stream {
 struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
-	struct arm_smmu_domain		*domain;
 	struct list_head		domain_head;
 	struct arm_smmu_stream		*streams;
 	/* Locked by the iommu core using the group mutex */
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Move to the new static global for identity domains. Move all the logic out
of arm_smmu_attach_dev into an identity only function.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 26d3200c127450..1e03bdedfabad1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 		return arm_smmu_sva_domain_alloc();
 
 	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA &&
-	    type != IOMMU_DOMAIN_IDENTITY)
+	    type != IOMMU_DOMAIN_DMA)
 		return NULL;
 
 	/*
@@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 
-	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
-		smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
-		return 0;
-	}
-
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
@@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
 
-	if (!domain)
+	if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
 		return;
 
 	smmu_domain = to_smmu_domain(domain);
@@ -2522,15 +2516,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	/*
-	 * The SMMU does not support enabling ATS with bypass. When the STE is
-	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
-	 * Translated transactions are denied as though ATS is disabled for the
-	 * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
-	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
-	 */
-	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
-		master->ats_enabled = arm_smmu_ats_supported(master);
+	master->ats_enabled = arm_smmu_ats_supported(master);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_add(&master->domain_head, &smmu_domain->devices);
@@ -2567,13 +2553,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
 		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_install_ste_for_dev(master, &target);
-		if (master->cd_table.cdtab)
-			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
-						      NULL);
-		break;
 	}
 
 	arm_smmu_enable_ats(master, smmu_domain);
@@ -2589,6 +2568,60 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	return ret;
 }
 
+static int arm_smmu_attach_dev_ste(struct device *dev,
+				   struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (arm_smmu_master_sva_enabled(master))
+		return -EBUSY;
+
+	/*
+	 * Do not allow any ASID to be changed while are working on the STE,
+	 * otherwise we could miss invalidations.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	/*
+	 * The SMMU does not support enabling ATS with bypass/abort. When the
+	 * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests
+	 * and Translated transactions are denied as though ATS is disabled for
+	 * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
+	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
+	 */
+	arm_smmu_detach_dev(master);
+
+	arm_smmu_install_ste_for_dev(master, ste);
+	mutex_unlock(&arm_smmu_asid_lock);
+
+	/*
+	 * This has to be done after removing the master from the
+	 * arm_smmu_domain->devices to avoid races updating the same context
+	 * descriptor from arm_smmu_share_asid().
+	 */
+	if (master->cd_table.cdtab)
+		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
+	return 0;
+}
+
+static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_bypass_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_identity_ops = {
+	.attach_dev = arm_smmu_attach_dev_identity,
+};
+
+static struct iommu_domain arm_smmu_identity_domain = {
+	.type = IOMMU_DOMAIN_IDENTITY,
+	.ops = &arm_smmu_identity_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -2981,6 +3014,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 }
 
 static struct iommu_ops arm_smmu_ops = {
+	.identity_domain	= &arm_smmu_identity_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 21f2f73501019a..154808f96718df 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -712,7 +712,6 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
-	ARM_SMMU_DOMAIN_BYPASS,
 };
 
 struct arm_smmu_domain {
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Move to the new static global for identity domains. Move all the logic out
of arm_smmu_attach_dev into an identity only function.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 26d3200c127450..1e03bdedfabad1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 		return arm_smmu_sva_domain_alloc();
 
 	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA &&
-	    type != IOMMU_DOMAIN_IDENTITY)
+	    type != IOMMU_DOMAIN_DMA)
 		return NULL;
 
 	/*
@@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 
-	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
-		smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
-		return 0;
-	}
-
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
@@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
 
-	if (!domain)
+	if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
 		return;
 
 	smmu_domain = to_smmu_domain(domain);
@@ -2522,15 +2516,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	/*
-	 * The SMMU does not support enabling ATS with bypass. When the STE is
-	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
-	 * Translated transactions are denied as though ATS is disabled for the
-	 * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
-	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
-	 */
-	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
-		master->ats_enabled = arm_smmu_ats_supported(master);
+	master->ats_enabled = arm_smmu_ats_supported(master);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_add(&master->domain_head, &smmu_domain->devices);
@@ -2567,13 +2553,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
 		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_install_ste_for_dev(master, &target);
-		if (master->cd_table.cdtab)
-			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
-						      NULL);
-		break;
 	}
 
 	arm_smmu_enable_ats(master, smmu_domain);
@@ -2589,6 +2568,60 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	return ret;
 }
 
+static int arm_smmu_attach_dev_ste(struct device *dev,
+				   struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (arm_smmu_master_sva_enabled(master))
+		return -EBUSY;
+
+	/*
+	 * Do not allow any ASID to be changed while are working on the STE,
+	 * otherwise we could miss invalidations.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	/*
+	 * The SMMU does not support enabling ATS with bypass/abort. When the
+	 * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests
+	 * and Translated transactions are denied as though ATS is disabled for
+	 * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
+	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
+	 */
+	arm_smmu_detach_dev(master);
+
+	arm_smmu_install_ste_for_dev(master, ste);
+	mutex_unlock(&arm_smmu_asid_lock);
+
+	/*
+	 * This has to be done after removing the master from the
+	 * arm_smmu_domain->devices to avoid races updating the same context
+	 * descriptor from arm_smmu_share_asid().
+	 */
+	if (master->cd_table.cdtab)
+		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
+	return 0;
+}
+
+static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_bypass_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_identity_ops = {
+	.attach_dev = arm_smmu_attach_dev_identity,
+};
+
+static struct iommu_domain arm_smmu_identity_domain = {
+	.type = IOMMU_DOMAIN_IDENTITY,
+	.ops = &arm_smmu_identity_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -2981,6 +3014,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 }
 
 static struct iommu_ops arm_smmu_ops = {
+	.identity_domain	= &arm_smmu_identity_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 21f2f73501019a..154808f96718df 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -712,7 +712,6 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
-	ARM_SMMU_DOMAIN_BYPASS,
 };
 
 struct arm_smmu_domain {
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 16/19] iommu/arm-smmu-v3: Add a global static BLOCKED domain
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Using the same design as the IDENTITY domain install an
STRTAB_STE_0_CFG_ABORT STE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1e03bdedfabad1..d42d5d0c03f812 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2622,6 +2622,24 @@ static struct iommu_domain arm_smmu_identity_domain = {
 	.ops = &arm_smmu_identity_ops,
 };
 
+static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_abort_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_blocked_ops = {
+	.attach_dev = arm_smmu_attach_dev_blocked,
+};
+
+static struct iommu_domain arm_smmu_blocked_domain = {
+	.type = IOMMU_DOMAIN_BLOCKED,
+	.ops = &arm_smmu_blocked_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -3015,6 +3033,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 
 static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
+	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 16/19] iommu/arm-smmu-v3: Add a global static BLOCKED domain
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Using the same design as the IDENTITY domain install an
STRTAB_STE_0_CFG_ABORT STE.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 1e03bdedfabad1..d42d5d0c03f812 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2622,6 +2622,24 @@ static struct iommu_domain arm_smmu_identity_domain = {
 	.ops = &arm_smmu_identity_ops,
 };
 
+static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_abort_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_blocked_ops = {
+	.attach_dev = arm_smmu_attach_dev_blocked,
+};
+
+static struct iommu_domain arm_smmu_blocked_domain = {
+	.type = IOMMU_DOMAIN_BLOCKED,
+	.ops = &arm_smmu_blocked_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -3015,6 +3033,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 
 static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
+	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 17/19] iommu/arm-smmu-v3: Use the identity/blocked domain during release
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Consolidate some more core by having release call
arm_smmu_attach_dev_identity/blocked() instead of open coding this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d42d5d0c03f812..95bb6cbe2fdb08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2875,19 +2875,16 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 
 	/* Put the STE back to what arm_smmu_init_strtab() sets */
 	if (disable_bypass && !dev->iommu->require_direct)
-		arm_smmu_make_abort_ste(&target);
+		arm_smmu_attach_dev_blocked(&arm_smmu_blocked_domain, dev);
 	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
+		arm_smmu_attach_dev_identity(&arm_smmu_identity_domain, dev);
 
-	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
 	if (master->cd_table.cdtab)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 17/19] iommu/arm-smmu-v3: Use the identity/blocked domain during release
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Consolidate some more core by having release call
arm_smmu_attach_dev_identity/blocked() instead of open coding this.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d42d5d0c03f812..95bb6cbe2fdb08 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2875,19 +2875,16 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 
 	/* Put the STE back to what arm_smmu_init_strtab() sets */
 	if (disable_bypass && !dev->iommu->require_direct)
-		arm_smmu_make_abort_ste(&target);
+		arm_smmu_attach_dev_blocked(&arm_smmu_blocked_domain, dev);
 	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
+		arm_smmu_attach_dev_identity(&arm_smmu_identity_domain, dev);
 
-	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
 	if (master->cd_table.cdtab)
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 18/19] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Instead of putting container_of() casts in the internals, use the proper
type in this call chain. This makes it easier to check that the two global
static domains are not leaking into call chains they should not.

Passing the smmu avoids the only caller from having to set it and unset it
in the error path.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 ++++++++++-----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 95bb6cbe2fdb08..6a4b6d23590e8f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -87,6 +87,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 };
 
 static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
@@ -2191,12 +2193,12 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
 	kfree(smmu_domain);
 }
 
-static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s1(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int ret;
 	u32 asid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_ctx_desc *cd = &smmu_domain->cd;
 	typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr;
 
@@ -2228,11 +2230,11 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 	return ret;
 }
 
-static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s2(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int vmid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
@@ -2245,17 +2247,17 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	return 0;
 }
 
-static int arm_smmu_domain_finalise(struct iommu_domain *domain)
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu)
 {
 	int ret;
 	unsigned long ias, oas;
 	enum io_pgtable_fmt fmt;
 	struct io_pgtable_cfg pgtbl_cfg;
 	struct io_pgtable_ops *pgtbl_ops;
-	int (*finalise_stage_fn)(struct arm_smmu_domain *,
+	int (*finalise_stage_fn)(struct arm_smmu_device *smmu,
+				 struct arm_smmu_domain *,
 				 struct io_pgtable_cfg *);
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
@@ -2294,17 +2296,18 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	if (!pgtbl_ops)
 		return -ENOMEM;
 
-	domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-	domain->geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
-	domain->geometry.force_aperture = true;
+	smmu_domain->domain.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+	smmu_domain->domain.geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
+	smmu_domain->domain.geometry.force_aperture = true;
 
-	ret = finalise_stage_fn(smmu_domain, &pgtbl_cfg);
+	ret = finalise_stage_fn(smmu, smmu_domain, &pgtbl_cfg);
 	if (ret < 0) {
 		free_io_pgtable_ops(pgtbl_ops);
 		return ret;
 	}
 
 	smmu_domain->pgtbl_ops = pgtbl_ops;
+	smmu_domain->smmu = smmu;
 	return 0;
 }
 
@@ -2495,10 +2498,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
-		smmu_domain->smmu = smmu;
-		ret = arm_smmu_domain_finalise(domain);
-		if (ret)
-			smmu_domain->smmu = NULL;
+		ret = arm_smmu_domain_finalise(smmu_domain, smmu);
 	} else if (smmu_domain->smmu != smmu)
 		ret = -EINVAL;
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 18/19] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Instead of putting container_of() casts in the internals, use the proper
type in this call chain. This makes it easier to check that the two global
static domains are not leaking into call chains they should not.

Passing the smmu avoids the only caller from having to set it and unset it
in the error path.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 32 ++++++++++-----------
 1 file changed, 16 insertions(+), 16 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 95bb6cbe2fdb08..6a4b6d23590e8f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -87,6 +87,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 };
 
 static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
@@ -2191,12 +2193,12 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
 	kfree(smmu_domain);
 }
 
-static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s1(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int ret;
 	u32 asid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_ctx_desc *cd = &smmu_domain->cd;
 	typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr;
 
@@ -2228,11 +2230,11 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 	return ret;
 }
 
-static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s2(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int vmid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
@@ -2245,17 +2247,17 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	return 0;
 }
 
-static int arm_smmu_domain_finalise(struct iommu_domain *domain)
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu)
 {
 	int ret;
 	unsigned long ias, oas;
 	enum io_pgtable_fmt fmt;
 	struct io_pgtable_cfg pgtbl_cfg;
 	struct io_pgtable_ops *pgtbl_ops;
-	int (*finalise_stage_fn)(struct arm_smmu_domain *,
+	int (*finalise_stage_fn)(struct arm_smmu_device *smmu,
+				 struct arm_smmu_domain *,
 				 struct io_pgtable_cfg *);
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
@@ -2294,17 +2296,18 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	if (!pgtbl_ops)
 		return -ENOMEM;
 
-	domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-	domain->geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
-	domain->geometry.force_aperture = true;
+	smmu_domain->domain.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+	smmu_domain->domain.geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
+	smmu_domain->domain.geometry.force_aperture = true;
 
-	ret = finalise_stage_fn(smmu_domain, &pgtbl_cfg);
+	ret = finalise_stage_fn(smmu, smmu_domain, &pgtbl_cfg);
 	if (ret < 0) {
 		free_io_pgtable_ops(pgtbl_ops);
 		return ret;
 	}
 
 	smmu_domain->pgtbl_ops = pgtbl_ops;
+	smmu_domain->smmu = smmu;
 	return 0;
 }
 
@@ -2495,10 +2498,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
-		smmu_domain->smmu = smmu;
-		ret = arm_smmu_domain_finalise(domain);
-		if (ret)
-			smmu_domain->smmu = NULL;
+		ret = arm_smmu_domain_finalise(smmu_domain, smmu);
 	} else if (smmu_domain->smmu != smmu)
 		ret = -EINVAL;
 
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 19/19] iommu/arm-smmu-v3: Convert to domain_alloc_paging()
  2023-10-11  0:33 ` Jason Gunthorpe
@ 2023-10-11  0:33   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Now that the BLOCKED and IDENTITY behaviors are managed with their own
domains change to the domain_alloc_paging() op.

For now SVA remains using the old interface, eventually it will get its
own op that can pass in the device and mm_struct which will let us have a
sane lifetime for the mmu_notifier.

Call arm_smmu_domain_finalise() early if dev is available.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6a4b6d23590e8f..7c1dc96aa75d92 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2145,14 +2145,15 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 
 static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 {
-	struct arm_smmu_domain *smmu_domain;
 
 	if (type == IOMMU_DOMAIN_SVA)
 		return arm_smmu_sva_domain_alloc();
+	return NULL;
+}
 
-	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA)
-		return NULL;
+static struct iommu_domain *arm_smmu_domain_alloc_paging(struct device *dev)
+{
+	struct arm_smmu_domain *smmu_domain;
 
 	/*
 	 * Allocate the domain and initialise some of its data structures.
@@ -2168,6 +2169,14 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	spin_lock_init(&smmu_domain->devices_lock);
 	INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
 
+	if (dev) {
+		struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+		if (arm_smmu_domain_finalise(smmu_domain, master->smmu)) {
+			kfree(smmu_domain);
+			return NULL;
+		}
+	}
 	return &smmu_domain->domain;
 }
 
@@ -3033,6 +3042,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
+	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
 	.device_group		= arm_smmu_device_group,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH 19/19] iommu/arm-smmu-v3: Convert to domain_alloc_paging()
@ 2023-10-11  0:33   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-11  0:33 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Michael Shavit, Nicolin Chen

Now that the BLOCKED and IDENTITY behaviors are managed with their own
domains change to the domain_alloc_paging() op.

For now SVA remains using the old interface, eventually it will get its
own op that can pass in the device and mm_struct which will let us have a
sane lifetime for the mmu_notifier.

Call arm_smmu_domain_finalise() early if dev is available.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 6a4b6d23590e8f..7c1dc96aa75d92 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2145,14 +2145,15 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 
 static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 {
-	struct arm_smmu_domain *smmu_domain;
 
 	if (type == IOMMU_DOMAIN_SVA)
 		return arm_smmu_sva_domain_alloc();
+	return NULL;
+}
 
-	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA)
-		return NULL;
+static struct iommu_domain *arm_smmu_domain_alloc_paging(struct device *dev)
+{
+	struct arm_smmu_domain *smmu_domain;
 
 	/*
 	 * Allocate the domain and initialise some of its data structures.
@@ -2168,6 +2169,14 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	spin_lock_init(&smmu_domain->devices_lock);
 	INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
 
+	if (dev) {
+		struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+		if (arm_smmu_domain_finalise(smmu_domain, master->smmu)) {
+			kfree(smmu_domain);
+			return NULL;
+		}
+	}
 	return &smmu_domain->domain;
 }
 
@@ -3033,6 +3042,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
+	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
 	.device_group		= arm_smmu_device_group,
-- 
2.42.0


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-12  8:10     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-12  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
>
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
>
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
>
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
>
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
>
> This gives two basic flows through the algorithm.
>
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
>
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
>
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.

This sounds pretty complicated....Is this complexity really required here?
Specifically, can we start with a naive version that always first
nukes `V=0` before writing the STE? This still allows you to remove
requirements that callers must have first set the STE to abort
(supposedly to get rid of the arm_smmu_detach_dev call currently made
from arm_smmu_attach_dev) while being easier to digest.
The more sophisticated version can then be closer in the series to the
patch that requires it (supposedly this is to support replacing a
fully blocking/bypass STE with one that uses
STRTAB_STE_1_S1DSS_TERMINATE/STRTAB_STE_1_S1DSS_BYPASS when a pasid
domain is first attached?) at which point it's easier to reason about
its benefits and alternatives.

> @@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);

Can we get rid of this line and use target.data[0] everywhere above?
'val' isn't exactly a great name to describe the first word of the STE
and there's no need to defer writing data[0] anymore since this isn't
directly writing to the register.
(Feel free to ignore this if it's already addressed by subsequent patches)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-12  8:10     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-12  8:10 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
>
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
>
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
>
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
>
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
>
> This gives two basic flows through the algorithm.
>
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
>
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
>
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.

This sounds pretty complicated....Is this complexity really required here?
Specifically, can we start with a naive version that always first
nukes `V=0` before writing the STE? This still allows you to remove
requirements that callers must have first set the STE to abort
(supposedly to get rid of the arm_smmu_detach_dev call currently made
from arm_smmu_attach_dev) while being easier to digest.
The more sophisticated version can then be closer in the series to the
patch that requires it (supposedly this is to support replacing a
fully blocking/bypass STE with one that uses
STRTAB_STE_1_S1DSS_TERMINATE/STRTAB_STE_1_S1DSS_BYPASS when a pasid
domain is first attached?) at which point it's easier to reason about
its benefits and alternatives.

> @@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);

Can we get rid of this line and use target.data[0] everywhere above?
'val' isn't exactly a great name to describe the first word of the STE
and there's no need to defer writing data[0] anymore since this isn't
directly writing to the register.
(Feel free to ignore this if it's already addressed by subsequent patches)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-12  9:01     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-12  9:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> If we are replacing a CD table entry when the STE already points at the CD
> entry then we just need to do the make/break sequence.

Do you mean when the STE already points at the CD table? What's the
make/break sequence?


>  static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> @@ -2554,6 +2546,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                                 master->domain = NULL;
>                                 goto out_list_del;
>                         }
> +               } else {
> +                       /*
> +                        * arm_smmu_write_ctx_desc() relies on the entry being
> +                        * invalid to work, clear any existing entry.
> +                        */
> +                       ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> +                                                     NULL);
> +                       if (ret) {
> +                               master->domain = NULL;
> +                               goto out_list_del;
> +                       }
>                 }
>
>                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> @@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                 }
>
>                 arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> +               arm_smmu_install_ste_for_dev(master, &target);

Even if it's handled correctly under the hood by clever ste writing
logic, isn't it weird that we don't explicitly check whether the CD
table is already installed and skip arm_smmu_install_ste_for_dev in
that case?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
@ 2023-10-12  9:01     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-12  9:01 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> If we are replacing a CD table entry when the STE already points at the CD
> entry then we just need to do the make/break sequence.

Do you mean when the STE already points at the CD table? What's the
make/break sequence?


>  static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> @@ -2554,6 +2546,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                                 master->domain = NULL;
>                                 goto out_list_del;
>                         }
> +               } else {
> +                       /*
> +                        * arm_smmu_write_ctx_desc() relies on the entry being
> +                        * invalid to work, clear any existing entry.
> +                        */
> +                       ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> +                                                     NULL);
> +                       if (ret) {
> +                               master->domain = NULL;
> +                               goto out_list_del;
> +                       }
>                 }
>
>                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> @@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                 }
>
>                 arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> +               arm_smmu_install_ste_for_dev(master, &target);

Even if it's handled correctly under the hood by clever ste writing
logic, isn't it weird that we don't explicitly check whether the CD
table is already installed and skip arm_smmu_install_ste_for_dev in
that case?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-12  8:10     ` Michael Shavit
@ 2023-10-12 12:16       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 12:16 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 04:10:50PM +0800, Michael Shavit wrote:

> This sounds pretty complicated....Is this complexity really required
> here?

It is first needed before 'iommu/arm-smmu-v3: Do not change the STE
twice during arm_smmu_attach_dev()' which is a couple of patches
further on.

Then it keeps getting relied on.

I don't think there is any simple answer here, the HW has this complex
requirement. The current code is also complex - arguably more complex
because of how leaky/fragile it is. There is even a couple of pages of
text in the spec describing how to do this, and it doesn't discuss the
hitless cases!

At the end this is only 32 lines and it replaces both
arm_smmu_write_ctx_desc() and arm_smmu_write_strtab_ent().

FWIW, I found the most difficult part the used bit calculation, not
the update algorithm. Difficult because it is hard to read and find in
the spec when things are INGORED, but it is a "straightforward" job of
finding INGORED cases and making the used bits 0.

> Specifically, can we start with a naive version that always first
> nukes `V=0` before writing the STE?

I'm a little worried doing so will subtly break things that are
currently working as the current code does have cases which are
hitless.

Then we'd just need to change it anyhow in 5 patches or so

> This still allows you to remove requirements that callers must have
> first set the STE to abort (supposedly to get rid of the
> arm_smmu_detach_dev call currently made from arm_smmu_attach_dev)
> while being easier to digest.  The more sophisticated version can
> then be closer in the series to the patch that requires it
> (supposedly this is to support replacing a fully blocking/bypass STE
> with one that uses
> STRTAB_STE_1_S1DSS_TERMINATE/STRTAB_STE_1_S1DSS_BYPASS when a pasid
> domain is first attached?) at which point it's easier to reason
> about its benefits and alternatives.

From memory there are many cases that use the full functionality:

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

Some of this are fragile and open coded today, eg the CD quiet_cd and
ASID changes both just edit the STE in place. At the end we always
build full target STE/CDs and always consistently store it.

This is a nice tool because we don't have to specially think about the
above 5 case and painfully open code a FSM across several layers. We
just do and it works. Then everything else that can be hitless also
just becomes hitless, even if we don't have a use case for it..

> Can we get rid of this line and use target.data[0] everywhere above?
> 'val' isn't exactly a great name to describe the first word of the STE
> and there's no need to defer writing data[0] anymore since this isn't
> directly writing to the register.
> (Feel free to ignore this if it's already addressed by subsequent patches)

Subsequent patches erase this function :)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-12 12:16       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 12:16 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 04:10:50PM +0800, Michael Shavit wrote:

> This sounds pretty complicated....Is this complexity really required
> here?

It is first needed before 'iommu/arm-smmu-v3: Do not change the STE
twice during arm_smmu_attach_dev()' which is a couple of patches
further on.

Then it keeps getting relied on.

I don't think there is any simple answer here, the HW has this complex
requirement. The current code is also complex - arguably more complex
because of how leaky/fragile it is. There is even a couple of pages of
text in the spec describing how to do this, and it doesn't discuss the
hitless cases!

At the end this is only 32 lines and it replaces both
arm_smmu_write_ctx_desc() and arm_smmu_write_strtab_ent().

FWIW, I found the most difficult part the used bit calculation, not
the update algorithm. Difficult because it is hard to read and find in
the spec when things are INGORED, but it is a "straightforward" job of
finding INGORED cases and making the used bits 0.

> Specifically, can we start with a naive version that always first
> nukes `V=0` before writing the STE?

I'm a little worried doing so will subtly break things that are
currently working as the current code does have cases which are
hitless.

Then we'd just need to change it anyhow in 5 patches or so

> This still allows you to remove requirements that callers must have
> first set the STE to abort (supposedly to get rid of the
> arm_smmu_detach_dev call currently made from arm_smmu_attach_dev)
> while being easier to digest.  The more sophisticated version can
> then be closer in the series to the patch that requires it
> (supposedly this is to support replacing a fully blocking/bypass STE
> with one that uses
> STRTAB_STE_1_S1DSS_TERMINATE/STRTAB_STE_1_S1DSS_BYPASS when a pasid
> domain is first attached?) at which point it's easier to reason
> about its benefits and alternatives.

From memory there are many cases that use the full functionality:

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

Some of this are fragile and open coded today, eg the CD quiet_cd and
ASID changes both just edit the STE in place. At the end we always
build full target STE/CDs and always consistently store it.

This is a nice tool because we don't have to specially think about the
above 5 case and painfully open code a FSM across several layers. We
just do and it works. Then everything else that can be hitless also
just becomes hitless, even if we don't have a use case for it..

> Can we get rid of this line and use target.data[0] everywhere above?
> 'val' isn't exactly a great name to describe the first word of the STE
> and there's no need to defer writing data[0] anymore since this isn't
> directly writing to the register.
> (Feel free to ignore this if it's already addressed by subsequent patches)

Subsequent patches erase this function :)

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
  2023-10-12  9:01     ` Michael Shavit
@ 2023-10-12 12:34       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 12:34 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 05:01:16PM +0800, Michael Shavit wrote:
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > If we are replacing a CD table entry when the STE already points at the CD
> > entry then we just need to do the make/break sequence.
> 
> Do you mean when the STE already points at the CD table? 

Yes

> What's the make/break sequence?

When replacing a CD table entry at this point the code makes the CD
table entry non-valid then immediately makes it valid. This is because
the CD code cannot (yet, ~10 patches later it does) handle a Valid to
Valid transition.

> > +               } else {
> > +                       /*
> > +                        * arm_smmu_write_ctx_desc() relies on the entry being
> > +                        * invalid to work, clear any existing entry.
> > +                        */
> > +                       ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> > +                                                     NULL);
> > +                       if (ret) {
> > +                               master->domain = NULL;
> > +                               goto out_list_del;
> > +                       }
> >                 }
> >
> >                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> > @@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >                 }
> >
> >                 arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> > +               arm_smmu_install_ste_for_dev(master, &target);
> 
> Even if it's handled correctly under the hood by clever ste writing
> logic, isn't it weird that we don't explicitly check whether the CD
> table is already installed and skip arm_smmu_install_ste_for_dev in
> that case?

There is a design logic at work here..

At this layer in the code we think in terms of 'target state'. We know
what the correct STE must be, so we compute that full value and make
the HW use that value. The lower layer computes the steps required to
put the HW into the target state, which might be a NOP.

Trying to optimizing the NOP here means this layer has to keep track
of what state the STE is currently in vs only tracking what state it
should be in. Avoiding that tracking is a main point of the new
programming logic.

This is a pretty common design pattern, "desired state" or "target
state".

Later on this becomes more complex as the CD table may be installed to
the STE but the S1DSS or EATS is not correct for S1 operation. Coding
it this way eventually trivially corrects those things as well. That
is something like 30 patches later.

Regards,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
@ 2023-10-12 12:34       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-12 12:34 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 05:01:16PM +0800, Michael Shavit wrote:
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > If we are replacing a CD table entry when the STE already points at the CD
> > entry then we just need to do the make/break sequence.
> 
> Do you mean when the STE already points at the CD table? 

Yes

> What's the make/break sequence?

When replacing a CD table entry at this point the code makes the CD
table entry non-valid then immediately makes it valid. This is because
the CD code cannot (yet, ~10 patches later it does) handle a Valid to
Valid transition.

> > +               } else {
> > +                       /*
> > +                        * arm_smmu_write_ctx_desc() relies on the entry being
> > +                        * invalid to work, clear any existing entry.
> > +                        */
> > +                       ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> > +                                                     NULL);
> > +                       if (ret) {
> > +                               master->domain = NULL;
> > +                               goto out_list_del;
> > +                       }
> >                 }
> >
> >                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> > @@ -2563,15 +2566,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >                 }
> >
> >                 arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> > +               arm_smmu_install_ste_for_dev(master, &target);
> 
> Even if it's handled correctly under the hood by clever ste writing
> logic, isn't it weird that we don't explicitly check whether the CD
> table is already installed and skip arm_smmu_install_ste_for_dev in
> that case?

There is a design logic at work here..

At this layer in the code we think in terms of 'target state'. We know
what the correct STE must be, so we compute that full value and make
the HW use that value. The lower layer computes the steps required to
put the HW into the target state, which might be a NOP.

Trying to optimizing the NOP here means this layer has to keep track
of what state the STE is currently in vs only tracking what state it
should be in. Avoiding that tracking is a main point of the new
programming logic.

This is a pretty common design pattern, "desired state" or "target
state".

Later on this becomes more complex as the CD table may be installed to
the STE but the S1DSS or EATS is not correct for S1 operation. Coding
it this way eventually trivially corrects those things as well. That
is something like 30 patches later.

Regards,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-13 10:37     ` Will Deacon
  -1 siblings, 0 replies; 134+ messages in thread
From: Will Deacon @ 2023-10-13 10:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy,
	Michael Shavit, Nicolin Chen

On Tue, Oct 10, 2023 at 09:33:07PM -0300, Jason Gunthorpe wrote:
> Instead of passing a naked __le16 * around to represent a STE wrap it in a
> "struct arm_smmu_ste" with an array of the correct size. This makes it
> much clearer which functions will comprise the "STE API".
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 54 ++++++++++-----------
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  7 ++-
>  2 files changed, 32 insertions(+), 29 deletions(-)

[...]

> @@ -2209,26 +2210,22 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
>  	return 0;
>  }
>  
> -static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> +static struct arm_smmu_ste *
> +arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
>  {
> -	__le64 *step;
>  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>  
>  	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> -		struct arm_smmu_strtab_l1_desc *l1_desc;
>  		int idx;
>  
>  		/* Two-level walk */
>  		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
> -		l1_desc = &cfg->l1_desc[idx];
> -		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
> -		step = &l1_desc->l2ptr[idx];
> +		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
>  	} else {
>  		/* Simple linear lookup */
> -		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
> +		return (struct arm_smmu_ste *)&cfg
> +			       ->strtab[sid * STRTAB_STE_DWORDS];

Why not change the type of 'struct arm_smmu_strtab_cfg::strtab' at the same
time?

Will

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
@ 2023-10-13 10:37     ` Will Deacon
  0 siblings, 0 replies; 134+ messages in thread
From: Will Deacon @ 2023-10-13 10:37 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy,
	Michael Shavit, Nicolin Chen

On Tue, Oct 10, 2023 at 09:33:07PM -0300, Jason Gunthorpe wrote:
> Instead of passing a naked __le16 * around to represent a STE wrap it in a
> "struct arm_smmu_ste" with an array of the correct size. This makes it
> much clearer which functions will comprise the "STE API".
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 54 ++++++++++-----------
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  7 ++-
>  2 files changed, 32 insertions(+), 29 deletions(-)

[...]

> @@ -2209,26 +2210,22 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
>  	return 0;
>  }
>  
> -static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> +static struct arm_smmu_ste *
> +arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
>  {
> -	__le64 *step;
>  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>  
>  	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> -		struct arm_smmu_strtab_l1_desc *l1_desc;
>  		int idx;
>  
>  		/* Two-level walk */
>  		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
> -		l1_desc = &cfg->l1_desc[idx];
> -		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
> -		step = &l1_desc->l2ptr[idx];
> +		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
>  	} else {
>  		/* Simple linear lookup */
> -		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
> +		return (struct arm_smmu_ste *)&cfg
> +			       ->strtab[sid * STRTAB_STE_DWORDS];

Why not change the type of 'struct arm_smmu_strtab_cfg::strtab' at the same
time?

Will

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
  2023-10-13 10:37     ` Will Deacon
@ 2023-10-13 14:00       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-13 14:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy,
	Michael Shavit, Nicolin Chen

On Fri, Oct 13, 2023 at 11:37:35AM +0100, Will Deacon wrote:

> > -static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> > +static struct arm_smmu_ste *
> > +arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> >  {
> > -	__le64 *step;
> >  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> >  
> >  	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> > -		struct arm_smmu_strtab_l1_desc *l1_desc;
> >  		int idx;
> >  
> >  		/* Two-level walk */
> >  		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
> > -		l1_desc = &cfg->l1_desc[idx];
> > -		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
> > -		step = &l1_desc->l2ptr[idx];
> > +		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
> >  	} else {
> >  		/* Simple linear lookup */
> > -		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
> > +		return (struct arm_smmu_ste *)&cfg
> > +			       ->strtab[sid * STRTAB_STE_DWORDS];
> 
> Why not change the type of 'struct arm_smmu_strtab_cfg::strtab' at the same
> time?

It doesn't always point at a STE.

arm_smmu_init_strtab_2lvl() sets strtab to:

	l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
	strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
				     GFP_KERNEL);
	cfg->strtab = strtab;

And arm_smmu_init_strtab_linear() sets strtab to:

	size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
	strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
				     GFP_KERNEL);
	cfg->strtab = strtab;

I can add this patch if you like immediately after:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d5ba85034c1386..bdb559878615b8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1559,7 +1559,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 		return 0;
 
 	size = 1 << (STRTAB_SPLIT + ilog2(STRTAB_STE_DWORDS) + 3);
-	strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
+	strtab = &cfg->strtab.l1_desc[sid >> STRTAB_SPLIT];
 
 	desc->span = STRTAB_SPLIT + 1;
 	desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
@@ -2347,8 +2347,7 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
 	} else {
 		/* Simple linear lookup */
-		return (struct arm_smmu_ste *)&cfg
-			       ->strtab[sid * STRTAB_STE_DWORDS];
+		return &cfg->strtab.linear[sid * STRTAB_STE_DWORDS];
 	}
 }
 
@@ -3421,17 +3420,15 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
 {
 	unsigned int i;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
-	void *strtab = smmu->strtab_cfg.strtab;
 
 	cfg->l1_desc = devm_kcalloc(smmu->dev, cfg->num_l1_ents,
 				    sizeof(*cfg->l1_desc), GFP_KERNEL);
 	if (!cfg->l1_desc)
 		return -ENOMEM;
 
-	for (i = 0; i < cfg->num_l1_ents; ++i) {
-		arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
-		strtab += STRTAB_L1_DESC_DWORDS << 3;
-	}
+	for (i = 0; i < cfg->num_l1_ents; ++i)
+		arm_smmu_write_strtab_l1_desc(
+			&smmu->strtab_cfg.strtab.l1_desc[i], &cfg->l1_desc[i]);
 
 	return 0;
 }
@@ -3463,7 +3460,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
 			l1size);
 		return -ENOMEM;
 	}
-	cfg->strtab = strtab;
+	cfg->strtab.l1_desc = strtab;
 
 	/* Configure strtab_base_cfg for 2 levels */
 	reg  = FIELD_PREP(STRTAB_BASE_CFG_FMT, STRTAB_BASE_CFG_FMT_2LVL);
@@ -3490,7 +3487,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 			size);
 		return -ENOMEM;
 	}
-	cfg->strtab = strtab;
+	cfg->strtab.linear = strtab;
 	cfg->num_l1_ents = 1 << smmu->sid_bits;
 
 	/* Configure strtab_base_cfg for a linear table covering all SIDs */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 74f6f9e28c6e84..6d75adb1a72b4f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -620,7 +620,10 @@ struct arm_smmu_s2_cfg {
 };
 
 struct arm_smmu_strtab_cfg {
-	__le64				*strtab;
+	union {
+		struct arm_smmu_ste *linear;
+		__le64 *l1_desc;
+	} strtab;
 	dma_addr_t			strtab_dma;
 	struct arm_smmu_strtab_l1_desc	*l1_desc;
 	unsigned int			num_l1_ents;

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE
@ 2023-10-13 14:00       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-13 14:00 UTC (permalink / raw)
  To: Will Deacon
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy,
	Michael Shavit, Nicolin Chen

On Fri, Oct 13, 2023 at 11:37:35AM +0100, Will Deacon wrote:

> > -static __le64 *arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> > +static struct arm_smmu_ste *
> > +arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
> >  {
> > -	__le64 *step;
> >  	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
> >  
> >  	if (smmu->features & ARM_SMMU_FEAT_2_LVL_STRTAB) {
> > -		struct arm_smmu_strtab_l1_desc *l1_desc;
> >  		int idx;
> >  
> >  		/* Two-level walk */
> >  		idx = (sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS;
> > -		l1_desc = &cfg->l1_desc[idx];
> > -		idx = (sid & ((1 << STRTAB_SPLIT) - 1)) * STRTAB_STE_DWORDS;
> > -		step = &l1_desc->l2ptr[idx];
> > +		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
> >  	} else {
> >  		/* Simple linear lookup */
> > -		step = &cfg->strtab[sid * STRTAB_STE_DWORDS];
> > +		return (struct arm_smmu_ste *)&cfg
> > +			       ->strtab[sid * STRTAB_STE_DWORDS];
> 
> Why not change the type of 'struct arm_smmu_strtab_cfg::strtab' at the same
> time?

It doesn't always point at a STE.

arm_smmu_init_strtab_2lvl() sets strtab to:

	l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
	strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
				     GFP_KERNEL);
	cfg->strtab = strtab;

And arm_smmu_init_strtab_linear() sets strtab to:

	size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
	strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
				     GFP_KERNEL);
	cfg->strtab = strtab;

I can add this patch if you like immediately after:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d5ba85034c1386..bdb559878615b8 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1559,7 +1559,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 		return 0;
 
 	size = 1 << (STRTAB_SPLIT + ilog2(STRTAB_STE_DWORDS) + 3);
-	strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
+	strtab = &cfg->strtab.l1_desc[sid >> STRTAB_SPLIT];
 
 	desc->span = STRTAB_SPLIT + 1;
 	desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
@@ -2347,8 +2347,7 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 		return &cfg->l1_desc[idx].l2ptr[sid & ((1 << STRTAB_SPLIT) - 1)];
 	} else {
 		/* Simple linear lookup */
-		return (struct arm_smmu_ste *)&cfg
-			       ->strtab[sid * STRTAB_STE_DWORDS];
+		return &cfg->strtab.linear[sid * STRTAB_STE_DWORDS];
 	}
 }
 
@@ -3421,17 +3420,15 @@ static int arm_smmu_init_l1_strtab(struct arm_smmu_device *smmu)
 {
 	unsigned int i;
 	struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
-	void *strtab = smmu->strtab_cfg.strtab;
 
 	cfg->l1_desc = devm_kcalloc(smmu->dev, cfg->num_l1_ents,
 				    sizeof(*cfg->l1_desc), GFP_KERNEL);
 	if (!cfg->l1_desc)
 		return -ENOMEM;
 
-	for (i = 0; i < cfg->num_l1_ents; ++i) {
-		arm_smmu_write_strtab_l1_desc(strtab, &cfg->l1_desc[i]);
-		strtab += STRTAB_L1_DESC_DWORDS << 3;
-	}
+	for (i = 0; i < cfg->num_l1_ents; ++i)
+		arm_smmu_write_strtab_l1_desc(
+			&smmu->strtab_cfg.strtab.l1_desc[i], &cfg->l1_desc[i]);
 
 	return 0;
 }
@@ -3463,7 +3460,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
 			l1size);
 		return -ENOMEM;
 	}
-	cfg->strtab = strtab;
+	cfg->strtab.l1_desc = strtab;
 
 	/* Configure strtab_base_cfg for 2 levels */
 	reg  = FIELD_PREP(STRTAB_BASE_CFG_FMT, STRTAB_BASE_CFG_FMT_2LVL);
@@ -3490,7 +3487,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 			size);
 		return -ENOMEM;
 	}
-	cfg->strtab = strtab;
+	cfg->strtab.linear = strtab;
 	cfg->num_l1_ents = 1 << smmu->sid_bits;
 
 	/* Configure strtab_base_cfg for a linear table covering all SIDs */
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 74f6f9e28c6e84..6d75adb1a72b4f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -620,7 +620,10 @@ struct arm_smmu_s2_cfg {
 };
 
 struct arm_smmu_strtab_cfg {
-	__le64				*strtab;
+	union {
+		struct arm_smmu_ste *linear;
+		__le64 *l1_desc;
+	} strtab;
 	dma_addr_t			strtab_dma;
 	struct arm_smmu_strtab_l1_desc	*l1_desc;
 	unsigned int			num_l1_ents;

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-18 10:54     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 10:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
>
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
>
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
>
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
>
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
>
> This gives two basic flows through the algorithm.
>
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
>
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
>
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.
>
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).
>
> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 247 +++++++++++++++-----
>  1 file changed, 183 insertions(+), 64 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index bf7218adbc2822..6e6b1ebb5ac0ef 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,6 +971,69 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Do one step along the coherent update algorithm. Each step either changes
> + * only bits that the HW isn't using or entirely changes 1 qword. It may take
> + * several iterations of this routine to make the full change.
> + */
> +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> +                                     const __le64 *target,
> +                                     const __le64 *target_used, __le64 *step,
> +                                     __le64 v_bit,
> +                                     unsigned int len)
> +{
> +       u8 step_used_diff = 0;
> +       u8 step_change = 0;
> +       unsigned int i;
> +
> +       /*
> +        * Compute a step that has all the bits currently unused by HW set to
> +        * their target values.
> +        */
> +       for (i = 0; i != len; i++) {
> +               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> +               if (cur[i] != step[i])
> +                       step_change |= 1 << i;
> +               /*
> +                * Each bit indicates if the step is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> +                       step_used_diff |= 1 << i;
> +       }
> +
> +       if (hweight8(step_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched, this cannot be done without
> +                * a break. Clear the V bit and go again.
> +                */
> +               step[0] &= ~v_bit;
> +       } else if (!step_change && step_used_diff) {
> +               /*
> +                * Have exactly one critical qword, all the other qwords are set
> +                * correctly, so we can set this qword now.
> +                */
> +               i = ffs(step_used_diff) - 1;
> +               step[i] = target[i];
> +       } else if (!step_change) {
> +               /* cur == target, so all done */
> +               if (memcmp(cur, target, sizeof(*cur)) == 0)
> +                       return true;
Shouldn't this be len * sizeof(*cur)?

> +
> +               /*
> +                * All the used HW bits match, but unused bits are different.
> +                * Set them as well. Technically this isn't necessary but it
> +                * brings the entry to the full target state, so if there are
> +                * bugs in the mask calculation this will obscure them.
> +                */
> +               memcpy(step, target, len * sizeof(*step));
> +       }
> +
> +       for (i = 0; i != len; i++)
> +               WRITE_ONCE(cur[i], step[i]);
> +       return false;
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1248,37 +1311,122 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> +                                 struct arm_smmu_ste *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +       if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +               return;
> +
> +       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> +       case STRTAB_STE_0_CFG_ABORT:
> +               break;
> +       case STRTAB_STE_0_CFG_BYPASS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +               break;
> +       case STRTAB_STE_0_CFG_S1_TRANS:
> +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> +                                                 STRTAB_STE_0_S1CDMAX);
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +
> +               if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> +                   STRTAB_STE_1_S1DSS_BYPASS)
> +                       used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);

Although the driver only explicitly sets SHCFG for bypass streams, my
reading of the spec is it is also accessed for S1 and S2 STEs:
"The SMMU might convey attributes input from a device through this
process, so that the device might influence the final transaction
access, and input attributes might be overridden on a per-device basis
using the MTCFG/MemAttr, SHCFG, ALLOCCFG STE fields. The input
attribute, modified by these fields, is primarily useful for setting
the resulting output access attribute when both stage 1 and stage 2
translation is bypassed (no translation table descriptors to determine
attribute) but can also be useful for stage 2-only configurations in
which a device stream might have finer knowledge about the required
access behavior than the general virtual machine-global stage 2
translation tables."

> +               break;
> +       case STRTAB_STE_0_CFG_S2_TRANS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +               used_bits->data[2] |=
> +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +               break;
> +
> +       default:
> +               memset(used_bits, 0xFF, sizeof(*used_bits));

Can we consider a WARN here since this driver only ever uses one of
the above 4 values and we probably have a programming error if we see
something else.

> +       }
> +}
> +
> +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> +                                   const struct arm_smmu_ste *target,
> +                                   const struct arm_smmu_ste *target_used)
> +{
> +       struct arm_smmu_ste cur_used;
> +       struct arm_smmu_ste step;
> +
> +       arm_smmu_get_ste_used(cur, &cur_used);
> +       return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> +                                        target_used->data, step.data,

What's up with requiring callers to allocate and provide step.data if
it's not used by any of the arm_smmu_write_entry_step callers?

> +                                        cpu_to_le64(STRTAB_STE_0_V),
This also looks a bit strange at this stage since CD entries aren't
yet supported..... but sure.

> +                                        ARRAY_SIZE(cur->data));
> +}
> +
> +/*
> + * This algorithm updates any STE to any value without creating a situation
> + * where the HW can percieve a corrupted STE. HW is only required to have a 64
> + * bit atomicity with stores.
> + *
> + * In the most general case we can make any update by disrupting the STE (making
> + * it abort, or clearing the V bit) using a single qword store. Then all the
> + * other qwords can be written safely, and finally the full STE written.
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect.
> + */
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +                              struct arm_smmu_ste *ste,
> +                              const struct arm_smmu_ste *target)
> +{
> +       struct arm_smmu_ste target_used;
> +       int i;
> +
> +       arm_smmu_get_ste_used(target, &target_used);
> +       /* Masks in arm_smmu_get_ste_used() are up to date */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++)
> +               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +
> +       while (true) {
> +               if (arm_smmu_write_ste_step(ste, target, &target_used))
> +                       break;
> +               arm_smmu_sync_ste_for_sid(smmu, sid);
> +       }
> +
> +       /* It's likely that we'll want to use the new STE soon */
> +       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +               struct arm_smmu_cmdq_ent
> +                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +                                        .prefetch = {
> +                                                .sid = sid,
> +                                        } };
> +
> +               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       }
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                                       struct arm_smmu_ste *dst)
>  {
> -       /*
> -        * This is hideously complicated, but we only really care about
> -        * three cases at the moment:
> -        *
> -        * 1. Invalid (all zero) -> bypass/fault (init)
> -        * 2. Bypass/fault -> translation/bypass (attach)
> -        * 3. Translation/bypass -> bypass/fault (detach)
> -        *
> -        * Given that we can't update the STE atomically and the SMMU
> -        * doesn't read the thing in a defined order, that leaves us
> -        * with the following maintenance requirements:
> -        *
> -        * 1. Update Config, return (init time STEs aren't live)
> -        * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -        * 3. Update Config, sync
> -        */
> -       u64 val = le64_to_cpu(dst->data[0]);
> -       bool ste_live = false;
> +       u64 val;
>         struct arm_smmu_device *smmu = master->smmu;
>         struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>         struct arm_smmu_s2_cfg *s2_cfg = NULL;
>         struct arm_smmu_domain *smmu_domain = master->domain;
> -       struct arm_smmu_cmdq_ent prefetch_cmd = {
> -               .opcode         = CMDQ_OP_PREFETCH_CFG,
> -               .prefetch       = {
> -                       .sid    = sid,
> -               },
> -       };
> +       struct arm_smmu_ste target = {};
>
>         if (smmu_domain) {
>                 switch (smmu_domain->stage) {
> @@ -1293,22 +1441,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 }
>         }
>
> -       if (val & STRTAB_STE_0_V) {
> -               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -               case STRTAB_STE_0_CFG_BYPASS:
> -                       break;
> -               case STRTAB_STE_0_CFG_S1_TRANS:
> -               case STRTAB_STE_0_CFG_S2_TRANS:
> -                       ste_live = true;
> -                       break;
> -               case STRTAB_STE_0_CFG_ABORT:
> -                       BUG_ON(!disable_bypass);
> -                       break;
> -               default:
> -                       BUG(); /* STE corruption */
> -               }
> -       }
> -
>         /* Nuke the existing STE_0 value, as we're going to rewrite it */
>         val = STRTAB_STE_0_V;
>
> @@ -1319,16 +1451,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 else
>                         val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>
> -               dst->data[0] = cpu_to_le64(val);
> -               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +               target.data[0] = cpu_to_le64(val);
> +               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>                                                 STRTAB_STE_1_SHCFG_INCOMING));
> -               dst->data[2] = 0; /* Nuke the VMID */
> -               /*
> -                * The SMMU can perform negative caching, so we must sync
> -                * the STE regardless of whether the old value was live.
> -                */
> -               if (smmu)
> -                       arm_smmu_sync_ste_for_sid(smmu, sid);
> +               target.data[2] = 0; /* Nuke the VMID */
> +               arm_smmu_write_ste(smmu, sid, dst, &target);
>                 return;
>         }
>
> @@ -1336,8 +1463,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>                         STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>
> -               BUG_ON(ste_live);
> -               dst->data[1] = cpu_to_le64(
> +               target.data[1] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>                          FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>                          FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1346,7 +1472,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>
>                 if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>                     !master->stall_enabled)
> -                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>
>                 val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>                         FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1355,8 +1481,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>         }
>
>         if (s2_cfg) {
> -               BUG_ON(ste_live);
> -               dst->data[2] = cpu_to_le64(
> +               target.data[2] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>                          FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);
> +       arm_smmu_write_ste(smmu, sid, dst, &target);
>  }
>
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> --
> 2.42.0
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-18 10:54     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 10:54 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
>
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
>
> Thus, consolidate all the complexity here. Callers do not have to care
> about what STE transition they are doing, this function will handle
> everything optimally.
>
> Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
> required programming sequence to avoid creating an incoherent 'torn' STE
> in the HW caches. The update algorithm follows the same design that the
> driver already uses: it is safe to change bits that HW doesn't currently
> use and then do a single 64 bit update, with sync's in between.
>
> The basic idea is to express in a bitmask what bits the HW is actually
> using based on the V and CFG bits. Based on that mask we know what STE
> changes are safe and which are disruptive. We can count how many 64 bit
> QWORDS need a disruptive update and know if a step with V=0 is required.
>
> This gives two basic flows through the algorithm.
>
> If only a single 64 bit quantity needs disruptive replacement:
>  - Write the target value into all currently unused bits
>  - Write the single 64 bit quantity
>  - Zero the remaining different bits
>
> If multiple 64 bit quantities need disruptive replacement then do:
>  - Write V=0 to QWORD 0
>  - Write the entire STE except QWORD 0
>  - Write QWORD 0
>
> With HW flushes at each step, that can be skipped if the STE didn't change
> in that step.
>
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).
>
> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 247 +++++++++++++++-----
>  1 file changed, 183 insertions(+), 64 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index bf7218adbc2822..6e6b1ebb5ac0ef 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,6 +971,69 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Do one step along the coherent update algorithm. Each step either changes
> + * only bits that the HW isn't using or entirely changes 1 qword. It may take
> + * several iterations of this routine to make the full change.
> + */
> +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> +                                     const __le64 *target,
> +                                     const __le64 *target_used, __le64 *step,
> +                                     __le64 v_bit,
> +                                     unsigned int len)
> +{
> +       u8 step_used_diff = 0;
> +       u8 step_change = 0;
> +       unsigned int i;
> +
> +       /*
> +        * Compute a step that has all the bits currently unused by HW set to
> +        * their target values.
> +        */
> +       for (i = 0; i != len; i++) {
> +               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> +               if (cur[i] != step[i])
> +                       step_change |= 1 << i;
> +               /*
> +                * Each bit indicates if the step is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> +                       step_used_diff |= 1 << i;
> +       }
> +
> +       if (hweight8(step_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched, this cannot be done without
> +                * a break. Clear the V bit and go again.
> +                */
> +               step[0] &= ~v_bit;
> +       } else if (!step_change && step_used_diff) {
> +               /*
> +                * Have exactly one critical qword, all the other qwords are set
> +                * correctly, so we can set this qword now.
> +                */
> +               i = ffs(step_used_diff) - 1;
> +               step[i] = target[i];
> +       } else if (!step_change) {
> +               /* cur == target, so all done */
> +               if (memcmp(cur, target, sizeof(*cur)) == 0)
> +                       return true;
Shouldn't this be len * sizeof(*cur)?

> +
> +               /*
> +                * All the used HW bits match, but unused bits are different.
> +                * Set them as well. Technically this isn't necessary but it
> +                * brings the entry to the full target state, so if there are
> +                * bugs in the mask calculation this will obscure them.
> +                */
> +               memcpy(step, target, len * sizeof(*step));
> +       }
> +
> +       for (i = 0; i != len; i++)
> +               WRITE_ONCE(cur[i], step[i]);
> +       return false;
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1248,37 +1311,122 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> +                                 struct arm_smmu_ste *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +       if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +               return;
> +
> +       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> +       case STRTAB_STE_0_CFG_ABORT:
> +               break;
> +       case STRTAB_STE_0_CFG_BYPASS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +               break;
> +       case STRTAB_STE_0_CFG_S1_TRANS:
> +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> +                                                 STRTAB_STE_0_S1CDMAX);
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +
> +               if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> +                   STRTAB_STE_1_S1DSS_BYPASS)
> +                       used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);

Although the driver only explicitly sets SHCFG for bypass streams, my
reading of the spec is it is also accessed for S1 and S2 STEs:
"The SMMU might convey attributes input from a device through this
process, so that the device might influence the final transaction
access, and input attributes might be overridden on a per-device basis
using the MTCFG/MemAttr, SHCFG, ALLOCCFG STE fields. The input
attribute, modified by these fields, is primarily useful for setting
the resulting output access attribute when both stage 1 and stage 2
translation is bypassed (no translation table descriptors to determine
attribute) but can also be useful for stage 2-only configurations in
which a device stream might have finer knowledge about the required
access behavior than the general virtual machine-global stage 2
translation tables."

> +               break;
> +       case STRTAB_STE_0_CFG_S2_TRANS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +               used_bits->data[2] |=
> +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +               break;
> +
> +       default:
> +               memset(used_bits, 0xFF, sizeof(*used_bits));

Can we consider a WARN here since this driver only ever uses one of
the above 4 values and we probably have a programming error if we see
something else.

> +       }
> +}
> +
> +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> +                                   const struct arm_smmu_ste *target,
> +                                   const struct arm_smmu_ste *target_used)
> +{
> +       struct arm_smmu_ste cur_used;
> +       struct arm_smmu_ste step;
> +
> +       arm_smmu_get_ste_used(cur, &cur_used);
> +       return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> +                                        target_used->data, step.data,

What's up with requiring callers to allocate and provide step.data if
it's not used by any of the arm_smmu_write_entry_step callers?

> +                                        cpu_to_le64(STRTAB_STE_0_V),
This also looks a bit strange at this stage since CD entries aren't
yet supported..... but sure.

> +                                        ARRAY_SIZE(cur->data));
> +}
> +
> +/*
> + * This algorithm updates any STE to any value without creating a situation
> + * where the HW can percieve a corrupted STE. HW is only required to have a 64
> + * bit atomicity with stores.
> + *
> + * In the most general case we can make any update by disrupting the STE (making
> + * it abort, or clearing the V bit) using a single qword store. Then all the
> + * other qwords can be written safely, and finally the full STE written.
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect.
> + */
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +                              struct arm_smmu_ste *ste,
> +                              const struct arm_smmu_ste *target)
> +{
> +       struct arm_smmu_ste target_used;
> +       int i;
> +
> +       arm_smmu_get_ste_used(target, &target_used);
> +       /* Masks in arm_smmu_get_ste_used() are up to date */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++)
> +               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +
> +       while (true) {
> +               if (arm_smmu_write_ste_step(ste, target, &target_used))
> +                       break;
> +               arm_smmu_sync_ste_for_sid(smmu, sid);
> +       }
> +
> +       /* It's likely that we'll want to use the new STE soon */
> +       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +               struct arm_smmu_cmdq_ent
> +                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +                                        .prefetch = {
> +                                                .sid = sid,
> +                                        } };
> +
> +               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       }
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                                       struct arm_smmu_ste *dst)
>  {
> -       /*
> -        * This is hideously complicated, but we only really care about
> -        * three cases at the moment:
> -        *
> -        * 1. Invalid (all zero) -> bypass/fault (init)
> -        * 2. Bypass/fault -> translation/bypass (attach)
> -        * 3. Translation/bypass -> bypass/fault (detach)
> -        *
> -        * Given that we can't update the STE atomically and the SMMU
> -        * doesn't read the thing in a defined order, that leaves us
> -        * with the following maintenance requirements:
> -        *
> -        * 1. Update Config, return (init time STEs aren't live)
> -        * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -        * 3. Update Config, sync
> -        */
> -       u64 val = le64_to_cpu(dst->data[0]);
> -       bool ste_live = false;
> +       u64 val;
>         struct arm_smmu_device *smmu = master->smmu;
>         struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>         struct arm_smmu_s2_cfg *s2_cfg = NULL;
>         struct arm_smmu_domain *smmu_domain = master->domain;
> -       struct arm_smmu_cmdq_ent prefetch_cmd = {
> -               .opcode         = CMDQ_OP_PREFETCH_CFG,
> -               .prefetch       = {
> -                       .sid    = sid,
> -               },
> -       };
> +       struct arm_smmu_ste target = {};
>
>         if (smmu_domain) {
>                 switch (smmu_domain->stage) {
> @@ -1293,22 +1441,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 }
>         }
>
> -       if (val & STRTAB_STE_0_V) {
> -               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -               case STRTAB_STE_0_CFG_BYPASS:
> -                       break;
> -               case STRTAB_STE_0_CFG_S1_TRANS:
> -               case STRTAB_STE_0_CFG_S2_TRANS:
> -                       ste_live = true;
> -                       break;
> -               case STRTAB_STE_0_CFG_ABORT:
> -                       BUG_ON(!disable_bypass);
> -                       break;
> -               default:
> -                       BUG(); /* STE corruption */
> -               }
> -       }
> -
>         /* Nuke the existing STE_0 value, as we're going to rewrite it */
>         val = STRTAB_STE_0_V;
>
> @@ -1319,16 +1451,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 else
>                         val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>
> -               dst->data[0] = cpu_to_le64(val);
> -               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +               target.data[0] = cpu_to_le64(val);
> +               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>                                                 STRTAB_STE_1_SHCFG_INCOMING));
> -               dst->data[2] = 0; /* Nuke the VMID */
> -               /*
> -                * The SMMU can perform negative caching, so we must sync
> -                * the STE regardless of whether the old value was live.
> -                */
> -               if (smmu)
> -                       arm_smmu_sync_ste_for_sid(smmu, sid);
> +               target.data[2] = 0; /* Nuke the VMID */
> +               arm_smmu_write_ste(smmu, sid, dst, &target);
>                 return;
>         }
>
> @@ -1336,8 +1463,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>                         STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>
> -               BUG_ON(ste_live);
> -               dst->data[1] = cpu_to_le64(
> +               target.data[1] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>                          FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>                          FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1346,7 +1472,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>
>                 if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>                     !master->stall_enabled)
> -                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>
>                 val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>                         FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1355,8 +1481,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>         }
>
>         if (s2_cfg) {
> -               BUG_ON(ste_live);
> -               dst->data[2] = cpu_to_le64(
> +               target.data[2] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>                          FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1365,23 +1490,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);
> +       arm_smmu_write_ste(smmu, sid, dst, &target);
>  }
>
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> --
> 2.42.0
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-12 12:16       ` Jason Gunthorpe
@ 2023-10-18 11:05         ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 11:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 8:16 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 12, 2023 at 04:10:50PM +0800, Michael Shavit wrote:
>
> > This sounds pretty complicated....Is this complexity really required
> > here?
>
> It is first needed before 'iommu/arm-smmu-v3: Do not change the STE
> twice during arm_smmu_attach_dev()' which is a couple of patches
> further on.
>
> Then it keeps getting relied on.
>
> I don't think there is any simple answer here, the HW has this complex
> requirement. The current code is also complex - arguably more complex
> because of how leaky/fragile it is. There is even a couple of pages of
> text in the spec describing how to do this, and it doesn't discuss the
> hitless cases!
>
> At the end this is only 32 lines and it replaces both
> arm_smmu_write_ctx_desc() and arm_smmu_write_strtab_ent().
>
> FWIW, I found the most difficult part the used bit calculation, not
> the update algorithm. Difficult because it is hard to read and find in
> the spec when things are INGORED, but it is a "straightforward" job of
> finding INGORED cases and making the used bits 0.

The update algorithm is the part I'm finding much harder to read and
review :) . arm_smmu_write_entry_step in particular is hard to read
through; on top of which there's some subtle dependencies between loop
iterations that weren't obvious to grok:
* Relying on the used_bits to be recomputed after the first iteration
where V=0 was set to 0 so that more bits can now be set.
* The STE having to be synced between iterations to prevent broken STE
reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
that would fit nicely here instead). But the caller is responsible for
calling this between iterations for some reason (supposedly to support
CD entries as well in the next series)

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-18 11:05         ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 11:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Oct 12, 2023 at 8:16 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 12, 2023 at 04:10:50PM +0800, Michael Shavit wrote:
>
> > This sounds pretty complicated....Is this complexity really required
> > here?
>
> It is first needed before 'iommu/arm-smmu-v3: Do not change the STE
> twice during arm_smmu_attach_dev()' which is a couple of patches
> further on.
>
> Then it keeps getting relied on.
>
> I don't think there is any simple answer here, the HW has this complex
> requirement. The current code is also complex - arguably more complex
> because of how leaky/fragile it is. There is even a couple of pages of
> text in the spec describing how to do this, and it doesn't discuss the
> hitless cases!
>
> At the end this is only 32 lines and it replaces both
> arm_smmu_write_ctx_desc() and arm_smmu_write_strtab_ent().
>
> FWIW, I found the most difficult part the used bit calculation, not
> the update algorithm. Difficult because it is hard to read and find in
> the spec when things are INGORED, but it is a "straightforward" job of
> finding INGORED cases and making the used bits 0.

The update algorithm is the part I'm finding much harder to read and
review :) . arm_smmu_write_entry_step in particular is hard to read
through; on top of which there's some subtle dependencies between loop
iterations that weren't obvious to grok:
* Relying on the used_bits to be recomputed after the first iteration
where V=0 was set to 0 so that more bits can now be set.
* The STE having to be synced between iterations to prevent broken STE
reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
that would fit nicely here instead). But the caller is responsible for
calling this between iterations for some reason (supposedly to support
CD entries as well in the next series)

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-18 11:06     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 11:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> Move to the new static global for identity domains. Move all the logic out
> of arm_smmu_attach_dev into an identity only function.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
>  2 files changed, 58 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 26d3200c127450..1e03bdedfabad1 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
>                 return arm_smmu_sva_domain_alloc();
>
>         if (type != IOMMU_DOMAIN_UNMANAGED &&
> -           type != IOMMU_DOMAIN_DMA &&
> -           type != IOMMU_DOMAIN_IDENTITY)
> +           type != IOMMU_DOMAIN_DMA)
>                 return NULL;
>
>         /*
> @@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
>         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>         struct arm_smmu_device *smmu = smmu_domain->smmu;
>
> -       if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> -               smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
> -               return 0;
> -       }
> -
>         /* Restrict the stage to what we can actually support */
>         if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
>                 smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> @@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>         struct arm_smmu_domain *smmu_domain;
>         unsigned long flags;
>
> -       if (!domain)
> +       if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
>                 return;

Confused me why we were checking against __IOMMU_DOMAIN_PAGING instead
 of IOMMU_DOMAIN_UNMANAGED/DMA to match domain_alloc, but ok it's
clarified by the final patch in the series.


>
>         smmu_domain = to_smmu_domain(domain);
> @@ -2522,15 +2516,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>
>         arm_smmu_detach_dev(master);
>
> -       /*
> -        * The SMMU does not support enabling ATS with bypass. When the STE is
> -        * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
> -        * Translated transactions are denied as though ATS is disabled for the
> -        * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> -        * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> -        */
> -       if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> -               master->ats_enabled = arm_smmu_ats_supported(master);
> +       master->ats_enabled = arm_smmu_ats_supported(master);
>
>         spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>         list_add(&master->domain_head, &smmu_domain->devices);
> @@ -2567,13 +2553,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                         arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
>                                                       NULL);
>                 break;
> -       case ARM_SMMU_DOMAIN_BYPASS:
> -               arm_smmu_make_bypass_ste(&target);
> -               arm_smmu_install_ste_for_dev(master, &target);
> -               if (master->cd_table.cdtab)
> -                       arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> -                                                     NULL);
> -               break;
>         }
>
>         arm_smmu_enable_ats(master, smmu_domain);
> @@ -2589,6 +2568,60 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>         return ret;
>  }
>
> +static int arm_smmu_attach_dev_ste(struct device *dev,
> +                                  struct arm_smmu_ste *ste)
> +{
> +       struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +
> +       if (arm_smmu_master_sva_enabled(master))
> +               return -EBUSY;
> +
> +       /*
> +        * Do not allow any ASID to be changed while are working on the STE,
> +        * otherwise we could miss invalidations.
> +        */
> +       mutex_lock(&arm_smmu_asid_lock);
> +
> +       /*
> +        * The SMMU does not support enabling ATS with bypass/abort. When the
> +        * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests
> +        * and Translated transactions are denied as though ATS is disabled for
> +        * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> +        * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> +        */
> +       arm_smmu_detach_dev(master);
> +
> +       arm_smmu_install_ste_for_dev(master, ste);
> +       mutex_unlock(&arm_smmu_asid_lock);
> +
> +       /*
> +        * This has to be done after removing the master from the
> +        * arm_smmu_domain->devices to avoid races updating the same context
> +        * descriptor from arm_smmu_share_asid().
> +        */
> +       if (master->cd_table.cdtab)
> +               arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
> +       return 0;
> +}
> +
> +static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
> +                                       struct device *dev)
> +{
> +       struct arm_smmu_ste ste;
> +
> +       arm_smmu_make_bypass_ste(&ste);
> +       return arm_smmu_attach_dev_ste(dev, &ste);
> +}
> +
> +static const struct iommu_domain_ops arm_smmu_identity_ops = {
> +       .attach_dev = arm_smmu_attach_dev_identity,
> +};
> +
> +static struct iommu_domain arm_smmu_identity_domain = {
> +       .type = IOMMU_DOMAIN_IDENTITY,
> +       .ops = &arm_smmu_identity_ops,
> +};
> +
>  static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
>                               phys_addr_t paddr, size_t pgsize, size_t pgcount,
>                               int prot, gfp_t gfp, size_t *mapped)
> @@ -2981,6 +3014,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
>  }
>
>  static struct iommu_ops arm_smmu_ops = {
> +       .identity_domain        = &arm_smmu_identity_domain,
>         .capable                = arm_smmu_capable,
>         .domain_alloc           = arm_smmu_domain_alloc,
>         .probe_device           = arm_smmu_probe_device,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 21f2f73501019a..154808f96718df 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -712,7 +712,6 @@ struct arm_smmu_master {
>  enum arm_smmu_domain_stage {
>         ARM_SMMU_DOMAIN_S1 = 0,
>         ARM_SMMU_DOMAIN_S2,
> -       ARM_SMMU_DOMAIN_BYPASS,
>  };
>
>  struct arm_smmu_domain {
> --
> 2.42.0
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
@ 2023-10-18 11:06     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-18 11:06 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> Move to the new static global for identity domains. Move all the logic out
> of arm_smmu_attach_dev into an identity only function.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
>  2 files changed, 58 insertions(+), 25 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 26d3200c127450..1e03bdedfabad1 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
>                 return arm_smmu_sva_domain_alloc();
>
>         if (type != IOMMU_DOMAIN_UNMANAGED &&
> -           type != IOMMU_DOMAIN_DMA &&
> -           type != IOMMU_DOMAIN_IDENTITY)
> +           type != IOMMU_DOMAIN_DMA)
>                 return NULL;
>
>         /*
> @@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
>         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
>         struct arm_smmu_device *smmu = smmu_domain->smmu;
>
> -       if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> -               smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
> -               return 0;
> -       }
> -
>         /* Restrict the stage to what we can actually support */
>         if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
>                 smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> @@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>         struct arm_smmu_domain *smmu_domain;
>         unsigned long flags;
>
> -       if (!domain)
> +       if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
>                 return;

Confused me why we were checking against __IOMMU_DOMAIN_PAGING instead
 of IOMMU_DOMAIN_UNMANAGED/DMA to match domain_alloc, but ok it's
clarified by the final patch in the series.


>
>         smmu_domain = to_smmu_domain(domain);
> @@ -2522,15 +2516,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>
>         arm_smmu_detach_dev(master);
>
> -       /*
> -        * The SMMU does not support enabling ATS with bypass. When the STE is
> -        * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
> -        * Translated transactions are denied as though ATS is disabled for the
> -        * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> -        * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> -        */
> -       if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> -               master->ats_enabled = arm_smmu_ats_supported(master);
> +       master->ats_enabled = arm_smmu_ats_supported(master);
>
>         spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>         list_add(&master->domain_head, &smmu_domain->devices);
> @@ -2567,13 +2553,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                         arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
>                                                       NULL);
>                 break;
> -       case ARM_SMMU_DOMAIN_BYPASS:
> -               arm_smmu_make_bypass_ste(&target);
> -               arm_smmu_install_ste_for_dev(master, &target);
> -               if (master->cd_table.cdtab)
> -                       arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> -                                                     NULL);
> -               break;
>         }
>
>         arm_smmu_enable_ats(master, smmu_domain);
> @@ -2589,6 +2568,60 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>         return ret;
>  }
>
> +static int arm_smmu_attach_dev_ste(struct device *dev,
> +                                  struct arm_smmu_ste *ste)
> +{
> +       struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +
> +       if (arm_smmu_master_sva_enabled(master))
> +               return -EBUSY;
> +
> +       /*
> +        * Do not allow any ASID to be changed while are working on the STE,
> +        * otherwise we could miss invalidations.
> +        */
> +       mutex_lock(&arm_smmu_asid_lock);
> +
> +       /*
> +        * The SMMU does not support enabling ATS with bypass/abort. When the
> +        * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests
> +        * and Translated transactions are denied as though ATS is disabled for
> +        * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> +        * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> +        */
> +       arm_smmu_detach_dev(master);
> +
> +       arm_smmu_install_ste_for_dev(master, ste);
> +       mutex_unlock(&arm_smmu_asid_lock);
> +
> +       /*
> +        * This has to be done after removing the master from the
> +        * arm_smmu_domain->devices to avoid races updating the same context
> +        * descriptor from arm_smmu_share_asid().
> +        */
> +       if (master->cd_table.cdtab)
> +               arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
> +       return 0;
> +}
> +
> +static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
> +                                       struct device *dev)
> +{
> +       struct arm_smmu_ste ste;
> +
> +       arm_smmu_make_bypass_ste(&ste);
> +       return arm_smmu_attach_dev_ste(dev, &ste);
> +}
> +
> +static const struct iommu_domain_ops arm_smmu_identity_ops = {
> +       .attach_dev = arm_smmu_attach_dev_identity,
> +};
> +
> +static struct iommu_domain arm_smmu_identity_domain = {
> +       .type = IOMMU_DOMAIN_IDENTITY,
> +       .ops = &arm_smmu_identity_ops,
> +};
> +
>  static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
>                               phys_addr_t paddr, size_t pgsize, size_t pgcount,
>                               int prot, gfp_t gfp, size_t *mapped)
> @@ -2981,6 +3014,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
>  }
>
>  static struct iommu_ops arm_smmu_ops = {
> +       .identity_domain        = &arm_smmu_identity_domain,
>         .capable                = arm_smmu_capable,
>         .domain_alloc           = arm_smmu_domain_alloc,
>         .probe_device           = arm_smmu_probe_device,
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 21f2f73501019a..154808f96718df 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -712,7 +712,6 @@ struct arm_smmu_master {
>  enum arm_smmu_domain_stage {
>         ARM_SMMU_DOMAIN_S1 = 0,
>         ARM_SMMU_DOMAIN_S2,
> -       ARM_SMMU_DOMAIN_BYPASS,
>  };
>
>  struct arm_smmu_domain {
> --
> 2.42.0
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-18 10:54     ` Michael Shavit
@ 2023-10-18 12:24       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 12:24 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 06:54:10PM +0800, Michael Shavit wrote:
> > +       } else if (!step_change) {
> > +               /* cur == target, so all done */
> > +               if (memcmp(cur, target, sizeof(*cur)) == 0)
> > +                       return true;
> Shouldn't this be len * sizeof(*cur)?

Ugh, yes, thank you. An earlier version had cur be a 'struct
arm_smmu_ste', I missed this when I changed it to allow reuse for the
CD path...

> > +       case STRTAB_STE_0_CFG_S1_TRANS:
> > +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> > +                                                 STRTAB_STE_0_S1CDMAX);
> > +               used_bits->data[1] |=
> > +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +
> > +               if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> > +                   STRTAB_STE_1_S1DSS_BYPASS)
> > +                       used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> 
> Although the driver only explicitly sets SHCFG for bypass streams, my
> reading of the spec is it is also accessed for S1 and S2 STEs:
> "The SMMU might convey attributes input from a device through this
> process, so that the device might influence the final transaction
> access, and input attributes might be overridden on a per-device basis
> using the MTCFG/MemAttr, SHCFG, ALLOCCFG STE fields. The input
> attribute, modified by these fields, is primarily useful for setting
> the resulting output access attribute when both stage 1 and stage 2
> translation is bypassed (no translation table descriptors to determine
> attribute) but can also be useful for stage 2-only configurations in
> which a device stream might have finer knowledge about the required
> access behavior than the general virtual machine-global stage 2
> translation tables."

Hm.. I struggled with this for a while.

There is some kind of issue here, we cannot have it both ways where
the S1 translation on a PASID needs SHCFG=0 and the S1DSS_BYPASS needs
SHCFG=1. Either the S1 PASID ignores the field, eg because the IOPTE
supersedes it (what this patch assumes), the S1DSS doesn't need it, or
we cannot use S1DSS at all.

Let me see if we can get a deeper understanding here, it is a good
point.

> > +       case STRTAB_STE_0_CFG_S2_TRANS:
> > +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +               used_bits->data[2] |=
> > +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> > +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> > +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> > +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> > +               break;
> > +
> > +       default:
> > +               memset(used_bits, 0xFF, sizeof(*used_bits));
> 
> Can we consider a WARN here since this driver only ever uses one of
> the above 4 values and we probably have a programming error if we see
> something else.

Ok

> > +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> > +                                   const struct arm_smmu_ste *target,
> > +                                   const struct arm_smmu_ste *target_used)
> > +{
> > +       struct arm_smmu_ste cur_used;
> > +       struct arm_smmu_ste step;
> > +
> > +       arm_smmu_get_ste_used(cur, &cur_used);
> > +       return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> > +                                        target_used->data, step.data,
> 
> What's up with requiring callers to allocate and provide step.data if
> it's not used by any of the arm_smmu_write_entry_step callers?

arm_smmu_write_entry_step requires a temporary memory of len bytes -
since varadic stack arrays (ie alloca) are forbidden in the kernel,
and kmalloc would be silly, the simplest solution was to have the
caller allocate it and then pass it in.

Alternatively we could have a max size temporary array inside
arm_smmu_write_entry_step() with some static asserts, but I thought
that was less clear.

> > +                                        cpu_to_le64(STRTAB_STE_0_V),
> This also looks a bit strange at this stage since CD entries aren't
> yet supported..... but sure.

Yeah, this function shim is for the later patch that adds one of these
for CD. Don't want to go and change stuff twice.

For reference the CD function from a later patch is:

static bool arm_smmu_write_cd_step(struct arm_smmu_cd *cur,
				   const struct arm_smmu_cd *target,
				   const struct arm_smmu_cd *target_used)
{
	struct arm_smmu_cd cur_used;
	struct arm_smmu_cd step;

	arm_smmu_get_cd_used(cur, &cur_used);
	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
					 target_used->data, step.data,
					 cpu_to_le64(CTXDESC_CD_0_V),
					 ARRAY_SIZE(cur->data));

}

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-18 12:24       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 12:24 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 06:54:10PM +0800, Michael Shavit wrote:
> > +       } else if (!step_change) {
> > +               /* cur == target, so all done */
> > +               if (memcmp(cur, target, sizeof(*cur)) == 0)
> > +                       return true;
> Shouldn't this be len * sizeof(*cur)?

Ugh, yes, thank you. An earlier version had cur be a 'struct
arm_smmu_ste', I missed this when I changed it to allow reuse for the
CD path...

> > +       case STRTAB_STE_0_CFG_S1_TRANS:
> > +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> > +                                                 STRTAB_STE_0_S1CDMAX);
> > +               used_bits->data[1] |=
> > +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +
> > +               if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> > +                   STRTAB_STE_1_S1DSS_BYPASS)
> > +                       used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> 
> Although the driver only explicitly sets SHCFG for bypass streams, my
> reading of the spec is it is also accessed for S1 and S2 STEs:
> "The SMMU might convey attributes input from a device through this
> process, so that the device might influence the final transaction
> access, and input attributes might be overridden on a per-device basis
> using the MTCFG/MemAttr, SHCFG, ALLOCCFG STE fields. The input
> attribute, modified by these fields, is primarily useful for setting
> the resulting output access attribute when both stage 1 and stage 2
> translation is bypassed (no translation table descriptors to determine
> attribute) but can also be useful for stage 2-only configurations in
> which a device stream might have finer knowledge about the required
> access behavior than the general virtual machine-global stage 2
> translation tables."

Hm.. I struggled with this for a while.

There is some kind of issue here, we cannot have it both ways where
the S1 translation on a PASID needs SHCFG=0 and the S1DSS_BYPASS needs
SHCFG=1. Either the S1 PASID ignores the field, eg because the IOPTE
supersedes it (what this patch assumes), the S1DSS doesn't need it, or
we cannot use S1DSS at all.

Let me see if we can get a deeper understanding here, it is a good
point.

> > +       case STRTAB_STE_0_CFG_S2_TRANS:
> > +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +               used_bits->data[2] |=
> > +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> > +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> > +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> > +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> > +               break;
> > +
> > +       default:
> > +               memset(used_bits, 0xFF, sizeof(*used_bits));
> 
> Can we consider a WARN here since this driver only ever uses one of
> the above 4 values and we probably have a programming error if we see
> something else.

Ok

> > +static bool arm_smmu_write_ste_step(struct arm_smmu_ste *cur,
> > +                                   const struct arm_smmu_ste *target,
> > +                                   const struct arm_smmu_ste *target_used)
> > +{
> > +       struct arm_smmu_ste cur_used;
> > +       struct arm_smmu_ste step;
> > +
> > +       arm_smmu_get_ste_used(cur, &cur_used);
> > +       return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
> > +                                        target_used->data, step.data,
> 
> What's up with requiring callers to allocate and provide step.data if
> it's not used by any of the arm_smmu_write_entry_step callers?

arm_smmu_write_entry_step requires a temporary memory of len bytes -
since varadic stack arrays (ie alloca) are forbidden in the kernel,
and kmalloc would be silly, the simplest solution was to have the
caller allocate it and then pass it in.

Alternatively we could have a max size temporary array inside
arm_smmu_write_entry_step() with some static asserts, but I thought
that was less clear.

> > +                                        cpu_to_le64(STRTAB_STE_0_V),
> This also looks a bit strange at this stage since CD entries aren't
> yet supported..... but sure.

Yeah, this function shim is for the later patch that adds one of these
for CD. Don't want to go and change stuff twice.

For reference the CD function from a later patch is:

static bool arm_smmu_write_cd_step(struct arm_smmu_cd *cur,
				   const struct arm_smmu_cd *target,
				   const struct arm_smmu_cd *target_used)
{
	struct arm_smmu_cd cur_used;
	struct arm_smmu_cd step;

	arm_smmu_get_cd_used(cur, &cur_used);
	return arm_smmu_write_entry_step(cur->data, cur_used.data, target->data,
					 target_used->data, step.data,
					 cpu_to_le64(CTXDESC_CD_0_V),
					 ARRAY_SIZE(cur->data));

}

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2023-10-18 11:06     ` Michael Shavit
@ 2023-10-18 12:26       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 12:26 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 07:06:55PM +0800, Michael Shavit wrote:
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > Move to the new static global for identity domains. Move all the logic out
> > of arm_smmu_attach_dev into an identity only function.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> >  2 files changed, 58 insertions(+), 25 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 26d3200c127450..1e03bdedfabad1 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> >                 return arm_smmu_sva_domain_alloc();
> >
> >         if (type != IOMMU_DOMAIN_UNMANAGED &&
> > -           type != IOMMU_DOMAIN_DMA &&
> > -           type != IOMMU_DOMAIN_IDENTITY)
> > +           type != IOMMU_DOMAIN_DMA)
> >                 return NULL;
> >
> >         /*
> > @@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> >         struct arm_smmu_device *smmu = smmu_domain->smmu;
> >
> > -       if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> > -               smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
> > -               return 0;
> > -       }
> > -
> >         /* Restrict the stage to what we can actually support */
> >         if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> >                 smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> > @@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> >         struct arm_smmu_domain *smmu_domain;
> >         unsigned long flags;
> >
> > -       if (!domain)
> > +       if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
> >                 return;
> 
> Confused me why we were checking against __IOMMU_DOMAIN_PAGING instead
>  of IOMMU_DOMAIN_UNMANAGED/DMA to match domain_alloc, but ok it's
> clarified by the final patch in the series.

Long term I am trying to remove DMA/UNMANAGED from the drivers (we are
actually getting quite close!). A domain created by
domain_alloc_paging (later patch) should be tested like this.

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain
@ 2023-10-18 12:26       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 12:26 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 07:06:55PM +0800, Michael Shavit wrote:
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > Move to the new static global for identity domains. Move all the logic out
> > of arm_smmu_attach_dev into an identity only function.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
> >  2 files changed, 58 insertions(+), 25 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 26d3200c127450..1e03bdedfabad1 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2149,8 +2149,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
> >                 return arm_smmu_sva_domain_alloc();
> >
> >         if (type != IOMMU_DOMAIN_UNMANAGED &&
> > -           type != IOMMU_DOMAIN_DMA &&
> > -           type != IOMMU_DOMAIN_IDENTITY)
> > +           type != IOMMU_DOMAIN_DMA)
> >                 return NULL;
> >
> >         /*
> > @@ -2258,11 +2257,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
> >         struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> >         struct arm_smmu_device *smmu = smmu_domain->smmu;
> >
> > -       if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> > -               smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
> > -               return 0;
> > -       }
> > -
> >         /* Restrict the stage to what we can actually support */
> >         if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
> >                 smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> > @@ -2459,7 +2453,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
> >         struct arm_smmu_domain *smmu_domain;
> >         unsigned long flags;
> >
> > -       if (!domain)
> > +       if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
> >                 return;
> 
> Confused me why we were checking against __IOMMU_DOMAIN_PAGING instead
>  of IOMMU_DOMAIN_UNMANAGED/DMA to match domain_alloc, but ok it's
> clarified by the final patch in the series.

Long term I am trying to remove DMA/UNMANAGED from the drivers (we are
actually getting quite close!). A domain created by
domain_alloc_paging (later patch) should be tested like this.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-18 11:05         ` Michael Shavit
@ 2023-10-18 13:04           ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 13:04 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:

> > FWIW, I found the most difficult part the used bit calculation, not
> > the update algorithm. Difficult because it is hard to read and find in
> > the spec when things are INGORED, but it is a "straightforward" job of
> > finding INGORED cases and making the used bits 0.
> 
> The update algorithm is the part I'm finding much harder to read and
> review :) . arm_smmu_write_entry_step in particular is hard to read
> through; on top of which there's some subtle dependencies between loop
> iterations that weren't obvious to grok:

Yes, you have it right, it is basically a classic greedy
algorithm. Let's improve the comment.

> * Relying on the used_bits to be recomputed after the first iteration
> where V=0 was set to 0 so that more bits can now be set.
> * The STE having to be synced between iterations to prevent broken STE
> reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> that would fit nicely here instead). But the caller is responsible for
> calling this between iterations for some reason (supposedly to support
> CD entries as well in the next series)

Yes, for CD entry support.

How about:

/*
 * This algorithm updates any STE/CD to any value without creating a situation
 * where the HW can percieve a corrupted entry. HW is only required to have a 64
 * bit atomicity with stores from the CPU, while entires are many 64 bit values
 * big.
 *
 * The algorithm works by evolving the entry toward the target in a series of
 * steps. Each step synchronizes with the HW so that the HW can not see an entry
 * torn across two steps. Upon each call cur/cur_used reflect the current
 * synchronized value seen by the HW.
 *
 * During each step the HW can observe a torn entry that has any combination of
 * the step's old/new 64 bit words. The algorithm objective is for the HW
 * behavior to always be one of current behavior, V=0, or new behavior, during
 * each step, and across all steps.
 *
 * At each step one of three actions is choosen to evolve cur to target:
 *  - Update all unused bits with their target values.
 *    This relies on the IGNORED behavior described in the specification
 *  - Update a single 64-bit value
 *  - Update all unused bits and set V=0
 *
 * The last two actions will cause cur_used to change, which will then allow the
 * first action on the next step.
 *
 * In the most general case we can make any update in three steps:
 *  - Disrupting the entry (V=0)
 *  - Fill now unused bits, all bits except V
 *  - Make valid (V=1), single 64 bit store
 *
 * However this disrupts the HW while it is happening. There are several
 * interesting cases where a STE/CD can be updated without disturbing the HW
 * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
 * because the used bits don't intersect. We can detect this by calculating how
 * many 64 bit values need update after adjusting the unused bits and skip the
 * V=0 process.
 */
static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-18 13:04           ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-18 13:04 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:

> > FWIW, I found the most difficult part the used bit calculation, not
> > the update algorithm. Difficult because it is hard to read and find in
> > the spec when things are INGORED, but it is a "straightforward" job of
> > finding INGORED cases and making the used bits 0.
> 
> The update algorithm is the part I'm finding much harder to read and
> review :) . arm_smmu_write_entry_step in particular is hard to read
> through; on top of which there's some subtle dependencies between loop
> iterations that weren't obvious to grok:

Yes, you have it right, it is basically a classic greedy
algorithm. Let's improve the comment.

> * Relying on the used_bits to be recomputed after the first iteration
> where V=0 was set to 0 so that more bits can now be set.
> * The STE having to be synced between iterations to prevent broken STE
> reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> that would fit nicely here instead). But the caller is responsible for
> calling this between iterations for some reason (supposedly to support
> CD entries as well in the next series)

Yes, for CD entry support.

How about:

/*
 * This algorithm updates any STE/CD to any value without creating a situation
 * where the HW can percieve a corrupted entry. HW is only required to have a 64
 * bit atomicity with stores from the CPU, while entires are many 64 bit values
 * big.
 *
 * The algorithm works by evolving the entry toward the target in a series of
 * steps. Each step synchronizes with the HW so that the HW can not see an entry
 * torn across two steps. Upon each call cur/cur_used reflect the current
 * synchronized value seen by the HW.
 *
 * During each step the HW can observe a torn entry that has any combination of
 * the step's old/new 64 bit words. The algorithm objective is for the HW
 * behavior to always be one of current behavior, V=0, or new behavior, during
 * each step, and across all steps.
 *
 * At each step one of three actions is choosen to evolve cur to target:
 *  - Update all unused bits with their target values.
 *    This relies on the IGNORED behavior described in the specification
 *  - Update a single 64-bit value
 *  - Update all unused bits and set V=0
 *
 * The last two actions will cause cur_used to change, which will then allow the
 * first action on the next step.
 *
 * In the most general case we can make any update in three steps:
 *  - Disrupting the entry (V=0)
 *  - Fill now unused bits, all bits except V
 *  - Make valid (V=1), single 64 bit store
 *
 * However this disrupts the HW while it is happening. There are several
 * interesting cases where a STE/CD can be updated without disturbing the HW
 * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
 * because the used bits don't intersect. We can detect this by calculating how
 * many 64 bit values need update after adjusting the unused bits and skip the
 * V=0 process.
 */
static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-18 12:24       ` Jason Gunthorpe
@ 2023-10-19 23:03         ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 23:03 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 09:24:35AM -0300, Jason Gunthorpe wrote:

> Let me see if we can get a deeper understanding here, it is a good
> point.

Nicolin found in the specification "13.5 Summary of
attribute/permission configuration fields" which the row for STE.SHCFG
says it is only read for "Bypass" or "Stage 2 Only"

In Stage 2 mode is combined with the IOPTE (TTD?) and since the kernel
sets it to 0 (weakest) it basically looks like it means the IOPTE
overrides? In Stage 1 only mode (eg the PASID case I worried about) it
only comes from the IOPTE.

So like this:

@@ -1393,12 +1393,14 @@ static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
                                    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
                used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
 
+               /* See 13.5 Summary of attribute/permission configuration fields */
                if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
                    STRTAB_STE_1_S1DSS_BYPASS)
                        used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
                break;
        case STRTAB_STE_0_CFG_S2_TRANS:
-               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
                used_bits->data[2] |=
                        cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
                                    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-19 23:03         ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-19 23:03 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 09:24:35AM -0300, Jason Gunthorpe wrote:

> Let me see if we can get a deeper understanding here, it is a good
> point.

Nicolin found in the specification "13.5 Summary of
attribute/permission configuration fields" which the row for STE.SHCFG
says it is only read for "Bypass" or "Stage 2 Only"

In Stage 2 mode is combined with the IOPTE (TTD?) and since the kernel
sets it to 0 (weakest) it basically looks like it means the IOPTE
overrides? In Stage 1 only mode (eg the PASID case I worried about) it
only comes from the IOPTE.

So like this:

@@ -1393,12 +1393,14 @@ static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
                                    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
                used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
 
+               /* See 13.5 Summary of attribute/permission configuration fields */
                if (FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
                    STRTAB_STE_1_S1DSS_BYPASS)
                        used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
                break;
        case STRTAB_STE_0_CFG_S2_TRANS:
-               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
                used_bits->data[2] |=
                        cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
                                    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-18 13:04           ` Jason Gunthorpe
@ 2023-10-20  8:23             ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-20  8:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 9:05 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:
>
> > > FWIW, I found the most difficult part the used bit calculation, not
> > > the update algorithm. Difficult because it is hard to read and find in
> > > the spec when things are INGORED, but it is a "straightforward" job of
> > > finding INGORED cases and making the used bits 0.
> >
> > The update algorithm is the part I'm finding much harder to read and
> > review :) . arm_smmu_write_entry_step in particular is hard to read
> > through; on top of which there's some subtle dependencies between loop
> > iterations that weren't obvious to grok:
>
> Yes, you have it right, it is basically a classic greedy
> algorithm. Let's improve the comment.
>
> > * Relying on the used_bits to be recomputed after the first iteration
> > where V=0 was set to 0 so that more bits can now be set.
> > * The STE having to be synced between iterations to prevent broken STE
> > reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> > that would fit nicely here instead). But the caller is responsible for
> > calling this between iterations for some reason (supposedly to support
> > CD entries as well in the next series)
>
> Yes, for CD entry support.
>
> How about:
>
> /*
>  * This algorithm updates any STE/CD to any value without creating a situation
>  * where the HW can percieve a corrupted entry. HW is only required to have a 64
>  * bit atomicity with stores from the CPU, while entires are many 64 bit values
>  * big.
>  *
>  * The algorithm works by evolving the entry toward the target in a series of
>  * steps. Each step synchronizes with the HW so that the HW can not see an entry
>  * torn across two steps. Upon each call cur/cur_used reflect the current
>  * synchronized value seen by the HW.
>  *
>  * During each step the HW can observe a torn entry that has any combination of
>  * the step's old/new 64 bit words. The algorithm objective is for the HW
>  * behavior to always be one of current behavior, V=0, or new behavior, during
>  * each step, and across all steps.
>  *
>  * At each step one of three actions is choosen to evolve cur to target:
>  *  - Update all unused bits with their target values.
>  *    This relies on the IGNORED behavior described in the specification
>  *  - Update a single 64-bit value
>  *  - Update all unused bits and set V=0
>  *
>  * The last two actions will cause cur_used to change, which will then allow the
>  * first action on the next step.
>  *
>  * In the most general case we can make any update in three steps:
>  *  - Disrupting the entry (V=0)
>  *  - Fill now unused bits, all bits except V
>  *  - Make valid (V=1), single 64 bit store
>  *
>  * However this disrupts the HW while it is happening. There are several
>  * interesting cases where a STE/CD can be updated without disturbing the HW
>  * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>  * because the used bits don't intersect. We can detect this by calculating how
>  * many 64 bit values need update after adjusting the unused bits and skip the
>  * V=0 process.
>  */
> static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
>
> Jason

The comment helps a lot thank you.

I do still have some final reservations: wouldn't it be clearer with
the loop un-rolled? After all it's only 3 steps in the worst case....
Something like:

+       arm_smmu_get_ste_used(target, &target_used);
+       arm_smmu_get_ste_used(cur, &cur_used);
+       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
+               target->data[0] = STRTAB_STE_0_V;
+               arm_smmu_sync_ste_for_sid(smmu, sid);
+               /*
+                * The STE is now in abort where none of the bits except
+                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
+                * all other words of the STE to be written without further
+                * disruption.
+                */
+               arm_smmu_get_ste_used(cur, &cur_used);
+       }
+       /* write bits in all positions unused by the STE */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               /* (should probably optimize this away if no write needed) */
+               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
| (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+       /*
+        * It should now be possible to make a single qword write to make the
+        * the new configuration take effect.
+        */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               if ((cur->data[i] & target_used[i]) !=
(target->data[i] & target_used[i]))
+                       /*
+                        * TODO:
+                        * WARN_ONCE if this condition hits more than once in
+                        * the loop
+                        */
+                       WRITE_ONCE(cur->data[i], (cur->data[i] &
cur_used[i]) | (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-20  8:23             ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-20  8:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 18, 2023 at 9:05 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:
>
> > > FWIW, I found the most difficult part the used bit calculation, not
> > > the update algorithm. Difficult because it is hard to read and find in
> > > the spec when things are INGORED, but it is a "straightforward" job of
> > > finding INGORED cases and making the used bits 0.
> >
> > The update algorithm is the part I'm finding much harder to read and
> > review :) . arm_smmu_write_entry_step in particular is hard to read
> > through; on top of which there's some subtle dependencies between loop
> > iterations that weren't obvious to grok:
>
> Yes, you have it right, it is basically a classic greedy
> algorithm. Let's improve the comment.
>
> > * Relying on the used_bits to be recomputed after the first iteration
> > where V=0 was set to 0 so that more bits can now be set.
> > * The STE having to be synced between iterations to prevent broken STE
> > reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> > that would fit nicely here instead). But the caller is responsible for
> > calling this between iterations for some reason (supposedly to support
> > CD entries as well in the next series)
>
> Yes, for CD entry support.
>
> How about:
>
> /*
>  * This algorithm updates any STE/CD to any value without creating a situation
>  * where the HW can percieve a corrupted entry. HW is only required to have a 64
>  * bit atomicity with stores from the CPU, while entires are many 64 bit values
>  * big.
>  *
>  * The algorithm works by evolving the entry toward the target in a series of
>  * steps. Each step synchronizes with the HW so that the HW can not see an entry
>  * torn across two steps. Upon each call cur/cur_used reflect the current
>  * synchronized value seen by the HW.
>  *
>  * During each step the HW can observe a torn entry that has any combination of
>  * the step's old/new 64 bit words. The algorithm objective is for the HW
>  * behavior to always be one of current behavior, V=0, or new behavior, during
>  * each step, and across all steps.
>  *
>  * At each step one of three actions is choosen to evolve cur to target:
>  *  - Update all unused bits with their target values.
>  *    This relies on the IGNORED behavior described in the specification
>  *  - Update a single 64-bit value
>  *  - Update all unused bits and set V=0
>  *
>  * The last two actions will cause cur_used to change, which will then allow the
>  * first action on the next step.
>  *
>  * In the most general case we can make any update in three steps:
>  *  - Disrupting the entry (V=0)
>  *  - Fill now unused bits, all bits except V
>  *  - Make valid (V=1), single 64 bit store
>  *
>  * However this disrupts the HW while it is happening. There are several
>  * interesting cases where a STE/CD can be updated without disturbing the HW
>  * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>  * because the used bits don't intersect. We can detect this by calculating how
>  * many 64 bit values need update after adjusting the unused bits and skip the
>  * V=0 process.
>  */
> static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
>
> Jason

The comment helps a lot thank you.

I do still have some final reservations: wouldn't it be clearer with
the loop un-rolled? After all it's only 3 steps in the worst case....
Something like:

+       arm_smmu_get_ste_used(target, &target_used);
+       arm_smmu_get_ste_used(cur, &cur_used);
+       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
+               target->data[0] = STRTAB_STE_0_V;
+               arm_smmu_sync_ste_for_sid(smmu, sid);
+               /*
+                * The STE is now in abort where none of the bits except
+                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
+                * all other words of the STE to be written without further
+                * disruption.
+                */
+               arm_smmu_get_ste_used(cur, &cur_used);
+       }
+       /* write bits in all positions unused by the STE */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               /* (should probably optimize this away if no write needed) */
+               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
| (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+       /*
+        * It should now be possible to make a single qword write to make the
+        * the new configuration take effect.
+        */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               if ((cur->data[i] & target_used[i]) !=
(target->data[i] & target_used[i]))
+                       /*
+                        * TODO:
+                        * WARN_ONCE if this condition hits more than once in
+                        * the loop
+                        */
+                       WRITE_ONCE(cur->data[i], (cur->data[i] &
cur_used[i]) | (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-20  8:23             ` Michael Shavit
@ 2023-10-20 11:39               ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-20 11:39 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> The comment helps a lot thank you.
> 
> I do still have some final reservations: wouldn't it be clearer with
> the loop un-rolled? After all it's only 3 steps in the worst case....
> Something like:

I thought about that, but a big point for me was to consolidate the
algorithm between CD/STE. Inlining everything makes it much more
difficult to achieve this. Actually my first sketches were trying to
write it unrolled.

> +       arm_smmu_get_ste_used(target, &target_used);
> +       arm_smmu_get_ste_used(cur, &cur_used);
> +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {

hitless possible requires the loop of the step function to calcuate
it.

> +               target->data[0] = STRTAB_STE_0_V;
> +               arm_smmu_sync_ste_for_sid(smmu, sid);

I still like V=0 as I think we do want the event for this case.

> +               /*
> +                * The STE is now in abort where none of the bits except
> +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> +                * all other words of the STE to be written without further
> +                * disruption.
> +                */
> +               arm_smmu_get_ste_used(cur, &cur_used);
> +       }
> +       /* write bits in all positions unused by the STE */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               /* (should probably optimize this away if no write needed) */
> +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> | (target->data[i] & ~cur_used[i]));
> +       }
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

Yes, I wanted to avoid all the syncs if they are not required.

> +       /*
> +        * It should now be possible to make a single qword write to make the
> +        * the new configuration take effect.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               if ((cur->data[i] & target_used[i]) !=
> (target->data[i] & target_used[i]))
> +                       /*
> +                        * TODO:
> +                        * WARN_ONCE if this condition hits more than once in
> +                        * the loop
> +                        */
> +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> cur_used[i]) | (target->data[i] & ~cur_used[i]));
> +       }

> +       arm_smmu_sync_ste_for_sid(smmu, sid);

This needs to be optional too

And there is another optional 4th pass to set the unused target values
to 0.

Basically you have captured the core algorithm, but I think if you
fill in all the missing bits to get up to the same functionality it
will be longer and unsharable with the CD side.

You could perhaps take this approach and split it into 4 sharable step
functions:

 if (step1(target, target_used, cur_used, cur_used, len)) {
  arm_smmu_sync_ste_for_sid(smmu, sid);
  arm_smmu_get_ste_used(cur, &cur_used);
 }

 if (step2(target, target_used, cur_used, cur_used, len))
  arm_smmu_sync_ste_for_sid(smmu, sid);

 if (step3(target, target_used, cur_used, cur_used, len)) {
   arm_smmu_sync_ste_for_sid(smmu, sid);
   arm_smmu_get_ste_used(cur, &cur_used);
  }

 if (step4(target, target_used, cur_used, cur_used, len))
   arm_smmu_sync_ste_for_sid(smmu, sid);

To me this is inelegant as if we only need to do step 3 we have to
redundantly scan the array 2 times. The rolled up version just
directly goes to step 3.

However this does convince me you've thought very carefully about this
and have not found a flaw in the design!

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-20 11:39               ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-20 11:39 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> The comment helps a lot thank you.
> 
> I do still have some final reservations: wouldn't it be clearer with
> the loop un-rolled? After all it's only 3 steps in the worst case....
> Something like:

I thought about that, but a big point for me was to consolidate the
algorithm between CD/STE. Inlining everything makes it much more
difficult to achieve this. Actually my first sketches were trying to
write it unrolled.

> +       arm_smmu_get_ste_used(target, &target_used);
> +       arm_smmu_get_ste_used(cur, &cur_used);
> +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {

hitless possible requires the loop of the step function to calcuate
it.

> +               target->data[0] = STRTAB_STE_0_V;
> +               arm_smmu_sync_ste_for_sid(smmu, sid);

I still like V=0 as I think we do want the event for this case.

> +               /*
> +                * The STE is now in abort where none of the bits except
> +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> +                * all other words of the STE to be written without further
> +                * disruption.
> +                */
> +               arm_smmu_get_ste_used(cur, &cur_used);
> +       }
> +       /* write bits in all positions unused by the STE */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               /* (should probably optimize this away if no write needed) */
> +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> | (target->data[i] & ~cur_used[i]));
> +       }
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

Yes, I wanted to avoid all the syncs if they are not required.

> +       /*
> +        * It should now be possible to make a single qword write to make the
> +        * the new configuration take effect.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               if ((cur->data[i] & target_used[i]) !=
> (target->data[i] & target_used[i]))
> +                       /*
> +                        * TODO:
> +                        * WARN_ONCE if this condition hits more than once in
> +                        * the loop
> +                        */
> +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> cur_used[i]) | (target->data[i] & ~cur_used[i]));
> +       }

> +       arm_smmu_sync_ste_for_sid(smmu, sid);

This needs to be optional too

And there is another optional 4th pass to set the unused target values
to 0.

Basically you have captured the core algorithm, but I think if you
fill in all the missing bits to get up to the same functionality it
will be longer and unsharable with the CD side.

You could perhaps take this approach and split it into 4 sharable step
functions:

 if (step1(target, target_used, cur_used, cur_used, len)) {
  arm_smmu_sync_ste_for_sid(smmu, sid);
  arm_smmu_get_ste_used(cur, &cur_used);
 }

 if (step2(target, target_used, cur_used, cur_used, len))
  arm_smmu_sync_ste_for_sid(smmu, sid);

 if (step3(target, target_used, cur_used, cur_used, len)) {
   arm_smmu_sync_ste_for_sid(smmu, sid);
   arm_smmu_get_ste_used(cur, &cur_used);
  }

 if (step4(target, target_used, cur_used, cur_used, len))
   arm_smmu_sync_ste_for_sid(smmu, sid);

To me this is inelegant as if we only need to do step 3 we have to
redundantly scan the array 2 times. The rolled up version just
directly goes to step 3.

However this does convince me you've thought very carefully about this
and have not found a flaw in the design!

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-20 11:39               ` Jason Gunthorpe
@ 2023-10-23  8:36                 ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-23  8:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > The comment helps a lot thank you.
> >
> > I do still have some final reservations: wouldn't it be clearer with
> > the loop un-rolled? After all it's only 3 steps in the worst case....
> > Something like:
>
> I thought about that, but a big point for me was to consolidate the
> algorithm between CD/STE. Inlining everything makes it much more
> difficult to achieve this. Actually my first sketches were trying to
> write it unrolled.
>
> > +       arm_smmu_get_ste_used(target, &target_used);
> > +       arm_smmu_get_ste_used(cur, &cur_used);
> > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
>
> hitless possible requires the loop of the step function to calcuate
> it.

Possibly yes. Another option would be to have a list of transitions
(e.g. IDENTITY -> S1) we expect and want to be hitless and check
against that list. It defeats some of the purpose of your design, but
it's also not obvious to me that we really need such flexibility in
the first place.

>
> > +               target->data[0] = STRTAB_STE_0_V;
> > +               arm_smmu_sync_ste_for_sid(smmu, sid);
>
> I still like V=0 as I think we do want the event for this case.

Oh yeah sure, I did not think too carefully about this.

>
> > +               /*
> > +                * The STE is now in abort where none of the bits except
> > +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> > +                * all other words of the STE to be written without further
> > +                * disruption.
> > +                */
> > +               arm_smmu_get_ste_used(cur, &cur_used);
> > +       }
> > +       /* write bits in all positions unused by the STE */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               /* (should probably optimize this away if no write needed) */
> > +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> > | (target->data[i] & ~cur_used[i]));
> > +       }
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> Yes, I wanted to avoid all the syncs if they are not required.

Should be easy enough to optimize away in this alternate version as well.

>
> > +       /*
> > +        * It should now be possible to make a single qword write to make the
> > +        * the new configuration take effect.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               if ((cur->data[i] & target_used[i]) !=
> > (target->data[i] & target_used[i]))
> > +                       /*
> > +                        * TODO:
> > +                        * WARN_ONCE if this condition hits more than once in
> > +                        * the loop
> > +                        */
> > +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> > cur_used[i]) | (target->data[i] & ~cur_used[i]));
> > +       }
>
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> This needs to be optional too
>
> And there is another optional 4th pass to set the unused target values
> to 0.
>
> Basically you have captured the core algorithm, but I think if you
> fill in all the missing bits to get up to the same functionality it
> will be longer and unsharable with the CD side.
>
> You could perhaps take this approach and split it into 4 sharable step
> functions:
>
>  if (step1(target, target_used, cur_used, cur_used, len)) {
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>   arm_smmu_get_ste_used(cur, &cur_used);
>  }
>
>  if (step2(target, target_used, cur_used, cur_used, len))
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>
>  if (step3(target, target_used, cur_used, cur_used, len)) {
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>    arm_smmu_get_ste_used(cur, &cur_used);
>   }
>
>  if (step4(target, target_used, cur_used, cur_used, len))
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>
> To me this is inelegant as if we only need to do step 3 we have to
> redundantly scan the array 2 times. The rolled up version just
> directly goes to step 3.
>
> However this does convince me you've thought very carefully about this
> and have not found a flaw in the design!

Right, it's more of a style/readability question so I'll leave it to
others to chime in further if needed :) .

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-23  8:36                 ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-23  8:36 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > The comment helps a lot thank you.
> >
> > I do still have some final reservations: wouldn't it be clearer with
> > the loop un-rolled? After all it's only 3 steps in the worst case....
> > Something like:
>
> I thought about that, but a big point for me was to consolidate the
> algorithm between CD/STE. Inlining everything makes it much more
> difficult to achieve this. Actually my first sketches were trying to
> write it unrolled.
>
> > +       arm_smmu_get_ste_used(target, &target_used);
> > +       arm_smmu_get_ste_used(cur, &cur_used);
> > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
>
> hitless possible requires the loop of the step function to calcuate
> it.

Possibly yes. Another option would be to have a list of transitions
(e.g. IDENTITY -> S1) we expect and want to be hitless and check
against that list. It defeats some of the purpose of your design, but
it's also not obvious to me that we really need such flexibility in
the first place.

>
> > +               target->data[0] = STRTAB_STE_0_V;
> > +               arm_smmu_sync_ste_for_sid(smmu, sid);
>
> I still like V=0 as I think we do want the event for this case.

Oh yeah sure, I did not think too carefully about this.

>
> > +               /*
> > +                * The STE is now in abort where none of the bits except
> > +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> > +                * all other words of the STE to be written without further
> > +                * disruption.
> > +                */
> > +               arm_smmu_get_ste_used(cur, &cur_used);
> > +       }
> > +       /* write bits in all positions unused by the STE */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               /* (should probably optimize this away if no write needed) */
> > +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> > | (target->data[i] & ~cur_used[i]));
> > +       }
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> Yes, I wanted to avoid all the syncs if they are not required.

Should be easy enough to optimize away in this alternate version as well.

>
> > +       /*
> > +        * It should now be possible to make a single qword write to make the
> > +        * the new configuration take effect.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               if ((cur->data[i] & target_used[i]) !=
> > (target->data[i] & target_used[i]))
> > +                       /*
> > +                        * TODO:
> > +                        * WARN_ONCE if this condition hits more than once in
> > +                        * the loop
> > +                        */
> > +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> > cur_used[i]) | (target->data[i] & ~cur_used[i]));
> > +       }
>
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> This needs to be optional too
>
> And there is another optional 4th pass to set the unused target values
> to 0.
>
> Basically you have captured the core algorithm, but I think if you
> fill in all the missing bits to get up to the same functionality it
> will be longer and unsharable with the CD side.
>
> You could perhaps take this approach and split it into 4 sharable step
> functions:
>
>  if (step1(target, target_used, cur_used, cur_used, len)) {
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>   arm_smmu_get_ste_used(cur, &cur_used);
>  }
>
>  if (step2(target, target_used, cur_used, cur_used, len))
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>
>  if (step3(target, target_used, cur_used, cur_used, len)) {
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>    arm_smmu_get_ste_used(cur, &cur_used);
>   }
>
>  if (step4(target, target_used, cur_used, cur_used, len))
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>
> To me this is inelegant as if we only need to do step 3 we have to
> redundantly scan the array 2 times. The rolled up version just
> directly goes to step 3.
>
> However this does convince me you've thought very carefully about this
> and have not found a flaw in the design!

Right, it's more of a style/readability question so I'll leave it to
others to chime in further if needed :) .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-23  8:36                 ` Michael Shavit
@ 2023-10-23 12:05                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-23 12:05 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Oct 23, 2023 at 04:36:36PM +0800, Michael Shavit wrote:
> On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > > The comment helps a lot thank you.
> > >
> > > I do still have some final reservations: wouldn't it be clearer with
> > > the loop un-rolled? After all it's only 3 steps in the worst case....
> > > Something like:
> >
> > I thought about that, but a big point for me was to consolidate the
> > algorithm between CD/STE. Inlining everything makes it much more
> > difficult to achieve this. Actually my first sketches were trying to
> > write it unrolled.
> >
> > > +       arm_smmu_get_ste_used(target, &target_used);
> > > +       arm_smmu_get_ste_used(cur, &cur_used);
> > > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
> >
> > hitless possible requires the loop of the step function to calcuate
> > it.
> 
> Possibly yes. Another option would be to have a list of transitions
> (e.g. IDENTITY -> S1) we expect and want to be hitless and check
> against that list. It defeats some of the purpose of your design, but
> it's also not obvious to me that we really need such flexibility in
> the first place.

When we get to nesting the change of the STE from a VM should be
hitless if the VM thought it should be hitless as a matter of
preserving HW semantics in emulation.

Even if you had a list there are several different methods to do
hitless, this is more of the direction of the old code where things
were carefully open coded and fragile.

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-10-23 12:05                   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-23 12:05 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Oct 23, 2023 at 04:36:36PM +0800, Michael Shavit wrote:
> On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > > The comment helps a lot thank you.
> > >
> > > I do still have some final reservations: wouldn't it be clearer with
> > > the loop un-rolled? After all it's only 3 steps in the worst case....
> > > Something like:
> >
> > I thought about that, but a big point for me was to consolidate the
> > algorithm between CD/STE. Inlining everything makes it much more
> > difficult to achieve this. Actually my first sketches were trying to
> > write it unrolled.
> >
> > > +       arm_smmu_get_ste_used(target, &target_used);
> > > +       arm_smmu_get_ste_used(cur, &cur_used);
> > > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
> >
> > hitless possible requires the loop of the step function to calcuate
> > it.
> 
> Possibly yes. Another option would be to have a list of transitions
> (e.g. IDENTITY -> S1) we expect and want to be hitless and check
> against that list. It defeats some of the purpose of your design, but
> it's also not obvious to me that we really need such flexibility in
> the first place.

When we get to nesting the change of the STE from a VM should be
hitless if the VM thought it should be hitless as a matter of
preserving HW semantics in emulation.

Even if you had a list there are several different methods to do
hitless, this is more of the direction of the old code where things
were carefully open coded and fragile.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2023-10-11  0:33   ` Jason Gunthorpe
@ 2023-10-24  2:44     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-24  2:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> The BTM support wants to be able to change the ASID of any smmu_domain.
> When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> the target domain's devices list.
>
> During attach of a S1 domain we must ensure that the devices list and
> CD are in sync, otherwise we could miss CD updates or a parallel CD update
> could push an out of date CD.
>
> This is pretty complicated, and works today because arm_smmu_detach_dev()
> remove the CD table from the STE before working on the CD entries.
>
> The next patch will allow the CD table to remain in the STE so solve this
> racy by holding the lock for a longer period. The lock covers both of the
> changes to the device list and the CD table entries.
>
> Move arm_smmu_detach_dev() till after we have initialized the domain so
> the lock can be held for less time.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2c06d3e3abe2b1..a29421f133a3c0 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                 return -EBUSY;
>         }
>
> -       arm_smmu_detach_dev(master);
> -
>         mutex_lock(&smmu_domain->init_mutex);
>
>         if (!smmu_domain->smmu) {
> @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>
>         mutex_unlock(&smmu_domain->init_mutex);
>         if (ret)
> -               return ret;
> +               goto out_unlock;

Oh, missed this earlier but on a second look the asid_lock isn't
grabbed here yet so this should stay as return ret.


> +
> +       /*
> +        * Prevent arm_smmu_share_asid() from trying to change the ASID
> +        * of either the old or new domain while we are working on it.
> +        * This allows the STE and the smmu_domain->devices list to
> +        * be inconsistent during this routine.
> +        */
> +       mutex_lock(&arm_smmu_asid_lock);
> +
> +       arm_smmu_detach_dev(master);
>
>         master->domain = smmu_domain;
>
> @@ -2576,13 +2584,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                         }
>                 }
>
> -               /*
> -                * Prevent SVA from concurrently modifying the CD or writing to
> -                * the CD entry
> -                */
> -               mutex_lock(&arm_smmu_asid_lock);
>                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> -               mutex_unlock(&arm_smmu_asid_lock);
>                 if (ret) {
>                         master->domain = NULL;
>                         goto out_list_del;
> @@ -2592,13 +2594,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>         arm_smmu_install_ste_for_dev(master);
>
>         arm_smmu_enable_ats(master);
> -       return 0;
> +       goto out_unlock;
>
>  out_list_del:
>         spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>         list_del(&master->domain_head);
>         spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>
> +out_unlock:
> +       mutex_unlock(&arm_smmu_asid_lock);
>         return ret;
>  }
>
> --
> 2.42.0
>

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
@ 2023-10-24  2:44     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-24  2:44 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> The BTM support wants to be able to change the ASID of any smmu_domain.
> When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> the target domain's devices list.
>
> During attach of a S1 domain we must ensure that the devices list and
> CD are in sync, otherwise we could miss CD updates or a parallel CD update
> could push an out of date CD.
>
> This is pretty complicated, and works today because arm_smmu_detach_dev()
> remove the CD table from the STE before working on the CD entries.
>
> The next patch will allow the CD table to remain in the STE so solve this
> racy by holding the lock for a longer period. The lock covers both of the
> changes to the device list and the CD table entries.
>
> Move arm_smmu_detach_dev() till after we have initialized the domain so
> the lock can be held for less time.
>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
>  1 file changed, 14 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 2c06d3e3abe2b1..a29421f133a3c0 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                 return -EBUSY;
>         }
>
> -       arm_smmu_detach_dev(master);
> -
>         mutex_lock(&smmu_domain->init_mutex);
>
>         if (!smmu_domain->smmu) {
> @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>
>         mutex_unlock(&smmu_domain->init_mutex);
>         if (ret)
> -               return ret;
> +               goto out_unlock;

Oh, missed this earlier but on a second look the asid_lock isn't
grabbed here yet so this should stay as return ret.


> +
> +       /*
> +        * Prevent arm_smmu_share_asid() from trying to change the ASID
> +        * of either the old or new domain while we are working on it.
> +        * This allows the STE and the smmu_domain->devices list to
> +        * be inconsistent during this routine.
> +        */
> +       mutex_lock(&arm_smmu_asid_lock);
> +
> +       arm_smmu_detach_dev(master);
>
>         master->domain = smmu_domain;
>
> @@ -2576,13 +2584,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>                         }
>                 }
>
> -               /*
> -                * Prevent SVA from concurrently modifying the CD or writing to
> -                * the CD entry
> -                */
> -               mutex_lock(&arm_smmu_asid_lock);
>                 ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> -               mutex_unlock(&arm_smmu_asid_lock);
>                 if (ret) {
>                         master->domain = NULL;
>                         goto out_list_del;
> @@ -2592,13 +2594,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>         arm_smmu_install_ste_for_dev(master);
>
>         arm_smmu_enable_ats(master);
> -       return 0;
> +       goto out_unlock;
>
>  out_list_del:
>         spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>         list_del(&master->domain_head);
>         spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>
> +out_unlock:
> +       mutex_unlock(&arm_smmu_asid_lock);
>         return ret;
>  }
>
> --
> 2.42.0
>

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2023-10-24  2:44     ` Michael Shavit
@ 2023-10-24  2:48       ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-24  2:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Oct 24, 2023 at 10:44 AM Michael Shavit <mshavit@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > The BTM support wants to be able to change the ASID of any smmu_domain.
> > When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> > the target domain's devices list.
> >
> > During attach of a S1 domain we must ensure that the devices list and
> > CD are in sync, otherwise we could miss CD updates or a parallel CD update
> > could push an out of date CD.
> >
> > This is pretty complicated, and works today because arm_smmu_detach_dev()
> > remove the CD table from the STE before working on the CD entries.
> >
> > The next patch will allow the CD table to remain in the STE so solve this
> > racy by holding the lock for a longer period. The lock covers both of the
> > changes to the device list and the CD table entries.
> >
> > Move arm_smmu_detach_dev() till after we have initialized the domain so
> > the lock can be held for less time.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
> >  1 file changed, 14 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 2c06d3e3abe2b1..a29421f133a3c0 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >                 return -EBUSY;
> >         }
> >
> > -       arm_smmu_detach_dev(master);
> > -
> >         mutex_lock(&smmu_domain->init_mutex);
> >
> >         if (!smmu_domain->smmu) {
> > @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >
> >         mutex_unlock(&smmu_domain->init_mutex);
> >         if (ret)
> > -               return ret;
> > +               goto out_unlock;
>
> Oh, missed this earlier but on a second look the asid_lock isn't
> grabbed here yet so this should stay as return ret.
>
Guess you must have noticed it too since it's fixed in patch 09 of the
second series :) .

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
@ 2023-10-24  2:48       ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-10-24  2:48 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Oct 24, 2023 at 10:44 AM Michael Shavit <mshavit@google.com> wrote:
>
> On Wed, Oct 11, 2023 at 8:33 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > The BTM support wants to be able to change the ASID of any smmu_domain.
> > When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> > the target domain's devices list.
> >
> > During attach of a S1 domain we must ensure that the devices list and
> > CD are in sync, otherwise we could miss CD updates or a parallel CD update
> > could push an out of date CD.
> >
> > This is pretty complicated, and works today because arm_smmu_detach_dev()
> > remove the CD table from the STE before working on the CD entries.
> >
> > The next patch will allow the CD table to remain in the STE so solve this
> > racy by holding the lock for a longer period. The lock covers both of the
> > changes to the device list and the CD table entries.
> >
> > Move arm_smmu_detach_dev() till after we have initialized the domain so
> > the lock can be held for less time.
> >
> > Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> > ---
> >  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 24 ++++++++++++---------
> >  1 file changed, 14 insertions(+), 10 deletions(-)
> >
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 2c06d3e3abe2b1..a29421f133a3c0 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -2535,8 +2535,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >                 return -EBUSY;
> >         }
> >
> > -       arm_smmu_detach_dev(master);
> > -
> >         mutex_lock(&smmu_domain->init_mutex);
> >
> >         if (!smmu_domain->smmu) {
> > @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >
> >         mutex_unlock(&smmu_domain->init_mutex);
> >         if (ret)
> > -               return ret;
> > +               goto out_unlock;
>
> Oh, missed this earlier but on a second look the asid_lock isn't
> grabbed here yet so this should stay as return ret.
>
Guess you must have noticed it too since it's fixed in patch 09 of the
second series :) .

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2023-10-24  2:44     ` Michael Shavit
@ 2023-10-24 11:50       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-24 11:50 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Oct 24, 2023 at 10:44:36AM +0800, Michael Shavit wrote:

> > @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >
> >         mutex_unlock(&smmu_domain->init_mutex);
> >         if (ret)
> > -               return ret;
> > +               goto out_unlock;
> 
> Oh, missed this earlier but on a second look the asid_lock isn't
> grabbed here yet so this should stay as return ret.

Yep, there is a hunk in a later patch fixing this, I moved it here

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
@ 2023-10-24 11:50       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-10-24 11:50 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Oct 24, 2023 at 10:44:36AM +0800, Michael Shavit wrote:

> > @@ -2549,7 +2547,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
> >
> >         mutex_unlock(&smmu_domain->init_mutex);
> >         if (ret)
> > -               return ret;
> > +               goto out_unlock;
> 
> Oh, missed this earlier but on a second look the asid_lock isn't
> grabbed here yet so this should stay as return ret.

Yep, there is a hunk in a later patch fixing this, I moved it here

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-10-23  8:36                 ` Michael Shavit
@ 2023-12-15 20:26                   ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-15 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Oct 23, 2023 at 4:36 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > > The comment helps a lot thank you.
> > >
> > > I do still have some final reservations: wouldn't it be clearer with
> > > the loop un-rolled? After all it's only 3 steps in the worst case....
> > > Something like:
> >
> > I thought about that, but a big point for me was to consolidate the
> > algorithm between CD/STE. Inlining everything makes it much more
> > difficult to achieve this. Actually my first sketches were trying to
> > write it unrolled.
> >
> > > +       arm_smmu_get_ste_used(target, &target_used);
> > > +       arm_smmu_get_ste_used(cur, &cur_used);
> > > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
> >
> > hitless possible requires the loop of the step function to calcuate
> > it.
>
> Possibly yes. Another option would be to have a list of transitions
> (e.g. IDENTITY -> S1) we expect and want to be hitless and check
> against that list. It defeats some of the purpose of your design, but
> it's also not obvious to me that we really need such flexibility in
> the first place.
>
> >
> > > +               target->data[0] = STRTAB_STE_0_V;
> > > +               arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > I still like V=0 as I think we do want the event for this case.
>
> Oh yeah sure, I did not think too carefully about this.
>
> >
> > > +               /*
> > > +                * The STE is now in abort where none of the bits except
> > > +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> > > +                * all other words of the STE to be written without further
> > > +                * disruption.
> > > +                */
> > > +               arm_smmu_get_ste_used(cur, &cur_used);
> > > +       }
> > > +       /* write bits in all positions unused by the STE */
> > > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > > +               /* (should probably optimize this away if no write needed) */
> > > +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> > > | (target->data[i] & ~cur_used[i]));
> > > +       }
> > > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > Yes, I wanted to avoid all the syncs if they are not required.
>
> Should be easy enough to optimize away in this alternate version as well.
>
> >
> > > +       /*
> > > +        * It should now be possible to make a single qword write to make the
> > > +        * the new configuration take effect.
> > > +        */
> > > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > > +               if ((cur->data[i] & target_used[i]) !=
> > > (target->data[i] & target_used[i]))
> > > +                       /*
> > > +                        * TODO:
> > > +                        * WARN_ONCE if this condition hits more than once in
> > > +                        * the loop
> > > +                        */
> > > +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> > > cur_used[i]) | (target->data[i] & ~cur_used[i]));
> > > +       }
> >
> > > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > This needs to be optional too
> >
> > And there is another optional 4th pass to set the unused target values
> > to 0.
> >
> > Basically you have captured the core algorithm, but I think if you
> > fill in all the missing bits to get up to the same functionality it
> > will be longer and unsharable with the CD side.
> >
> > You could perhaps take this approach and split it into 4 sharable step
> > functions:
> >
> >  if (step1(target, target_used, cur_used, cur_used, len)) {
> >   arm_smmu_sync_ste_for_sid(smmu, sid);
> >   arm_smmu_get_ste_used(cur, &cur_used);
> >  }
> >
> >  if (step2(target, target_used, cur_used, cur_used, len))
> >   arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> >  if (step3(target, target_used, cur_used, cur_used, len)) {
> >    arm_smmu_sync_ste_for_sid(smmu, sid);
> >    arm_smmu_get_ste_used(cur, &cur_used);
> >   }
> >
> >  if (step4(target, target_used, cur_used, cur_used, len))
> >    arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > To me this is inelegant as if we only need to do step 3 we have to
> > redundantly scan the array 2 times. The rolled up version just
> > directly goes to step 3.
> >
> > However this does convince me you've thought very carefully about this
> > and have not found a flaw in the design!
>
> Right, it's more of a style/readability question so I'll leave it to
> others to chime in further if needed :) .

Ok, I took a proper stab at trying to unroll the loop on the github
version of this patch (v3+)
As you suspected, it's not easy to re-use the unrolled version for
both STE and CD writing as we'd have to pass in callbacks for syncing
the STE/CD and recomputing arm_smmu_{get_ste/cd}_used. The flow is
still exactly the same in both cases however, and there is room to
extract some of the steps between sync operations into shareable
helpers for re-use between the two (I haven't bothered to do that so
that we can evaluate first before putting in more effort).

I personally find this quite a bit more readable as a sequential
series of steps instead of a loop. It also only requires 3 STE/CD
syncs in the worst case compared to 4 in the loop version since we
determine whether a hitless transition is possible before making any
writes to the STE (where-as the current patch optimistically starts
trying to transition to the target before discovering that it must
nuke the V bit first).
Are those both not worth a bit of code duplication between STE and CD writing?

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..1c35599d944d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-                                     const __le64 *target,
-                                     const __le64 *target_used, __le64 *step,
-                                     __le64 v_bit, unsigned int len)
-{
-       u8 step_used_diff = 0;
-       u8 step_change = 0;
-       unsigned int i;
-
-       /*
-        * Compute a step that has all the bits currently unused by HW set to
-        * their target values.
-        */
-       for (i = 0; i != len; i++) {
-               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-               if (cur[i] != step[i])
-                       step_change |= 1 << i;
-               /*
-                * Each bit indicates if the step is incorrect compared to the
-                * target, considering only the used bits in the target
-                */
-               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-                       step_used_diff |= 1 << i;
-       }
-
-       if (hweight8(step_used_diff) > 1) {
-               /*
-                * More than 1 qword is mismatched, this cannot be done without
-                * a break. Clear the V bit and go again.
-                */
-               step[0] &= ~v_bit;
-       } else if (!step_change && step_used_diff) {
-               /*
-                * Have exactly one critical qword, all the other qwords are set
-                * correctly, so we can set this qword now.
-                */
-               i = ffs(step_used_diff) - 1;
-               step[i] = target[i];
-       } else if (!step_change) {
-               /* cur == target, so all done */
-               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-                       return true;
-
-               /*
-                * All the used HW bits match, but unused bits are different.
-                * Set them as well. Technically this isn't necessary but it
-                * brings the entry to the full target state, so if there are
-                * bugs in the mask calculation this will obscure them.
-                */
-               memcpy(step, target, len * sizeof(*step));
-       }
-
-       for (i = 0; i != len; i++)
-               WRITE_ONCE(cur[i], step[i]);
-       return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
        }
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-                                   const struct arm_smmu_ste *target,
-                                   const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+                                  struct arm_smmu_ste *cur,
+                                  const struct arm_smmu_ste *target)
 {
        struct arm_smmu_ste cur_used;
-       struct arm_smmu_ste step;
+       int i =0;

        arm_smmu_get_ste_used(cur, &cur_used);
-       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-                                        target_used->data, step.data,
-                                        cpu_to_le64(STRTAB_STE_0_V),
-                                        ARRAY_SIZE(cur->data));
+       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+                              (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
                               struct arm_smmu_ste *ste,
                               const struct arm_smmu_ste *target)
 {
+       struct arm_smmu_ste staging_ste;
        struct arm_smmu_ste target_used;
-       int i;
+       int writes_required = 0;
+       u8 staging_used_diff = 0;
+       int i = 0;

+       /*
+        * Compute a staging ste that has all the bits currently unused by HW
+        * set to their target values, such that comitting it to the ste table
+        * woudn't disrupt the hardware.
+        */
+       memcpy(&staging_ste, ste, sizeof(staging_ste));
+       arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+       /*
+        * Determine if it's possible to reach the target configuration from the
+        * staged configured in a single qword write (ignoring bits that are
+        * unused under the target configuration).
+        */
        arm_smmu_get_ste_used(target, &target_used);
-       /* Masks in arm_smmu_get_ste_used() are up to date */
-       for (i = 0; i != ARRAY_SIZE(target->data); i++)
-               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+       /*
+        * But first sanity check that masks in arm_smmu_get_ste_used() are up
+        * to date.
+        */
+       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
+               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
+                       target_used.data[i] |= target->data[i];
+       }

-       for (i = 0; true; i++) {
-               if (arm_smmu_write_ste_next(ste, target, &target_used))
-                       break;
-               arm_smmu_sync_ste_for_sid(smmu, sid);
-               if (WARN_ON(i == 4))
-                       break;
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
+               /*
+                * Each bit of staging_used_diff indicates the index of a qword
+                * within the staged ste that is incorrect compared to the
+                * target, considering only the used bits in the target
+                */
+               if ((staging_ste.data[i] &
+                   target_used.data[i]) != (target->data[i]))
+                       staging_used_diff |= 1 << i;
+       }
+       if (hweight8(staging_used_diff) > 1) {
+               /*
+                * More than 1 qword is mismatched and a hitless transition from
+                * the current ste to the target ste is not possible. Clear the
+                * V bit and recompute the staging STE.
+                * Because the V bit is cleared, the staging STE will be equal
+                * to the target STE except for the first qword.
+                */
+               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+               arm_smmu_ste_set_unused_bits(&staging_ste, target);
+               staging_used_diff = 1;
+       }
+
+       /*
+        * Commit the staging STE.
+        */
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(staging_used_diff) - 1;
+       WRITE_ONCE(ste->data[i], target->data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still bit set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+               if (ste->data[i] != target->data[i]) {
+                       WRITE_ONCE(ste->data[i], target->data[i]);
+               }
        }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

        /* It's likely that we'll want to use the new STE soon */
        if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-15 20:26                   ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-15 20:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Oct 23, 2023 at 4:36 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > > The comment helps a lot thank you.
> > >
> > > I do still have some final reservations: wouldn't it be clearer with
> > > the loop un-rolled? After all it's only 3 steps in the worst case....
> > > Something like:
> >
> > I thought about that, but a big point for me was to consolidate the
> > algorithm between CD/STE. Inlining everything makes it much more
> > difficult to achieve this. Actually my first sketches were trying to
> > write it unrolled.
> >
> > > +       arm_smmu_get_ste_used(target, &target_used);
> > > +       arm_smmu_get_ste_used(cur, &cur_used);
> > > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
> >
> > hitless possible requires the loop of the step function to calcuate
> > it.
>
> Possibly yes. Another option would be to have a list of transitions
> (e.g. IDENTITY -> S1) we expect and want to be hitless and check
> against that list. It defeats some of the purpose of your design, but
> it's also not obvious to me that we really need such flexibility in
> the first place.
>
> >
> > > +               target->data[0] = STRTAB_STE_0_V;
> > > +               arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > I still like V=0 as I think we do want the event for this case.
>
> Oh yeah sure, I did not think too carefully about this.
>
> >
> > > +               /*
> > > +                * The STE is now in abort where none of the bits except
> > > +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> > > +                * all other words of the STE to be written without further
> > > +                * disruption.
> > > +                */
> > > +               arm_smmu_get_ste_used(cur, &cur_used);
> > > +       }
> > > +       /* write bits in all positions unused by the STE */
> > > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > > +               /* (should probably optimize this away if no write needed) */
> > > +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> > > | (target->data[i] & ~cur_used[i]));
> > > +       }
> > > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > Yes, I wanted to avoid all the syncs if they are not required.
>
> Should be easy enough to optimize away in this alternate version as well.
>
> >
> > > +       /*
> > > +        * It should now be possible to make a single qword write to make the
> > > +        * the new configuration take effect.
> > > +        */
> > > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > > +               if ((cur->data[i] & target_used[i]) !=
> > > (target->data[i] & target_used[i]))
> > > +                       /*
> > > +                        * TODO:
> > > +                        * WARN_ONCE if this condition hits more than once in
> > > +                        * the loop
> > > +                        */
> > > +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> > > cur_used[i]) | (target->data[i] & ~cur_used[i]));
> > > +       }
> >
> > > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > This needs to be optional too
> >
> > And there is another optional 4th pass to set the unused target values
> > to 0.
> >
> > Basically you have captured the core algorithm, but I think if you
> > fill in all the missing bits to get up to the same functionality it
> > will be longer and unsharable with the CD side.
> >
> > You could perhaps take this approach and split it into 4 sharable step
> > functions:
> >
> >  if (step1(target, target_used, cur_used, cur_used, len)) {
> >   arm_smmu_sync_ste_for_sid(smmu, sid);
> >   arm_smmu_get_ste_used(cur, &cur_used);
> >  }
> >
> >  if (step2(target, target_used, cur_used, cur_used, len))
> >   arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> >  if (step3(target, target_used, cur_used, cur_used, len)) {
> >    arm_smmu_sync_ste_for_sid(smmu, sid);
> >    arm_smmu_get_ste_used(cur, &cur_used);
> >   }
> >
> >  if (step4(target, target_used, cur_used, cur_used, len))
> >    arm_smmu_sync_ste_for_sid(smmu, sid);
> >
> > To me this is inelegant as if we only need to do step 3 we have to
> > redundantly scan the array 2 times. The rolled up version just
> > directly goes to step 3.
> >
> > However this does convince me you've thought very carefully about this
> > and have not found a flaw in the design!
>
> Right, it's more of a style/readability question so I'll leave it to
> others to chime in further if needed :) .

Ok, I took a proper stab at trying to unroll the loop on the github
version of this patch (v3+)
As you suspected, it's not easy to re-use the unrolled version for
both STE and CD writing as we'd have to pass in callbacks for syncing
the STE/CD and recomputing arm_smmu_{get_ste/cd}_used. The flow is
still exactly the same in both cases however, and there is room to
extract some of the steps between sync operations into shareable
helpers for re-use between the two (I haven't bothered to do that so
that we can evaluate first before putting in more effort).

I personally find this quite a bit more readable as a sequential
series of steps instead of a loop. It also only requires 3 STE/CD
syncs in the worst case compared to 4 in the loop version since we
determine whether a hitless transition is possible before making any
writes to the STE (where-as the current patch optimistically starts
trying to transition to the target before discovering that it must
nuke the V bit first).
Are those both not worth a bit of code duplication between STE and CD writing?

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..1c35599d944d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-                                     const __le64 *target,
-                                     const __le64 *target_used, __le64 *step,
-                                     __le64 v_bit, unsigned int len)
-{
-       u8 step_used_diff = 0;
-       u8 step_change = 0;
-       unsigned int i;
-
-       /*
-        * Compute a step that has all the bits currently unused by HW set to
-        * their target values.
-        */
-       for (i = 0; i != len; i++) {
-               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-               if (cur[i] != step[i])
-                       step_change |= 1 << i;
-               /*
-                * Each bit indicates if the step is incorrect compared to the
-                * target, considering only the used bits in the target
-                */
-               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-                       step_used_diff |= 1 << i;
-       }
-
-       if (hweight8(step_used_diff) > 1) {
-               /*
-                * More than 1 qword is mismatched, this cannot be done without
-                * a break. Clear the V bit and go again.
-                */
-               step[0] &= ~v_bit;
-       } else if (!step_change && step_used_diff) {
-               /*
-                * Have exactly one critical qword, all the other qwords are set
-                * correctly, so we can set this qword now.
-                */
-               i = ffs(step_used_diff) - 1;
-               step[i] = target[i];
-       } else if (!step_change) {
-               /* cur == target, so all done */
-               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-                       return true;
-
-               /*
-                * All the used HW bits match, but unused bits are different.
-                * Set them as well. Technically this isn't necessary but it
-                * brings the entry to the full target state, so if there are
-                * bugs in the mask calculation this will obscure them.
-                */
-               memcpy(step, target, len * sizeof(*step));
-       }
-
-       for (i = 0; i != len; i++)
-               WRITE_ONCE(cur[i], step[i]);
-       return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
        }
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-                                   const struct arm_smmu_ste *target,
-                                   const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+                                  struct arm_smmu_ste *cur,
+                                  const struct arm_smmu_ste *target)
 {
        struct arm_smmu_ste cur_used;
-       struct arm_smmu_ste step;
+       int i =0;

        arm_smmu_get_ste_used(cur, &cur_used);
-       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-                                        target_used->data, step.data,
-                                        cpu_to_le64(STRTAB_STE_0_V),
-                                        ARRAY_SIZE(cur->data));
+       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+                              (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
                               struct arm_smmu_ste *ste,
                               const struct arm_smmu_ste *target)
 {
+       struct arm_smmu_ste staging_ste;
        struct arm_smmu_ste target_used;
-       int i;
+       int writes_required = 0;
+       u8 staging_used_diff = 0;
+       int i = 0;

+       /*
+        * Compute a staging ste that has all the bits currently unused by HW
+        * set to their target values, such that comitting it to the ste table
+        * woudn't disrupt the hardware.
+        */
+       memcpy(&staging_ste, ste, sizeof(staging_ste));
+       arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+       /*
+        * Determine if it's possible to reach the target configuration from the
+        * staged configured in a single qword write (ignoring bits that are
+        * unused under the target configuration).
+        */
        arm_smmu_get_ste_used(target, &target_used);
-       /* Masks in arm_smmu_get_ste_used() are up to date */
-       for (i = 0; i != ARRAY_SIZE(target->data); i++)
-               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+       /*
+        * But first sanity check that masks in arm_smmu_get_ste_used() are up
+        * to date.
+        */
+       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
+               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
+                       target_used.data[i] |= target->data[i];
+       }

-       for (i = 0; true; i++) {
-               if (arm_smmu_write_ste_next(ste, target, &target_used))
-                       break;
-               arm_smmu_sync_ste_for_sid(smmu, sid);
-               if (WARN_ON(i == 4))
-                       break;
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
+               /*
+                * Each bit of staging_used_diff indicates the index of a qword
+                * within the staged ste that is incorrect compared to the
+                * target, considering only the used bits in the target
+                */
+               if ((staging_ste.data[i] &
+                   target_used.data[i]) != (target->data[i]))
+                       staging_used_diff |= 1 << i;
+       }
+       if (hweight8(staging_used_diff) > 1) {
+               /*
+                * More than 1 qword is mismatched and a hitless transition from
+                * the current ste to the target ste is not possible. Clear the
+                * V bit and recompute the staging STE.
+                * Because the V bit is cleared, the staging STE will be equal
+                * to the target STE except for the first qword.
+                */
+               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+               arm_smmu_ste_set_unused_bits(&staging_ste, target);
+               staging_used_diff = 1;
+       }
+
+       /*
+        * Commit the staging STE.
+        */
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(staging_used_diff) - 1;
+       WRITE_ONCE(ste->data[i], target->data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still bit set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+               if (ste->data[i] != target->data[i]) {
+                       WRITE_ONCE(ste->data[i], target->data[i]);
+               }
        }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

        /* It's likely that we'll want to use the new STE soon */
        if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-15 20:26                   ` Michael Shavit
@ 2023-12-17 13:03                     ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-17 13:03 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:

> Ok, I took a proper stab at trying to unroll the loop on the github
> version of this patch (v3+)
> As you suspected, it's not easy to re-use the unrolled version for
> both STE and CD writing as we'd have to pass in callbacks for syncing
> the STE/CD and recomputing arm_smmu_{get_ste/cd}_used. 

Yes, that is why I structured it as an iterator
> I personally find this quite a bit more readable as a sequential
> series of steps instead of a loop. It also only requires 3 STE/CD
> syncs in the worst case compared to 4 in the loop version since we

The existing version is max 3 as well, it works the same by
checking the number of critical qwords after computing the first step
in the hitless flow.

However, what you have below has the same problem as the first sketch,
it always does 3 syncs. The existing version fully minimizes the
syncs. It is why it is so complex to unroll it as you have to check
before every sync if the sync is needed at all.

This could probably be organized like this so one shared function
computes the "plan" and then a cd/ste copy executes the plan. It
avoids the loop but all the code is still basically the same, there is
just more of it.

I'm fine with any of these ways

Jason

>  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
>                                struct arm_smmu_ste *ste,
>                                const struct arm_smmu_ste *target)
>  {
> +       struct arm_smmu_ste staging_ste;
>         struct arm_smmu_ste target_used;
> +       int writes_required = 0;
> +       u8 staging_used_diff = 0;
> +       int i = 0;
> 
> +       /*
> +        * Compute a staging ste that has all the bits currently unused by HW
> +        * set to their target values, such that comitting it to the ste table
> +        * woudn't disrupt the hardware.
> +        */
> +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +
> +       /*
> +        * Determine if it's possible to reach the target configuration from the
> +        * staged configured in a single qword write (ignoring bits that are
> +        * unused under the target configuration).
> +        */
>         arm_smmu_get_ste_used(target, &target_used);
> -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +       /*
> +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> +        * to date.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> +                       target_used.data[i] |= target->data[i];
> +       }
> 
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> +               /*
> +                * Each bit of staging_used_diff indicates the index of a qword
> +                * within the staged ste that is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((staging_ste.data[i] &
> +                   target_used.data[i]) != (target->data[i]))
> +                       staging_used_diff |= 1 << i;
> +       }
> +       if (hweight8(staging_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched and a hitless transition from
> +                * the current ste to the target ste is not possible. Clear the
> +                * V bit and recompute the staging STE.
> +                * Because the V bit is cleared, the staging STE will be equal
> +                * to the target STE except for the first qword.
> +                */
> +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +               staging_used_diff = 1;
> +       }
> +
> +       /*
> +        * Commit the staging STE.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(staging_used_diff) - 1;
> +       WRITE_ONCE(ste->data[i], target->data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

The non hitless flow doesn't look right to me, it should set v=0 then
load all qwords but 0 then load 0, in exactly that sequence. If the
goal is clarity then the two programming flows should be explicitly
spelled out.

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-17 13:03                     ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-17 13:03 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:

> Ok, I took a proper stab at trying to unroll the loop on the github
> version of this patch (v3+)
> As you suspected, it's not easy to re-use the unrolled version for
> both STE and CD writing as we'd have to pass in callbacks for syncing
> the STE/CD and recomputing arm_smmu_{get_ste/cd}_used. 

Yes, that is why I structured it as an iterator
> I personally find this quite a bit more readable as a sequential
> series of steps instead of a loop. It also only requires 3 STE/CD
> syncs in the worst case compared to 4 in the loop version since we

The existing version is max 3 as well, it works the same by
checking the number of critical qwords after computing the first step
in the hitless flow.

However, what you have below has the same problem as the first sketch,
it always does 3 syncs. The existing version fully minimizes the
syncs. It is why it is so complex to unroll it as you have to check
before every sync if the sync is needed at all.

This could probably be organized like this so one shared function
computes the "plan" and then a cd/ste copy executes the plan. It
avoids the loop but all the code is still basically the same, there is
just more of it.

I'm fine with any of these ways

Jason

>  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
>                                struct arm_smmu_ste *ste,
>                                const struct arm_smmu_ste *target)
>  {
> +       struct arm_smmu_ste staging_ste;
>         struct arm_smmu_ste target_used;
> +       int writes_required = 0;
> +       u8 staging_used_diff = 0;
> +       int i = 0;
> 
> +       /*
> +        * Compute a staging ste that has all the bits currently unused by HW
> +        * set to their target values, such that comitting it to the ste table
> +        * woudn't disrupt the hardware.
> +        */
> +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +
> +       /*
> +        * Determine if it's possible to reach the target configuration from the
> +        * staged configured in a single qword write (ignoring bits that are
> +        * unused under the target configuration).
> +        */
>         arm_smmu_get_ste_used(target, &target_used);
> -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +       /*
> +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> +        * to date.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> +                       target_used.data[i] |= target->data[i];
> +       }
> 
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> +               /*
> +                * Each bit of staging_used_diff indicates the index of a qword
> +                * within the staged ste that is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((staging_ste.data[i] &
> +                   target_used.data[i]) != (target->data[i]))
> +                       staging_used_diff |= 1 << i;
> +       }
> +       if (hweight8(staging_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched and a hitless transition from
> +                * the current ste to the target ste is not possible. Clear the
> +                * V bit and recompute the staging STE.
> +                * Because the V bit is cleared, the staging STE will be equal
> +                * to the target STE except for the first qword.
> +                */
> +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +               staging_used_diff = 1;
> +       }
> +
> +       /*
> +        * Commit the staging STE.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(staging_used_diff) - 1;
> +       WRITE_ONCE(ste->data[i], target->data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

The non hitless flow doesn't look right to me, it should set v=0 then
load all qwords but 0 then load 0, in exactly that sequence. If the
goal is clarity then the two programming flows should be explicitly
spelled out.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-17 13:03                     ` Jason Gunthorpe
@ 2023-12-18 12:35                       ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-18 12:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
>
> > Ok, I took a proper stab at trying to unroll the loop on the github
> > version of this patch (v3+)
> > As you suspected, it's not easy to re-use the unrolled version for
> > both STE and CD writing as we'd have to pass in callbacks for syncing
> > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
>
> Yes, that is why I structured it as an iterator

On second thought, perhaps defining a helper class implementing
entry_sync() and entry_get_used_bits() might not be so bad?
It's a little bit more verbose, but avoids deduplication of the
complicated parts.

> > I personally find this quite a bit more readable as a sequential
> > series of steps instead of a loop. It also only requires 3 STE/CD
> > syncs in the worst case compared to 4 in the loop version since we
>
> The existing version is max 3 as well, it works the same by
> checking the number of critical qwords after computing the first step
> in the hitless flow.

Gotcha. I was being lazy and assuming it was 4 based on the warning
added to the github version :).

>
> However, what you have below has the same problem as the first sketch,
> it always does 3 syncs. The existing version fully minimizes the
> syncs. It is why it is so complex to unroll it as you have to check
> before every sync if the sync is needed at all.

Hmmmm, AFAICT there are two optimizations I was missing:
1. The final clean-up loop may be a nop, in which case a sync isn't required.
2. We may be able to directly transition to the target state with a
single qword write from the very beginning, without going through any
intermediate STEs.

Both of these seem pretty easy to address in this version as well
however; or am I still overlooking a scenario?

>
> This could probably be organized like this so one shared function
> computes the "plan" and then a cd/ste copy executes the plan. It
> avoids the loop but all the code is still basically the same, there is
> just more of it.
>
> I'm fine with any of these ways
>
> Jason
>
> >  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> >                                struct arm_smmu_ste *ste,
> >                                const struct arm_smmu_ste *target)
> >  {
> > +       struct arm_smmu_ste staging_ste;
> >         struct arm_smmu_ste target_used;
> > +       int writes_required = 0;
> > +       u8 staging_used_diff = 0;
> > +       int i = 0;
> >
> > +       /*
> > +        * Compute a staging ste that has all the bits currently unused by HW
> > +        * set to their target values, such that comitting it to the ste table
> > +        * woudn't disrupt the hardware.
> > +        */
> > +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> > +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > +
> > +       /*
> > +        * Determine if it's possible to reach the target configuration from the
> > +        * staged configured in a single qword write (ignoring bits that are
> > +        * unused under the target configuration).
> > +        */
> >         arm_smmu_get_ste_used(target, &target_used);
> > -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> > +       /*
> > +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> > +        * to date.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> > +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> > +                       target_used.data[i] |= target->data[i];
> > +       }
> >
> > +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> > +               /*
> > +                * Each bit of staging_used_diff indicates the index of a qword
> > +                * within the staged ste that is incorrect compared to the
> > +                * target, considering only the used bits in the target
> > +                */
> > +               if ((staging_ste.data[i] &
> > +                   target_used.data[i]) != (target->data[i]))
> > +                       staging_used_diff |= 1 << i;
> > +       }
> > +       if (hweight8(staging_used_diff) > 1) {
> > +               /*
> > +                * More than 1 qword is mismatched and a hitless transition from
> > +                * the current ste to the target ste is not possible. Clear the
> > +                * V bit and recompute the staging STE.
> > +                * Because the V bit is cleared, the staging STE will be equal
> > +                * to the target STE except for the first qword.
> > +                */
> > +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> > +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > +               staging_used_diff = 1;
> > +       }
> > +
> > +       /*
> > +        * Commit the staging STE.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> > +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> > +
> > +       /*
> > +        * It's now possible to switch to the target configuration with a write
> > +        * to a single qword. Make that switch now.
> > +        */
> > +       i = ffs(staging_used_diff) - 1;
> > +       WRITE_ONCE(ste->data[i], target->data[i]);
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> The non hitless flow doesn't look right to me, it should set v=0 then
> load all qwords but 0 then load 0, in exactly that sequence. If the
> goal is clarity then the two programming flows should be explicitly
> spelled out.

Ah yeah you're right. I forgot to set the other qwords in the staging
STE to have their final values in the non-hitless case.

The following should address the bug and optimization you pointed out
with minimal adjustments:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..1c35599d944d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-                                     const __le64 *target,
-                                     const __le64 *target_used, __le64 *step,
-                                     __le64 v_bit, unsigned int len)
-{
-       u8 step_used_diff = 0;
-       u8 step_change = 0;
-       unsigned int i;
-
-       /*
-        * Compute a step that has all the bits currently unused by HW set to
-        * their target values.
-        */
-       for (i = 0; i != len; i++) {
-               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-               if (cur[i] != step[i])
-                       step_change |= 1 << i;
-               /*
-                * Each bit indicates if the step is incorrect compared to the
-                * target, considering only the used bits in the target
-                */
-               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-                       step_used_diff |= 1 << i;
-       }
-
-       if (hweight8(step_used_diff) > 1) {
-               /*
-                * More than 1 qword is mismatched, this cannot be done without
-                * a break. Clear the V bit and go again.
-                */
-               step[0] &= ~v_bit;
-       } else if (!step_change && step_used_diff) {
-               /*
-                * Have exactly one critical qword, all the other qwords are set
-                * correctly, so we can set this qword now.
-                */
-               i = ffs(step_used_diff) - 1;
-               step[i] = target[i];
-       } else if (!step_change) {
-               /* cur == target, so all done */
-               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-                       return true;
-
-               /*
-                * All the used HW bits match, but unused bits are different.
-                * Set them as well. Technically this isn't necessary but it
-                * brings the entry to the full target state, so if there are
-                * bugs in the mask calculation this will obscure them.
-                */
-               memcpy(step, target, len * sizeof(*step));
-       }
-
-       for (i = 0; i != len; i++)
-               WRITE_ONCE(cur[i], step[i]);
-       return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
        }
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-                                   const struct arm_smmu_ste *target,
-                                   const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+                                  struct arm_smmu_ste *cur,
+                                  const struct arm_smmu_ste *target)
 {
        struct arm_smmu_ste cur_used;
-       struct arm_smmu_ste step;
+       int i =0;

        arm_smmu_get_ste_used(cur, &cur_used);
-       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-                                        target_used->data, step.data,
-                                        cpu_to_le64(STRTAB_STE_0_V),
-                                        ARRAY_SIZE(cur->data));
+       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+                              (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
                               struct arm_smmu_ste *ste,
                               const struct arm_smmu_ste *target)
 {
+       struct arm_smmu_ste staging_ste;
        struct arm_smmu_ste target_used;
-       int i;
+       int writes_required = 0;
+       u8 staging_used_diff = 0;
+       int i = 0;

+       /*
+        * Compute a staging ste that has all the bits currently unused by HW
+        * set to their target values, such that comitting it to the ste table
+        * woudn't disrupt the hardware.
+        */
+       memcpy(&staging_ste, ste, sizeof(staging_ste));
+       arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+       /*
+        * Determine if it's possible to reach the target configuration from the
+        * staged configured in a single qword write (ignoring bits that are
+        * unused under the target configuration).
+        */
        arm_smmu_get_ste_used(target, &target_used);
-       /* Masks in arm_smmu_get_ste_used() are up to date */
-       for (i = 0; i != ARRAY_SIZE(target->data); i++)
-               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+       /*
+        * But first sanity check that masks in arm_smmu_get_ste_used() are up
+        * to date.
+        */
+       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
+               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
+                       target_used.data[i] |= target->data[i];
+       }

-       for (i = 0; true; i++) {
-               if (arm_smmu_write_ste_next(ste, target, &target_used))
-                       break;
-               arm_smmu_sync_ste_for_sid(smmu, sid);
-               if (WARN_ON(i == 4))
-                       break;
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
+               /*
+                * Each bit of staging_used_diff indicates the index of a qword
+                * within the staged ste that is incorrect compared to the
+                * target, considering only the used bits in the target
+                */
+               if ((staging_ste.data[i] &
+                   target_used.data[i]) != (target->data[i]))
+                       staging_used_diff |= 1 << i;
+       }
+       if (hweight8(staging_used_diff) > 1) {
+               /*
+                * More than 1 qword is mismatched and a hitless transition from
+                * the current ste to the target ste is not possible. Clear the
+                * V bit and recompute the staging STE.
+                * Because the V bit is cleared, the staging STE will be equal
+                * to the target STE except for the first qword.
+                */
+               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+               arm_smmu_ste_set_unused_bits(&staging_ste, target);
+               staging_used_diff = 1;
+       }
+
+       /*
+        * Commit the staging STE.
+        */
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(staging_used_diff) - 1;
+       WRITE_ONCE(ste->data[i], target->data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still bit set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+               if (ste->data[i] != target->data[i]) {
+                       WRITE_ONCE(ste->data[i], target->data[i]);
+               }
        }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

        /* It's likely that we'll want to use the new STE soon */
        if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-18 12:35                       ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-18 12:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
>
> > Ok, I took a proper stab at trying to unroll the loop on the github
> > version of this patch (v3+)
> > As you suspected, it's not easy to re-use the unrolled version for
> > both STE and CD writing as we'd have to pass in callbacks for syncing
> > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
>
> Yes, that is why I structured it as an iterator

On second thought, perhaps defining a helper class implementing
entry_sync() and entry_get_used_bits() might not be so bad?
It's a little bit more verbose, but avoids deduplication of the
complicated parts.

> > I personally find this quite a bit more readable as a sequential
> > series of steps instead of a loop. It also only requires 3 STE/CD
> > syncs in the worst case compared to 4 in the loop version since we
>
> The existing version is max 3 as well, it works the same by
> checking the number of critical qwords after computing the first step
> in the hitless flow.

Gotcha. I was being lazy and assuming it was 4 based on the warning
added to the github version :).

>
> However, what you have below has the same problem as the first sketch,
> it always does 3 syncs. The existing version fully minimizes the
> syncs. It is why it is so complex to unroll it as you have to check
> before every sync if the sync is needed at all.

Hmmmm, AFAICT there are two optimizations I was missing:
1. The final clean-up loop may be a nop, in which case a sync isn't required.
2. We may be able to directly transition to the target state with a
single qword write from the very beginning, without going through any
intermediate STEs.

Both of these seem pretty easy to address in this version as well
however; or am I still overlooking a scenario?

>
> This could probably be organized like this so one shared function
> computes the "plan" and then a cd/ste copy executes the plan. It
> avoids the loop but all the code is still basically the same, there is
> just more of it.
>
> I'm fine with any of these ways
>
> Jason
>
> >  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> >                                struct arm_smmu_ste *ste,
> >                                const struct arm_smmu_ste *target)
> >  {
> > +       struct arm_smmu_ste staging_ste;
> >         struct arm_smmu_ste target_used;
> > +       int writes_required = 0;
> > +       u8 staging_used_diff = 0;
> > +       int i = 0;
> >
> > +       /*
> > +        * Compute a staging ste that has all the bits currently unused by HW
> > +        * set to their target values, such that comitting it to the ste table
> > +        * woudn't disrupt the hardware.
> > +        */
> > +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> > +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > +
> > +       /*
> > +        * Determine if it's possible to reach the target configuration from the
> > +        * staged configured in a single qword write (ignoring bits that are
> > +        * unused under the target configuration).
> > +        */
> >         arm_smmu_get_ste_used(target, &target_used);
> > -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> > +       /*
> > +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> > +        * to date.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> > +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> > +                       target_used.data[i] |= target->data[i];
> > +       }
> >
> > +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> > +               /*
> > +                * Each bit of staging_used_diff indicates the index of a qword
> > +                * within the staged ste that is incorrect compared to the
> > +                * target, considering only the used bits in the target
> > +                */
> > +               if ((staging_ste.data[i] &
> > +                   target_used.data[i]) != (target->data[i]))
> > +                       staging_used_diff |= 1 << i;
> > +       }
> > +       if (hweight8(staging_used_diff) > 1) {
> > +               /*
> > +                * More than 1 qword is mismatched and a hitless transition from
> > +                * the current ste to the target ste is not possible. Clear the
> > +                * V bit and recompute the staging STE.
> > +                * Because the V bit is cleared, the staging STE will be equal
> > +                * to the target STE except for the first qword.
> > +                */
> > +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> > +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > +               staging_used_diff = 1;
> > +       }
> > +
> > +       /*
> > +        * Commit the staging STE.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> > +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
> > +
> > +       /*
> > +        * It's now possible to switch to the target configuration with a write
> > +        * to a single qword. Make that switch now.
> > +        */
> > +       i = ffs(staging_used_diff) - 1;
> > +       WRITE_ONCE(ste->data[i], target->data[i]);
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> The non hitless flow doesn't look right to me, it should set v=0 then
> load all qwords but 0 then load 0, in exactly that sequence. If the
> goal is clarity then the two programming flows should be explicitly
> spelled out.

Ah yeah you're right. I forgot to set the other qwords in the staging
STE to have their final values in the non-hitless case.

The following should address the bug and optimization you pointed out
with minimal adjustments:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..1c35599d944d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-                                     const __le64 *target,
-                                     const __le64 *target_used, __le64 *step,
-                                     __le64 v_bit, unsigned int len)
-{
-       u8 step_used_diff = 0;
-       u8 step_change = 0;
-       unsigned int i;
-
-       /*
-        * Compute a step that has all the bits currently unused by HW set to
-        * their target values.
-        */
-       for (i = 0; i != len; i++) {
-               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-               if (cur[i] != step[i])
-                       step_change |= 1 << i;
-               /*
-                * Each bit indicates if the step is incorrect compared to the
-                * target, considering only the used bits in the target
-                */
-               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-                       step_used_diff |= 1 << i;
-       }
-
-       if (hweight8(step_used_diff) > 1) {
-               /*
-                * More than 1 qword is mismatched, this cannot be done without
-                * a break. Clear the V bit and go again.
-                */
-               step[0] &= ~v_bit;
-       } else if (!step_change && step_used_diff) {
-               /*
-                * Have exactly one critical qword, all the other qwords are set
-                * correctly, so we can set this qword now.
-                */
-               i = ffs(step_used_diff) - 1;
-               step[i] = target[i];
-       } else if (!step_change) {
-               /* cur == target, so all done */
-               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-                       return true;
-
-               /*
-                * All the used HW bits match, but unused bits are different.
-                * Set them as well. Technically this isn't necessary but it
-                * brings the entry to the full target state, so if there are
-                * bugs in the mask calculation this will obscure them.
-                */
-               memcpy(step, target, len * sizeof(*step));
-       }
-
-       for (i = 0; i != len; i++)
-               WRITE_ONCE(cur[i], step[i]);
-       return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
        }
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-                                   const struct arm_smmu_ste *target,
-                                   const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+                                  struct arm_smmu_ste *cur,
+                                  const struct arm_smmu_ste *target)
 {
        struct arm_smmu_ste cur_used;
-       struct arm_smmu_ste step;
+       int i =0;

        arm_smmu_get_ste_used(cur, &cur_used);
-       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-                                        target_used->data, step.data,
-                                        cpu_to_le64(STRTAB_STE_0_V),
-                                        ARRAY_SIZE(cur->data));
+       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+                              (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
                               struct arm_smmu_ste *ste,
                               const struct arm_smmu_ste *target)
 {
+       struct arm_smmu_ste staging_ste;
        struct arm_smmu_ste target_used;
-       int i;
+       int writes_required = 0;
+       u8 staging_used_diff = 0;
+       int i = 0;

+       /*
+        * Compute a staging ste that has all the bits currently unused by HW
+        * set to their target values, such that comitting it to the ste table
+        * woudn't disrupt the hardware.
+        */
+       memcpy(&staging_ste, ste, sizeof(staging_ste));
+       arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+       /*
+        * Determine if it's possible to reach the target configuration from the
+        * staged configured in a single qword write (ignoring bits that are
+        * unused under the target configuration).
+        */
        arm_smmu_get_ste_used(target, &target_used);
-       /* Masks in arm_smmu_get_ste_used() are up to date */
-       for (i = 0; i != ARRAY_SIZE(target->data); i++)
-               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+       /*
+        * But first sanity check that masks in arm_smmu_get_ste_used() are up
+        * to date.
+        */
+       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
+               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
+                       target_used.data[i] |= target->data[i];
+       }

-       for (i = 0; true; i++) {
-               if (arm_smmu_write_ste_next(ste, target, &target_used))
-                       break;
-               arm_smmu_sync_ste_for_sid(smmu, sid);
-               if (WARN_ON(i == 4))
-                       break;
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
+               /*
+                * Each bit of staging_used_diff indicates the index of a qword
+                * within the staged ste that is incorrect compared to the
+                * target, considering only the used bits in the target
+                */
+               if ((staging_ste.data[i] &
+                   target_used.data[i]) != (target->data[i]))
+                       staging_used_diff |= 1 << i;
+       }
+       if (hweight8(staging_used_diff) > 1) {
+               /*
+                * More than 1 qword is mismatched and a hitless transition from
+                * the current ste to the target ste is not possible. Clear the
+                * V bit and recompute the staging STE.
+                * Because the V bit is cleared, the staging STE will be equal
+                * to the target STE except for the first qword.
+                */
+               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+               arm_smmu_ste_set_unused_bits(&staging_ste, target);
+               staging_used_diff = 1;
+       }
+
+       /*
+        * Commit the staging STE.
+        */
+       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(staging_used_diff) - 1;
+       WRITE_ONCE(ste->data[i], target->data[i]);
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still bit set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+               if (ste->data[i] != target->data[i]) {
+                       WRITE_ONCE(ste->data[i], target->data[i]);
+               }
        }
+       arm_smmu_sync_ste_for_sid(smmu, sid);

        /* It's likely that we'll want to use the new STE soon */
        if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-18 12:35                       ` Michael Shavit
@ 2023-12-18 12:42                         ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-18 12:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
> The following should address the bug and optimization you pointed out
> with minimal adjustments:
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 98aa8cc17b58b..1c35599d944d7 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
> arm_smmu_device *smmu, u16 asid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> -/*
> - * This algorithm updates any STE/CD to any value without creating a situation
> - * where the HW can percieve a corrupted entry. HW is only required
> to have a 64
> - * bit atomicity with stores from the CPU, while entries are many 64 bit values
> - * big.
> - *
> - * The algorithm works by evolving the entry toward the target in a series of
> - * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> - * torn across two steps. Upon each call cur/cur_used reflect the current
> - * synchronized value seen by the HW.
> - *
> - * During each step the HW can observe a torn entry that has any combination of
> - * the step's old/new 64 bit words. The algorithm objective is for the HW
> - * behavior to always be one of current behavior, V=0, or new behavior, during
> - * each step, and across all steps.
> - *
> - * At each step one of three actions is chosen to evolve cur to target:
> - *  - Update all unused bits with their target values.
> - *    This relies on the IGNORED behavior described in the specification
> - *  - Update a single 64-bit value
> - *  - Update all unused bits and set V=0
> - *
> - * The last two actions will cause cur_used to change, which will
> then allow the
> - * first action on the next step.
> - *
> - * In the most general case we can make any update in three steps:
> - *  - Disrupting the entry (V=0)
> - *  - Fill now unused bits, all bits except V
> - *  - Make valid (V=1), single 64 bit store
> - *
> - * However this disrupts the HW while it is happening. There are several
> - * interesting cases where a STE/CD can be updated without disturbing the HW
> - * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> - * because the used bits don't intersect. We can detect this by calculating how
> - * many 64 bit values need update after adjusting the unused bits and skip the
> - * V=0 process.
> - */
> -static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
> -                                     const __le64 *target,
> -                                     const __le64 *target_used, __le64 *step,
> -                                     __le64 v_bit, unsigned int len)
> -{
> -       u8 step_used_diff = 0;
> -       u8 step_change = 0;
> -       unsigned int i;
> -
> -       /*
> -        * Compute a step that has all the bits currently unused by HW set to
> -        * their target values.
> -        */
> -       for (i = 0; i != len; i++) {
> -               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> -               if (cur[i] != step[i])
> -                       step_change |= 1 << i;
> -               /*
> -                * Each bit indicates if the step is incorrect compared to the
> -                * target, considering only the used bits in the target
> -                */
> -               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> -                       step_used_diff |= 1 << i;
> -       }
> -
> -       if (hweight8(step_used_diff) > 1) {
> -               /*
> -                * More than 1 qword is mismatched, this cannot be done without
> -                * a break. Clear the V bit and go again.
> -                */
> -               step[0] &= ~v_bit;
> -       } else if (!step_change && step_used_diff) {
> -               /*
> -                * Have exactly one critical qword, all the other qwords are set
> -                * correctly, so we can set this qword now.
> -                */
> -               i = ffs(step_used_diff) - 1;
> -               step[i] = target[i];
> -       } else if (!step_change) {
> -               /* cur == target, so all done */
> -               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> -                       return true;
> -
> -               /*
> -                * All the used HW bits match, but unused bits are different.
> -                * Set them as well. Technically this isn't necessary but it
> -                * brings the entry to the full target state, so if there are
> -                * bugs in the mask calculation this will obscure them.
> -                */
> -               memcpy(step, target, len * sizeof(*step));
> -       }
> -
> -       for (i = 0; i != len; i++)
> -               WRITE_ONCE(cur[i], step[i]);
> -       return false;
> -}
> -
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
> struct arm_smmu_ste *ent,
>         }
>  }
>
> -static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
> -                                   const struct arm_smmu_ste *target,
> -                                   const struct arm_smmu_ste *target_used)
> +/*
> + * Make bits of the current ste that aren't in use by the hardware equal to the
> + * target's bits.
> + */
> +static void arm_smmu_ste_set_unused_bits(
> +                                  struct arm_smmu_ste *cur,
> +                                  const struct arm_smmu_ste *target)
>  {
>         struct arm_smmu_ste cur_used;
> -       struct arm_smmu_ste step;
> +       int i =0;
>
>         arm_smmu_get_ste_used(cur, &cur_used);
> -       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
> -                                        target_used->data, step.data,
> -                                        cpu_to_le64(STRTAB_STE_0_V),
> -                                        ARRAY_SIZE(cur->data));
> +       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
> +               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
> +                              (target->data[i] & ~cur_used.data[i]);
>  }
>
> +/*
> + * Update the STE to the target configuration. The transition from the current
> + * STE to the target STE takes place over multiple steps that attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only
> required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */
>  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
>                                struct arm_smmu_ste *ste,
>                                const struct arm_smmu_ste *target)
>  {
> +       struct arm_smmu_ste staging_ste;
>         struct arm_smmu_ste target_used;
> -       int i;
> +       int writes_required = 0;
> +       u8 staging_used_diff = 0;
> +       int i = 0;
>
> +       /*
> +        * Compute a staging ste that has all the bits currently unused by HW
> +        * set to their target values, such that comitting it to the ste table
> +        * woudn't disrupt the hardware.
> +        */
> +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +
> +       /*
> +        * Determine if it's possible to reach the target configuration from the
> +        * staged configured in a single qword write (ignoring bits that are
> +        * unused under the target configuration).
> +        */
>         arm_smmu_get_ste_used(target, &target_used);
> -       /* Masks in arm_smmu_get_ste_used() are up to date */
> -       for (i = 0; i != ARRAY_SIZE(target->data); i++)
> -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +       /*
> +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> +        * to date.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> +                       target_used.data[i] |= target->data[i];
> +       }
>
> -       for (i = 0; true; i++) {
> -               if (arm_smmu_write_ste_next(ste, target, &target_used))
> -                       break;
> -               arm_smmu_sync_ste_for_sid(smmu, sid);
> -               if (WARN_ON(i == 4))
> -                       break;
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> +               /*
> +                * Each bit of staging_used_diff indicates the index of a qword
> +                * within the staged ste that is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((staging_ste.data[i] &
> +                   target_used.data[i]) != (target->data[i]))
> +                       staging_used_diff |= 1 << i;
> +       }
> +       if (hweight8(staging_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched and a hitless transition from
> +                * the current ste to the target ste is not possible. Clear the
> +                * V bit and recompute the staging STE.
> +                * Because the V bit is cleared, the staging STE will be equal
> +                * to the target STE except for the first qword.
> +                */
> +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +               staging_used_diff = 1;
> +       }
> +
> +       /*
> +        * Commit the staging STE.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(staging_used_diff) - 1;
> +       WRITE_ONCE(ste->data[i], target->data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * Some of the bits set under the previous configuration but unused
> +        * under the target configuration might still bit set. Clear them as
> +        * well. Technically this isn't necessary but it brings the entry to
> +        * the full target state, so if there are bugs in the mask calculation
> +        * this will obscure them.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
> +               if (ste->data[i] != target->data[i]) {
> +                       WRITE_ONCE(ste->data[i], target->data[i]);
> +               }
>         }
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
>         /* It's likely that we'll want to use the new STE soon */
>         if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

Sorry, that was the old patch. Here's the new one:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..de2e7d9e5919c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-				      const __le64 *target,
-				      const __le64 *target_used, __le64 *step,
-				      __le64 v_bit, unsigned int len)
-{
-	u8 step_used_diff = 0;
-	u8 step_change = 0;
-	unsigned int i;
-
-	/*
-	 * Compute a step that has all the bits currently unused by HW set to
-	 * their target values.
-	 */
-	for (i = 0; i != len; i++) {
-		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-		if (cur[i] != step[i])
-			step_change |= 1 << i;
-		/*
-		 * Each bit indicates if the step is incorrect compared to the
-		 * target, considering only the used bits in the target
-		 */
-		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-			step_used_diff |= 1 << i;
-	}
-
-	if (hweight8(step_used_diff) > 1) {
-		/*
-		 * More than 1 qword is mismatched, this cannot be done without
-		 * a break. Clear the V bit and go again.
-		 */
-		step[0] &= ~v_bit;
-	} else if (!step_change && step_used_diff) {
-		/*
-		 * Have exactly one critical qword, all the other qwords are set
-		 * correctly, so we can set this qword now.
-		 */
-		i = ffs(step_used_diff) - 1;
-		step[i] = target[i];
-	} else if (!step_change) {
-		/* cur == target, so all done */
-		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-			return true;
-
-		/*
-		 * All the used HW bits match, but unused bits are different.
-		 * Set them as well. Technically this isn't necessary but it
-		 * brings the entry to the full target state, so if there are
-		 * bugs in the mask calculation this will obscure them.
-		 */
-		memcpy(step, target, len * sizeof(*step));
-	}
-
-	for (i = 0; i != len; i++)
-		WRITE_ONCE(cur[i], step[i]);
-	return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1398,40 +1304,160 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
 	}
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-				    const struct arm_smmu_ste *target,
-				    const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+			           struct arm_smmu_ste *cur,
+			           const struct arm_smmu_ste *target)
 {
 	struct arm_smmu_ste cur_used;
-	struct arm_smmu_ste step;
+	int i =0;

 	arm_smmu_get_ste_used(cur, &cur_used);
-	return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-					 target_used->data, step.data,
-					 cpu_to_le64(STRTAB_STE_0_V),
-					 ARRAY_SIZE(cur->data));
+	for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+		cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+			       (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Each bit of the return value indicates the index of a qword within the ste
+ * that is incorrect compared to the target, considering only the used bits in
+ * the target
+ */
+static u8 arm_smmu_ste_used_qword_diff_indexes(const struct arm_smmu_ste *ste,
+					       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste target_used;
+	u8 qword_diff_indexes = 0;
+	int i = 0;
+
+	arm_smmu_get_ste_used(target, &target_used);
+	for (i = 0; i < ARRAY_SIZE(ste->data); i++) {
+		if ((ste->data[i] & target_used.data[i]) !=
+		    (target->data[i] & target_used.data[i]))
+			qword_diff_indexes |= 1 << i;
+	}
+	return qword_diff_indexes;
+}
+
+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 			       struct arm_smmu_ste *ste,
 			       const struct arm_smmu_ste *target)
 {
-	struct arm_smmu_ste target_used;
-	int i;
+	bool cleanup_sync_required = false;
+	struct arm_smmu_ste staging_ste;
+	u8 ste_qwords_used_diff = 0;
+	int i = 0;

-	arm_smmu_get_ste_used(target, &target_used);
-	/* Masks in arm_smmu_get_ste_used() are up to date */
-	for (i = 0; i != ARRAY_SIZE(target->data); i++)
-		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+	ste_qwords_used_diff = arm_smmu_ste_used_qword_diff_indexes(ste, target);
+	if (WARN_ON_ONCE(ste_qwords_used_diff == 0))
+		return;

-	for (i = 0; true; i++) {
-		if (arm_smmu_write_ste_next(ste, target, &target_used))
-			break;
+	if (hweight8(ste_qwords_used_diff) > 1) {
+		/*
+		 * If transitioning to the target STE with a single qword write
+		 * isn't possible, then we must first transition to an
+		 * intermediate STE. The intermediate STE may either be an STE
+		 * that melds bits of the target STE into the current STE
+		 * without affecting bits used by the hardware under the current
+		 * configuration; or a breaking STE if a hitless transition to
+		 * the target isn't possible.
+		 */
+
+		/*
+		 * Compute a staging ste that has all the bits currently unused
+		 * by HW set to their target values, such that comitting it to
+		 * the ste table woudn't disrupt the hardware.
+		 */
+		memcpy(&staging_ste, ste, sizeof(staging_ste));
+		arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+		ste_qwords_used_diff = arm_smmu_ste_used_qword_diff_indexes(ste,
+									    target);
+		if (hweight8(ste_qwords_used_diff) > 1) {
+			/*
+			 * More than 1 qword is mismatched between the staging
+			 * and target STE. A hitless transition to the target
+			 * ste is not possible. Set the staging STE to be equal
+			 * to the target STE, apart from the V bit's qword. As
+			 * long as the V bit is cleared first then writes to the
+			 * subsequent qwords will not further disrupt the
+			 * hardware.
+			 */
+			memcpy(&staging_ste, ste, sizeof(staging_ste));
+			staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+			arm_smmu_ste_set_unused_bits(&staging_ste, target);
+			/*
+			 * After comitting the staging STE, only the 0th qword
+			 * will differ from the target.
+			 */
+			ste_qwords_used_diff = 1;
+		}
+
+		/*
+		 * Commit the staging STE. Note that the iteration order
+		 * matters, as we may be comitting a breaking STE in the
+		 * non-hitless case.
+		 */
+		for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+			WRITE_ONCE(ste->data[i], staging_ste.data[i]);
 		arm_smmu_sync_ste_for_sid(smmu, sid);
-		if (WARN_ON(i == 4))
-			break;
 	}

+	/*
+	 * It's now possible to switch to the target configuration with a write
+	 * to a single qword. Make that switch now.
+	 */
+	i = ffs(ste_qwords_used_diff) - 1;
+	WRITE_ONCE(ste->data[i], target->data[i]);
+	arm_smmu_sync_ste_for_sid(smmu, sid);
+
+	/*
+	 * Some of the bits set under the previous configuration but unused
+	 * under the target configuration might still bit set. Clear them as
+	 * well. Technically this isn't necessary but it brings the entry to
+	 * the full target state, so if there are bugs in the mask calculation
+	 * this will obscure them.
+	 */
+	for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+		if (ste->data[i] != target->data[i]) {
+			WRITE_ONCE(ste->data[i], target->data[i]);
+			cleanup_sync_required = true;
+		}
+	}
+	if (cleanup_sync_required)
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+
 	/* It's likely that we'll want to use the new STE soon */
 	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
 		struct arm_smmu_cmdq_ent

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-18 12:42                         ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-18 12:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
> The following should address the bug and optimization you pointed out
> with minimal adjustments:
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 98aa8cc17b58b..1c35599d944d7 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
> arm_smmu_device *smmu, u16 asid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> -/*
> - * This algorithm updates any STE/CD to any value without creating a situation
> - * where the HW can percieve a corrupted entry. HW is only required
> to have a 64
> - * bit atomicity with stores from the CPU, while entries are many 64 bit values
> - * big.
> - *
> - * The algorithm works by evolving the entry toward the target in a series of
> - * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> - * torn across two steps. Upon each call cur/cur_used reflect the current
> - * synchronized value seen by the HW.
> - *
> - * During each step the HW can observe a torn entry that has any combination of
> - * the step's old/new 64 bit words. The algorithm objective is for the HW
> - * behavior to always be one of current behavior, V=0, or new behavior, during
> - * each step, and across all steps.
> - *
> - * At each step one of three actions is chosen to evolve cur to target:
> - *  - Update all unused bits with their target values.
> - *    This relies on the IGNORED behavior described in the specification
> - *  - Update a single 64-bit value
> - *  - Update all unused bits and set V=0
> - *
> - * The last two actions will cause cur_used to change, which will
> then allow the
> - * first action on the next step.
> - *
> - * In the most general case we can make any update in three steps:
> - *  - Disrupting the entry (V=0)
> - *  - Fill now unused bits, all bits except V
> - *  - Make valid (V=1), single 64 bit store
> - *
> - * However this disrupts the HW while it is happening. There are several
> - * interesting cases where a STE/CD can be updated without disturbing the HW
> - * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> - * because the used bits don't intersect. We can detect this by calculating how
> - * many 64 bit values need update after adjusting the unused bits and skip the
> - * V=0 process.
> - */
> -static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
> -                                     const __le64 *target,
> -                                     const __le64 *target_used, __le64 *step,
> -                                     __le64 v_bit, unsigned int len)
> -{
> -       u8 step_used_diff = 0;
> -       u8 step_change = 0;
> -       unsigned int i;
> -
> -       /*
> -        * Compute a step that has all the bits currently unused by HW set to
> -        * their target values.
> -        */
> -       for (i = 0; i != len; i++) {
> -               step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> -               if (cur[i] != step[i])
> -                       step_change |= 1 << i;
> -               /*
> -                * Each bit indicates if the step is incorrect compared to the
> -                * target, considering only the used bits in the target
> -                */
> -               if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
> -                       step_used_diff |= 1 << i;
> -       }
> -
> -       if (hweight8(step_used_diff) > 1) {
> -               /*
> -                * More than 1 qword is mismatched, this cannot be done without
> -                * a break. Clear the V bit and go again.
> -                */
> -               step[0] &= ~v_bit;
> -       } else if (!step_change && step_used_diff) {
> -               /*
> -                * Have exactly one critical qword, all the other qwords are set
> -                * correctly, so we can set this qword now.
> -                */
> -               i = ffs(step_used_diff) - 1;
> -               step[i] = target[i];
> -       } else if (!step_change) {
> -               /* cur == target, so all done */
> -               if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> -                       return true;
> -
> -               /*
> -                * All the used HW bits match, but unused bits are different.
> -                * Set them as well. Technically this isn't necessary but it
> -                * brings the entry to the full target state, so if there are
> -                * bugs in the mask calculation this will obscure them.
> -                */
> -               memcpy(step, target, len * sizeof(*step));
> -       }
> -
> -       for (i = 0; i != len; i++)
> -               WRITE_ONCE(cur[i], step[i]);
> -       return false;
> -}
> -
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
> struct arm_smmu_ste *ent,
>         }
>  }
>
> -static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
> -                                   const struct arm_smmu_ste *target,
> -                                   const struct arm_smmu_ste *target_used)
> +/*
> + * Make bits of the current ste that aren't in use by the hardware equal to the
> + * target's bits.
> + */
> +static void arm_smmu_ste_set_unused_bits(
> +                                  struct arm_smmu_ste *cur,
> +                                  const struct arm_smmu_ste *target)
>  {
>         struct arm_smmu_ste cur_used;
> -       struct arm_smmu_ste step;
> +       int i =0;
>
>         arm_smmu_get_ste_used(cur, &cur_used);
> -       return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
> -                                        target_used->data, step.data,
> -                                        cpu_to_le64(STRTAB_STE_0_V),
> -                                        ARRAY_SIZE(cur->data));
> +       for (i = 0; i < ARRAY_SIZE(cur->data); i++)
> +               cur->data[i] = (cur->data[i] & cur_used.data[i]) |
> +                              (target->data[i] & ~cur_used.data[i]);
>  }
>
> +/*
> + * Update the STE to the target configuration. The transition from the current
> + * STE to the target STE takes place over multiple steps that attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only
> required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */
>  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
>                                struct arm_smmu_ste *ste,
>                                const struct arm_smmu_ste *target)
>  {
> +       struct arm_smmu_ste staging_ste;
>         struct arm_smmu_ste target_used;
> -       int i;
> +       int writes_required = 0;
> +       u8 staging_used_diff = 0;
> +       int i = 0;
>
> +       /*
> +        * Compute a staging ste that has all the bits currently unused by HW
> +        * set to their target values, such that comitting it to the ste table
> +        * woudn't disrupt the hardware.
> +        */
> +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +
> +       /*
> +        * Determine if it's possible to reach the target configuration from the
> +        * staged configured in a single qword write (ignoring bits that are
> +        * unused under the target configuration).
> +        */
>         arm_smmu_get_ste_used(target, &target_used);
> -       /* Masks in arm_smmu_get_ste_used() are up to date */
> -       for (i = 0; i != ARRAY_SIZE(target->data); i++)
> -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +       /*
> +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> +        * to date.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> +                       target_used.data[i] |= target->data[i];
> +       }
>
> -       for (i = 0; true; i++) {
> -               if (arm_smmu_write_ste_next(ste, target, &target_used))
> -                       break;
> -               arm_smmu_sync_ste_for_sid(smmu, sid);
> -               if (WARN_ON(i == 4))
> -                       break;
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> +               /*
> +                * Each bit of staging_used_diff indicates the index of a qword
> +                * within the staged ste that is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((staging_ste.data[i] &
> +                   target_used.data[i]) != (target->data[i]))
> +                       staging_used_diff |= 1 << i;
> +       }
> +       if (hweight8(staging_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched and a hitless transition from
> +                * the current ste to the target ste is not possible. Clear the
> +                * V bit and recompute the staging STE.
> +                * Because the V bit is cleared, the staging STE will be equal
> +                * to the target STE except for the first qword.
> +                */
> +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +               staging_used_diff = 1;
> +       }
> +
> +       /*
> +        * Commit the staging STE.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(staging_used_diff) - 1;
> +       WRITE_ONCE(ste->data[i], target->data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * Some of the bits set under the previous configuration but unused
> +        * under the target configuration might still bit set. Clear them as
> +        * well. Technically this isn't necessary but it brings the entry to
> +        * the full target state, so if there are bugs in the mask calculation
> +        * this will obscure them.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
> +               if (ste->data[i] != target->data[i]) {
> +                       WRITE_ONCE(ste->data[i], target->data[i]);
> +               }
>         }
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
>         /* It's likely that we'll want to use the new STE soon */
>         if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {

Sorry, that was the old patch. Here's the new one:

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..de2e7d9e5919c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- *  - Update all unused bits with their target values.
- *    This relies on the IGNORED behavior described in the specification
- *  - Update a single 64-bit value
- *  - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- *  - Disrupting the entry (V=0)
- *  - Fill now unused bits, all bits except V
- *  - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
-				      const __le64 *target,
-				      const __le64 *target_used, __le64 *step,
-				      __le64 v_bit, unsigned int len)
-{
-	u8 step_used_diff = 0;
-	u8 step_change = 0;
-	unsigned int i;
-
-	/*
-	 * Compute a step that has all the bits currently unused by HW set to
-	 * their target values.
-	 */
-	for (i = 0; i != len; i++) {
-		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
-		if (cur[i] != step[i])
-			step_change |= 1 << i;
-		/*
-		 * Each bit indicates if the step is incorrect compared to the
-		 * target, considering only the used bits in the target
-		 */
-		if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
-			step_used_diff |= 1 << i;
-	}
-
-	if (hweight8(step_used_diff) > 1) {
-		/*
-		 * More than 1 qword is mismatched, this cannot be done without
-		 * a break. Clear the V bit and go again.
-		 */
-		step[0] &= ~v_bit;
-	} else if (!step_change && step_used_diff) {
-		/*
-		 * Have exactly one critical qword, all the other qwords are set
-		 * correctly, so we can set this qword now.
-		 */
-		i = ffs(step_used_diff) - 1;
-		step[i] = target[i];
-	} else if (!step_change) {
-		/* cur == target, so all done */
-		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
-			return true;
-
-		/*
-		 * All the used HW bits match, but unused bits are different.
-		 * Set them as well. Technically this isn't necessary but it
-		 * brings the entry to the full target state, so if there are
-		 * bugs in the mask calculation this will obscure them.
-		 */
-		memcpy(step, target, len * sizeof(*step));
-	}
-
-	for (i = 0; i != len; i++)
-		WRITE_ONCE(cur[i], step[i]);
-	return false;
-}
-
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1398,40 +1304,160 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
 	}
 }

-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
-				    const struct arm_smmu_ste *target,
-				    const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+			           struct arm_smmu_ste *cur,
+			           const struct arm_smmu_ste *target)
 {
 	struct arm_smmu_ste cur_used;
-	struct arm_smmu_ste step;
+	int i =0;

 	arm_smmu_get_ste_used(cur, &cur_used);
-	return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
-					 target_used->data, step.data,
-					 cpu_to_le64(STRTAB_STE_0_V),
-					 ARRAY_SIZE(cur->data));
+	for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+		cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+			       (target->data[i] & ~cur_used.data[i]);
 }

+/*
+ * Each bit of the return value indicates the index of a qword within the ste
+ * that is incorrect compared to the target, considering only the used bits in
+ * the target
+ */
+static u8 arm_smmu_ste_used_qword_diff_indexes(const struct arm_smmu_ste *ste,
+					       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste target_used;
+	u8 qword_diff_indexes = 0;
+	int i = 0;
+
+	arm_smmu_get_ste_used(target, &target_used);
+	for (i = 0; i < ARRAY_SIZE(ste->data); i++) {
+		if ((ste->data[i] & target_used.data[i]) !=
+		    (target->data[i] & target_used.data[i]))
+			qword_diff_indexes |= 1 << i;
+	}
+	return qword_diff_indexes;
+}
+
+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
 static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 			       struct arm_smmu_ste *ste,
 			       const struct arm_smmu_ste *target)
 {
-	struct arm_smmu_ste target_used;
-	int i;
+	bool cleanup_sync_required = false;
+	struct arm_smmu_ste staging_ste;
+	u8 ste_qwords_used_diff = 0;
+	int i = 0;

-	arm_smmu_get_ste_used(target, &target_used);
-	/* Masks in arm_smmu_get_ste_used() are up to date */
-	for (i = 0; i != ARRAY_SIZE(target->data); i++)
-		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+	ste_qwords_used_diff = arm_smmu_ste_used_qword_diff_indexes(ste, target);
+	if (WARN_ON_ONCE(ste_qwords_used_diff == 0))
+		return;

-	for (i = 0; true; i++) {
-		if (arm_smmu_write_ste_next(ste, target, &target_used))
-			break;
+	if (hweight8(ste_qwords_used_diff) > 1) {
+		/*
+		 * If transitioning to the target STE with a single qword write
+		 * isn't possible, then we must first transition to an
+		 * intermediate STE. The intermediate STE may either be an STE
+		 * that melds bits of the target STE into the current STE
+		 * without affecting bits used by the hardware under the current
+		 * configuration; or a breaking STE if a hitless transition to
+		 * the target isn't possible.
+		 */
+
+		/*
+		 * Compute a staging ste that has all the bits currently unused
+		 * by HW set to their target values, such that comitting it to
+		 * the ste table woudn't disrupt the hardware.
+		 */
+		memcpy(&staging_ste, ste, sizeof(staging_ste));
+		arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+		ste_qwords_used_diff = arm_smmu_ste_used_qword_diff_indexes(ste,
+									    target);
+		if (hweight8(ste_qwords_used_diff) > 1) {
+			/*
+			 * More than 1 qword is mismatched between the staging
+			 * and target STE. A hitless transition to the target
+			 * ste is not possible. Set the staging STE to be equal
+			 * to the target STE, apart from the V bit's qword. As
+			 * long as the V bit is cleared first then writes to the
+			 * subsequent qwords will not further disrupt the
+			 * hardware.
+			 */
+			memcpy(&staging_ste, ste, sizeof(staging_ste));
+			staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+			arm_smmu_ste_set_unused_bits(&staging_ste, target);
+			/*
+			 * After comitting the staging STE, only the 0th qword
+			 * will differ from the target.
+			 */
+			ste_qwords_used_diff = 1;
+		}
+
+		/*
+		 * Commit the staging STE. Note that the iteration order
+		 * matters, as we may be comitting a breaking STE in the
+		 * non-hitless case.
+		 */
+		for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+			WRITE_ONCE(ste->data[i], staging_ste.data[i]);
 		arm_smmu_sync_ste_for_sid(smmu, sid);
-		if (WARN_ON(i == 4))
-			break;
 	}

+	/*
+	 * It's now possible to switch to the target configuration with a write
+	 * to a single qword. Make that switch now.
+	 */
+	i = ffs(ste_qwords_used_diff) - 1;
+	WRITE_ONCE(ste->data[i], target->data[i]);
+	arm_smmu_sync_ste_for_sid(smmu, sid);
+
+	/*
+	 * Some of the bits set under the previous configuration but unused
+	 * under the target configuration might still bit set. Clear them as
+	 * well. Technically this isn't necessary but it brings the entry to
+	 * the full target state, so if there are bugs in the mask calculation
+	 * this will obscure them.
+	 */
+	for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+		if (ste->data[i] != target->data[i]) {
+			WRITE_ONCE(ste->data[i], target->data[i]);
+			cleanup_sync_required = true;
+		}
+	}
+	if (cleanup_sync_required)
+		arm_smmu_sync_ste_for_sid(smmu, sid);
+
 	/* It's likely that we'll want to use the new STE soon */
 	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
 		struct arm_smmu_cmdq_ent

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-18 12:35                       ` Michael Shavit
@ 2023-12-19 13:42                         ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-19 13:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
> >
> > > Ok, I took a proper stab at trying to unroll the loop on the github
> > > version of this patch (v3+)
> > > As you suspected, it's not easy to re-use the unrolled version for
> > > both STE and CD writing as we'd have to pass in callbacks for syncing
> > > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
> >
> > Yes, that is why I structured it as an iterator
>
> On second thought, perhaps defining a helper class implementing
> entry_sync() and entry_get_used_bits() might not be so bad?
> It's a little bit more verbose, but avoids deduplication of the
> complicated parts.

Gave this a try so that we have something more concrete to compare.
Consider the following two patches as alternatives to this patch and
patch "Make CD programming use arm_smmu_write_entry_step" from the
next part of the patch series.

STE programming patch
---
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b120d83668...1e17bff37f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,174 @@
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

+struct arm_smmu_entry_writer;
+
+/**
+ * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
+ * @sync_entry: sync entry to the hardware after writing to it.
+ * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
+ *                   equal to the target's bits.
+ * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
+ *                               are incorrect compared to the target,
+ *                               considering only the used bits in the target.
+ *                               The set bits in the return value
represents the
+ *                               indexes of those qwords.
+ */
+struct arm_smmu_entry_writer_ops {
+       void (*sync_entry)(struct arm_smmu_entry_writer *);
+       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
+       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
+};
+
+struct arm_smmu_entry_writer {
+       struct arm_smmu_entry_writer_ops ops;
+       __le64 v_bit;
+       unsigned int entry_length;
+};
+
+static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
+                                          const __le64 *entry_used,
+                                          unsigned int length)
+{
+       int i = 0;
+
+       for (i = 0; i < length; i++)
+               entry[i] = (entry[i] & entry_used[i]) |
+                          (target[i] & ~entry_used[i]);
+}
+
+static u8 arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
+                                                const __le64 *target,
+                                                const __le64 *target_used,
+                                                unsigned int length)
+{
+       u8 qword_diff_indexes = 0;
+       int i = 0;
+
+       for (i = 0; i < length; i++) {
+               if ((entry[i] & target_used[i]) != (target[i] & target_used[i]))
+                       qword_diff_indexes |= 1 << i;
+       }
+       return qword_diff_indexes;
+}
+
+/*
+ * Update the STE/CD to the target configuration. The transition from
the current
+ * entry to the target entry takes place over multiple steps that
attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
+static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
+                                __le64 *cur, const __le64 *target,
+                                __le64 *staging_entry)
+{
+       bool cleanup_sync_required = false;
+       u8 entry_qwords_used_diff = 0;
+       int i = 0;
+
+       entry_qwords_used_diff =
+               writer->ops.get_used_qword_diff_indexes(cur, target);
+       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
+               return;
+
+       if (hweight8(entry_qwords_used_diff) > 1) {
+               /*
+                * If transitioning to the target entry with a single qword
+                * write isn't possible, then we must first transition to an
+                * intermediate entry. The intermediate entry may either be an
+                * entry that melds bits of the target entry into the current
+                * entry without disrupting the hardware, or a breaking entry if
+                * a hitless transition to the target is impossible.
+                */
+
+               /*
+                * Compute a staging entry that has all the bits currently
+                * unused by HW set to their target values, such that comitting
+                * it to the entry table woudn't disrupt the hardware.
+                */
+               memcpy(staging_entry, cur, writer->entry_length);
+               writer->ops.set_unused_bits(staging_entry, target);
+
+               entry_qwords_used_diff =
+                       writer->ops.get_used_qword_diff_indexes(staging_entry,
+                                                               target);
+               if (hweight8(entry_qwords_used_diff) > 1) {
+                       /*
+                        * More than 1 qword is mismatched between the staging
+                        * and target entry. A hitless transition to the target
+                        * entry is not possible. Set the staging entry to be
+                        * equal to the target entry, apart from the V bit's
+                        * qword. As long as the V bit is cleared first then
+                        * writes to the subsequent qwords will not further
+                        * disrupt the hardware.
+                        */
+                       memcpy(staging_entry, target, writer->entry_length);
+                       staging_entry[0] &= ~writer->v_bit;
+                       /*
+                        * After comitting the staging entry, only the 0th qword
+                        * will differ from the target.
+                        */
+                       entry_qwords_used_diff = 1;
+               }
+
+               /*
+                * Commit the staging entry. Note that the iteration order
+                * matters, as we may be comitting a breaking entry in the
+                * non-hitless case. The 0th qword which holds the valid bit
+                * must be written first in that case.
+                */
+               for (i = 0; i != writer->entry_length; i++)
+                       WRITE_ONCE(cur[i], staging_entry[i]);
+               writer->ops.sync_entry(writer);
+       }
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(entry_qwords_used_diff) - 1;
+       WRITE_ONCE(cur[i], target[i]);
+       writer->ops.sync_entry(writer);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still be set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != writer->entry_length; i++) {
+               if (cur[i] != target[i]) {
+                       WRITE_ONCE(cur[i], target[i]);
+                       cleanup_sync_required = true;
+               }
+       }
+       if (cleanup_sync_required)
+               writer->ops.sync_entry(writer);
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1248,37 +1416,142 @@
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

+/*
+ * Based on the value of ent report which bits of the STE the HW will
access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(const __le64 *ent,
+                                 struct arm_smmu_ste *used_bits)
+{
+       memset(used_bits, 0, sizeof(*used_bits));
+
+       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
+       if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
+               return;
+
+       /*
+        * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+        * attribute/permission configuration fields for the SHCFG behavior.
+        */
+       if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
+           FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
+                   STRTAB_STE_1_S1DSS_BYPASS)
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
+       case STRTAB_STE_0_CFG_ABORT:
+               break;
+       case STRTAB_STE_0_CFG_BYPASS:
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+               break;
+       case STRTAB_STE_0_CFG_S1_TRANS:
+               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+                                                 STRTAB_STE_0_S1CTXPTR_MASK |
+                                                 STRTAB_STE_0_S1CDMAX);
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+               break;
+       case STRTAB_STE_0_CFG_S2_TRANS:
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+               used_bits->data[2] |=
+                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+               break;
+
+       default:
+               memset(used_bits, 0xFF, sizeof(*used_bits));
+               WARN_ON(true);
+       }
+}
+
+struct arm_smmu_ste_writer {
+       struct arm_smmu_entry_writer writer;
+       struct arm_smmu_device *smmu;
+       u32 sid;
+};
+
+static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
+{
+       struct arm_smmu_ste entry_used;
+       arm_smmu_get_ste_used(entry, &entry_used);
+
+       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
+                                      ARRAY_SIZE(entry_used.data));
+}
+
+static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
+                                              const __le64 *target)
+{
+       struct arm_smmu_ste target_used;
+
+       arm_smmu_get_ste_used(target, &target_used);
+       return arm_smmu_entry_used_qword_diff_indexes(
+               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
+}
+
+static void arm_smmu_ste_writer_sync_entry(struct
arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_ste_writer *ste_writer =
+               container_of(writer, struct arm_smmu_ste_writer, writer);
+
+       arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
+       .sync_entry = arm_smmu_ste_writer_sync_entry,
+       .set_unused_bits = arm_smmu_ste_set_unused_bits,
+       .get_used_qword_diff_indexes = arm_smmu_ste_used_qword_diff_indexes,
+};
+
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+                              struct arm_smmu_ste *ste,
+                              const struct arm_smmu_ste *target)
+{
+       struct arm_smmu_ste preallocated_staging_ste = {0};
+       struct arm_smmu_ste_writer ste_writer = {
+               .writer = {
+                       .ops = arm_smmu_ste_writer_ops,
+                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
+                       .entry_length = ARRAY_SIZE(ste->data),
+               },
+               .smmu = smmu,
+               .sid = sid,
+       };
+
+       arm_smmu_write_entry(&ste_writer.writer,
+                              ste->data,
+                              target->data,
+                              preallocated_staging_ste.data);
+
+       /* It's likely that we'll want to use the new STE soon */
+       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+               struct arm_smmu_cmdq_ent
+                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+                                        .prefetch = {
+                                                .sid = sid,
+                                        } };
+
+               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+       }
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
                                      struct arm_smmu_ste *dst)
 {
-       /*
-        * This is hideously complicated, but we only really care about
-        * three cases at the moment:
-        *
-        * 1. Invalid (all zero) -> bypass/fault (init)
-        * 2. Bypass/fault -> translation/bypass (attach)
-        * 3. Translation/bypass -> bypass/fault (detach)
-        *
-        * Given that we can't update the STE atomically and the SMMU
-        * doesn't read the thing in a defined order, that leaves us
-        * with the following maintenance requirements:
-        *
-        * 1. Update Config, return (init time STEs aren't live)
-        * 2. Write everything apart from dword 0, sync, write dword 0, sync
-        * 3. Update Config, sync
-        */
-       u64 val = le64_to_cpu(dst->data[0]);
-       bool ste_live = false;
+       u64 val;
        struct arm_smmu_device *smmu = master->smmu;
        struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
        struct arm_smmu_s2_cfg *s2_cfg = NULL;
        struct arm_smmu_domain *smmu_domain = master->domain;
-       struct arm_smmu_cmdq_ent prefetch_cmd = {
-               .opcode         = CMDQ_OP_PREFETCH_CFG,
-               .prefetch       = {
-                       .sid    = sid,
-               },
-       };
+       struct arm_smmu_ste target = {};

        if (smmu_domain) {
                switch (smmu_domain->stage) {
@@ -1293,22 +1566,6 @@
                }
        }

-       if (val & STRTAB_STE_0_V) {
-               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-               case STRTAB_STE_0_CFG_BYPASS:
-                       break;
-               case STRTAB_STE_0_CFG_S1_TRANS:
-               case STRTAB_STE_0_CFG_S2_TRANS:
-                       ste_live = true;
-                       break;
-               case STRTAB_STE_0_CFG_ABORT:
-                       BUG_ON(!disable_bypass);
-                       break;
-               default:
-                       BUG(); /* STE corruption */
-               }
-       }
-
        /* Nuke the existing STE_0 value, as we're going to rewrite it */
        val = STRTAB_STE_0_V;

@@ -1319,16 +1576,11 @@
                else
                        val |= FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_BYPASS);

-               dst->data[0] = cpu_to_le64(val);
-               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+               target.data[0] = cpu_to_le64(val);
+               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
                                                STRTAB_STE_1_SHCFG_INCOMING));
-               dst->data[2] = 0; /* Nuke the VMID */
-               /*
-                * The SMMU can perform negative caching, so we must sync
-                * the STE regardless of whether the old value was live.
-                */
-               if (smmu)
-                       arm_smmu_sync_ste_for_sid(smmu, sid);
+               target.data[2] = 0; /* Nuke the VMID */
+               arm_smmu_write_ste(smmu, sid, dst, &target);
                return;
        }

@@ -1336,8 +1588,7 @@
                u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
                        STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;

-               BUG_ON(ste_live);
-               dst->data[1] = cpu_to_le64(
+               target.data[1] = cpu_to_le64(
                         FIELD_PREP(STRTAB_STE_1_S1DSS,
STRTAB_STE_1_S1DSS_SSID0) |
                         FIELD_PREP(STRTAB_STE_1_S1CIR,
STRTAB_STE_1_S1C_CACHE_WBRA) |
                         FIELD_PREP(STRTAB_STE_1_S1COR,
STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1597,7 @@

                if (smmu->features & ARM_SMMU_FEAT_STALLS &&
                    !master->stall_enabled)
-                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);

                val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
                        FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1606,7 @@
        }

        if (s2_cfg) {
-               BUG_ON(ste_live);
-               dst->data[2] = cpu_to_le64(
+               target.data[2] = cpu_to_le64(
                         FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
                         FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1615,17 @@
                         STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
                         STRTAB_STE_2_S2R);

-               dst->data[3] = cpu_to_le64(s2_cfg->vttbr &
STRTAB_STE_3_S2TTB_MASK);
+               target.data[3] = cpu_to_le64(s2_cfg->vttbr &
STRTAB_STE_3_S2TTB_MASK);

                val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
        }

        if (master->ats_enabled)
-               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
                                                 STRTAB_STE_1_EATS_TRANS));

-       arm_smmu_sync_ste_for_sid(smmu, sid);
-       /* See comment in arm_smmu_write_ctx_desc() */
-       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-       arm_smmu_sync_ste_for_sid(smmu, sid);
-
-       /* It's likely that we'll want to use the new STE soon */
-       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+       target.data[0] = cpu_to_le64(val);
+       arm_smmu_write_ste(smmu, sid, dst, &target);
 }

 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,





---
CD programming
---
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 55703a5d62...c849b26c43 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1219,6 +1219,86 @@
        return &l1_desc->l2ptr[idx];
 }

+static void arm_smmu_get_cd_used(const __le64 *ent,
+                                struct arm_smmu_cd *used_bits)
+{
+       memset(used_bits, 0, sizeof(*used_bits));
+
+       used_bits->data[0] = cpu_to_le64(CTXDESC_CD_0_V);
+       if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
+               return;
+       memset(used_bits, 0xFF, sizeof(*used_bits));
+
+       /* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
+       if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
+               used_bits->data[0] &= ~cpu_to_le64(
+                       CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
+                       CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
+                       CTXDESC_CD_0_TCR_SH0);
+               used_bits->data[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
+       }
+}
+
+struct arm_smmu_cd_writer {
+       struct arm_smmu_entry_writer writer;
+       struct arm_smmu_master *master;
+       int ssid;
+};
+
+static void arm_smmu_cd_set_unused_bits(__le64 *entry, const __le64 *target)
+{
+       struct arm_smmu_cd entry_used;
+       arm_smmu_get_cd_used(entry, &entry_used);
+
+       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
+                                      ARRAY_SIZE(entry_used.data));
+}
+
+static u8 arm_smmu_cd_used_qword_diff_indexes(__le64 *cur,
+                                              const __le64 *target)
+{
+       struct arm_smmu_cd target_used;
+
+       arm_smmu_get_cd_used(target, &target_used);
+       return arm_smmu_entry_used_qword_diff_indexes(
+               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
+}
+
+static void arm_smmu_cd_writer_sync_entry(struct arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_cd_writer *cd_writer =
+               container_of(writer, struct arm_smmu_cd_writer, writer);
+
+       arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
+       .sync_entry = arm_smmu_cd_writer_sync_entry,
+       .set_unused_bits = arm_smmu_cd_set_unused_bits,
+       .get_used_qword_diff_indexes = arm_smmu_cd_used_qword_diff_indexes,
+};
+
+static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
+                                   struct arm_smmu_cd *cdptr,
+                                   const struct arm_smmu_cd *target)
+{
+       struct arm_smmu_cd preallocated_staging_cd = {0};
+       struct arm_smmu_cd_writer cd_writer = {
+               .writer = {
+                       .ops = arm_smmu_cd_writer_ops,
+                       .v_bit = cpu_to_le64(CTXDESC_CD_0_V),
+                       .entry_length = ARRAY_SIZE(cdptr->data),
+               },
+               .master = master,
+               .ssid = ssid,
+       };
+
+       arm_smmu_write_entry(&cd_writer.writer,
+                              cdptr->data,
+                              target->data,
+                              preallocated_staging_cd.data);
+}
+
 int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
                            struct arm_smmu_ctx_desc *cd)
 {
@@ -1235,16 +1315,19 @@
         */
        u64 val;
        bool cd_live;
-       struct arm_smmu_cd *cdptr;
+       struct arm_smmu_cd target;
+       struct arm_smmu_cd *cdptr = &target;
+       struct arm_smmu_cd *cd_table_entry;
        struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;

        if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
                return -E2BIG;

-       cdptr = arm_smmu_get_cd_ptr(master, ssid);
-       if (!cdptr)
+       cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
+       if (!cd_table_entry)
                return -ENOMEM;

+       target = *cd_table_entry;
        val = le64_to_cpu(cdptr->data[0]);
        cd_live = !!(val & CTXDESC_CD_0_V);

@@ -1264,13 +1347,6 @@
                cdptr->data[2] = 0;
                cdptr->data[3] = cpu_to_le64(cd->mair);

-               /*
-                * STE may be live, and the SMMU might read dwords of
this CD in any
-                * order. Ensure that it observes valid values before reading
-                * V=1.
-                */
-               arm_smmu_sync_cd(master, ssid, true);
-
                val = cd->tcr |
 #ifdef __BIG_ENDIAN
                        CTXDESC_CD_0_ENDI |
@@ -1284,18 +1360,8 @@
                if (cd_table->stall_enabled)
                        val |= CTXDESC_CD_0_S;
        }
-
-       /*
-        * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
-        * "Configuration structures and configuration invalidation completion"
-        *
-        *   The size of single-copy atomic reads made by the SMMU is
-        *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
-        *   field within an aligned 64-bit span of a structure can be altered
-        *   without first making the structure invalid.
-        */
-       WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
-       arm_smmu_sync_cd(master, ssid, true);
+       cdptr->data[0] = cpu_to_le64(val);
+       arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
        return 0;
 }

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-19 13:42                         ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-19 13:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
> >
> > > Ok, I took a proper stab at trying to unroll the loop on the github
> > > version of this patch (v3+)
> > > As you suspected, it's not easy to re-use the unrolled version for
> > > both STE and CD writing as we'd have to pass in callbacks for syncing
> > > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
> >
> > Yes, that is why I structured it as an iterator
>
> On second thought, perhaps defining a helper class implementing
> entry_sync() and entry_get_used_bits() might not be so bad?
> It's a little bit more verbose, but avoids deduplication of the
> complicated parts.

Gave this a try so that we have something more concrete to compare.
Consider the following two patches as alternatives to this patch and
patch "Make CD programming use arm_smmu_write_entry_step" from the
next part of the patch series.

STE programming patch
---
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b120d83668...1e17bff37f 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,174 @@
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

+struct arm_smmu_entry_writer;
+
+/**
+ * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
+ * @sync_entry: sync entry to the hardware after writing to it.
+ * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
+ *                   equal to the target's bits.
+ * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
+ *                               are incorrect compared to the target,
+ *                               considering only the used bits in the target.
+ *                               The set bits in the return value
represents the
+ *                               indexes of those qwords.
+ */
+struct arm_smmu_entry_writer_ops {
+       void (*sync_entry)(struct arm_smmu_entry_writer *);
+       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
+       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
+};
+
+struct arm_smmu_entry_writer {
+       struct arm_smmu_entry_writer_ops ops;
+       __le64 v_bit;
+       unsigned int entry_length;
+};
+
+static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
+                                          const __le64 *entry_used,
+                                          unsigned int length)
+{
+       int i = 0;
+
+       for (i = 0; i < length; i++)
+               entry[i] = (entry[i] & entry_used[i]) |
+                          (target[i] & ~entry_used[i]);
+}
+
+static u8 arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
+                                                const __le64 *target,
+                                                const __le64 *target_used,
+                                                unsigned int length)
+{
+       u8 qword_diff_indexes = 0;
+       int i = 0;
+
+       for (i = 0; i < length; i++) {
+               if ((entry[i] & target_used[i]) != (target[i] & target_used[i]))
+                       qword_diff_indexes |= 1 << i;
+       }
+       return qword_diff_indexes;
+}
+
+/*
+ * Update the STE/CD to the target configuration. The transition from
the current
+ * entry to the target entry takes place over multiple steps that
attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
+static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
+                                __le64 *cur, const __le64 *target,
+                                __le64 *staging_entry)
+{
+       bool cleanup_sync_required = false;
+       u8 entry_qwords_used_diff = 0;
+       int i = 0;
+
+       entry_qwords_used_diff =
+               writer->ops.get_used_qword_diff_indexes(cur, target);
+       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
+               return;
+
+       if (hweight8(entry_qwords_used_diff) > 1) {
+               /*
+                * If transitioning to the target entry with a single qword
+                * write isn't possible, then we must first transition to an
+                * intermediate entry. The intermediate entry may either be an
+                * entry that melds bits of the target entry into the current
+                * entry without disrupting the hardware, or a breaking entry if
+                * a hitless transition to the target is impossible.
+                */
+
+               /*
+                * Compute a staging entry that has all the bits currently
+                * unused by HW set to their target values, such that comitting
+                * it to the entry table woudn't disrupt the hardware.
+                */
+               memcpy(staging_entry, cur, writer->entry_length);
+               writer->ops.set_unused_bits(staging_entry, target);
+
+               entry_qwords_used_diff =
+                       writer->ops.get_used_qword_diff_indexes(staging_entry,
+                                                               target);
+               if (hweight8(entry_qwords_used_diff) > 1) {
+                       /*
+                        * More than 1 qword is mismatched between the staging
+                        * and target entry. A hitless transition to the target
+                        * entry is not possible. Set the staging entry to be
+                        * equal to the target entry, apart from the V bit's
+                        * qword. As long as the V bit is cleared first then
+                        * writes to the subsequent qwords will not further
+                        * disrupt the hardware.
+                        */
+                       memcpy(staging_entry, target, writer->entry_length);
+                       staging_entry[0] &= ~writer->v_bit;
+                       /*
+                        * After comitting the staging entry, only the 0th qword
+                        * will differ from the target.
+                        */
+                       entry_qwords_used_diff = 1;
+               }
+
+               /*
+                * Commit the staging entry. Note that the iteration order
+                * matters, as we may be comitting a breaking entry in the
+                * non-hitless case. The 0th qword which holds the valid bit
+                * must be written first in that case.
+                */
+               for (i = 0; i != writer->entry_length; i++)
+                       WRITE_ONCE(cur[i], staging_entry[i]);
+               writer->ops.sync_entry(writer);
+       }
+
+       /*
+        * It's now possible to switch to the target configuration with a write
+        * to a single qword. Make that switch now.
+        */
+       i = ffs(entry_qwords_used_diff) - 1;
+       WRITE_ONCE(cur[i], target[i]);
+       writer->ops.sync_entry(writer);
+
+       /*
+        * Some of the bits set under the previous configuration but unused
+        * under the target configuration might still be set. Clear them as
+        * well. Technically this isn't necessary but it brings the entry to
+        * the full target state, so if there are bugs in the mask calculation
+        * this will obscure them.
+        */
+       for (i = 0; i != writer->entry_length; i++) {
+               if (cur[i] != target[i]) {
+                       WRITE_ONCE(cur[i], target[i]);
+                       cleanup_sync_required = true;
+               }
+       }
+       if (cleanup_sync_required)
+               writer->ops.sync_entry(writer);
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
                             int ssid, bool leaf)
 {
@@ -1248,37 +1416,142 @@
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

+/*
+ * Based on the value of ent report which bits of the STE the HW will
access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(const __le64 *ent,
+                                 struct arm_smmu_ste *used_bits)
+{
+       memset(used_bits, 0, sizeof(*used_bits));
+
+       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
+       if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
+               return;
+
+       /*
+        * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+        * attribute/permission configuration fields for the SHCFG behavior.
+        */
+       if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
+           FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
+                   STRTAB_STE_1_S1DSS_BYPASS)
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
+       case STRTAB_STE_0_CFG_ABORT:
+               break;
+       case STRTAB_STE_0_CFG_BYPASS:
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+               break;
+       case STRTAB_STE_0_CFG_S1_TRANS:
+               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+                                                 STRTAB_STE_0_S1CTXPTR_MASK |
+                                                 STRTAB_STE_0_S1CDMAX);
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+               break;
+       case STRTAB_STE_0_CFG_S2_TRANS:
+               used_bits->data[1] |=
+                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+               used_bits->data[2] |=
+                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+               break;
+
+       default:
+               memset(used_bits, 0xFF, sizeof(*used_bits));
+               WARN_ON(true);
+       }
+}
+
+struct arm_smmu_ste_writer {
+       struct arm_smmu_entry_writer writer;
+       struct arm_smmu_device *smmu;
+       u32 sid;
+};
+
+static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
+{
+       struct arm_smmu_ste entry_used;
+       arm_smmu_get_ste_used(entry, &entry_used);
+
+       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
+                                      ARRAY_SIZE(entry_used.data));
+}
+
+static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
+                                              const __le64 *target)
+{
+       struct arm_smmu_ste target_used;
+
+       arm_smmu_get_ste_used(target, &target_used);
+       return arm_smmu_entry_used_qword_diff_indexes(
+               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
+}
+
+static void arm_smmu_ste_writer_sync_entry(struct
arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_ste_writer *ste_writer =
+               container_of(writer, struct arm_smmu_ste_writer, writer);
+
+       arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
+       .sync_entry = arm_smmu_ste_writer_sync_entry,
+       .set_unused_bits = arm_smmu_ste_set_unused_bits,
+       .get_used_qword_diff_indexes = arm_smmu_ste_used_qword_diff_indexes,
+};
+
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+                              struct arm_smmu_ste *ste,
+                              const struct arm_smmu_ste *target)
+{
+       struct arm_smmu_ste preallocated_staging_ste = {0};
+       struct arm_smmu_ste_writer ste_writer = {
+               .writer = {
+                       .ops = arm_smmu_ste_writer_ops,
+                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
+                       .entry_length = ARRAY_SIZE(ste->data),
+               },
+               .smmu = smmu,
+               .sid = sid,
+       };
+
+       arm_smmu_write_entry(&ste_writer.writer,
+                              ste->data,
+                              target->data,
+                              preallocated_staging_ste.data);
+
+       /* It's likely that we'll want to use the new STE soon */
+       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+               struct arm_smmu_cmdq_ent
+                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+                                        .prefetch = {
+                                                .sid = sid,
+                                        } };
+
+               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+       }
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
                                      struct arm_smmu_ste *dst)
 {
-       /*
-        * This is hideously complicated, but we only really care about
-        * three cases at the moment:
-        *
-        * 1. Invalid (all zero) -> bypass/fault (init)
-        * 2. Bypass/fault -> translation/bypass (attach)
-        * 3. Translation/bypass -> bypass/fault (detach)
-        *
-        * Given that we can't update the STE atomically and the SMMU
-        * doesn't read the thing in a defined order, that leaves us
-        * with the following maintenance requirements:
-        *
-        * 1. Update Config, return (init time STEs aren't live)
-        * 2. Write everything apart from dword 0, sync, write dword 0, sync
-        * 3. Update Config, sync
-        */
-       u64 val = le64_to_cpu(dst->data[0]);
-       bool ste_live = false;
+       u64 val;
        struct arm_smmu_device *smmu = master->smmu;
        struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
        struct arm_smmu_s2_cfg *s2_cfg = NULL;
        struct arm_smmu_domain *smmu_domain = master->domain;
-       struct arm_smmu_cmdq_ent prefetch_cmd = {
-               .opcode         = CMDQ_OP_PREFETCH_CFG,
-               .prefetch       = {
-                       .sid    = sid,
-               },
-       };
+       struct arm_smmu_ste target = {};

        if (smmu_domain) {
                switch (smmu_domain->stage) {
@@ -1293,22 +1566,6 @@
                }
        }

-       if (val & STRTAB_STE_0_V) {
-               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-               case STRTAB_STE_0_CFG_BYPASS:
-                       break;
-               case STRTAB_STE_0_CFG_S1_TRANS:
-               case STRTAB_STE_0_CFG_S2_TRANS:
-                       ste_live = true;
-                       break;
-               case STRTAB_STE_0_CFG_ABORT:
-                       BUG_ON(!disable_bypass);
-                       break;
-               default:
-                       BUG(); /* STE corruption */
-               }
-       }
-
        /* Nuke the existing STE_0 value, as we're going to rewrite it */
        val = STRTAB_STE_0_V;

@@ -1319,16 +1576,11 @@
                else
                        val |= FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_BYPASS);

-               dst->data[0] = cpu_to_le64(val);
-               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+               target.data[0] = cpu_to_le64(val);
+               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
                                                STRTAB_STE_1_SHCFG_INCOMING));
-               dst->data[2] = 0; /* Nuke the VMID */
-               /*
-                * The SMMU can perform negative caching, so we must sync
-                * the STE regardless of whether the old value was live.
-                */
-               if (smmu)
-                       arm_smmu_sync_ste_for_sid(smmu, sid);
+               target.data[2] = 0; /* Nuke the VMID */
+               arm_smmu_write_ste(smmu, sid, dst, &target);
                return;
        }

@@ -1336,8 +1588,7 @@
                u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
                        STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;

-               BUG_ON(ste_live);
-               dst->data[1] = cpu_to_le64(
+               target.data[1] = cpu_to_le64(
                         FIELD_PREP(STRTAB_STE_1_S1DSS,
STRTAB_STE_1_S1DSS_SSID0) |
                         FIELD_PREP(STRTAB_STE_1_S1CIR,
STRTAB_STE_1_S1C_CACHE_WBRA) |
                         FIELD_PREP(STRTAB_STE_1_S1COR,
STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1597,7 @@

                if (smmu->features & ARM_SMMU_FEAT_STALLS &&
                    !master->stall_enabled)
-                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);

                val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
                        FIELD_PREP(STRTAB_STE_0_CFG,
STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1606,7 @@
        }

        if (s2_cfg) {
-               BUG_ON(ste_live);
-               dst->data[2] = cpu_to_le64(
+               target.data[2] = cpu_to_le64(
                         FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
                         FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1615,17 @@
                         STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
                         STRTAB_STE_2_S2R);

-               dst->data[3] = cpu_to_le64(s2_cfg->vttbr &
STRTAB_STE_3_S2TTB_MASK);
+               target.data[3] = cpu_to_le64(s2_cfg->vttbr &
STRTAB_STE_3_S2TTB_MASK);

                val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
        }

        if (master->ats_enabled)
-               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
                                                 STRTAB_STE_1_EATS_TRANS));

-       arm_smmu_sync_ste_for_sid(smmu, sid);
-       /* See comment in arm_smmu_write_ctx_desc() */
-       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-       arm_smmu_sync_ste_for_sid(smmu, sid);
-
-       /* It's likely that we'll want to use the new STE soon */
-       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+       target.data[0] = cpu_to_le64(val);
+       arm_smmu_write_ste(smmu, sid, dst, &target);
 }

 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,





---
CD programming
---
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 55703a5d62...c849b26c43 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1219,6 +1219,86 @@
        return &l1_desc->l2ptr[idx];
 }

+static void arm_smmu_get_cd_used(const __le64 *ent,
+                                struct arm_smmu_cd *used_bits)
+{
+       memset(used_bits, 0, sizeof(*used_bits));
+
+       used_bits->data[0] = cpu_to_le64(CTXDESC_CD_0_V);
+       if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
+               return;
+       memset(used_bits, 0xFF, sizeof(*used_bits));
+
+       /* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
+       if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
+               used_bits->data[0] &= ~cpu_to_le64(
+                       CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
+                       CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
+                       CTXDESC_CD_0_TCR_SH0);
+               used_bits->data[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
+       }
+}
+
+struct arm_smmu_cd_writer {
+       struct arm_smmu_entry_writer writer;
+       struct arm_smmu_master *master;
+       int ssid;
+};
+
+static void arm_smmu_cd_set_unused_bits(__le64 *entry, const __le64 *target)
+{
+       struct arm_smmu_cd entry_used;
+       arm_smmu_get_cd_used(entry, &entry_used);
+
+       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
+                                      ARRAY_SIZE(entry_used.data));
+}
+
+static u8 arm_smmu_cd_used_qword_diff_indexes(__le64 *cur,
+                                              const __le64 *target)
+{
+       struct arm_smmu_cd target_used;
+
+       arm_smmu_get_cd_used(target, &target_used);
+       return arm_smmu_entry_used_qword_diff_indexes(
+               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
+}
+
+static void arm_smmu_cd_writer_sync_entry(struct arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_cd_writer *cd_writer =
+               container_of(writer, struct arm_smmu_cd_writer, writer);
+
+       arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
+       .sync_entry = arm_smmu_cd_writer_sync_entry,
+       .set_unused_bits = arm_smmu_cd_set_unused_bits,
+       .get_used_qword_diff_indexes = arm_smmu_cd_used_qword_diff_indexes,
+};
+
+static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
+                                   struct arm_smmu_cd *cdptr,
+                                   const struct arm_smmu_cd *target)
+{
+       struct arm_smmu_cd preallocated_staging_cd = {0};
+       struct arm_smmu_cd_writer cd_writer = {
+               .writer = {
+                       .ops = arm_smmu_cd_writer_ops,
+                       .v_bit = cpu_to_le64(CTXDESC_CD_0_V),
+                       .entry_length = ARRAY_SIZE(cdptr->data),
+               },
+               .master = master,
+               .ssid = ssid,
+       };
+
+       arm_smmu_write_entry(&cd_writer.writer,
+                              cdptr->data,
+                              target->data,
+                              preallocated_staging_cd.data);
+}
+
 int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
                            struct arm_smmu_ctx_desc *cd)
 {
@@ -1235,16 +1315,19 @@
         */
        u64 val;
        bool cd_live;
-       struct arm_smmu_cd *cdptr;
+       struct arm_smmu_cd target;
+       struct arm_smmu_cd *cdptr = &target;
+       struct arm_smmu_cd *cd_table_entry;
        struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;

        if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
                return -E2BIG;

-       cdptr = arm_smmu_get_cd_ptr(master, ssid);
-       if (!cdptr)
+       cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
+       if (!cd_table_entry)
                return -ENOMEM;

+       target = *cd_table_entry;
        val = le64_to_cpu(cdptr->data[0]);
        cd_live = !!(val & CTXDESC_CD_0_V);

@@ -1264,13 +1347,6 @@
                cdptr->data[2] = 0;
                cdptr->data[3] = cpu_to_le64(cd->mair);

-               /*
-                * STE may be live, and the SMMU might read dwords of
this CD in any
-                * order. Ensure that it observes valid values before reading
-                * V=1.
-                */
-               arm_smmu_sync_cd(master, ssid, true);
-
                val = cd->tcr |
 #ifdef __BIG_ENDIAN
                        CTXDESC_CD_0_ENDI |
@@ -1284,18 +1360,8 @@
                if (cd_table->stall_enabled)
                        val |= CTXDESC_CD_0_S;
        }
-
-       /*
-        * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
-        * "Configuration structures and configuration invalidation completion"
-        *
-        *   The size of single-copy atomic reads made by the SMMU is
-        *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
-        *   field within an aligned 64-bit span of a structure can be altered
-        *   without first making the structure invalid.
-        */
-       WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
-       arm_smmu_sync_cd(master, ssid, true);
+       cdptr->data[0] = cpu_to_le64(val);
+       arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
        return 0;
 }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-19 13:42                         ` Michael Shavit
@ 2023-12-25 12:17                           ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-25 12:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
>
>
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;
> +
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);

This should be `memcpy(staging_entry, cur, writer->entry_length *
sizeof(cur[0]));`

>
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);

 ditto, this should be `memcpy(staging_entry, target,
writer->entry_length * sizeof(target[0]));`

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-25 12:17                           ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-25 12:17 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
>
>
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;
> +
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);

This should be `memcpy(staging_entry, cur, writer->entry_length *
sizeof(cur[0]));`

>
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);

 ditto, this should be `memcpy(staging_entry, target,
writer->entry_length * sizeof(target[0]));`

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-25 12:17                           ` Michael Shavit
@ 2023-12-25 12:58                             ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-25 12:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

Ok I'm probably getting ahead of myself here, but on an nth re-read to
sanity check things I realized that this is pretty well suited to
unit-testing. In fact that's how I caught the bug that this last email
fixed.

(Sorry for duplicate email Jason, first email was accidentally off-list).

---
From 8b71430fd55a40203d600b29da93b413af4349ee Mon Sep 17 00:00:00 2001
From: Michael Shavit <mshavit@google.com>
Date: Fri, 22 Dec 2023 16:54:12 +0800
Subject: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry

Add tests for some of the more common STE update operations that we
expect to see, as well as some artificial STE updates to test the edges
of arm_smmu_write_entry. These also serve as a record of which common
operation is expected to be hitless, and how many syncs they require.

arm_smmu_write_entry implements a generic algorithm that updates an
STE/CD to any other abritrary STE/CD configuration. The update requires
a sequence of write+sync operations, with some invariants that must be
held true after each sync. arm_smmu_write_entry lends itself well to
unit-testing since the function's interaction with the STE/CD is already
abstracted by input callbacks that we can hook to introspect into the
sequence of operations. We can use these hooks to guarantee that
invariants are held throughout the entire update operation.

Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/Kconfig                         |   9 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   2 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 344 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  41 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  38 +-
 5 files changed, 400 insertions(+), 34 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 7673bb82945b6..e4c4071115c8e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -405,6 +405,15 @@ config ARM_SMMU_V3_SVA
          Say Y here if your system supports SVA extensions such as PCIe PASID
          and PRI.

+config ARM_SMMU_V3_KUNIT_TEST
+       tristate "KUnit tests for arm-smmu-v3 driver"  if !KUNIT_ALL_TESTS
+       depends on ARM_SMMU_V3 && KUNIT
+       default KUNIT_ALL_TESTS
+       help
+         Enable this option to unit-test arm-smmu-v3 driver functions.
+
+         If unsure, say N.
+
 config S390_IOMMU
        def_bool y if S390 && PCI
        depends on S390 && PCI
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile
b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 54feb1ecccad8..014a997753a8a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -3,3 +3,5 @@ obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-objs-y += arm-smmu-v3.o
 arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
+
+obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
new file mode 100644
index 0000000000000..2e59e157bf528
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -0,0 +1,344 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+
+#include "arm-smmu-v3.h"
+
+struct arm_smmu_test_writer {
+       struct arm_smmu_entry_writer writer;
+       struct kunit *test;
+       __le64 *(*get_used_bits)(struct arm_smmu_test_writer *test_writer,
+                                const __le64 *entry);
+
+       const __le64 *init_entry;
+       const __le64 *target_entry;
+       __le64 *entry;
+
+       bool invalid_entry_written;
+       int num_syncs;
+};
+
+static bool arm_smmu_entry_differs_in_used_bits(const __le64 *entry,
+                                               const __le64 *used_bits,
+                                               const __le64 *target,
+                                               unsigned int length)
+{
+       bool differs = false;
+       int i;
+
+       for (i = 0; i < length; i++) {
+               if ((entry[i] & used_bits[i]) != target[i])
+                       differs = true;
+       }
+       return differs;
+}
+
+static void
+arm_smmu_test_writer_record_syncs(struct arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_test_writer *test_writer =
+               container_of(writer, struct arm_smmu_test_writer, writer);
+       __le64 *entry_used_bits;
+
+       pr_debug("STE value is now set to: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8,
+                            test_writer->entry,
+                            writer->entry_length * sizeof(*test_writer->entry),
+                            false);
+
+       test_writer->num_syncs += 1;
+       if (!(test_writer->entry[0] & writer->v_bit))
+               test_writer->invalid_entry_written = true;
+       else {
+               /*
+                * At any stage in a hitless transition, the entry must be
+                * equivalent to either the initial entry or the target entry
+                * when only considering the bits used by the current
+                * configuration.
+                */
+               entry_used_bits = test_writer->get_used_bits(
+                       test_writer, test_writer->entry);
+               KUNIT_EXPECT_FALSE(test_writer->test,
+                                  arm_smmu_entry_differs_in_used_bits(
+                                          test_writer->entry, entry_used_bits,
+                                          test_writer->init_entry,
+                                          writer->entry_length) &&
+                                          arm_smmu_entry_differs_in_used_bits(
+                                                  test_writer->entry,
+                                                  entry_used_bits,
+                                                  test_writer->target_entry,
+                                                  writer->entry_length));
+       }
+}
+
+static __le64 *
+arm_smmu_test_ste_writer_get_used_bits(struct arm_smmu_test_writer
*test_writer,
+                                      const __le64 *entry)
+{
+       struct arm_smmu_ste *used_bits = kunit_kzalloc(
+               test_writer->test, sizeof(*used_bits), GFP_KERNEL);
+
+       arm_smmu_get_ste_used(entry, used_bits);
+       return used_bits->data;
+}
+
+static void
+arm_smmu_v3_test_ste_debug_print_used_bits(const struct arm_smmu_ste *ste)
+{
+       struct arm_smmu_ste used_bits = { 0 };
+
+       arm_smmu_get_ste_used(ste->data, &used_bits);
+       pr_debug("STE used bits: ");
+       print_hex_dump_debug(
+               "    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
+               ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
+}
+
+static void arm_smmu_v3_test_ste_expect_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected, bool hitless)
+{
+       struct arm_smmu_ste cur_copy;
+       struct arm_smmu_ste preallocated_staging_ste = { 0 };
+       struct arm_smmu_entry_writer_ops arm_smmu_test_writer_ops = {
+               .sync_entry = arm_smmu_test_writer_record_syncs,
+               .set_unused_bits = arm_smmu_ste_set_unused_bits,
+               .get_used_qword_diff_indexes =
+                       arm_smmu_ste_used_qword_diff_indexes,
+       };
+       struct arm_smmu_test_writer test_writer = {
+               .writer = {
+                       .ops = arm_smmu_test_writer_ops,
+                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
+                       .entry_length = ARRAY_SIZE(cur_copy.data),
+               },
+               .get_used_bits = arm_smmu_test_ste_writer_get_used_bits,
+               .test = test,
+               .init_entry = cur->data,
+               .target_entry = target->data,
+               .entry = cur_copy.data,
+               .num_syncs = 0,
+               .invalid_entry_written = false,
+
+       };
+       memcpy(&cur_copy, cur, sizeof(cur_copy));
+
+       pr_debug("STE initial value: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, cur_copy.data,
+                            ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+                            false);
+       arm_smmu_v3_test_ste_debug_print_used_bits(cur);
+       pr_debug("STE target value: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, target->data,
+                            ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+                            false);
+       arm_smmu_v3_test_ste_debug_print_used_bits(target);
+
+       arm_smmu_write_entry(&test_writer.writer, cur_copy.data, target->data,
+                            preallocated_staging_ste.data);
+
+       KUNIT_EXPECT_EQ(test, test_writer.invalid_entry_written, !hitless);
+       KUNIT_EXPECT_EQ(test, test_writer.num_syncs, num_syncs_expected);
+       KUNIT_EXPECT_MEMEQ(test, target->data, cur_copy.data,
+                          ARRAY_SIZE(cur_copy.data));
+}
+
+static void arm_smmu_v3_test_ste_expect_non_hitless_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+       arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+                                              num_syncs_expected, false);
+}
+
+static void arm_smmu_v3_test_ste_expect_hitless_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+       arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+                                              num_syncs_expected, true);
+}
+
+static const dma_addr_t fake_cdtab_dma_addr = 0xF0F0F0F0F0F0;
+
+static void arm_smmu_test_make_cdtable_ste(struct arm_smmu_ste *ste,
+                                          unsigned int s1dss,
+                                          const dma_addr_t dma_addr)
+{
+       struct arm_smmu_master master;
+       struct arm_smmu_ctx_desc_cfg cd_table;
+       struct arm_smmu_device smmu;
+
+       cd_table.cdtab_dma = dma_addr;
+       cd_table.s1cdmax = 0xFF;
+       cd_table.s1fmt = STRTAB_STE_0_S1FMT_64K_L2;
+       smmu.features = ARM_SMMU_FEAT_STALLS;
+       master.smmu = &smmu;
+
+       arm_smmu_make_cdtable_ste(ste, &master, &cd_table, true, s1dss);
+}
+
+struct arm_smmu_ste bypass_ste;
+struct arm_smmu_ste abort_ste;
+
+static int arm_smmu_v3_test_suite_init(struct kunit_suite *test)
+{
+       arm_smmu_make_bypass_ste(&bypass_ste);
+       arm_smmu_make_abort_ste(&abort_ste);
+
+       return 0;
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_abort(struct kunit *test)
+{
+       /*
+        * Bypass STEs has used bits in the first two Qwords, while abort STEs
+        * only have used bits in the first QWord. Transitioning from bypass to
+        * abort requires two syncs: the first to set the first qword and make
+        * the STE into an abort, the second to clean up the second qword.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &abort_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_bypass(struct kunit *test)
+{
+       /*
+        * Transitioning from abort to bypass also requires two syncs: the first
+        * to set the second qword data required by the bypass STE, and the
+        * second to set the first qword and switch to bypass.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &abort_ste, &bypass_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_abort(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &abort_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_cdtable(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &abort_ste, &ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_bypass(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &bypass_ste,
+               /*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_cdtable(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &ste,
+               /*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_s1dss_change(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+
+       /*
+        * Flipping s1dss on a CD table STE only involves changes to the second
+        * qword of an STE and can be done in a single write.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &s1dss_bypass,
+               /*num_syncs_expected=*/1);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &s1dss_bypass, &ste,
+               /*num_syncs_expected=*/1);
+}
+
+static void
+arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass(struct kunit *test)
+{
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &s1dss_bypass, &bypass_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void
+arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass(struct kunit *test)
+{
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &s1dss_bypass,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_non_hitless(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+       struct arm_smmu_ste ste_2;
+
+       /*
+        * Although no flow resembles this in practice, one way to force an STE
+        * update to be non-hitless is to change its CD table pointer as well as
+        * s1 dss field in the same update.
+        */
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_test_make_cdtable_ste(&ste_2, STRTAB_STE_1_S1DSS_BYPASS,
+                                      0x4B4B4b4B4B);
+       arm_smmu_v3_test_ste_expect_non_hitless_transition(
+               test, &ste, &ste_2,
+               /*num_syncs_expected=*/2);
+}
+
+static struct kunit_case arm_smmu_v3_test_cases[] = {
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_abort),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_cdtable),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_bypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_cdtable),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_s1dss_change),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_non_hitless),
+       {},
+};
+
+static struct kunit_suite arm_smmu_v3_test_module = {
+       .name = "arm-smmu-v3-kunit-test",
+       .suite_init = arm_smmu_v3_test_suite_init,
+       .test_cases = arm_smmu_v3_test_cases,
+};
+kunit_test_suites(&arm_smmu_v3_test_module);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0accd00ed1918..ec15a8c6a0f65 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -967,31 +967,6 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device
*smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-struct arm_smmu_entry_writer;
-
-/**
- * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
- * @sync_entry: sync entry to the hardware after writing to it.
- * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
- *                   equal to the target's bits.
- * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
- *                               are incorrect compared to the target,
- *                               considering only the used bits in the target.
- *                               The set bits in the return value
represents the
- *                               indexes of those qwords.
- */
-struct arm_smmu_entry_writer_ops {
-       void (*sync_entry)(struct arm_smmu_entry_writer *);
-       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
-       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
-};
-
-struct arm_smmu_entry_writer {
-       struct arm_smmu_entry_writer_ops ops;
-       __le64 v_bit;
-       unsigned int entry_length;
-};
-
 static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
                                           const __le64 *entry_used,
                                           unsigned int length)
@@ -1046,7 +1021,7 @@ static u8
arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
  * V=0 process. This relies on the IGNORED behavior described in the
  * specification
  */
-static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
+void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
                                 __le64 *cur, const __le64 *target,
                                 __le64 *staging_entry)
 {
@@ -1466,8 +1441,8 @@ static void arm_smmu_sync_ste_for_sid(struct
arm_smmu_device *smmu, u32 sid)
  * would be nice if this was complete according to the spec, but minimally it
  * has to capture the bits this driver uses.
  */
-static void arm_smmu_get_ste_used(const __le64 *ent,
-                                 struct arm_smmu_ste *used_bits)
+void arm_smmu_get_ste_used(const __le64 *ent,
+                         struct arm_smmu_ste *used_bits)
 {
        memset(used_bits, 0, sizeof(*used_bits));

@@ -1523,7 +1498,7 @@ struct arm_smmu_ste_writer {
        u32 sid;
 };

-static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
+void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
 {
        struct arm_smmu_ste entry_used;
        arm_smmu_get_ste_used(entry, &entry_used);
@@ -1532,7 +1507,7 @@ static void arm_smmu_ste_set_unused_bits(__le64
*entry, const __le64 *target)
                                       ARRAY_SIZE(entry_used.data));
 }

-static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
+u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
                                               const __le64 *target)
 {
        struct arm_smmu_ste target_used;
@@ -1588,7 +1563,7 @@ static void arm_smmu_write_ste(struct
arm_smmu_device *smmu, u32 sid,
        }
 }

-static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 {
        memset(target, 0, sizeof(*target));
        target->data[0] = cpu_to_le64(
@@ -1596,7 +1571,7 @@ static void arm_smmu_make_abort_ste(struct
arm_smmu_ste *target)
                FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
 }

-static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 {
        memset(target, 0, sizeof(*target));
        target->data[0] = cpu_to_le64(
@@ -1606,7 +1581,7 @@ static void arm_smmu_make_bypass_ste(struct
arm_smmu_ste *target)
                FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }

-static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
                                      struct arm_smmu_master *master,
                                      struct arm_smmu_ctx_desc_cfg *cd_table,
                                      bool ats_enabled, unsigned int s1dss)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 91b23437f4105..9789f18a04a59 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -752,6 +752,43 @@ struct arm_smmu_master_domain {
        u16 ssid;
 };

+struct arm_smmu_entry_writer;
+
+/**
+ * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
+ * @sync_entry: sync entry to the hardware after writing to it.
+ * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
+ *                   equal to the target's bits.
+ * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
+ *                               are incorrect compared to the target,
+ *                               considering only the used bits in the target.
+ *                               The set bits in the return value
represents the
+ *                               indexes of those qwords.
+ */
+struct arm_smmu_entry_writer_ops {
+       void (*sync_entry)(struct arm_smmu_entry_writer *writer);
+       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
+       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
+};
+
+struct arm_smmu_entry_writer {
+       struct arm_smmu_entry_writer_ops ops;
+       __le64 v_bit;
+       unsigned int entry_length;
+};
+
+void arm_smmu_get_ste_used(const __le64 *ent, struct arm_smmu_ste *used_bits);
+void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target);
+u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur, const __le64 *target);
+void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer, __le64 *cur,
+                         const __le64 *target, __le64 *staging_entry);
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+                                     struct arm_smmu_master *master,
+                                     struct arm_smmu_ctx_desc_cfg *cd_table,
+                                     bool ats_enabled, unsigned int s1dss);
+
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
        return container_of(dom, struct arm_smmu_domain, domain);
@@ -783,7 +820,6 @@ void arm_smmu_make_s1_cd(struct arm_smmu_cd *target,
 void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
                             struct arm_smmu_cd *cdptr,
                             const struct arm_smmu_cd *target);
-
 int arm_smmu_set_pasid(struct arm_smmu_master *master,
                       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
                       struct arm_smmu_cd *cd);

base-commit: 5c93358344002b351615b6f8c8c526a7ae83f72d
--
2.43.0.472.g3155946c3a-goog

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-25 12:58                             ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2023-12-25 12:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

Ok I'm probably getting ahead of myself here, but on an nth re-read to
sanity check things I realized that this is pretty well suited to
unit-testing. In fact that's how I caught the bug that this last email
fixed.

(Sorry for duplicate email Jason, first email was accidentally off-list).

---
From 8b71430fd55a40203d600b29da93b413af4349ee Mon Sep 17 00:00:00 2001
From: Michael Shavit <mshavit@google.com>
Date: Fri, 22 Dec 2023 16:54:12 +0800
Subject: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry

Add tests for some of the more common STE update operations that we
expect to see, as well as some artificial STE updates to test the edges
of arm_smmu_write_entry. These also serve as a record of which common
operation is expected to be hitless, and how many syncs they require.

arm_smmu_write_entry implements a generic algorithm that updates an
STE/CD to any other abritrary STE/CD configuration. The update requires
a sequence of write+sync operations, with some invariants that must be
held true after each sync. arm_smmu_write_entry lends itself well to
unit-testing since the function's interaction with the STE/CD is already
abstracted by input callbacks that we can hook to introspect into the
sequence of operations. We can use these hooks to guarantee that
invariants are held throughout the entire update operation.

Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/Kconfig                         |   9 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   2 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 344 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |  41 +--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |  38 +-
 5 files changed, 400 insertions(+), 34 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 7673bb82945b6..e4c4071115c8e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -405,6 +405,15 @@ config ARM_SMMU_V3_SVA
          Say Y here if your system supports SVA extensions such as PCIe PASID
          and PRI.

+config ARM_SMMU_V3_KUNIT_TEST
+       tristate "KUnit tests for arm-smmu-v3 driver"  if !KUNIT_ALL_TESTS
+       depends on ARM_SMMU_V3 && KUNIT
+       default KUNIT_ALL_TESTS
+       help
+         Enable this option to unit-test arm-smmu-v3 driver functions.
+
+         If unsure, say N.
+
 config S390_IOMMU
        def_bool y if S390 && PCI
        depends on S390 && PCI
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile
b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 54feb1ecccad8..014a997753a8a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -3,3 +3,5 @@ obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-objs-y += arm-smmu-v3.o
 arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
+
+obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
new file mode 100644
index 0000000000000..2e59e157bf528
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -0,0 +1,344 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+
+#include "arm-smmu-v3.h"
+
+struct arm_smmu_test_writer {
+       struct arm_smmu_entry_writer writer;
+       struct kunit *test;
+       __le64 *(*get_used_bits)(struct arm_smmu_test_writer *test_writer,
+                                const __le64 *entry);
+
+       const __le64 *init_entry;
+       const __le64 *target_entry;
+       __le64 *entry;
+
+       bool invalid_entry_written;
+       int num_syncs;
+};
+
+static bool arm_smmu_entry_differs_in_used_bits(const __le64 *entry,
+                                               const __le64 *used_bits,
+                                               const __le64 *target,
+                                               unsigned int length)
+{
+       bool differs = false;
+       int i;
+
+       for (i = 0; i < length; i++) {
+               if ((entry[i] & used_bits[i]) != target[i])
+                       differs = true;
+       }
+       return differs;
+}
+
+static void
+arm_smmu_test_writer_record_syncs(struct arm_smmu_entry_writer *writer)
+{
+       struct arm_smmu_test_writer *test_writer =
+               container_of(writer, struct arm_smmu_test_writer, writer);
+       __le64 *entry_used_bits;
+
+       pr_debug("STE value is now set to: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8,
+                            test_writer->entry,
+                            writer->entry_length * sizeof(*test_writer->entry),
+                            false);
+
+       test_writer->num_syncs += 1;
+       if (!(test_writer->entry[0] & writer->v_bit))
+               test_writer->invalid_entry_written = true;
+       else {
+               /*
+                * At any stage in a hitless transition, the entry must be
+                * equivalent to either the initial entry or the target entry
+                * when only considering the bits used by the current
+                * configuration.
+                */
+               entry_used_bits = test_writer->get_used_bits(
+                       test_writer, test_writer->entry);
+               KUNIT_EXPECT_FALSE(test_writer->test,
+                                  arm_smmu_entry_differs_in_used_bits(
+                                          test_writer->entry, entry_used_bits,
+                                          test_writer->init_entry,
+                                          writer->entry_length) &&
+                                          arm_smmu_entry_differs_in_used_bits(
+                                                  test_writer->entry,
+                                                  entry_used_bits,
+                                                  test_writer->target_entry,
+                                                  writer->entry_length));
+       }
+}
+
+static __le64 *
+arm_smmu_test_ste_writer_get_used_bits(struct arm_smmu_test_writer
*test_writer,
+                                      const __le64 *entry)
+{
+       struct arm_smmu_ste *used_bits = kunit_kzalloc(
+               test_writer->test, sizeof(*used_bits), GFP_KERNEL);
+
+       arm_smmu_get_ste_used(entry, used_bits);
+       return used_bits->data;
+}
+
+static void
+arm_smmu_v3_test_ste_debug_print_used_bits(const struct arm_smmu_ste *ste)
+{
+       struct arm_smmu_ste used_bits = { 0 };
+
+       arm_smmu_get_ste_used(ste->data, &used_bits);
+       pr_debug("STE used bits: ");
+       print_hex_dump_debug(
+               "    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
+               ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
+}
+
+static void arm_smmu_v3_test_ste_expect_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected, bool hitless)
+{
+       struct arm_smmu_ste cur_copy;
+       struct arm_smmu_ste preallocated_staging_ste = { 0 };
+       struct arm_smmu_entry_writer_ops arm_smmu_test_writer_ops = {
+               .sync_entry = arm_smmu_test_writer_record_syncs,
+               .set_unused_bits = arm_smmu_ste_set_unused_bits,
+               .get_used_qword_diff_indexes =
+                       arm_smmu_ste_used_qword_diff_indexes,
+       };
+       struct arm_smmu_test_writer test_writer = {
+               .writer = {
+                       .ops = arm_smmu_test_writer_ops,
+                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
+                       .entry_length = ARRAY_SIZE(cur_copy.data),
+               },
+               .get_used_bits = arm_smmu_test_ste_writer_get_used_bits,
+               .test = test,
+               .init_entry = cur->data,
+               .target_entry = target->data,
+               .entry = cur_copy.data,
+               .num_syncs = 0,
+               .invalid_entry_written = false,
+
+       };
+       memcpy(&cur_copy, cur, sizeof(cur_copy));
+
+       pr_debug("STE initial value: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, cur_copy.data,
+                            ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+                            false);
+       arm_smmu_v3_test_ste_debug_print_used_bits(cur);
+       pr_debug("STE target value: ");
+       print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, target->data,
+                            ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+                            false);
+       arm_smmu_v3_test_ste_debug_print_used_bits(target);
+
+       arm_smmu_write_entry(&test_writer.writer, cur_copy.data, target->data,
+                            preallocated_staging_ste.data);
+
+       KUNIT_EXPECT_EQ(test, test_writer.invalid_entry_written, !hitless);
+       KUNIT_EXPECT_EQ(test, test_writer.num_syncs, num_syncs_expected);
+       KUNIT_EXPECT_MEMEQ(test, target->data, cur_copy.data,
+                          ARRAY_SIZE(cur_copy.data));
+}
+
+static void arm_smmu_v3_test_ste_expect_non_hitless_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+       arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+                                              num_syncs_expected, false);
+}
+
+static void arm_smmu_v3_test_ste_expect_hitless_transition(
+       struct kunit *test, const struct arm_smmu_ste *cur,
+       const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+       arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+                                              num_syncs_expected, true);
+}
+
+static const dma_addr_t fake_cdtab_dma_addr = 0xF0F0F0F0F0F0;
+
+static void arm_smmu_test_make_cdtable_ste(struct arm_smmu_ste *ste,
+                                          unsigned int s1dss,
+                                          const dma_addr_t dma_addr)
+{
+       struct arm_smmu_master master;
+       struct arm_smmu_ctx_desc_cfg cd_table;
+       struct arm_smmu_device smmu;
+
+       cd_table.cdtab_dma = dma_addr;
+       cd_table.s1cdmax = 0xFF;
+       cd_table.s1fmt = STRTAB_STE_0_S1FMT_64K_L2;
+       smmu.features = ARM_SMMU_FEAT_STALLS;
+       master.smmu = &smmu;
+
+       arm_smmu_make_cdtable_ste(ste, &master, &cd_table, true, s1dss);
+}
+
+struct arm_smmu_ste bypass_ste;
+struct arm_smmu_ste abort_ste;
+
+static int arm_smmu_v3_test_suite_init(struct kunit_suite *test)
+{
+       arm_smmu_make_bypass_ste(&bypass_ste);
+       arm_smmu_make_abort_ste(&abort_ste);
+
+       return 0;
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_abort(struct kunit *test)
+{
+       /*
+        * Bypass STEs has used bits in the first two Qwords, while abort STEs
+        * only have used bits in the first QWord. Transitioning from bypass to
+        * abort requires two syncs: the first to set the first qword and make
+        * the STE into an abort, the second to clean up the second qword.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &abort_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_bypass(struct kunit *test)
+{
+       /*
+        * Transitioning from abort to bypass also requires two syncs: the first
+        * to set the second qword data required by the bypass STE, and the
+        * second to set the first qword and switch to bypass.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &abort_ste, &bypass_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_abort(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &abort_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_cdtable(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &abort_ste, &ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_bypass(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &bypass_ste,
+               /*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_cdtable(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &ste,
+               /*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_s1dss_change(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+
+       /*
+        * Flipping s1dss on a CD table STE only involves changes to the second
+        * qword of an STE and can be done in a single write.
+        */
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &ste, &s1dss_bypass,
+               /*num_syncs_expected=*/1);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &s1dss_bypass, &ste,
+               /*num_syncs_expected=*/1);
+}
+
+static void
+arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass(struct kunit *test)
+{
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &s1dss_bypass, &bypass_ste,
+               /*num_syncs_expected=*/2);
+}
+
+static void
+arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass(struct kunit *test)
+{
+       struct arm_smmu_ste s1dss_bypass;
+
+       arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_v3_test_ste_expect_hitless_transition(
+               test, &bypass_ste, &s1dss_bypass,
+               /*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_non_hitless(struct kunit *test)
+{
+       struct arm_smmu_ste ste;
+       struct arm_smmu_ste ste_2;
+
+       /*
+        * Although no flow resembles this in practice, one way to force an STE
+        * update to be non-hitless is to change its CD table pointer as well as
+        * s1 dss field in the same update.
+        */
+       arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+                                      fake_cdtab_dma_addr);
+       arm_smmu_test_make_cdtable_ste(&ste_2, STRTAB_STE_1_S1DSS_BYPASS,
+                                      0x4B4B4b4B4B);
+       arm_smmu_v3_test_ste_expect_non_hitless_transition(
+               test, &ste, &ste_2,
+               /*num_syncs_expected=*/2);
+}
+
+static struct kunit_case arm_smmu_v3_test_cases[] = {
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_abort),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_cdtable),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_bypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_cdtable),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_s1dss_change),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass),
+       KUNIT_CASE(arm_smmu_v3_write_ste_test_non_hitless),
+       {},
+};
+
+static struct kunit_suite arm_smmu_v3_test_module = {
+       .name = "arm-smmu-v3-kunit-test",
+       .suite_init = arm_smmu_v3_test_suite_init,
+       .test_cases = arm_smmu_v3_test_cases,
+};
+kunit_test_suites(&arm_smmu_v3_test_module);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0accd00ed1918..ec15a8c6a0f65 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -967,31 +967,6 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device
*smmu, u16 asid)
        arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }

-struct arm_smmu_entry_writer;
-
-/**
- * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
- * @sync_entry: sync entry to the hardware after writing to it.
- * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
- *                   equal to the target's bits.
- * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
- *                               are incorrect compared to the target,
- *                               considering only the used bits in the target.
- *                               The set bits in the return value
represents the
- *                               indexes of those qwords.
- */
-struct arm_smmu_entry_writer_ops {
-       void (*sync_entry)(struct arm_smmu_entry_writer *);
-       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
-       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
-};
-
-struct arm_smmu_entry_writer {
-       struct arm_smmu_entry_writer_ops ops;
-       __le64 v_bit;
-       unsigned int entry_length;
-};
-
 static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
                                           const __le64 *entry_used,
                                           unsigned int length)
@@ -1046,7 +1021,7 @@ static u8
arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
  * V=0 process. This relies on the IGNORED behavior described in the
  * specification
  */
-static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
+void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
                                 __le64 *cur, const __le64 *target,
                                 __le64 *staging_entry)
 {
@@ -1466,8 +1441,8 @@ static void arm_smmu_sync_ste_for_sid(struct
arm_smmu_device *smmu, u32 sid)
  * would be nice if this was complete according to the spec, but minimally it
  * has to capture the bits this driver uses.
  */
-static void arm_smmu_get_ste_used(const __le64 *ent,
-                                 struct arm_smmu_ste *used_bits)
+void arm_smmu_get_ste_used(const __le64 *ent,
+                         struct arm_smmu_ste *used_bits)
 {
        memset(used_bits, 0, sizeof(*used_bits));

@@ -1523,7 +1498,7 @@ struct arm_smmu_ste_writer {
        u32 sid;
 };

-static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
+void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
 {
        struct arm_smmu_ste entry_used;
        arm_smmu_get_ste_used(entry, &entry_used);
@@ -1532,7 +1507,7 @@ static void arm_smmu_ste_set_unused_bits(__le64
*entry, const __le64 *target)
                                       ARRAY_SIZE(entry_used.data));
 }

-static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
+u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
                                               const __le64 *target)
 {
        struct arm_smmu_ste target_used;
@@ -1588,7 +1563,7 @@ static void arm_smmu_write_ste(struct
arm_smmu_device *smmu, u32 sid,
        }
 }

-static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 {
        memset(target, 0, sizeof(*target));
        target->data[0] = cpu_to_le64(
@@ -1596,7 +1571,7 @@ static void arm_smmu_make_abort_ste(struct
arm_smmu_ste *target)
                FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
 }

-static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 {
        memset(target, 0, sizeof(*target));
        target->data[0] = cpu_to_le64(
@@ -1606,7 +1581,7 @@ static void arm_smmu_make_bypass_ste(struct
arm_smmu_ste *target)
                FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }

-static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
                                      struct arm_smmu_master *master,
                                      struct arm_smmu_ctx_desc_cfg *cd_table,
                                      bool ats_enabled, unsigned int s1dss)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 91b23437f4105..9789f18a04a59 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -752,6 +752,43 @@ struct arm_smmu_master_domain {
        u16 ssid;
 };

+struct arm_smmu_entry_writer;
+
+/**
+ * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
+ * @sync_entry: sync entry to the hardware after writing to it.
+ * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
+ *                   equal to the target's bits.
+ * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
+ *                               are incorrect compared to the target,
+ *                               considering only the used bits in the target.
+ *                               The set bits in the return value
represents the
+ *                               indexes of those qwords.
+ */
+struct arm_smmu_entry_writer_ops {
+       void (*sync_entry)(struct arm_smmu_entry_writer *writer);
+       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
+       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
+};
+
+struct arm_smmu_entry_writer {
+       struct arm_smmu_entry_writer_ops ops;
+       __le64 v_bit;
+       unsigned int entry_length;
+};
+
+void arm_smmu_get_ste_used(const __le64 *ent, struct arm_smmu_ste *used_bits);
+void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target);
+u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur, const __le64 *target);
+void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer, __le64 *cur,
+                         const __le64 *target, __le64 *staging_entry);
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+                                     struct arm_smmu_master *master,
+                                     struct arm_smmu_ctx_desc_cfg *cd_table,
+                                     bool ats_enabled, unsigned int s1dss);
+
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
        return container_of(dom, struct arm_smmu_domain, domain);
@@ -783,7 +820,6 @@ void arm_smmu_make_s1_cd(struct arm_smmu_cd *target,
 void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
                             struct arm_smmu_cd *cdptr,
                             const struct arm_smmu_cd *target);
-
 int arm_smmu_set_pasid(struct arm_smmu_master *master,
                       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
                       struct arm_smmu_cd *cd);

base-commit: 5c93358344002b351615b6f8c8c526a7ae83f72d
--
2.43.0.472.g3155946c3a-goog

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-25 12:58                             ` Michael Shavit
@ 2023-12-27 15:33                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-27 15:33 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 25, 2023 at 08:58:06PM +0800, Michael Shavit wrote:
> Ok I'm probably getting ahead of myself here, but on an nth re-read to
> sanity check things I realized that this is pretty well suited to
> unit-testing. In fact that's how I caught the bug that this last email
> fixed.

Yeah, this is great! I was thinking of building it too! Especially I
think it helps with the concern over complexity if the whole thi   ng is
covered by a unit test. That is far more maintainable than most kernel
code.

I'll look at this in detail next week

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-27 15:33                               ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-27 15:33 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Mon, Dec 25, 2023 at 08:58:06PM +0800, Michael Shavit wrote:
> Ok I'm probably getting ahead of myself here, but on an nth re-read to
> sanity check things I realized that this is pretty well suited to
> unit-testing. In fact that's how I caught the bug that this last email
> fixed.

Yeah, this is great! I was thinking of building it too! Especially I
think it helps with the concern over complexity if the whole thi   ng is
covered by a unit test. That is far more maintainable than most kernel
code.

I'll look at this in detail next week

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-19 13:42                         ` Michael Shavit
@ 2023-12-27 15:46                           ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-27 15:46 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:

> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;

A no change update is actually API legal, eg we can set the same
domain twice in a row. It should just do nothing.

If the goal is to improve readability I'd split this into smaller
functions and have the main function look like this:

       compute_used(..)
       if (hweight8(entry_qwords_used_diff) > 1) {
             set_v_0(..);
             set(qword_start=1,qword_end=N);
	     set(qword_start=0,qword_end=1); // V=1
       } else if (hweight8(entry_qwords_used_diff) == 1) {
             set_unused(..);
	     critical = ffs(..);
             set(qword_start=critical,qword_end=critical+1);
             set(qword_start=0,qword_end=N);
       } else { // hweight8 == 0
             set(qword_start=0,qword_end=N);
       }

Then the three different programming algorithms are entirely clear in
code. Make the generic set() function skip the sync if nothing
changed.

> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);

Put the number qwords directly in the ops struct and don't make this
an op.  Above will need N=number of qwords as well.

Regrads,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2023-12-27 15:46                           ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2023-12-27 15:46 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:

> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;

A no change update is actually API legal, eg we can set the same
domain twice in a row. It should just do nothing.

If the goal is to improve readability I'd split this into smaller
functions and have the main function look like this:

       compute_used(..)
       if (hweight8(entry_qwords_used_diff) > 1) {
             set_v_0(..);
             set(qword_start=1,qword_end=N);
	     set(qword_start=0,qword_end=1); // V=1
       } else if (hweight8(entry_qwords_used_diff) == 1) {
             set_unused(..);
	     critical = ffs(..);
             set(qword_start=critical,qword_end=critical+1);
             set(qword_start=0,qword_end=N);
       } else { // hweight8 == 0
             set(qword_start=0,qword_end=N);
       }

Then the three different programming algorithms are entirely clear in
code. Make the generic set() function skip the sync if nothing
changed.

> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);

Put the number qwords directly in the ops struct and don't make this
an op.  Above will need N=number of qwords as well.

Regrads,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-27 15:46                           ` Jason Gunthorpe
@ 2024-01-02  8:08                             ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-02  8:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
>
> > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > +                                __le64 *cur, const __le64 *target,
> > +                                __le64 *staging_entry)
> > +{
> > +       bool cleanup_sync_required = false;
> > +       u8 entry_qwords_used_diff = 0;
> > +       int i = 0;
> > +
> > +       entry_qwords_used_diff =
> > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > +               return;
>
> A no change update is actually API legal, eg we can set the same
> domain twice in a row. It should just do nothing.
>
> If the goal is to improve readability I'd split this into smaller
> functions and have the main function look like this:
>
>        compute_used(..)
>        if (hweight8(entry_qwords_used_diff) > 1) {
>              set_v_0(..);
>              set(qword_start=1,qword_end=N);
>              set(qword_start=0,qword_end=1); // V=1

This branch is probably a bit more complicated than that. It's a bit more like:
       if (hweight8(entry_qwords_used_diff) > 1) {
             compute_staging_entry(...);
             compute_used_diffs(...staging_entry...)
             if (hweight(entry_qwords_used_diff) > 1) {
                 set_v_0();
                 set(qword_start=1,qword_end=N);
                 set(qword_start=0,qword_end=1); // V=1
             } else {
                 set(qword_start=0, qword_end=N, staging_entry, entry)
                 critical = ffs(..);
                 set(qword_start=critical,qword_end=critical+1);
                 set(qword_start=0,qword_end=N);
             }
      }

>        } else if (hweight8(entry_qwords_used_diff) == 1) {
>              set_unused(..);
>              critical = ffs(..);
>              set(qword_start=critical,qword_end=critical+1);
>              set(qword_start=0,qword_end=N);

And then this branch is the case where you can directly switch to the
entry without first setting unused bits.

>        } else { // hweight8 == 0
>              set(qword_start=0,qword_end=N);
>        }
>
> Then the three different programming algorithms are entirely clear in
> code. Make the generic set() function skip the sync if nothing
> changed.
>
> > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > +               /*
> > +                * If transitioning to the target entry with a single qword
> > +                * write isn't possible, then we must first transition to an
> > +                * intermediate entry. The intermediate entry may either be an
> > +                * entry that melds bits of the target entry into the current
> > +                * entry without disrupting the hardware, or a breaking entry if
> > +                * a hitless transition to the target is impossible.
> > +                */
> > +
> > +               /*
> > +                * Compute a staging entry that has all the bits currently
> > +                * unused by HW set to their target values, such that comitting
> > +                * it to the entry table woudn't disrupt the hardware.
> > +                */
> > +               memcpy(staging_entry, cur, writer->entry_length);
> > +               writer->ops.set_unused_bits(staging_entry, target);
> > +
> > +               entry_qwords_used_diff =
> > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > +                                                               target);
>
> Put the number qwords directly in the ops struct and don't make this
> an op.  Above will need N=number of qwords as well.

The reason I made get_used_qword_diff_indexes an op is because the
algorithm needs to compute the used_bits for entries (for the current
entry, the target entry as well as the melded-staging entry). This
requires a buffer to hold the output of the used_bits op however.
We're already passing such a buffer to arm_smmu_write_entry for the
staging_entry, and I wasn't a fan of adding a second one.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-02  8:08                             ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-02  8:08 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
>
> > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > +                                __le64 *cur, const __le64 *target,
> > +                                __le64 *staging_entry)
> > +{
> > +       bool cleanup_sync_required = false;
> > +       u8 entry_qwords_used_diff = 0;
> > +       int i = 0;
> > +
> > +       entry_qwords_used_diff =
> > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > +               return;
>
> A no change update is actually API legal, eg we can set the same
> domain twice in a row. It should just do nothing.
>
> If the goal is to improve readability I'd split this into smaller
> functions and have the main function look like this:
>
>        compute_used(..)
>        if (hweight8(entry_qwords_used_diff) > 1) {
>              set_v_0(..);
>              set(qword_start=1,qword_end=N);
>              set(qword_start=0,qword_end=1); // V=1

This branch is probably a bit more complicated than that. It's a bit more like:
       if (hweight8(entry_qwords_used_diff) > 1) {
             compute_staging_entry(...);
             compute_used_diffs(...staging_entry...)
             if (hweight(entry_qwords_used_diff) > 1) {
                 set_v_0();
                 set(qword_start=1,qword_end=N);
                 set(qword_start=0,qword_end=1); // V=1
             } else {
                 set(qword_start=0, qword_end=N, staging_entry, entry)
                 critical = ffs(..);
                 set(qword_start=critical,qword_end=critical+1);
                 set(qword_start=0,qword_end=N);
             }
      }

>        } else if (hweight8(entry_qwords_used_diff) == 1) {
>              set_unused(..);
>              critical = ffs(..);
>              set(qword_start=critical,qword_end=critical+1);
>              set(qword_start=0,qword_end=N);

And then this branch is the case where you can directly switch to the
entry without first setting unused bits.

>        } else { // hweight8 == 0
>              set(qword_start=0,qword_end=N);
>        }
>
> Then the three different programming algorithms are entirely clear in
> code. Make the generic set() function skip the sync if nothing
> changed.
>
> > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > +               /*
> > +                * If transitioning to the target entry with a single qword
> > +                * write isn't possible, then we must first transition to an
> > +                * intermediate entry. The intermediate entry may either be an
> > +                * entry that melds bits of the target entry into the current
> > +                * entry without disrupting the hardware, or a breaking entry if
> > +                * a hitless transition to the target is impossible.
> > +                */
> > +
> > +               /*
> > +                * Compute a staging entry that has all the bits currently
> > +                * unused by HW set to their target values, such that comitting
> > +                * it to the entry table woudn't disrupt the hardware.
> > +                */
> > +               memcpy(staging_entry, cur, writer->entry_length);
> > +               writer->ops.set_unused_bits(staging_entry, target);
> > +
> > +               entry_qwords_used_diff =
> > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > +                                                               target);
>
> Put the number qwords directly in the ops struct and don't make this
> an op.  Above will need N=number of qwords as well.

The reason I made get_used_qword_diff_indexes an op is because the
algorithm needs to compute the used_bits for entries (for the current
entry, the target entry as well as the melded-staging entry). This
requires a buffer to hold the output of the used_bits op however.
We're already passing such a buffer to arm_smmu_write_entry for the
staging_entry, and I wasn't a fan of adding a second one.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-19 13:42                         ` Michael Shavit
@ 2024-01-02  8:13                           ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-02  8:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
...
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);
> +                       staging_entry[0] &= ~writer->v_bit;
> +                       /*
> +                        * After comitting the staging entry, only the 0th qword
> +                        * will differ from the target.
> +                        */
> +                       entry_qwords_used_diff = 1;
> +               }
> +
> +               /*
> +                * Commit the staging entry. Note that the iteration order
> +                * matters, as we may be comitting a breaking entry in the
> +                * non-hitless case. The 0th qword which holds the valid bit
> +                * must be written first in that case.
> +                */
> +               for (i = 0; i != writer->entry_length; i++)
> +                       WRITE_ONCE(cur[i], staging_entry[i]);
> +               writer->ops.sync_entry(writer);

Realized while replying to your latest email that this is wrong (and
the unit-test as well!). It's not enough to just write the 0th qword
first if it's a breaking entry, it must also sync after that 0th qword
write.

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
> >
> > On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
> > >
> > > > Ok, I took a proper stab at trying to unroll the loop on the github
> > > > version of this patch (v3+)
> > > > As you suspected, it's not easy to re-use the unrolled version for
> > > > both STE and CD writing as we'd have to pass in callbacks for syncing
> > > > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
> > >
> > > Yes, that is why I structured it as an iterator
> >
> > On second thought, perhaps defining a helper class implementing
> > entry_sync() and entry_get_used_bits() might not be so bad?
> > It's a little bit more verbose, but avoids deduplication of the
> > complicated parts.
>
> Gave this a try so that we have something more concrete to compare.
> Consider the following two patches as alternatives to this patch and
> patch "Make CD programming use arm_smmu_write_entry_step" from the
> next part of the patch series.
>
> STE programming patch
> ---
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b120d83668...1e17bff37f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,6 +971,174 @@
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +struct arm_smmu_entry_writer;
> +
> +/**
> + * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
> + * @sync_entry: sync entry to the hardware after writing to it.
> + * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
> + *                   equal to the target's bits.
> + * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
> + *                               are incorrect compared to the target,
> + *                               considering only the used bits in the target.
> + *                               The set bits in the return value
> represents the
> + *                               indexes of those qwords.
> + */
> +struct arm_smmu_entry_writer_ops {
> +       void (*sync_entry)(struct arm_smmu_entry_writer *);
> +       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
> +       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
> +};
> +
> +struct arm_smmu_entry_writer {
> +       struct arm_smmu_entry_writer_ops ops;
> +       __le64 v_bit;
> +       unsigned int entry_length;
> +};
> +
> +static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
> +                                          const __le64 *entry_used,
> +                                          unsigned int length)
> +{
> +       int i = 0;
> +
> +       for (i = 0; i < length; i++)
> +               entry[i] = (entry[i] & entry_used[i]) |
> +                          (target[i] & ~entry_used[i]);
> +}
> +
> +static u8 arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
> +                                                const __le64 *target,
> +                                                const __le64 *target_used,
> +                                                unsigned int length)
> +{
> +       u8 qword_diff_indexes = 0;
> +       int i = 0;
> +
> +       for (i = 0; i < length; i++) {
> +               if ((entry[i] & target_used[i]) != (target[i] & target_used[i]))
> +                       qword_diff_indexes |= 1 << i;
> +       }
> +       return qword_diff_indexes;
> +}
> +
> +/*
> + * Update the STE/CD to the target configuration. The transition from
> the current
> + * entry to the target entry takes place over multiple steps that
> attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only
> required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;
> +
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);
> +                       staging_entry[0] &= ~writer->v_bit;
> +                       /*
> +                        * After comitting the staging entry, only the 0th qword
> +                        * will differ from the target.
> +                        */
> +                       entry_qwords_used_diff = 1;
> +               }
> +
> +               /*
> +                * Commit the staging entry. Note that the iteration order
> +                * matters, as we may be comitting a breaking entry in the
> +                * non-hitless case. The 0th qword which holds the valid bit
> +                * must be written first in that case.
> +                */
> +               for (i = 0; i != writer->entry_length; i++)
> +                       WRITE_ONCE(cur[i], staging_entry[i]);
> +               writer->ops.sync_entry(writer);
> +       }
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(entry_qwords_used_diff) - 1;
> +       WRITE_ONCE(cur[i], target[i]);
> +       writer->ops.sync_entry(writer);
> +
> +       /*
> +        * Some of the bits set under the previous configuration but unused
> +        * under the target configuration might still be set. Clear them as
> +        * well. Technically this isn't necessary but it brings the entry to
> +        * the full target state, so if there are bugs in the mask calculation
> +        * this will obscure them.
> +        */
> +       for (i = 0; i != writer->entry_length; i++) {
> +               if (cur[i] != target[i]) {
> +                       WRITE_ONCE(cur[i], target[i]);
> +                       cleanup_sync_required = true;
> +               }
> +       }
> +       if (cleanup_sync_required)
> +               writer->ops.sync_entry(writer);
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1248,37 +1416,142 @@
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Based on the value of ent report which bits of the STE the HW will
> access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const __le64 *ent,
> +                                 struct arm_smmu_ste *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +       if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +               return;
> +
> +       /*
> +        * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +        * attribute/permission configuration fields for the SHCFG behavior.
> +        */
> +       if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
> +           FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
> +                   STRTAB_STE_1_S1DSS_BYPASS)
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
> +       case STRTAB_STE_0_CFG_ABORT:
> +               break;
> +       case STRTAB_STE_0_CFG_BYPASS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +               break;
> +       case STRTAB_STE_0_CFG_S1_TRANS:
> +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> +                                                 STRTAB_STE_0_S1CDMAX);
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +               break;
> +       case STRTAB_STE_0_CFG_S2_TRANS:
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +               used_bits->data[2] |=
> +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +               break;
> +
> +       default:
> +               memset(used_bits, 0xFF, sizeof(*used_bits));
> +               WARN_ON(true);
> +       }
> +}
> +
> +struct arm_smmu_ste_writer {
> +       struct arm_smmu_entry_writer writer;
> +       struct arm_smmu_device *smmu;
> +       u32 sid;
> +};
> +
> +static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
> +{
> +       struct arm_smmu_ste entry_used;
> +       arm_smmu_get_ste_used(entry, &entry_used);
> +
> +       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
> +                                      ARRAY_SIZE(entry_used.data));
> +}
> +
> +static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
> +                                              const __le64 *target)
> +{
> +       struct arm_smmu_ste target_used;
> +
> +       arm_smmu_get_ste_used(target, &target_used);
> +       return arm_smmu_entry_used_qword_diff_indexes(
> +               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
> +}
> +
> +static void arm_smmu_ste_writer_sync_entry(struct
> arm_smmu_entry_writer *writer)
> +{
> +       struct arm_smmu_ste_writer *ste_writer =
> +               container_of(writer, struct arm_smmu_ste_writer, writer);
> +
> +       arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
> +       .sync_entry = arm_smmu_ste_writer_sync_entry,
> +       .set_unused_bits = arm_smmu_ste_set_unused_bits,
> +       .get_used_qword_diff_indexes = arm_smmu_ste_used_qword_diff_indexes,
> +};
> +
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +                              struct arm_smmu_ste *ste,
> +                              const struct arm_smmu_ste *target)
> +{
> +       struct arm_smmu_ste preallocated_staging_ste = {0};
> +       struct arm_smmu_ste_writer ste_writer = {
> +               .writer = {
> +                       .ops = arm_smmu_ste_writer_ops,
> +                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
> +                       .entry_length = ARRAY_SIZE(ste->data),
> +               },
> +               .smmu = smmu,
> +               .sid = sid,
> +       };
> +
> +       arm_smmu_write_entry(&ste_writer.writer,
> +                              ste->data,
> +                              target->data,
> +                              preallocated_staging_ste.data);
> +
> +       /* It's likely that we'll want to use the new STE soon */
> +       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +               struct arm_smmu_cmdq_ent
> +                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +                                        .prefetch = {
> +                                                .sid = sid,
> +                                        } };
> +
> +               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       }
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                                       struct arm_smmu_ste *dst)
>  {
> -       /*
> -        * This is hideously complicated, but we only really care about
> -        * three cases at the moment:
> -        *
> -        * 1. Invalid (all zero) -> bypass/fault (init)
> -        * 2. Bypass/fault -> translation/bypass (attach)
> -        * 3. Translation/bypass -> bypass/fault (detach)
> -        *
> -        * Given that we can't update the STE atomically and the SMMU
> -        * doesn't read the thing in a defined order, that leaves us
> -        * with the following maintenance requirements:
> -        *
> -        * 1. Update Config, return (init time STEs aren't live)
> -        * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -        * 3. Update Config, sync
> -        */
> -       u64 val = le64_to_cpu(dst->data[0]);
> -       bool ste_live = false;
> +       u64 val;
>         struct arm_smmu_device *smmu = master->smmu;
>         struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>         struct arm_smmu_s2_cfg *s2_cfg = NULL;
>         struct arm_smmu_domain *smmu_domain = master->domain;
> -       struct arm_smmu_cmdq_ent prefetch_cmd = {
> -               .opcode         = CMDQ_OP_PREFETCH_CFG,
> -               .prefetch       = {
> -                       .sid    = sid,
> -               },
> -       };
> +       struct arm_smmu_ste target = {};
>
>         if (smmu_domain) {
>                 switch (smmu_domain->stage) {
> @@ -1293,22 +1566,6 @@
>                 }
>         }
>
> -       if (val & STRTAB_STE_0_V) {
> -               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -               case STRTAB_STE_0_CFG_BYPASS:
> -                       break;
> -               case STRTAB_STE_0_CFG_S1_TRANS:
> -               case STRTAB_STE_0_CFG_S2_TRANS:
> -                       ste_live = true;
> -                       break;
> -               case STRTAB_STE_0_CFG_ABORT:
> -                       BUG_ON(!disable_bypass);
> -                       break;
> -               default:
> -                       BUG(); /* STE corruption */
> -               }
> -       }
> -
>         /* Nuke the existing STE_0 value, as we're going to rewrite it */
>         val = STRTAB_STE_0_V;
>
> @@ -1319,16 +1576,11 @@
>                 else
>                         val |= FIELD_PREP(STRTAB_STE_0_CFG,
> STRTAB_STE_0_CFG_BYPASS);
>
> -               dst->data[0] = cpu_to_le64(val);
> -               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +               target.data[0] = cpu_to_le64(val);
> +               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>                                                 STRTAB_STE_1_SHCFG_INCOMING));
> -               dst->data[2] = 0; /* Nuke the VMID */
> -               /*
> -                * The SMMU can perform negative caching, so we must sync
> -                * the STE regardless of whether the old value was live.
> -                */
> -               if (smmu)
> -                       arm_smmu_sync_ste_for_sid(smmu, sid);
> +               target.data[2] = 0; /* Nuke the VMID */
> +               arm_smmu_write_ste(smmu, sid, dst, &target);
>                 return;
>         }
>
> @@ -1336,8 +1588,7 @@
>                 u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>                         STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>
> -               BUG_ON(ste_live);
> -               dst->data[1] = cpu_to_le64(
> +               target.data[1] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_1_S1DSS,
> STRTAB_STE_1_S1DSS_SSID0) |
>                          FIELD_PREP(STRTAB_STE_1_S1CIR,
> STRTAB_STE_1_S1C_CACHE_WBRA) |
>                          FIELD_PREP(STRTAB_STE_1_S1COR,
> STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1346,7 +1597,7 @@
>
>                 if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>                     !master->stall_enabled)
> -                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>
>                 val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>                         FIELD_PREP(STRTAB_STE_0_CFG,
> STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1355,8 +1606,7 @@
>         }
>
>         if (s2_cfg) {
> -               BUG_ON(ste_live);
> -               dst->data[2] = cpu_to_le64(
> +               target.data[2] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>                          FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1365,23 +1615,17 @@
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr &
> STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr &
> STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);
> +       arm_smmu_write_ste(smmu, sid, dst, &target);
>  }
>
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
>
>
>
>
>
> ---
> CD programming
> ---
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 55703a5d62...c849b26c43 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1219,6 +1219,86 @@
>         return &l1_desc->l2ptr[idx];
>  }
>
> +static void arm_smmu_get_cd_used(const __le64 *ent,
> +                                struct arm_smmu_cd *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(CTXDESC_CD_0_V);
> +       if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
> +               return;
> +       memset(used_bits, 0xFF, sizeof(*used_bits));
> +
> +       /* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
> +       if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
> +               used_bits->data[0] &= ~cpu_to_le64(
> +                       CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
> +                       CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
> +                       CTXDESC_CD_0_TCR_SH0);
> +               used_bits->data[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
> +       }
> +}
> +
> +struct arm_smmu_cd_writer {
> +       struct arm_smmu_entry_writer writer;
> +       struct arm_smmu_master *master;
> +       int ssid;
> +};
> +
> +static void arm_smmu_cd_set_unused_bits(__le64 *entry, const __le64 *target)
> +{
> +       struct arm_smmu_cd entry_used;
> +       arm_smmu_get_cd_used(entry, &entry_used);
> +
> +       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
> +                                      ARRAY_SIZE(entry_used.data));
> +}
> +
> +static u8 arm_smmu_cd_used_qword_diff_indexes(__le64 *cur,
> +                                              const __le64 *target)
> +{
> +       struct arm_smmu_cd target_used;
> +
> +       arm_smmu_get_cd_used(target, &target_used);
> +       return arm_smmu_entry_used_qword_diff_indexes(
> +               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
> +}
> +
> +static void arm_smmu_cd_writer_sync_entry(struct arm_smmu_entry_writer *writer)
> +{
> +       struct arm_smmu_cd_writer *cd_writer =
> +               container_of(writer, struct arm_smmu_cd_writer, writer);
> +
> +       arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
> +       .sync_entry = arm_smmu_cd_writer_sync_entry,
> +       .set_unused_bits = arm_smmu_cd_set_unused_bits,
> +       .get_used_qword_diff_indexes = arm_smmu_cd_used_qword_diff_indexes,
> +};
> +
> +static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
> +                                   struct arm_smmu_cd *cdptr,
> +                                   const struct arm_smmu_cd *target)
> +{
> +       struct arm_smmu_cd preallocated_staging_cd = {0};
> +       struct arm_smmu_cd_writer cd_writer = {
> +               .writer = {
> +                       .ops = arm_smmu_cd_writer_ops,
> +                       .v_bit = cpu_to_le64(CTXDESC_CD_0_V),
> +                       .entry_length = ARRAY_SIZE(cdptr->data),
> +               },
> +               .master = master,
> +               .ssid = ssid,
> +       };
> +
> +       arm_smmu_write_entry(&cd_writer.writer,
> +                              cdptr->data,
> +                              target->data,
> +                              preallocated_staging_cd.data);
> +}
> +
>  int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
>                             struct arm_smmu_ctx_desc *cd)
>  {
> @@ -1235,16 +1315,19 @@
>          */
>         u64 val;
>         bool cd_live;
> -       struct arm_smmu_cd *cdptr;
> +       struct arm_smmu_cd target;
> +       struct arm_smmu_cd *cdptr = &target;
> +       struct arm_smmu_cd *cd_table_entry;
>         struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
>
>         if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
>                 return -E2BIG;
>
> -       cdptr = arm_smmu_get_cd_ptr(master, ssid);
> -       if (!cdptr)
> +       cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
> +       if (!cd_table_entry)
>                 return -ENOMEM;
>
> +       target = *cd_table_entry;
>         val = le64_to_cpu(cdptr->data[0]);
>         cd_live = !!(val & CTXDESC_CD_0_V);
>
> @@ -1264,13 +1347,6 @@
>                 cdptr->data[2] = 0;
>                 cdptr->data[3] = cpu_to_le64(cd->mair);
>
> -               /*
> -                * STE may be live, and the SMMU might read dwords of
> this CD in any
> -                * order. Ensure that it observes valid values before reading
> -                * V=1.
> -                */
> -               arm_smmu_sync_cd(master, ssid, true);
> -
>                 val = cd->tcr |
>  #ifdef __BIG_ENDIAN
>                         CTXDESC_CD_0_ENDI |
> @@ -1284,18 +1360,8 @@
>                 if (cd_table->stall_enabled)
>                         val |= CTXDESC_CD_0_S;
>         }
> -
> -       /*
> -        * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
> -        * "Configuration structures and configuration invalidation completion"
> -        *
> -        *   The size of single-copy atomic reads made by the SMMU is
> -        *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
> -        *   field within an aligned 64-bit span of a structure can be altered
> -        *   without first making the structure invalid.
> -        */
> -       WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_cd(master, ssid, true);
> +       cdptr->data[0] = cpu_to_le64(val);
> +       arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
>         return 0;
>  }

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-02  8:13                           ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-02  8:13 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
...
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);
> +                       staging_entry[0] &= ~writer->v_bit;
> +                       /*
> +                        * After comitting the staging entry, only the 0th qword
> +                        * will differ from the target.
> +                        */
> +                       entry_qwords_used_diff = 1;
> +               }
> +
> +               /*
> +                * Commit the staging entry. Note that the iteration order
> +                * matters, as we may be comitting a breaking entry in the
> +                * non-hitless case. The 0th qword which holds the valid bit
> +                * must be written first in that case.
> +                */
> +               for (i = 0; i != writer->entry_length; i++)
> +                       WRITE_ONCE(cur[i], staging_entry[i]);
> +               writer->ops.sync_entry(writer);

Realized while replying to your latest email that this is wrong (and
the unit-test as well!). It's not enough to just write the 0th qword
first if it's a breaking entry, it must also sync after that 0th qword
write.

On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
>
> On Mon, Dec 18, 2023 at 8:35 PM Michael Shavit <mshavit@google.com> wrote:
> >
> > On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
> > >
> > > > Ok, I took a proper stab at trying to unroll the loop on the github
> > > > version of this patch (v3+)
> > > > As you suspected, it's not easy to re-use the unrolled version for
> > > > both STE and CD writing as we'd have to pass in callbacks for syncing
> > > > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
> > >
> > > Yes, that is why I structured it as an iterator
> >
> > On second thought, perhaps defining a helper class implementing
> > entry_sync() and entry_get_used_bits() might not be so bad?
> > It's a little bit more verbose, but avoids deduplication of the
> > complicated parts.
>
> Gave this a try so that we have something more concrete to compare.
> Consider the following two patches as alternatives to this patch and
> patch "Make CD programming use arm_smmu_write_entry_step" from the
> next part of the patch series.
>
> STE programming patch
> ---
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index b120d83668...1e17bff37f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -971,6 +971,174 @@
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +struct arm_smmu_entry_writer;
> +
> +/**
> + * struct arm_smmu_entry_writer_ops - Helper class for writing a CD/STE entry.
> + * @sync_entry: sync entry to the hardware after writing to it.
> + * @set_unused_bits: Make bits of the entry that aren't in use by the hardware
> + *                   equal to the target's bits.
> + * @get_used_qword_diff_indexes: Compute the list of qwords in the entry that
> + *                               are incorrect compared to the target,
> + *                               considering only the used bits in the target.
> + *                               The set bits in the return value
> represents the
> + *                               indexes of those qwords.
> + */
> +struct arm_smmu_entry_writer_ops {
> +       void (*sync_entry)(struct arm_smmu_entry_writer *);
> +       void (*set_unused_bits)(__le64 *entry, const __le64 *target);
> +       u8 (*get_used_qword_diff_indexes)(__le64 *entry, const __le64 *target);
> +};
> +
> +struct arm_smmu_entry_writer {
> +       struct arm_smmu_entry_writer_ops ops;
> +       __le64 v_bit;
> +       unsigned int entry_length;
> +};
> +
> +static void arm_smmu_entry_set_unused_bits(__le64 *entry, const __le64 *target,
> +                                          const __le64 *entry_used,
> +                                          unsigned int length)
> +{
> +       int i = 0;
> +
> +       for (i = 0; i < length; i++)
> +               entry[i] = (entry[i] & entry_used[i]) |
> +                          (target[i] & ~entry_used[i]);
> +}
> +
> +static u8 arm_smmu_entry_used_qword_diff_indexes(__le64 *entry,
> +                                                const __le64 *target,
> +                                                const __le64 *target_used,
> +                                                unsigned int length)
> +{
> +       u8 qword_diff_indexes = 0;
> +       int i = 0;
> +
> +       for (i = 0; i < length; i++) {
> +               if ((entry[i] & target_used[i]) != (target[i] & target_used[i]))
> +                       qword_diff_indexes |= 1 << i;
> +       }
> +       return qword_diff_indexes;
> +}
> +
> +/*
> + * Update the STE/CD to the target configuration. The transition from
> the current
> + * entry to the target entry takes place over multiple steps that
> attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only
> required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not
> see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *cur, const __le64 *target,
> +                                __le64 *staging_entry)
> +{
> +       bool cleanup_sync_required = false;
> +       u8 entry_qwords_used_diff = 0;
> +       int i = 0;
> +
> +       entry_qwords_used_diff =
> +               writer->ops.get_used_qword_diff_indexes(cur, target);
> +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> +               return;
> +
> +       if (hweight8(entry_qwords_used_diff) > 1) {
> +               /*
> +                * If transitioning to the target entry with a single qword
> +                * write isn't possible, then we must first transition to an
> +                * intermediate entry. The intermediate entry may either be an
> +                * entry that melds bits of the target entry into the current
> +                * entry without disrupting the hardware, or a breaking entry if
> +                * a hitless transition to the target is impossible.
> +                */
> +
> +               /*
> +                * Compute a staging entry that has all the bits currently
> +                * unused by HW set to their target values, such that comitting
> +                * it to the entry table woudn't disrupt the hardware.
> +                */
> +               memcpy(staging_entry, cur, writer->entry_length);
> +               writer->ops.set_unused_bits(staging_entry, target);
> +
> +               entry_qwords_used_diff =
> +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> +                                                               target);
> +               if (hweight8(entry_qwords_used_diff) > 1) {
> +                       /*
> +                        * More than 1 qword is mismatched between the staging
> +                        * and target entry. A hitless transition to the target
> +                        * entry is not possible. Set the staging entry to be
> +                        * equal to the target entry, apart from the V bit's
> +                        * qword. As long as the V bit is cleared first then
> +                        * writes to the subsequent qwords will not further
> +                        * disrupt the hardware.
> +                        */
> +                       memcpy(staging_entry, target, writer->entry_length);
> +                       staging_entry[0] &= ~writer->v_bit;
> +                       /*
> +                        * After comitting the staging entry, only the 0th qword
> +                        * will differ from the target.
> +                        */
> +                       entry_qwords_used_diff = 1;
> +               }
> +
> +               /*
> +                * Commit the staging entry. Note that the iteration order
> +                * matters, as we may be comitting a breaking entry in the
> +                * non-hitless case. The 0th qword which holds the valid bit
> +                * must be written first in that case.
> +                */
> +               for (i = 0; i != writer->entry_length; i++)
> +                       WRITE_ONCE(cur[i], staging_entry[i]);
> +               writer->ops.sync_entry(writer);
> +       }
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(entry_qwords_used_diff) - 1;
> +       WRITE_ONCE(cur[i], target[i]);
> +       writer->ops.sync_entry(writer);
> +
> +       /*
> +        * Some of the bits set under the previous configuration but unused
> +        * under the target configuration might still be set. Clear them as
> +        * well. Technically this isn't necessary but it brings the entry to
> +        * the full target state, so if there are bugs in the mask calculation
> +        * this will obscure them.
> +        */
> +       for (i = 0; i != writer->entry_length; i++) {
> +               if (cur[i] != target[i]) {
> +                       WRITE_ONCE(cur[i], target[i]);
> +                       cleanup_sync_required = true;
> +               }
> +       }
> +       if (cleanup_sync_required)
> +               writer->ops.sync_entry(writer);
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1248,37 +1416,142 @@
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Based on the value of ent report which bits of the STE the HW will
> access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(const __le64 *ent,
> +                                 struct arm_smmu_ste *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> +       if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +               return;
> +
> +       /*
> +        * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +        * attribute/permission configuration fields for the SHCFG behavior.
> +        */
> +       if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
> +           FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
> +                   STRTAB_STE_1_S1DSS_BYPASS)
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +       used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
> +       case STRTAB_STE_0_CFG_ABORT:
> +               break;
> +       case STRTAB_STE_0_CFG_BYPASS:
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +               break;
> +       case STRTAB_STE_0_CFG_S1_TRANS:
> +               used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +                                                 STRTAB_STE_0_S1CTXPTR_MASK |
> +                                                 STRTAB_STE_0_S1CDMAX);
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +               used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +               break;
> +       case STRTAB_STE_0_CFG_S2_TRANS:
> +               used_bits->data[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +               used_bits->data[2] |=
> +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +               used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +               break;
> +
> +       default:
> +               memset(used_bits, 0xFF, sizeof(*used_bits));
> +               WARN_ON(true);
> +       }
> +}
> +
> +struct arm_smmu_ste_writer {
> +       struct arm_smmu_entry_writer writer;
> +       struct arm_smmu_device *smmu;
> +       u32 sid;
> +};
> +
> +static void arm_smmu_ste_set_unused_bits(__le64 *entry, const __le64 *target)
> +{
> +       struct arm_smmu_ste entry_used;
> +       arm_smmu_get_ste_used(entry, &entry_used);
> +
> +       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
> +                                      ARRAY_SIZE(entry_used.data));
> +}
> +
> +static u8 arm_smmu_ste_used_qword_diff_indexes(__le64 *cur,
> +                                              const __le64 *target)
> +{
> +       struct arm_smmu_ste target_used;
> +
> +       arm_smmu_get_ste_used(target, &target_used);
> +       return arm_smmu_entry_used_qword_diff_indexes(
> +               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
> +}
> +
> +static void arm_smmu_ste_writer_sync_entry(struct
> arm_smmu_entry_writer *writer)
> +{
> +       struct arm_smmu_ste_writer *ste_writer =
> +               container_of(writer, struct arm_smmu_ste_writer, writer);
> +
> +       arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
> +       .sync_entry = arm_smmu_ste_writer_sync_entry,
> +       .set_unused_bits = arm_smmu_ste_set_unused_bits,
> +       .get_used_qword_diff_indexes = arm_smmu_ste_used_qword_diff_indexes,
> +};
> +
> +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> +                              struct arm_smmu_ste *ste,
> +                              const struct arm_smmu_ste *target)
> +{
> +       struct arm_smmu_ste preallocated_staging_ste = {0};
> +       struct arm_smmu_ste_writer ste_writer = {
> +               .writer = {
> +                       .ops = arm_smmu_ste_writer_ops,
> +                       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
> +                       .entry_length = ARRAY_SIZE(ste->data),
> +               },
> +               .smmu = smmu,
> +               .sid = sid,
> +       };
> +
> +       arm_smmu_write_entry(&ste_writer.writer,
> +                              ste->data,
> +                              target->data,
> +                              preallocated_staging_ste.data);
> +
> +       /* It's likely that we'll want to use the new STE soon */
> +       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +               struct arm_smmu_cmdq_ent
> +                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +                                        .prefetch = {
> +                                                .sid = sid,
> +                                        } };
> +
> +               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       }
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                                       struct arm_smmu_ste *dst)
>  {
> -       /*
> -        * This is hideously complicated, but we only really care about
> -        * three cases at the moment:
> -        *
> -        * 1. Invalid (all zero) -> bypass/fault (init)
> -        * 2. Bypass/fault -> translation/bypass (attach)
> -        * 3. Translation/bypass -> bypass/fault (detach)
> -        *
> -        * Given that we can't update the STE atomically and the SMMU
> -        * doesn't read the thing in a defined order, that leaves us
> -        * with the following maintenance requirements:
> -        *
> -        * 1. Update Config, return (init time STEs aren't live)
> -        * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -        * 3. Update Config, sync
> -        */
> -       u64 val = le64_to_cpu(dst->data[0]);
> -       bool ste_live = false;
> +       u64 val;
>         struct arm_smmu_device *smmu = master->smmu;
>         struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>         struct arm_smmu_s2_cfg *s2_cfg = NULL;
>         struct arm_smmu_domain *smmu_domain = master->domain;
> -       struct arm_smmu_cmdq_ent prefetch_cmd = {
> -               .opcode         = CMDQ_OP_PREFETCH_CFG,
> -               .prefetch       = {
> -                       .sid    = sid,
> -               },
> -       };
> +       struct arm_smmu_ste target = {};
>
>         if (smmu_domain) {
>                 switch (smmu_domain->stage) {
> @@ -1293,22 +1566,6 @@
>                 }
>         }
>
> -       if (val & STRTAB_STE_0_V) {
> -               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -               case STRTAB_STE_0_CFG_BYPASS:
> -                       break;
> -               case STRTAB_STE_0_CFG_S1_TRANS:
> -               case STRTAB_STE_0_CFG_S2_TRANS:
> -                       ste_live = true;
> -                       break;
> -               case STRTAB_STE_0_CFG_ABORT:
> -                       BUG_ON(!disable_bypass);
> -                       break;
> -               default:
> -                       BUG(); /* STE corruption */
> -               }
> -       }
> -
>         /* Nuke the existing STE_0 value, as we're going to rewrite it */
>         val = STRTAB_STE_0_V;
>
> @@ -1319,16 +1576,11 @@
>                 else
>                         val |= FIELD_PREP(STRTAB_STE_0_CFG,
> STRTAB_STE_0_CFG_BYPASS);
>
> -               dst->data[0] = cpu_to_le64(val);
> -               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +               target.data[0] = cpu_to_le64(val);
> +               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>                                                 STRTAB_STE_1_SHCFG_INCOMING));
> -               dst->data[2] = 0; /* Nuke the VMID */
> -               /*
> -                * The SMMU can perform negative caching, so we must sync
> -                * the STE regardless of whether the old value was live.
> -                */
> -               if (smmu)
> -                       arm_smmu_sync_ste_for_sid(smmu, sid);
> +               target.data[2] = 0; /* Nuke the VMID */
> +               arm_smmu_write_ste(smmu, sid, dst, &target);
>                 return;
>         }
>
> @@ -1336,8 +1588,7 @@
>                 u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>                         STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>
> -               BUG_ON(ste_live);
> -               dst->data[1] = cpu_to_le64(
> +               target.data[1] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_1_S1DSS,
> STRTAB_STE_1_S1DSS_SSID0) |
>                          FIELD_PREP(STRTAB_STE_1_S1CIR,
> STRTAB_STE_1_S1C_CACHE_WBRA) |
>                          FIELD_PREP(STRTAB_STE_1_S1COR,
> STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1346,7 +1597,7 @@
>
>                 if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>                     !master->stall_enabled)
> -                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>
>                 val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>                         FIELD_PREP(STRTAB_STE_0_CFG,
> STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1355,8 +1606,7 @@
>         }
>
>         if (s2_cfg) {
> -               BUG_ON(ste_live);
> -               dst->data[2] = cpu_to_le64(
> +               target.data[2] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>                          FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1365,23 +1615,17 @@
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr &
> STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr &
> STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);
> +       arm_smmu_write_ste(smmu, sid, dst, &target);
>  }
>
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
>
>
>
>
>
> ---
> CD programming
> ---
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 55703a5d62...c849b26c43 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1219,6 +1219,86 @@
>         return &l1_desc->l2ptr[idx];
>  }
>
> +static void arm_smmu_get_cd_used(const __le64 *ent,
> +                                struct arm_smmu_cd *used_bits)
> +{
> +       memset(used_bits, 0, sizeof(*used_bits));
> +
> +       used_bits->data[0] = cpu_to_le64(CTXDESC_CD_0_V);
> +       if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
> +               return;
> +       memset(used_bits, 0xFF, sizeof(*used_bits));
> +
> +       /* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
> +       if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
> +               used_bits->data[0] &= ~cpu_to_le64(
> +                       CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
> +                       CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
> +                       CTXDESC_CD_0_TCR_SH0);
> +               used_bits->data[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
> +       }
> +}
> +
> +struct arm_smmu_cd_writer {
> +       struct arm_smmu_entry_writer writer;
> +       struct arm_smmu_master *master;
> +       int ssid;
> +};
> +
> +static void arm_smmu_cd_set_unused_bits(__le64 *entry, const __le64 *target)
> +{
> +       struct arm_smmu_cd entry_used;
> +       arm_smmu_get_cd_used(entry, &entry_used);
> +
> +       arm_smmu_entry_set_unused_bits(entry, target, entry_used.data,
> +                                      ARRAY_SIZE(entry_used.data));
> +}
> +
> +static u8 arm_smmu_cd_used_qword_diff_indexes(__le64 *cur,
> +                                              const __le64 *target)
> +{
> +       struct arm_smmu_cd target_used;
> +
> +       arm_smmu_get_cd_used(target, &target_used);
> +       return arm_smmu_entry_used_qword_diff_indexes(
> +               cur, target, target_used.data, ARRAY_SIZE(target_used.data));
> +}
> +
> +static void arm_smmu_cd_writer_sync_entry(struct arm_smmu_entry_writer *writer)
> +{
> +       struct arm_smmu_cd_writer *cd_writer =
> +               container_of(writer, struct arm_smmu_cd_writer, writer);
> +
> +       arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
> +       .sync_entry = arm_smmu_cd_writer_sync_entry,
> +       .set_unused_bits = arm_smmu_cd_set_unused_bits,
> +       .get_used_qword_diff_indexes = arm_smmu_cd_used_qword_diff_indexes,
> +};
> +
> +static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
> +                                   struct arm_smmu_cd *cdptr,
> +                                   const struct arm_smmu_cd *target)
> +{
> +       struct arm_smmu_cd preallocated_staging_cd = {0};
> +       struct arm_smmu_cd_writer cd_writer = {
> +               .writer = {
> +                       .ops = arm_smmu_cd_writer_ops,
> +                       .v_bit = cpu_to_le64(CTXDESC_CD_0_V),
> +                       .entry_length = ARRAY_SIZE(cdptr->data),
> +               },
> +               .master = master,
> +               .ssid = ssid,
> +       };
> +
> +       arm_smmu_write_entry(&cd_writer.writer,
> +                              cdptr->data,
> +                              target->data,
> +                              preallocated_staging_cd.data);
> +}
> +
>  int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
>                             struct arm_smmu_ctx_desc *cd)
>  {
> @@ -1235,16 +1315,19 @@
>          */
>         u64 val;
>         bool cd_live;
> -       struct arm_smmu_cd *cdptr;
> +       struct arm_smmu_cd target;
> +       struct arm_smmu_cd *cdptr = &target;
> +       struct arm_smmu_cd *cd_table_entry;
>         struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
>
>         if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
>                 return -E2BIG;
>
> -       cdptr = arm_smmu_get_cd_ptr(master, ssid);
> -       if (!cdptr)
> +       cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
> +       if (!cd_table_entry)
>                 return -ENOMEM;
>
> +       target = *cd_table_entry;
>         val = le64_to_cpu(cdptr->data[0]);
>         cd_live = !!(val & CTXDESC_CD_0_V);
>
> @@ -1264,13 +1347,6 @@
>                 cdptr->data[2] = 0;
>                 cdptr->data[3] = cpu_to_le64(cd->mair);
>
> -               /*
> -                * STE may be live, and the SMMU might read dwords of
> this CD in any
> -                * order. Ensure that it observes valid values before reading
> -                * V=1.
> -                */
> -               arm_smmu_sync_cd(master, ssid, true);
> -
>                 val = cd->tcr |
>  #ifdef __BIG_ENDIAN
>                         CTXDESC_CD_0_ENDI |
> @@ -1284,18 +1360,8 @@
>                 if (cd_table->stall_enabled)
>                         val |= CTXDESC_CD_0_S;
>         }
> -
> -       /*
> -        * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
> -        * "Configuration structures and configuration invalidation completion"
> -        *
> -        *   The size of single-copy atomic reads made by the SMMU is
> -        *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
> -        *   field within an aligned 64-bit span of a structure can be altered
> -        *   without first making the structure invalid.
> -        */
> -       WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_cd(master, ssid, true);
> +       cdptr->data[0] = cpu_to_le64(val);
> +       arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
>         return 0;
>  }

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-02  8:08                             ` Michael Shavit
@ 2024-01-02 14:48                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-02 14:48 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 02, 2024 at 04:08:41PM +0800, Michael Shavit wrote:
> On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
> >
> > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > +                                __le64 *cur, const __le64 *target,
> > > +                                __le64 *staging_entry)
> > > +{
> > > +       bool cleanup_sync_required = false;
> > > +       u8 entry_qwords_used_diff = 0;
> > > +       int i = 0;
> > > +
> > > +       entry_qwords_used_diff =
> > > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > > +               return;
> >
> > A no change update is actually API legal, eg we can set the same
> > domain twice in a row. It should just do nothing.
> >
> > If the goal is to improve readability I'd split this into smaller
> > functions and have the main function look like this:
> >
> >        compute_used(..)
> >        if (hweight8(entry_qwords_used_diff) > 1) {
> >              set_v_0(..);
> >              set(qword_start=1,qword_end=N);
> >              set(qword_start=0,qword_end=1); // V=1
> 
> This branch is probably a bit more complicated than that. It's a bit more like:
>        if (hweight8(entry_qwords_used_diff) > 1) {
>              compute_staging_entry(...);
>              compute_used_diffs(...staging_entry...)
>              if (hweight(entry_qwords_used_diff) > 1) {
>                  set_v_0();
>                  set(qword_start=1,qword_end=N);
>                  set(qword_start=0,qword_end=1); // V=1
>              } else {
>                  set(qword_start=0, qword_end=N, staging_entry, entry)
>                  critical = ffs(..);
>                  set(qword_start=critical,qword_end=critical+1);
>                  set(qword_start=0,qword_end=N);
>              }
>       }
> 
> >        } else if (hweight8(entry_qwords_used_diff) == 1) {
> >              set_unused(..);
> >              critical = ffs(..);
> >              set(qword_start=critical,qword_end=critical+1);
> >              set(qword_start=0,qword_end=N);
> 
> And then this branch is the case where you can directly switch to the
> entry without first setting unused bits.

Don't make that a special case, just always set the unused bits. All
the setting functions should skip the sync if they didn't change the
entry, so we don't need to care if we call them needlessly.

There are only three programming sequences.

entry_qwords_used_diff should reflect required changes after setting
the unused bits.

> > > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > > +               /*
> > > +                * If transitioning to the target entry with a single qword
> > > +                * write isn't possible, then we must first transition to an
> > > +                * intermediate entry. The intermediate entry may either be an
> > > +                * entry that melds bits of the target entry into the current
> > > +                * entry without disrupting the hardware, or a breaking entry if
> > > +                * a hitless transition to the target is impossible.
> > > +                */
> > > +
> > > +               /*
> > > +                * Compute a staging entry that has all the bits currently
> > > +                * unused by HW set to their target values, such that comitting
> > > +                * it to the entry table woudn't disrupt the hardware.
> > > +                */
> > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > +
> > > +               entry_qwords_used_diff =
> > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > +                                                               target);
> >
> > Put the number qwords directly in the ops struct and don't make this
> > an op.  Above will need N=number of qwords as well.
> 
> The reason I made get_used_qword_diff_indexes an op is because the
> algorithm needs to compute the used_bits for entries (for the current
> entry, the target entry as well as the melded-staging entry).

Make getting the used bits the op..

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-02 14:48                               ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-02 14:48 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 02, 2024 at 04:08:41PM +0800, Michael Shavit wrote:
> On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> > On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
> >
> > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > +                                __le64 *cur, const __le64 *target,
> > > +                                __le64 *staging_entry)
> > > +{
> > > +       bool cleanup_sync_required = false;
> > > +       u8 entry_qwords_used_diff = 0;
> > > +       int i = 0;
> > > +
> > > +       entry_qwords_used_diff =
> > > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > > +               return;
> >
> > A no change update is actually API legal, eg we can set the same
> > domain twice in a row. It should just do nothing.
> >
> > If the goal is to improve readability I'd split this into smaller
> > functions and have the main function look like this:
> >
> >        compute_used(..)
> >        if (hweight8(entry_qwords_used_diff) > 1) {
> >              set_v_0(..);
> >              set(qword_start=1,qword_end=N);
> >              set(qword_start=0,qword_end=1); // V=1
> 
> This branch is probably a bit more complicated than that. It's a bit more like:
>        if (hweight8(entry_qwords_used_diff) > 1) {
>              compute_staging_entry(...);
>              compute_used_diffs(...staging_entry...)
>              if (hweight(entry_qwords_used_diff) > 1) {
>                  set_v_0();
>                  set(qword_start=1,qword_end=N);
>                  set(qword_start=0,qword_end=1); // V=1
>              } else {
>                  set(qword_start=0, qword_end=N, staging_entry, entry)
>                  critical = ffs(..);
>                  set(qword_start=critical,qword_end=critical+1);
>                  set(qword_start=0,qword_end=N);
>              }
>       }
> 
> >        } else if (hweight8(entry_qwords_used_diff) == 1) {
> >              set_unused(..);
> >              critical = ffs(..);
> >              set(qword_start=critical,qword_end=critical+1);
> >              set(qword_start=0,qword_end=N);
> 
> And then this branch is the case where you can directly switch to the
> entry without first setting unused bits.

Don't make that a special case, just always set the unused bits. All
the setting functions should skip the sync if they didn't change the
entry, so we don't need to care if we call them needlessly.

There are only three programming sequences.

entry_qwords_used_diff should reflect required changes after setting
the unused bits.

> > > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > > +               /*
> > > +                * If transitioning to the target entry with a single qword
> > > +                * write isn't possible, then we must first transition to an
> > > +                * intermediate entry. The intermediate entry may either be an
> > > +                * entry that melds bits of the target entry into the current
> > > +                * entry without disrupting the hardware, or a breaking entry if
> > > +                * a hitless transition to the target is impossible.
> > > +                */
> > > +
> > > +               /*
> > > +                * Compute a staging entry that has all the bits currently
> > > +                * unused by HW set to their target values, such that comitting
> > > +                * it to the entry table woudn't disrupt the hardware.
> > > +                */
> > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > +
> > > +               entry_qwords_used_diff =
> > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > +                                                               target);
> >
> > Put the number qwords directly in the ops struct and don't make this
> > an op.  Above will need N=number of qwords as well.
> 
> The reason I made get_used_qword_diff_indexes an op is because the
> algorithm needs to compute the used_bits for entries (for the current
> entry, the target entry as well as the melded-staging entry).

Make getting the used bits the op..

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-02  8:13                           ` Michael Shavit
@ 2024-01-02 14:48                             ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-02 14:48 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 02, 2024 at 04:13:28PM +0800, Michael Shavit wrote:
> On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
> ...
> > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > +               /*
> > +                * If transitioning to the target entry with a single qword
> > +                * write isn't possible, then we must first transition to an
> > +                * intermediate entry. The intermediate entry may either be an
> > +                * entry that melds bits of the target entry into the current
> > +                * entry without disrupting the hardware, or a breaking entry if
> > +                * a hitless transition to the target is impossible.
> > +                */
> > +
> > +               /*
> > +                * Compute a staging entry that has all the bits currently
> > +                * unused by HW set to their target values, such that comitting
> > +                * it to the entry table woudn't disrupt the hardware.
> > +                */
> > +               memcpy(staging_entry, cur, writer->entry_length);
> > +               writer->ops.set_unused_bits(staging_entry, target);
> > +
> > +               entry_qwords_used_diff =
> > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > +                                                               target);
> > +               if (hweight8(entry_qwords_used_diff) > 1) {
> > +                       /*
> > +                        * More than 1 qword is mismatched between the staging
> > +                        * and target entry. A hitless transition to the target
> > +                        * entry is not possible. Set the staging entry to be
> > +                        * equal to the target entry, apart from the V bit's
> > +                        * qword. As long as the V bit is cleared first then
> > +                        * writes to the subsequent qwords will not further
> > +                        * disrupt the hardware.
> > +                        */
> > +                       memcpy(staging_entry, target, writer->entry_length);
> > +                       staging_entry[0] &= ~writer->v_bit;
> > +                       /*
> > +                        * After comitting the staging entry, only the 0th qword
> > +                        * will differ from the target.
> > +                        */
> > +                       entry_qwords_used_diff = 1;
> > +               }
> > +
> > +               /*
> > +                * Commit the staging entry. Note that the iteration order
> > +                * matters, as we may be comitting a breaking entry in the
> > +                * non-hitless case. The 0th qword which holds the valid bit
> > +                * must be written first in that case.
> > +                */
> > +               for (i = 0; i != writer->entry_length; i++)
> > +                       WRITE_ONCE(cur[i], staging_entry[i]);
> > +               writer->ops.sync_entry(writer);
> 
> Realized while replying to your latest email that this is wrong (and
> the unit-test as well!). It's not enough to just write the 0th qword
> first if it's a breaking entry, it must also sync after that 0th qword
> write.

Right.

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-02 14:48                             ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-02 14:48 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 02, 2024 at 04:13:28PM +0800, Michael Shavit wrote:
> On Tue, Dec 19, 2023 at 9:42 PM Michael Shavit <mshavit@google.com> wrote:
> ...
> > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > +               /*
> > +                * If transitioning to the target entry with a single qword
> > +                * write isn't possible, then we must first transition to an
> > +                * intermediate entry. The intermediate entry may either be an
> > +                * entry that melds bits of the target entry into the current
> > +                * entry without disrupting the hardware, or a breaking entry if
> > +                * a hitless transition to the target is impossible.
> > +                */
> > +
> > +               /*
> > +                * Compute a staging entry that has all the bits currently
> > +                * unused by HW set to their target values, such that comitting
> > +                * it to the entry table woudn't disrupt the hardware.
> > +                */
> > +               memcpy(staging_entry, cur, writer->entry_length);
> > +               writer->ops.set_unused_bits(staging_entry, target);
> > +
> > +               entry_qwords_used_diff =
> > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > +                                                               target);
> > +               if (hweight8(entry_qwords_used_diff) > 1) {
> > +                       /*
> > +                        * More than 1 qword is mismatched between the staging
> > +                        * and target entry. A hitless transition to the target
> > +                        * entry is not possible. Set the staging entry to be
> > +                        * equal to the target entry, apart from the V bit's
> > +                        * qword. As long as the V bit is cleared first then
> > +                        * writes to the subsequent qwords will not further
> > +                        * disrupt the hardware.
> > +                        */
> > +                       memcpy(staging_entry, target, writer->entry_length);
> > +                       staging_entry[0] &= ~writer->v_bit;
> > +                       /*
> > +                        * After comitting the staging entry, only the 0th qword
> > +                        * will differ from the target.
> > +                        */
> > +                       entry_qwords_used_diff = 1;
> > +               }
> > +
> > +               /*
> > +                * Commit the staging entry. Note that the iteration order
> > +                * matters, as we may be comitting a breaking entry in the
> > +                * non-hitless case. The 0th qword which holds the valid bit
> > +                * must be written first in that case.
> > +                */
> > +               for (i = 0; i != writer->entry_length; i++)
> > +                       WRITE_ONCE(cur[i], staging_entry[i]);
> > +               writer->ops.sync_entry(writer);
> 
> Realized while replying to your latest email that this is wrong (and
> the unit-test as well!). It's not enough to just write the 0th qword
> first if it's a breaking entry, it must also sync after that 0th qword
> write.

Right.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2023-12-27 15:46                           ` Jason Gunthorpe
@ 2024-01-03 15:42                             ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 15:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>        } else { // hweight8 == 0
>              set(qword_start=0,qword_end=N);
>        }
About this branch: wouldn't it be more clear to explicitly do nothing
or warn if any of the bits differ? We shouldn't ever expect to see the
STE differ in unused-bits position.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-03 15:42                             ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 15:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
>        } else { // hweight8 == 0
>              set(qword_start=0,qword_end=N);
>        }
About this branch: wouldn't it be more clear to explicitly do nothing
or warn if any of the bits differ? We shouldn't ever expect to see the
STE differ in unused-bits position.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-03 15:42                             ` Michael Shavit
@ 2024-01-03 15:49                               ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-03 15:49 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Jan 03, 2024 at 11:42:50PM +0800, Michael Shavit wrote:
> On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> >        } else { // hweight8 == 0
> >              set(qword_start=0,qword_end=N);
> >        }
> About this branch: wouldn't it be more clear to explicitly do nothing
> or warn if any of the bits differ? We shouldn't ever expect to see the
> STE differ in unused-bits position.

Right, it should be impossible considering the used consistency check,
but if we do hit this case for some future bug the right thing to do
is fix it up and keep going. A WARN_ONCE would be good

If we get here and set does actually make a change it will be to
change an unused bit which is 1 to a 0, and we've already taken the
position in this logic that is something we want to correct.

As before since set should skip the sync if no bits change this is
effectively a NOP.

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-03 15:49                               ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-03 15:49 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Jan 03, 2024 at 11:42:50PM +0800, Michael Shavit wrote:
> On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> >
> >        } else { // hweight8 == 0
> >              set(qword_start=0,qword_end=N);
> >        }
> About this branch: wouldn't it be more clear to explicitly do nothing
> or warn if any of the bits differ? We shouldn't ever expect to see the
> STE differ in unused-bits position.

Right, it should be impossible considering the used consistency check,
but if we do hit this case for some future bug the right thing to do
is fix it up and keep going. A WARN_ONCE would be good

If we get here and set does actually make a change it will be to
change an unused bit which is 1 to a 0, and we've already taken the
position in this logic that is something we want to correct.

As before since set should skip the sync if no bits change this is
effectively a NOP.

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-03 15:49                               ` Jason Gunthorpe
@ 2024-01-03 16:47                                 ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Jan 3, 2024 at 11:49 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jan 03, 2024 at 11:42:50PM +0800, Michael Shavit wrote:
> > On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > >        } else { // hweight8 == 0
> > >              set(qword_start=0,qword_end=N);
> > >        }
> > About this branch: wouldn't it be more clear to explicitly do nothing
> > or warn if any of the bits differ? We shouldn't ever expect to see the
> > STE differ in unused-bits position.
>
> Right, it should be impossible considering the used consistency check,
> but if we do hit this case for some future bug the right thing to do
> is fix it up and keep going. A WARN_ONCE would be good
>
> If we get here and set does actually make a change it will be to
> change an unused bit which is 1 to a 0, and we've already taken the
> position in this logic that is something we want to correct.
>
> As before since set should skip the sync if no bits change this is
> effectively a NOP.

Yeah I realize it's a NOP, it's just a question of clarity. Having a
call to set() without a warn or a comment could mislead readers into
thinking that this is a valid branch rather than a bug-catching
branch.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-03 16:47                                 ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 16:47 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Wed, Jan 3, 2024 at 11:49 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Wed, Jan 03, 2024 at 11:42:50PM +0800, Michael Shavit wrote:
> > On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > >        } else { // hweight8 == 0
> > >              set(qword_start=0,qword_end=N);
> > >        }
> > About this branch: wouldn't it be more clear to explicitly do nothing
> > or warn if any of the bits differ? We shouldn't ever expect to see the
> > STE differ in unused-bits position.
>
> Right, it should be impossible considering the used consistency check,
> but if we do hit this case for some future bug the right thing to do
> is fix it up and keep going. A WARN_ONCE would be good
>
> If we get here and set does actually make a change it will be to
> change an unused bit which is 1 to a 0, and we've already taken the
> position in this logic that is something we want to correct.
>
> As before since set should skip the sync if no bits change this is
> effectively a NOP.

Yeah I realize it's a NOP, it's just a question of clarity. Having a
call to set() without a warn or a comment could mislead readers into
thinking that this is a valid branch rather than a bug-catching
branch.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-02 14:48                               ` Jason Gunthorpe
@ 2024-01-03 16:52                                 ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 16:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 2, 2024 at 10:48 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Jan 02, 2024 at 04:08:41PM +0800, Michael Shavit wrote:
> > On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
> > >
> > > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > > +                                __le64 *cur, const __le64 *target,
> > > > +                                __le64 *staging_entry)
> > > > +{
> > > > +       bool cleanup_sync_required = false;
> > > > +       u8 entry_qwords_used_diff = 0;
> > > > +       int i = 0;
> > > > +
> > > > +       entry_qwords_used_diff =
> > > > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > > > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > > > +               return;
> > >
> > > A no change update is actually API legal, eg we can set the same
> > > domain twice in a row. It should just do nothing.
> > >
> > > If the goal is to improve readability I'd split this into smaller
> > > functions and have the main function look like this:
> > >
> > >        compute_used(..)
> > >        if (hweight8(entry_qwords_used_diff) > 1) {
> > >              set_v_0(..);
> > >              set(qword_start=1,qword_end=N);
> > >              set(qword_start=0,qword_end=1); // V=1
> >
> > This branch is probably a bit more complicated than that. It's a bit more like:
> >        if (hweight8(entry_qwords_used_diff) > 1) {
> >              compute_staging_entry(...);
> >              compute_used_diffs(...staging_entry...)
> >              if (hweight(entry_qwords_used_diff) > 1) {
> >                  set_v_0();
> >                  set(qword_start=1,qword_end=N);
> >                  set(qword_start=0,qword_end=1); // V=1
> >              } else {
> >                  set(qword_start=0, qword_end=N, staging_entry, entry)
> >                  critical = ffs(..);
> >                  set(qword_start=critical,qword_end=critical+1);
> >                  set(qword_start=0,qword_end=N);
> >              }
> >       }
> >
> > >        } else if (hweight8(entry_qwords_used_diff) == 1) {
> > >              set_unused(..);
> > >              critical = ffs(..);
> > >              set(qword_start=critical,qword_end=critical+1);
> > >              set(qword_start=0,qword_end=N);
> >
> > And then this branch is the case where you can directly switch to the
> > entry without first setting unused bits.
>
> Don't make that a special case, just always set the unused bits. All
> the setting functions should skip the sync if they didn't change the
> entry, so we don't need to care if we call them needlessly.
>
> There are only three programming sequences.

The different cases (ignoring clean-up) from simplest to least are:
1. No change because the STE is already equal to the target.
2. Directly writing critical word because that's the only difference.
3. Setting unused bits then writing critical word.
4. Installing breaking STE, write other words, write critical word.

Case 2. could potentially be collapsed into 3. if the routine that
sets unused bits skips over the critical word, so that it's a nop when
the only change is on that critical word.

> entry_qwords_used_diff should reflect required changes after setting
> the unused bits.

Ohhhhhhh, I see. Your suggestion is essentially to move this block
into the first call to get_used_qword_diff_indexes:
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);

Such that:
if (hweight8(entry_qwords_used_diff) > 1) => non hitless
if (hweight8(entry_qwords_used_diff) > 0) => hitless, potentially by
first setting some unused bits in non-critical qwords.

>
> > > > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > > > +               /*
> > > > +                * If transitioning to the target entry with a single qword
> > > > +                * write isn't possible, then we must first transition to an
> > > > +                * intermediate entry. The intermediate entry may either be an
> > > > +                * entry that melds bits of the target entry into the current
> > > > +                * entry without disrupting the hardware, or a breaking entry if
> > > > +                * a hitless transition to the target is impossible.
> > > > +                */
> > > > +
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);
> > >
> > > Put the number qwords directly in the ops struct and don't make this
> > > an op.  Above will need N=number of qwords as well.
> >
> > The reason I made get_used_qword_diff_indexes an op is because the
> > algorithm needs to compute the used_bits for entries (for the current
> > entry, the target entry as well as the melded-staging entry).
>
> Make getting the used bits the op..

Right, I initially tried making get_used_bits the op but the problem
is where to store the output used_bits without dynamic allocation.
Introducing .get_used_qword_diff_indexes and .set_unused_bits bypasses
the issue. I agree it's a bit weird though.
Are you suggesting that get_used_bits()'s output would use storage
from its ops struct? With the requirement that calls to
get_used_bits() invalidates previous calls?

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-03 16:52                                 ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-03 16:52 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Tue, Jan 2, 2024 at 10:48 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Tue, Jan 02, 2024 at 04:08:41PM +0800, Michael Shavit wrote:
> > On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
> > >
> > > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > > +                                __le64 *cur, const __le64 *target,
> > > > +                                __le64 *staging_entry)
> > > > +{
> > > > +       bool cleanup_sync_required = false;
> > > > +       u8 entry_qwords_used_diff = 0;
> > > > +       int i = 0;
> > > > +
> > > > +       entry_qwords_used_diff =
> > > > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > > > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > > > +               return;
> > >
> > > A no change update is actually API legal, eg we can set the same
> > > domain twice in a row. It should just do nothing.
> > >
> > > If the goal is to improve readability I'd split this into smaller
> > > functions and have the main function look like this:
> > >
> > >        compute_used(..)
> > >        if (hweight8(entry_qwords_used_diff) > 1) {
> > >              set_v_0(..);
> > >              set(qword_start=1,qword_end=N);
> > >              set(qword_start=0,qword_end=1); // V=1
> >
> > This branch is probably a bit more complicated than that. It's a bit more like:
> >        if (hweight8(entry_qwords_used_diff) > 1) {
> >              compute_staging_entry(...);
> >              compute_used_diffs(...staging_entry...)
> >              if (hweight(entry_qwords_used_diff) > 1) {
> >                  set_v_0();
> >                  set(qword_start=1,qword_end=N);
> >                  set(qword_start=0,qword_end=1); // V=1
> >              } else {
> >                  set(qword_start=0, qword_end=N, staging_entry, entry)
> >                  critical = ffs(..);
> >                  set(qword_start=critical,qword_end=critical+1);
> >                  set(qword_start=0,qword_end=N);
> >              }
> >       }
> >
> > >        } else if (hweight8(entry_qwords_used_diff) == 1) {
> > >              set_unused(..);
> > >              critical = ffs(..);
> > >              set(qword_start=critical,qword_end=critical+1);
> > >              set(qword_start=0,qword_end=N);
> >
> > And then this branch is the case where you can directly switch to the
> > entry without first setting unused bits.
>
> Don't make that a special case, just always set the unused bits. All
> the setting functions should skip the sync if they didn't change the
> entry, so we don't need to care if we call them needlessly.
>
> There are only three programming sequences.

The different cases (ignoring clean-up) from simplest to least are:
1. No change because the STE is already equal to the target.
2. Directly writing critical word because that's the only difference.
3. Setting unused bits then writing critical word.
4. Installing breaking STE, write other words, write critical word.

Case 2. could potentially be collapsed into 3. if the routine that
sets unused bits skips over the critical word, so that it's a nop when
the only change is on that critical word.

> entry_qwords_used_diff should reflect required changes after setting
> the unused bits.

Ohhhhhhh, I see. Your suggestion is essentially to move this block
into the first call to get_used_qword_diff_indexes:
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);

Such that:
if (hweight8(entry_qwords_used_diff) > 1) => non hitless
if (hweight8(entry_qwords_used_diff) > 0) => hitless, potentially by
first setting some unused bits in non-critical qwords.

>
> > > > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > > > +               /*
> > > > +                * If transitioning to the target entry with a single qword
> > > > +                * write isn't possible, then we must first transition to an
> > > > +                * intermediate entry. The intermediate entry may either be an
> > > > +                * entry that melds bits of the target entry into the current
> > > > +                * entry without disrupting the hardware, or a breaking entry if
> > > > +                * a hitless transition to the target is impossible.
> > > > +                */
> > > > +
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);
> > >
> > > Put the number qwords directly in the ops struct and don't make this
> > > an op.  Above will need N=number of qwords as well.
> >
> > The reason I made get_used_qword_diff_indexes an op is because the
> > algorithm needs to compute the used_bits for entries (for the current
> > entry, the target entry as well as the melded-staging entry).
>
> Make getting the used bits the op..

Right, I initially tried making get_used_bits the op but the problem
is where to store the output used_bits without dynamic allocation.
Introducing .get_used_qword_diff_indexes and .set_unused_bits bypasses
the issue. I agree it's a bit weird though.
Are you suggesting that get_used_bits()'s output would use storage
from its ops struct? With the requirement that calls to
get_used_bits() invalidates previous calls?

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-03 16:52                                 ` Michael Shavit
@ 2024-01-03 17:50                                   ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-03 17:50 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Jan 04, 2024 at 12:52:48AM +0800, Michael Shavit wrote:
> > > And then this branch is the case where you can directly switch to the
> > > entry without first setting unused bits.
> >
> > Don't make that a special case, just always set the unused bits. All
> > the setting functions should skip the sync if they didn't change the
> > entry, so we don't need to care if we call them needlessly.
> >
> > There are only three programming sequences.
> 
> The different cases (ignoring clean-up) from simplest to least are:
> 1. No change because the STE is already equal to the target.
> 2. Directly writing critical word because that's the only difference.
> 3. Setting unused bits then writing critical word.
> 4. Installing breaking STE, write other words, write critical word.

Right

> Case 2. could potentially be collapsed into 3. if the routine that
> sets unused bits skips over the critical word, so that it's a nop when
> the only change is on that critical word.

Right

> > entry_qwords_used_diff should reflect required changes after setting
> > the unused bits.
> 
> Ohhhhhhh, I see. Your suggestion is essentially to move this block
> into the first call to get_used_qword_diff_indexes:
> > > > > +               /*
> > > > > +                * Compute a staging entry that has all the bits currently
> > > > > +                * unused by HW set to their target values, such that comitting
> > > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > > +                */
> > > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > > +
> > > > > +               entry_qwords_used_diff =
> > > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > > +                                                               target);
> 
> Such that:
> if (hweight8(entry_qwords_used_diff) > 1) => non hitless
> if (hweight8(entry_qwords_used_diff) > 0) => hitless, potentially by
> first setting some unused bits in non-critical qwords.

Yes, sorry it was unclear. Here is the full thing for what I mean:

struct arm_smmu_entry_writer_ops {
	unsigned int num_entry_qwords;
	__le64 v_bit;
	void (*get_used)(const __le64 *entry, __le64 *used);
	void (*sync)(void);
};

enum {
	NUM_ENTRY_QWORDS =
		((sizeof(struct arm_smmu_ste) > sizeof(struct arm_smmu_cd)) ?
			 sizeof(struct arm_smmu_ste) :
			 sizeof(struct arm_smmu_cd)) /
		sizeof(u64)
};

static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
		      __le64 *entry, const __le64 *target, unsigned int start,
		      unsigned int len)
{
	bool changed = false;

	entry = entry + start;
	target = target + start;
	for (; len != 0; len--, target++, start++) {
		if (*entry != *target) {
			WRITE_ONCE(*entry, *target);
			changed = true;
		}
	}

	if (changed)
		ops->sync();
	return changed;
}

/*
 * Figure out if we can do a hitless update of entry to become target. Returns a
 * bit mask where 1 indicates that qword needs to be set disruptively.
 * unused_update is an intermediate value of entry that has unused bits set to
 * their new values.
 */
static u8 compute_qword_diff(const struct arm_smmu_entry_writer_ops *ops,
			     const __le64 *entry, const __le64 *target,
			     __le64 *unused_update)
{
	__le64 target_used[NUM_ENTRY_QWORDS];
	__le64 cur_used[NUM_ENTRY_QWORDS];
	u8 used_qword_diff = 0;
	unsigned int i;

	ops->get_used(entry, cur_used);
	ops->get_used(target, target_used);

	for (i = 0; i != ops->num_entry_qwords; i++) {
		/*
		 * Masks are up to date, the make functions are not allowed to
		 * set a bit to 1 if the used function doesn't say it is used.
		 */
		WARN_ON_ONCE(target[i] & ~target_used[i]);

		/* Bits can change because they are not currently being used */
		unused_update[i] = (entry[i] & cur_used[i]) |
				   (target[i] & ~cur_used[i]);
		/*
		 * Each bit indicates that a used bit in a qword needs to be
		 * changed after unused_update is applied.
		 */
		if ((unused_update[i] & target_used[i]) !=
		    (target[i] & target_used[i]))
			used_qword_diff |= 1 << i;
	}
	return used_qword_diff;
}

static void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
				 __le64 *entry, const __le64 *target)
{
	__le64 unused_update[NUM_ENTRY_QWORDS];
	u8 used_qword_diff;

	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
	if (hweight8(used_qword_diff) > 1) {
		/*
		 * At least two qwords need their used bits to be changed. This
		 * requires a breaking update, zero the V bit, write all qwords
		 * but 0, then set qword 0
		 */
		unused_update[0] = entry[0] & (~ops->v_bit);
		entry_set(ops, entry, unused_update, 0, 1);
		entry_set(ops, entry, target, 1, ops->num_entry_qwords);
		entry_set(ops, entry, target, 0, 1);
	} else if (hweight8(used_qword_diff) == 1) {
		/*
		 * Only one qword needs its used bits to be changed. This is a
		 * hitless update, update all bits the current STE is ignoring
		 * to their new values, then update a single qword to change the
		 * STE and finally 0 and unused bits.
		 */
		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
		entry_set(ops, entry, target, ffs(used_qword_diff) - 1, 1);
		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
	} else {
		/*
		 * If everything is working properly this shouldn't do anything
		 * as unused bits should always be 0 and thus
		 * can't change.
		 */
		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
				       ops->num_entry_qwords));
	}
}

I'm fine with this, if you think it is better please sort out the rest
of the bits and send me a diff and I'll integrate it

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-03 17:50                                   ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-03 17:50 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Thu, Jan 04, 2024 at 12:52:48AM +0800, Michael Shavit wrote:
> > > And then this branch is the case where you can directly switch to the
> > > entry without first setting unused bits.
> >
> > Don't make that a special case, just always set the unused bits. All
> > the setting functions should skip the sync if they didn't change the
> > entry, so we don't need to care if we call them needlessly.
> >
> > There are only three programming sequences.
> 
> The different cases (ignoring clean-up) from simplest to least are:
> 1. No change because the STE is already equal to the target.
> 2. Directly writing critical word because that's the only difference.
> 3. Setting unused bits then writing critical word.
> 4. Installing breaking STE, write other words, write critical word.

Right

> Case 2. could potentially be collapsed into 3. if the routine that
> sets unused bits skips over the critical word, so that it's a nop when
> the only change is on that critical word.

Right

> > entry_qwords_used_diff should reflect required changes after setting
> > the unused bits.
> 
> Ohhhhhhh, I see. Your suggestion is essentially to move this block
> into the first call to get_used_qword_diff_indexes:
> > > > > +               /*
> > > > > +                * Compute a staging entry that has all the bits currently
> > > > > +                * unused by HW set to their target values, such that comitting
> > > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > > +                */
> > > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > > +
> > > > > +               entry_qwords_used_diff =
> > > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > > +                                                               target);
> 
> Such that:
> if (hweight8(entry_qwords_used_diff) > 1) => non hitless
> if (hweight8(entry_qwords_used_diff) > 0) => hitless, potentially by
> first setting some unused bits in non-critical qwords.

Yes, sorry it was unclear. Here is the full thing for what I mean:

struct arm_smmu_entry_writer_ops {
	unsigned int num_entry_qwords;
	__le64 v_bit;
	void (*get_used)(const __le64 *entry, __le64 *used);
	void (*sync)(void);
};

enum {
	NUM_ENTRY_QWORDS =
		((sizeof(struct arm_smmu_ste) > sizeof(struct arm_smmu_cd)) ?
			 sizeof(struct arm_smmu_ste) :
			 sizeof(struct arm_smmu_cd)) /
		sizeof(u64)
};

static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
		      __le64 *entry, const __le64 *target, unsigned int start,
		      unsigned int len)
{
	bool changed = false;

	entry = entry + start;
	target = target + start;
	for (; len != 0; len--, target++, start++) {
		if (*entry != *target) {
			WRITE_ONCE(*entry, *target);
			changed = true;
		}
	}

	if (changed)
		ops->sync();
	return changed;
}

/*
 * Figure out if we can do a hitless update of entry to become target. Returns a
 * bit mask where 1 indicates that qword needs to be set disruptively.
 * unused_update is an intermediate value of entry that has unused bits set to
 * their new values.
 */
static u8 compute_qword_diff(const struct arm_smmu_entry_writer_ops *ops,
			     const __le64 *entry, const __le64 *target,
			     __le64 *unused_update)
{
	__le64 target_used[NUM_ENTRY_QWORDS];
	__le64 cur_used[NUM_ENTRY_QWORDS];
	u8 used_qword_diff = 0;
	unsigned int i;

	ops->get_used(entry, cur_used);
	ops->get_used(target, target_used);

	for (i = 0; i != ops->num_entry_qwords; i++) {
		/*
		 * Masks are up to date, the make functions are not allowed to
		 * set a bit to 1 if the used function doesn't say it is used.
		 */
		WARN_ON_ONCE(target[i] & ~target_used[i]);

		/* Bits can change because they are not currently being used */
		unused_update[i] = (entry[i] & cur_used[i]) |
				   (target[i] & ~cur_used[i]);
		/*
		 * Each bit indicates that a used bit in a qword needs to be
		 * changed after unused_update is applied.
		 */
		if ((unused_update[i] & target_used[i]) !=
		    (target[i] & target_used[i]))
			used_qword_diff |= 1 << i;
	}
	return used_qword_diff;
}

static void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
				 __le64 *entry, const __le64 *target)
{
	__le64 unused_update[NUM_ENTRY_QWORDS];
	u8 used_qword_diff;

	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
	if (hweight8(used_qword_diff) > 1) {
		/*
		 * At least two qwords need their used bits to be changed. This
		 * requires a breaking update, zero the V bit, write all qwords
		 * but 0, then set qword 0
		 */
		unused_update[0] = entry[0] & (~ops->v_bit);
		entry_set(ops, entry, unused_update, 0, 1);
		entry_set(ops, entry, target, 1, ops->num_entry_qwords);
		entry_set(ops, entry, target, 0, 1);
	} else if (hweight8(used_qword_diff) == 1) {
		/*
		 * Only one qword needs its used bits to be changed. This is a
		 * hitless update, update all bits the current STE is ignoring
		 * to their new values, then update a single qword to change the
		 * STE and finally 0 and unused bits.
		 */
		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
		entry_set(ops, entry, target, ffs(used_qword_diff) - 1, 1);
		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
	} else {
		/*
		 * If everything is working properly this shouldn't do anything
		 * as unused bits should always be 0 and thus
		 * can't change.
		 */
		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
				       ops->num_entry_qwords));
	}
}

I'm fine with this, if you think it is better please sort out the rest
of the bits and send me a diff and I'll integrate it

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-03 17:50                                   ` Jason Gunthorpe
@ 2024-01-06  8:36                                     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

From: Jason Gunthorpe <jgg@nvidia.com>

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.

Thus, consolidate all the complexity here. Callers do not have to care
about what STE transition they are doing, this function will handle
everything optimally.

Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
required programming sequence to avoid creating an incoherent 'torn' STE
in the HW caches. The update algorithm follows the same design that the
driver already uses: it is safe to change bits that HW doesn't currently
use and then do a single 64 bit update, with sync's in between.

The basic idea is to express in a bitmask what bits the HW is actually
using based on the V and CFG bits. Based on that mask we know what STE
changes are safe and which are disruptive. We can count how many 64 bit
QWORDS need a disruptive update and know if a step with V=0 is required.

This gives two basic flows through the algorithm.

If only a single 64 bit quantity needs disruptive replacement:
 - Write the target value into all currently unused bits
 - Write the single 64 bit quantity
 - Zero the remaining different bits

If multiple 64 bit quantities need disruptive replacement then do:
 - Write V=0 to QWORD 0
 - Write the entire STE except QWORD 0
 - Write QWORD 0

With HW flushes at each step, that can be skipped if the STE didn't change
in that step.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

A large part of this design is motivated by supporting the IOMMU driver
API expectations for hitless STE updates on the following sequences:

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - NESTING -> NESTING (eg to change S1DSS, change CD table pointers, etc)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

In addition to support cases with VMs where STE transitions are quite
broad and the VM may be assuming hitless as the native HW can do.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Shavit <mshavit@google.com>

---

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 317 ++++++++++++++++----
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  15 +
 2 files changed, 268 insertions(+), 64 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b120d836681c1..f663d2c11b8d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,142 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
+		      __le64 *entry, const __le64 *target, unsigned int start,
+		      unsigned int len)
+{
+	bool changed = false;
+	unsigned int i;
+
+	for (i = start; len != 0; len--, i++) {
+		if (entry[i] != target[i]) {
+			WRITE_ONCE(entry[i], target[i]);
+			changed = true;
+		}
+	}
+
+	if (changed)
+		ops->sync(ops);
+	return changed;
+}
+
+#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
+
+/*
+ * Figure out if we can do a hitless update of entry to become target. Returns a
+ * bit mask where 1 indicates that qword needs to be set disruptively.
+ * unused_update is an intermediate value of entry that has unused bits set to
+ * their new values.
+ */
+static u8 compute_qword_diff(const struct arm_smmu_entry_writer_ops *ops,
+			     const __le64 *entry, const __le64 *target,
+			     __le64 *unused_update)
+{
+	__le64 target_used[NUM_ENTRY_QWORDS];
+	__le64 cur_used[NUM_ENTRY_QWORDS];
+	u8 used_qword_diff = 0;
+	unsigned int i;
+
+	ops->get_used(ops, entry, cur_used);
+	ops->get_used(ops, target, target_used);
+
+	for (i = 0; i != ops->num_entry_qwords; i++) {
+		/*
+		 * Masks are up to date, the make functions are not allowed to
+		 * set a bit to 1 if the used function doesn't say it is used.
+		 */
+		WARN_ON_ONCE(target[i] & ~target_used[i]);
+
+		/* Bits can change because they are not currently being used */
+		unused_update[i] = (entry[i] & cur_used[i]) |
+				   (target[i] & ~cur_used[i]);
+		/*
+		 * Each bit indicates that a used bit in a qword needs to be
+		 * changed after unused_update is applied.
+		 */
+		if ((unused_update[i] & target_used[i]) !=
+		    (target[i] & target_used[i]))
+			used_qword_diff |= 1 << i;
+	}
+	return used_qword_diff;
+}
+
+/*
+ * Update the STE/CD to the target configuration. The transition from the current
+ * entry to the target entry takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
+void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
+			  __le64 *entry, const __le64 *target)
+{
+	__le64 unused_update[NUM_ENTRY_QWORDS];
+	u8 used_qword_diff;
+	unsigned int critical_qword_index;
+
+	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
+	if (hweight8(used_qword_diff) > 1) {
+		/*
+		 * At least two qwords need their used bits to be changed. This
+		 * requires a breaking update, zero the V bit, write all qwords
+		 * but 0, then set qword 0
+		 */
+		unused_update[0] = entry[0] & (~ops->v_bit);
+		entry_set(ops, entry, unused_update, 0, 1);
+		entry_set(ops, entry, target, 1, ops->num_entry_qwords - 1);
+		entry_set(ops, entry, target, 0, 1);
+	} else if (hweight8(used_qword_diff) == 1) {
+		/*
+		 * Only one qword needs its used bits to be changed. This is a
+		 * hitless update, update all bits the current STE is ignoring
+		 * to their new values, then update a single qword to change the
+		 * STE and finally 0 out any bits that are now unused in the
+		 * target configuration.
+		 */
+		critical_qword_index = ffs(used_qword_diff) - 1;
+		/*
+		 * Skip writing unused bits in the critical qword since we'll be
+		 * writing it in the next step anyways. This can save a sync
+		 * when the only change is in that qword.
+		 */
+		unused_update[critical_qword_index] = entry[critical_qword_index];
+		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
+		entry_set(ops, entry, target, critical_qword_index, 1);
+		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
+	} else {
+		/*
+		 * If everything is working properly this shouldn't do anything
+		 * as unused bits should always be 0 and thus can't change.
+		 */
+		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
+				       ops->num_entry_qwords));
+	}
+}
+
+#undef NUM_ENTRY_QWORDS
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1248,37 +1384,119 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
-static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      struct arm_smmu_ste *dst)
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
+			   const __le64 *ent, __le64 *used_bits)
 {
+	memset(used_bits, 0, ops->num_entry_qwords * sizeof(*used_bits));
+
+	used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
 	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
+	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+	 * attribute/permission configuration fields for the SHCFG behavior.
 	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
+	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+	used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+					    STRTAB_STE_0_S1CTXPTR_MASK |
+					    STRTAB_STE_0_S1CDMAX);
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+		used_bits[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF,
+		       ops->num_entry_qwords * sizeof(*used_bits));
+		WARN_ON(true);
+	}
+}
+
+struct arm_smmu_ste_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct arm_smmu_device *smmu;
+	u32 sid;
+};
+
+static void
+arm_smmu_ste_writer_sync_entry(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_ste_writer *ste_writer =
+		container_of(ops, struct arm_smmu_ste_writer, ops);
+
+	arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
+	.sync = arm_smmu_ste_writer_sync_entry,
+	.get_used = arm_smmu_get_ste_used,
+	.v_bit = cpu_to_le64(STRTAB_STE_0_V),
+	.num_entry_qwords =
+		sizeof_field(struct arm_smmu_ste, data) / sizeof(u64),
+};
+
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste_writer ste_writer = {
+		.ops = arm_smmu_ste_writer_ops,
+		.smmu = smmu,
+		.sid = sid,
+	};
+
+	arm_smmu_write_entry(&ste_writer.ops, ste->data, target->data);
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
+}
+
+static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
+				      struct arm_smmu_ste *dst)
+{
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1293,22 +1511,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1319,16 +1521,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(smmu, sid, dst, &target);
 		return;
 	}
 
@@ -1336,8 +1533,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1542,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1551,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1560,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 27ddf1acd12ce..565a38a61333c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -739,6 +739,21 @@ struct arm_smmu_domain {
 	struct list_head		mmu_notifiers;
 };
 
+/* The following are exposed for testing purposes. */
+struct arm_smmu_entry_writer_ops;
+struct arm_smmu_entry_writer_ops {
+	unsigned int num_entry_qwords;
+	__le64 v_bit;
+	void (*get_used)(const struct arm_smmu_entry_writer_ops *ops,
+			 const __le64 *entry, __le64 *used);
+	void (*sync)(const struct arm_smmu_entry_writer_ops *ops);
+};
+
+void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
+			   const __le64 *ent, __le64 *used_bits);
+void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
+			  __le64 *cur, const __le64 *target);
+
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
 	return container_of(dom, struct arm_smmu_domain, domain);

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-06  8:36                                     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

From: Jason Gunthorpe <jgg@nvidia.com>

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.

Thus, consolidate all the complexity here. Callers do not have to care
about what STE transition they are doing, this function will handle
everything optimally.

Revise arm_smmu_write_strtab_ent() so it algorithmically computes the
required programming sequence to avoid creating an incoherent 'torn' STE
in the HW caches. The update algorithm follows the same design that the
driver already uses: it is safe to change bits that HW doesn't currently
use and then do a single 64 bit update, with sync's in between.

The basic idea is to express in a bitmask what bits the HW is actually
using based on the V and CFG bits. Based on that mask we know what STE
changes are safe and which are disruptive. We can count how many 64 bit
QWORDS need a disruptive update and know if a step with V=0 is required.

This gives two basic flows through the algorithm.

If only a single 64 bit quantity needs disruptive replacement:
 - Write the target value into all currently unused bits
 - Write the single 64 bit quantity
 - Zero the remaining different bits

If multiple 64 bit quantities need disruptive replacement then do:
 - Write V=0 to QWORD 0
 - Write the entire STE except QWORD 0
 - Write QWORD 0

With HW flushes at each step, that can be skipped if the STE didn't change
in that step.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

A large part of this design is motivated by supporting the IOMMU driver
API expectations for hitless STE updates on the following sequences:

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - NESTING -> NESTING (eg to change S1DSS, change CD table pointers, etc)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

In addition to support cases with VMs where STE transitions are quite
broad and the VM may be assuming hitless as the native HW can do.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Shavit <mshavit@google.com>

---

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 317 ++++++++++++++++----
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  15 +
 2 files changed, 268 insertions(+), 64 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index b120d836681c1..f663d2c11b8d0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,6 +971,142 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
+		      __le64 *entry, const __le64 *target, unsigned int start,
+		      unsigned int len)
+{
+	bool changed = false;
+	unsigned int i;
+
+	for (i = start; len != 0; len--, i++) {
+		if (entry[i] != target[i]) {
+			WRITE_ONCE(entry[i], target[i]);
+			changed = true;
+		}
+	}
+
+	if (changed)
+		ops->sync(ops);
+	return changed;
+}
+
+#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
+
+/*
+ * Figure out if we can do a hitless update of entry to become target. Returns a
+ * bit mask where 1 indicates that qword needs to be set disruptively.
+ * unused_update is an intermediate value of entry that has unused bits set to
+ * their new values.
+ */
+static u8 compute_qword_diff(const struct arm_smmu_entry_writer_ops *ops,
+			     const __le64 *entry, const __le64 *target,
+			     __le64 *unused_update)
+{
+	__le64 target_used[NUM_ENTRY_QWORDS];
+	__le64 cur_used[NUM_ENTRY_QWORDS];
+	u8 used_qword_diff = 0;
+	unsigned int i;
+
+	ops->get_used(ops, entry, cur_used);
+	ops->get_used(ops, target, target_used);
+
+	for (i = 0; i != ops->num_entry_qwords; i++) {
+		/*
+		 * Masks are up to date, the make functions are not allowed to
+		 * set a bit to 1 if the used function doesn't say it is used.
+		 */
+		WARN_ON_ONCE(target[i] & ~target_used[i]);
+
+		/* Bits can change because they are not currently being used */
+		unused_update[i] = (entry[i] & cur_used[i]) |
+				   (target[i] & ~cur_used[i]);
+		/*
+		 * Each bit indicates that a used bit in a qword needs to be
+		 * changed after unused_update is applied.
+		 */
+		if ((unused_update[i] & target_used[i]) !=
+		    (target[i] & target_used[i]))
+			used_qword_diff |= 1 << i;
+	}
+	return used_qword_diff;
+}
+
+/*
+ * Update the STE/CD to the target configuration. The transition from the current
+ * entry to the target entry takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused bits, all bits except V
+ *  - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
+void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
+			  __le64 *entry, const __le64 *target)
+{
+	__le64 unused_update[NUM_ENTRY_QWORDS];
+	u8 used_qword_diff;
+	unsigned int critical_qword_index;
+
+	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
+	if (hweight8(used_qword_diff) > 1) {
+		/*
+		 * At least two qwords need their used bits to be changed. This
+		 * requires a breaking update, zero the V bit, write all qwords
+		 * but 0, then set qword 0
+		 */
+		unused_update[0] = entry[0] & (~ops->v_bit);
+		entry_set(ops, entry, unused_update, 0, 1);
+		entry_set(ops, entry, target, 1, ops->num_entry_qwords - 1);
+		entry_set(ops, entry, target, 0, 1);
+	} else if (hweight8(used_qword_diff) == 1) {
+		/*
+		 * Only one qword needs its used bits to be changed. This is a
+		 * hitless update, update all bits the current STE is ignoring
+		 * to their new values, then update a single qword to change the
+		 * STE and finally 0 out any bits that are now unused in the
+		 * target configuration.
+		 */
+		critical_qword_index = ffs(used_qword_diff) - 1;
+		/*
+		 * Skip writing unused bits in the critical qword since we'll be
+		 * writing it in the next step anyways. This can save a sync
+		 * when the only change is in that qword.
+		 */
+		unused_update[critical_qword_index] = entry[critical_qword_index];
+		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
+		entry_set(ops, entry, target, critical_qword_index, 1);
+		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
+	} else {
+		/*
+		 * If everything is working properly this shouldn't do anything
+		 * as unused bits should always be 0 and thus can't change.
+		 */
+		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
+				       ops->num_entry_qwords));
+	}
+}
+
+#undef NUM_ENTRY_QWORDS
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1248,37 +1384,119 @@ static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
-static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      struct arm_smmu_ste *dst)
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
+			   const __le64 *ent, __le64 *used_bits)
 {
+	memset(used_bits, 0, ops->num_entry_qwords * sizeof(*used_bits));
+
+	used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
 	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
+	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+	 * attribute/permission configuration fields for the SHCFG behavior.
 	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
+	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+	used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+					    STRTAB_STE_0_S1CTXPTR_MASK |
+					    STRTAB_STE_0_S1CDMAX);
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+		used_bits[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF,
+		       ops->num_entry_qwords * sizeof(*used_bits));
+		WARN_ON(true);
+	}
+}
+
+struct arm_smmu_ste_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct arm_smmu_device *smmu;
+	u32 sid;
+};
+
+static void
+arm_smmu_ste_writer_sync_entry(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_ste_writer *ste_writer =
+		container_of(ops, struct arm_smmu_ste_writer, ops);
+
+	arm_smmu_sync_ste_for_sid(ste_writer->smmu, ste_writer->sid);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
+	.sync = arm_smmu_ste_writer_sync_entry,
+	.get_used = arm_smmu_get_ste_used,
+	.v_bit = cpu_to_le64(STRTAB_STE_0_V),
+	.num_entry_qwords =
+		sizeof_field(struct arm_smmu_ste, data) / sizeof(u64),
+};
+
+static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_ste_writer ste_writer = {
+		.ops = arm_smmu_ste_writer_ops,
+		.smmu = smmu,
+		.sid = sid,
+	};
+
+	arm_smmu_write_entry(&ste_writer.ops, ste->data, target->data);
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
+}
+
+static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
+				      struct arm_smmu_ste *dst)
+{
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1293,22 +1511,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1319,16 +1521,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(smmu, sid, dst, &target);
 		return;
 	}
 
@@ -1336,8 +1533,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1346,7 +1542,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1355,8 +1551,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1365,23 +1560,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(smmu, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 27ddf1acd12ce..565a38a61333c 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -739,6 +739,21 @@ struct arm_smmu_domain {
 	struct list_head		mmu_notifiers;
 };
 
+/* The following are exposed for testing purposes. */
+struct arm_smmu_entry_writer_ops;
+struct arm_smmu_entry_writer_ops {
+	unsigned int num_entry_qwords;
+	__le64 v_bit;
+	void (*get_used)(const struct arm_smmu_entry_writer_ops *ops,
+			 const __le64 *entry, __le64 *used);
+	void (*sync)(const struct arm_smmu_entry_writer_ops *ops);
+};
+
+void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
+			   const __le64 *ent, __le64 *used_bits);
+void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
+			  __le64 *cur, const __le64 *target);
+
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
 	return container_of(dom, struct arm_smmu_domain, domain);

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
-- 
2.43.0.472.g3155946c3a-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry_step()
  2024-01-06  8:36                                     ` Michael Shavit
@ 2024-01-06  8:36                                       ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

From: Jason Gunthorpe <jgg@nvidia.com>

CD table entries and STE's have the same essential programming sequence,
just with different types and sizes.

Have arm_smmu_write_ctx_desc() generate a target CD and call
arm_smmu_write_entry_step() to do the programming. Due to the way the
target CD is generated by modifying the existing CD this alone is not
enough for the CD callers to be freed of the ordering requirements.

The following patches will make the rest of the CD flow mirror the STE
flow with precise CD contents generated in all cases.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 90 +++++++++++++++------
 1 file changed, 67 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c9559c4075b4b..5a598500b5c6d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -23,6 +23,7 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/of_platform.h>
+#include <linux/minmax.h>
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
@@ -994,7 +995,9 @@ static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
 	return changed;
 }
 
-#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
+#define NUM_ENTRY_QWORDS (max(sizeof_field(struct arm_smmu_ste, data), \
+			     sizeof_field(struct arm_smmu_cd, data)) \
+			     / sizeof(u64))
 
 /*
  * Figure out if we can do a hitless update of entry to become target. Returns a
@@ -1187,6 +1190,61 @@ static struct arm_smmu_cd *arm_smmu_get_cd_ptr(struct arm_smmu_master *master,
 	return &l1_desc->l2ptr[idx];
 }
 
+static void arm_smmu_get_cd_used(const struct arm_smmu_entry_writer_ops *ops,
+				 const __le64 *ent, __le64 *used_bits)
+{
+	memset(used_bits, 0, ops->num_entry_qwords * sizeof(*used_bits));
+
+	used_bits[0] = cpu_to_le64(CTXDESC_CD_0_V);
+	if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
+		return;
+	memset(used_bits, 0xFF, sizeof(*used_bits));
+
+	/* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
+	if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
+		used_bits[0] &= ~cpu_to_le64(
+			CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
+			CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
+			CTXDESC_CD_0_TCR_SH0);
+		used_bits[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
+	}
+}
+
+struct arm_smmu_cd_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct arm_smmu_master *master;
+	int ssid;
+};
+
+static void arm_smmu_cd_writer_sync_entry(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_cd_writer *cd_writer =
+		container_of(ops, struct arm_smmu_cd_writer, ops);
+
+	arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
+	.sync = arm_smmu_cd_writer_sync_entry,
+	.get_used = arm_smmu_get_cd_used,
+	.v_bit = cpu_to_le64(CTXDESC_CD_0_V),
+	.num_entry_qwords =
+		sizeof_field(struct arm_smmu_cd, data) / sizeof(u64),
+};
+
+static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
+				    struct arm_smmu_cd *cdptr,
+				    const struct arm_smmu_cd *target)
+{
+	struct arm_smmu_cd_writer cd_writer = {
+		.ops = arm_smmu_cd_writer_ops,
+		.master = master,
+		.ssid = ssid,
+	};
+
+	arm_smmu_write_entry(&cd_writer.ops, cdptr->data, target->data);
+}
+
 int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 			    struct arm_smmu_ctx_desc *cd)
 {
@@ -1203,16 +1261,19 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 	 */
 	u64 val;
 	bool cd_live;
-	struct arm_smmu_cd *cdptr;
+	struct arm_smmu_cd target;
+	struct arm_smmu_cd *cdptr = &target;
+	struct arm_smmu_cd *cd_table_entry;
 	struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
 
 	if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
 		return -E2BIG;
 
-	cdptr = arm_smmu_get_cd_ptr(master, ssid);
-	if (!cdptr)
+	cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
+	if (!cd_table_entry)
 		return -ENOMEM;
 
+	target = *cd_table_entry;
 	val = le64_to_cpu(cdptr->data[0]);
 	cd_live = !!(val & CTXDESC_CD_0_V);
 
@@ -1232,13 +1293,6 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 		cdptr->data[2] = 0;
 		cdptr->data[3] = cpu_to_le64(cd->mair);
 
-		/*
-		 * STE may be live, and the SMMU might read dwords of this CD in any
-		 * order. Ensure that it observes valid values before reading
-		 * V=1.
-		 */
-		arm_smmu_sync_cd(master, ssid, true);
-
 		val = cd->tcr |
 #ifdef __BIG_ENDIAN
 			CTXDESC_CD_0_ENDI |
@@ -1252,18 +1306,8 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 		if (cd_table->stall_enabled)
 			val |= CTXDESC_CD_0_S;
 	}
-
-	/*
-	 * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
-	 * "Configuration structures and configuration invalidation completion"
-	 *
-	 *   The size of single-copy atomic reads made by the SMMU is
-	 *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
-	 *   field within an aligned 64-bit span of a structure can be altered
-	 *   without first making the structure invalid.
-	 */
-	WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
-	arm_smmu_sync_cd(master, ssid, true);
+	cdptr->data[0] = cpu_to_le64(val);
+	arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
 	return 0;
 }
 

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
prerequisite-patch-id: d6342ff93ec8850ce76e45f1e22d143208bfa13c
prerequisite-patch-id: 6d2c59c2fdb9ae9e09fb042148f57b12d5058c9e
prerequisite-patch-id: f86746e1c19fba223fe2e559fc0f3ecf6fc7cc47
prerequisite-patch-id: 2d43b690a831e369547d10cf08a8e785fc4c1b69
prerequisite-patch-id: ae154d0d43beba4483f29747aecceae853657561
prerequisite-patch-id: 1ac7f3a4007a4ff64813e1a117ee6f16c28695bc
prerequisite-patch-id: ed34d0ebe0b56869508698367a26bd9e913394eb
prerequisite-patch-id: 658bad2b9692a0f959ee73e2d3798a34f16c9f11
prerequisite-patch-id: 4d83a8451a41ee3d597f1e6be1457f695b738b76
prerequisite-patch-id: d3b421dc985d58dbaaef46ec6d16b4a2764424ea
prerequisite-patch-id: ac7aab762dcd10fcc241be07503abae66f5912c8
prerequisite-patch-id: 34877d560c1c74de6e6875bdd719dafebb620732
prerequisite-patch-id: 9864c8f72ae9de7d6caf90096cf015ad0199ea7e
prerequisite-patch-id: fa730102c85dc93ce0c9e7b4128d08dc09306192
prerequisite-patch-id: 8c1a8a32e9be9b282727985a542afe4766c4afd5
prerequisite-patch-id: ac25e540981c4015261293bd5502ab39f0b6d9e6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 245dbf34f0d60634846534ce846baa39ff91f6dc
prerequisite-patch-id: 879c03c00f0023fcddfc8194692cd5706be4b893
prerequisite-patch-id: 6aa6a678f8c0d9ff3ce278d27342742ec352e95d
prerequisite-patch-id: ccb225b386bb12bf442a8ac9096aabc4b2c6058c
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry_step()
@ 2024-01-06  8:36                                       ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

From: Jason Gunthorpe <jgg@nvidia.com>

CD table entries and STE's have the same essential programming sequence,
just with different types and sizes.

Have arm_smmu_write_ctx_desc() generate a target CD and call
arm_smmu_write_entry_step() to do the programming. Due to the way the
target CD is generated by modifying the existing CD this alone is not
enough for the CD callers to be freed of the ordering requirements.

The following patches will make the rest of the CD flow mirror the STE
flow with precise CD contents generated in all cases.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 90 +++++++++++++++------
 1 file changed, 67 insertions(+), 23 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index c9559c4075b4b..5a598500b5c6d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -23,6 +23,7 @@
 #include <linux/of.h>
 #include <linux/of_address.h>
 #include <linux/of_platform.h>
+#include <linux/minmax.h>
 #include <linux/pci.h>
 #include <linux/pci-ats.h>
 #include <linux/platform_device.h>
@@ -994,7 +995,9 @@ static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
 	return changed;
 }
 
-#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
+#define NUM_ENTRY_QWORDS (max(sizeof_field(struct arm_smmu_ste, data), \
+			     sizeof_field(struct arm_smmu_cd, data)) \
+			     / sizeof(u64))
 
 /*
  * Figure out if we can do a hitless update of entry to become target. Returns a
@@ -1187,6 +1190,61 @@ static struct arm_smmu_cd *arm_smmu_get_cd_ptr(struct arm_smmu_master *master,
 	return &l1_desc->l2ptr[idx];
 }
 
+static void arm_smmu_get_cd_used(const struct arm_smmu_entry_writer_ops *ops,
+				 const __le64 *ent, __le64 *used_bits)
+{
+	memset(used_bits, 0, ops->num_entry_qwords * sizeof(*used_bits));
+
+	used_bits[0] = cpu_to_le64(CTXDESC_CD_0_V);
+	if (!(ent[0] & cpu_to_le64(CTXDESC_CD_0_V)))
+		return;
+	memset(used_bits, 0xFF, sizeof(*used_bits));
+
+	/* EPD0 means T0SZ/TG0/IR0/OR0/SH0/TTB0 are IGNORED */
+	if (ent[0] & cpu_to_le64(CTXDESC_CD_0_TCR_EPD0)) {
+		used_bits[0] &= ~cpu_to_le64(
+			CTXDESC_CD_0_TCR_T0SZ | CTXDESC_CD_0_TCR_TG0 |
+			CTXDESC_CD_0_TCR_IRGN0 | CTXDESC_CD_0_TCR_ORGN0 |
+			CTXDESC_CD_0_TCR_SH0);
+		used_bits[1] &= ~cpu_to_le64(CTXDESC_CD_1_TTB0_MASK);
+	}
+}
+
+struct arm_smmu_cd_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct arm_smmu_master *master;
+	int ssid;
+};
+
+static void arm_smmu_cd_writer_sync_entry(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_cd_writer *cd_writer =
+		container_of(ops, struct arm_smmu_cd_writer, ops);
+
+	arm_smmu_sync_cd(cd_writer->master, cd_writer->ssid, true);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_cd_writer_ops = {
+	.sync = arm_smmu_cd_writer_sync_entry,
+	.get_used = arm_smmu_get_cd_used,
+	.v_bit = cpu_to_le64(CTXDESC_CD_0_V),
+	.num_entry_qwords =
+		sizeof_field(struct arm_smmu_cd, data) / sizeof(u64),
+};
+
+static void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
+				    struct arm_smmu_cd *cdptr,
+				    const struct arm_smmu_cd *target)
+{
+	struct arm_smmu_cd_writer cd_writer = {
+		.ops = arm_smmu_cd_writer_ops,
+		.master = master,
+		.ssid = ssid,
+	};
+
+	arm_smmu_write_entry(&cd_writer.ops, cdptr->data, target->data);
+}
+
 int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 			    struct arm_smmu_ctx_desc *cd)
 {
@@ -1203,16 +1261,19 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 	 */
 	u64 val;
 	bool cd_live;
-	struct arm_smmu_cd *cdptr;
+	struct arm_smmu_cd target;
+	struct arm_smmu_cd *cdptr = &target;
+	struct arm_smmu_cd *cd_table_entry;
 	struct arm_smmu_ctx_desc_cfg *cd_table = &master->cd_table;
 
 	if (WARN_ON(ssid >= (1 << cd_table->s1cdmax)))
 		return -E2BIG;
 
-	cdptr = arm_smmu_get_cd_ptr(master, ssid);
-	if (!cdptr)
+	cd_table_entry = arm_smmu_get_cd_ptr(master, ssid);
+	if (!cd_table_entry)
 		return -ENOMEM;
 
+	target = *cd_table_entry;
 	val = le64_to_cpu(cdptr->data[0]);
 	cd_live = !!(val & CTXDESC_CD_0_V);
 
@@ -1232,13 +1293,6 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 		cdptr->data[2] = 0;
 		cdptr->data[3] = cpu_to_le64(cd->mair);
 
-		/*
-		 * STE may be live, and the SMMU might read dwords of this CD in any
-		 * order. Ensure that it observes valid values before reading
-		 * V=1.
-		 */
-		arm_smmu_sync_cd(master, ssid, true);
-
 		val = cd->tcr |
 #ifdef __BIG_ENDIAN
 			CTXDESC_CD_0_ENDI |
@@ -1252,18 +1306,8 @@ int arm_smmu_write_ctx_desc(struct arm_smmu_master *master, int ssid,
 		if (cd_table->stall_enabled)
 			val |= CTXDESC_CD_0_S;
 	}
-
-	/*
-	 * The SMMU accesses 64-bit values atomically. See IHI0070Ca 3.21.3
-	 * "Configuration structures and configuration invalidation completion"
-	 *
-	 *   The size of single-copy atomic reads made by the SMMU is
-	 *   IMPLEMENTATION DEFINED but must be at least 64 bits. Any single
-	 *   field within an aligned 64-bit span of a structure can be altered
-	 *   without first making the structure invalid.
-	 */
-	WRITE_ONCE(cdptr->data[0], cpu_to_le64(val));
-	arm_smmu_sync_cd(master, ssid, true);
+	cdptr->data[0] = cpu_to_le64(val);
+	arm_smmu_write_cd_entry(master, ssid, cd_table_entry, &target);
 	return 0;
 }
 

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
prerequisite-patch-id: d6342ff93ec8850ce76e45f1e22d143208bfa13c
prerequisite-patch-id: 6d2c59c2fdb9ae9e09fb042148f57b12d5058c9e
prerequisite-patch-id: f86746e1c19fba223fe2e559fc0f3ecf6fc7cc47
prerequisite-patch-id: 2d43b690a831e369547d10cf08a8e785fc4c1b69
prerequisite-patch-id: ae154d0d43beba4483f29747aecceae853657561
prerequisite-patch-id: 1ac7f3a4007a4ff64813e1a117ee6f16c28695bc
prerequisite-patch-id: ed34d0ebe0b56869508698367a26bd9e913394eb
prerequisite-patch-id: 658bad2b9692a0f959ee73e2d3798a34f16c9f11
prerequisite-patch-id: 4d83a8451a41ee3d597f1e6be1457f695b738b76
prerequisite-patch-id: d3b421dc985d58dbaaef46ec6d16b4a2764424ea
prerequisite-patch-id: ac7aab762dcd10fcc241be07503abae66f5912c8
prerequisite-patch-id: 34877d560c1c74de6e6875bdd719dafebb620732
prerequisite-patch-id: 9864c8f72ae9de7d6caf90096cf015ad0199ea7e
prerequisite-patch-id: fa730102c85dc93ce0c9e7b4128d08dc09306192
prerequisite-patch-id: 8c1a8a32e9be9b282727985a542afe4766c4afd5
prerequisite-patch-id: ac25e540981c4015261293bd5502ab39f0b6d9e6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 245dbf34f0d60634846534ce846baa39ff91f6dc
prerequisite-patch-id: 879c03c00f0023fcddfc8194692cd5706be4b893
prerequisite-patch-id: 6aa6a678f8c0d9ff3ce278d27342742ec352e95d
prerequisite-patch-id: ccb225b386bb12bf442a8ac9096aabc4b2c6058c
-- 
2.43.0.472.g3155946c3a-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
  2024-01-06  8:36                                     ` Michael Shavit
@ 2024-01-06  8:36                                       ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

Add tests for some of the more common STE update operations that we
expect to see, as well as some artificial STE updates to test the edges
of arm_smmu_write_entry. These also serve as a record of which common
operation is expected to be hitless, and how many syncs they require.

arm_smmu_write_entry implements a generic algorithm that updates an
STE/CD to any other abritrary STE/CD configuration. The update requires
a sequence of write+sync operations, with some invariants that must be
held true after each sync. arm_smmu_write_entry lends itself well to
unit-testing since the function's interaction with the STE/CD is already
abstracted by input callbacks that we can hook to introspect into the
sequence of operations. We can use these hooks to guarantee that
invariants are held throughout the entire update operation.

Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/Kconfig                         |   9 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   2 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 329 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   6 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +-
 5 files changed, 349 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 7673bb82945b6..e4c4071115c8e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -405,6 +405,15 @@ config ARM_SMMU_V3_SVA
 	  Say Y here if your system supports SVA extensions such as PCIe PASID
 	  and PRI.
 
+config ARM_SMMU_V3_KUNIT_TEST
+	tristate "KUnit tests for arm-smmu-v3 driver"  if !KUNIT_ALL_TESTS
+	depends on ARM_SMMU_V3 && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable this option to unit-test arm-smmu-v3 driver functions.
+
+	  If unsure, say N.
+
 config S390_IOMMU
 	def_bool y if S390 && PCI
 	depends on S390 && PCI
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 54feb1ecccad8..014a997753a8a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -3,3 +3,5 @@ obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-objs-y += arm-smmu-v3.o
 arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
+
+obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
new file mode 100644
index 0000000000000..59ffcafb575fb
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+
+#include "arm-smmu-v3.h"
+
+struct arm_smmu_test_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct kunit *test;
+	const __le64 *init_entry;
+	const __le64 *target_entry;
+	__le64 *entry;
+
+	bool invalid_entry_written;
+	int num_syncs;
+};
+
+static bool arm_smmu_entry_differs_in_used_bits(const __le64 *entry,
+						const __le64 *used_bits,
+						const __le64 *target,
+						unsigned int length)
+{
+	bool differs = false;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		if ((entry[i] & used_bits[i]) != target[i])
+			differs = true;
+	}
+	return differs;
+}
+
+static void
+arm_smmu_test_writer_record_syncs(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_test_writer *test_writer =
+		container_of(ops, struct arm_smmu_test_writer, ops);
+	__le64 *entry_used_bits;
+
+	entry_used_bits = kunit_kzalloc(
+		test_writer->test,
+		sizeof(*entry_used_bits) * ops->num_entry_qwords, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test_writer->test, entry_used_bits);
+
+	pr_debug("STE value is now set to: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8,
+			     test_writer->entry,
+			     ops->num_entry_qwords * sizeof(*test_writer->entry),
+			     false);
+
+	test_writer->num_syncs += 1;
+	if (!(test_writer->entry[0] & ops->v_bit))
+		test_writer->invalid_entry_written = true;
+	else {
+		/*
+		 * At any stage in a hitless transition, the entry must be
+		 * equivalent to either the initial entry or the target entry
+		 * when only considering the bits used by the current
+		 * configuration.
+		 */
+		ops->get_used(ops,
+			test_writer->entry,
+			entry_used_bits);
+		KUNIT_EXPECT_FALSE(test_writer->test,
+				   arm_smmu_entry_differs_in_used_bits(
+					   test_writer->entry, entry_used_bits,
+					   test_writer->init_entry,
+					   ops->num_entry_qwords) &&
+					   arm_smmu_entry_differs_in_used_bits(
+						   test_writer->entry,
+						   entry_used_bits,
+						   test_writer->target_entry,
+						   ops->num_entry_qwords));
+	}
+}
+
+static void arm_smmu_v3_test_ste_debug_print_used_bits(
+	const struct arm_smmu_entry_writer_ops *ops,
+	const struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_ste used_bits = { 0 };
+
+	arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
+	pr_debug("STE used bits: ");
+	print_hex_dump_debug(
+		"    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
+		ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
+}
+
+static void arm_smmu_v3_test_ste_expect_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected, bool hitless)
+{
+	struct arm_smmu_ste cur_copy;
+	struct arm_smmu_test_writer test_writer = {
+		.ops = {
+			.v_bit = cpu_to_le64(STRTAB_STE_0_V),
+			.num_entry_qwords = ARRAY_SIZE(cur_copy.data),
+			.sync = arm_smmu_test_writer_record_syncs,
+			.get_used = arm_smmu_get_ste_used,
+		},
+		.test = test,
+		.init_entry = cur->data,
+		.target_entry = target->data,
+		.entry = cur_copy.data,
+		.num_syncs = 0,
+		.invalid_entry_written = false,
+
+	};
+	memcpy(&cur_copy, cur, sizeof(cur_copy));
+
+	pr_debug("STE initial value: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, cur_copy.data,
+			     ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+			     false);
+	arm_smmu_v3_test_ste_debug_print_used_bits(&test_writer.ops, cur);
+	pr_debug("STE target value: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, target->data,
+			     ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+			     false);
+	arm_smmu_v3_test_ste_debug_print_used_bits(&test_writer.ops, target);
+
+	arm_smmu_write_entry(&test_writer.ops, cur_copy.data, target->data);
+
+	KUNIT_EXPECT_EQ(test, test_writer.invalid_entry_written, !hitless);
+	KUNIT_EXPECT_EQ(test, test_writer.num_syncs, num_syncs_expected);
+	KUNIT_EXPECT_MEMEQ(test, target->data, cur_copy.data,
+			   ARRAY_SIZE(cur_copy.data));
+}
+
+static void arm_smmu_v3_test_ste_expect_non_hitless_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+	arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+					       num_syncs_expected, false);
+}
+
+static void arm_smmu_v3_test_ste_expect_hitless_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+	arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+					       num_syncs_expected, true);
+}
+
+static const dma_addr_t fake_cdtab_dma_addr = 0xF0F0F0F0F0F0;
+
+static void arm_smmu_test_make_cdtable_ste(struct arm_smmu_ste *ste,
+					   unsigned int s1dss,
+					   const dma_addr_t dma_addr)
+{
+	struct arm_smmu_master master;
+	struct arm_smmu_ctx_desc_cfg cd_table;
+	struct arm_smmu_device smmu;
+
+	cd_table.cdtab_dma = dma_addr;
+	cd_table.s1cdmax = 0xFF;
+	cd_table.s1fmt = STRTAB_STE_0_S1FMT_64K_L2;
+	smmu.features = ARM_SMMU_FEAT_STALLS;
+	master.smmu = &smmu;
+
+	arm_smmu_make_cdtable_ste(ste, &master, &cd_table, true, s1dss);
+}
+
+struct arm_smmu_ste bypass_ste;
+struct arm_smmu_ste abort_ste;
+
+static int arm_smmu_v3_test_suite_init(struct kunit_suite *test)
+{
+	arm_smmu_make_bypass_ste(&bypass_ste);
+	arm_smmu_make_abort_ste(&abort_ste);
+
+	return 0;
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_abort(struct kunit *test)
+{
+	/*
+	 * Bypass STEs has used bits in the first two Qwords, while abort STEs
+	 * only have used bits in the first QWord. Transitioning from bypass to
+	 * abort requires two syncs: the first to set the first qword and make
+	 * the STE into an abort, the second to clean up the second qword.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &abort_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_bypass(struct kunit *test)
+{
+	/*
+	 * Transitioning from abort to bypass also requires two syncs: the first
+	 * to set the second qword data required by the bypass STE, and the
+	 * second to set the first qword and switch to bypass.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &abort_ste, &bypass_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_abort(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &abort_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_cdtable(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &abort_ste, &ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_bypass(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &bypass_ste,
+		/*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_cdtable(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &ste,
+		/*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_s1dss_change(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+
+	/*
+	 * Flipping s1dss on a CD table STE only involves changes to the second
+	 * qword of an STE and can be done in a single write.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &s1dss_bypass,
+		/*num_syncs_expected=*/1);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &s1dss_bypass, &ste,
+		/*num_syncs_expected=*/1);
+}
+
+static void
+arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass(struct kunit *test)
+{
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &s1dss_bypass, &bypass_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void
+arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass(struct kunit *test)
+{
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &s1dss_bypass,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_non_hitless(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+	struct arm_smmu_ste ste_2;
+
+	/*
+	 * Although no flow resembles this in practice, one way to force an STE
+	 * update to be non-hitless is to change its CD table pointer as well as
+	 * s1 dss field in the same update.
+	 */
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_test_make_cdtable_ste(&ste_2, STRTAB_STE_1_S1DSS_BYPASS,
+				       0x4B4B4b4B4B);
+	arm_smmu_v3_test_ste_expect_non_hitless_transition(
+		test, &ste, &ste_2,
+		/*num_syncs_expected=*/3);
+}
+
+static struct kunit_case arm_smmu_v3_test_cases[] = {
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_abort),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_cdtable),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_bypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_cdtable),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_s1dss_change),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_non_hitless),
+	{},
+};
+
+static struct kunit_suite arm_smmu_v3_test_module = {
+	.name = "arm-smmu-v3-kunit-test",
+	.suite_init = arm_smmu_v3_test_suite_init,
+	.test_cases = arm_smmu_v3_test_cases,
+};
+kunit_test_suites(&arm_smmu_v3_test_module);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 065df42c86b28..e8630a317cc5e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1511,7 +1511,7 @@ static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 	}
 }
 
-static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 {
 	memset(target, 0, sizeof(*target));
 	target->data[0] = cpu_to_le64(
@@ -1519,7 +1519,7 @@ static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
 }
 
-static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 {
 	memset(target, 0, sizeof(*target));
 	target->data[0] = cpu_to_le64(
@@ -1529,7 +1529,7 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }
 
-static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
 				      struct arm_smmu_master *master,
 				      struct arm_smmu_ctx_desc_cfg *cd_table,
 				      bool ats_enabled, unsigned int s1dss)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 54a6af60800d2..eddd686645040 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -766,6 +766,12 @@ void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
 			   const __le64 *ent, __le64 *used_bits);
 void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
 			  __le64 *cur, const __le64 *target);
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+				      struct arm_smmu_master *master,
+				      struct arm_smmu_ctx_desc_cfg *cd_table,
+				      bool ats_enabled, unsigned int s1dss);
 
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
@@ -798,7 +804,6 @@ void arm_smmu_make_s1_cd(struct arm_smmu_cd *target,
 void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
 			     struct arm_smmu_cd *cdptr,
 			     const struct arm_smmu_cd *target);
-
 int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd);

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
prerequisite-patch-id: d6342ff93ec8850ce76e45f1e22d143208bfa13c
prerequisite-patch-id: 6d2c59c2fdb9ae9e09fb042148f57b12d5058c9e
prerequisite-patch-id: f86746e1c19fba223fe2e559fc0f3ecf6fc7cc47
prerequisite-patch-id: 2d43b690a831e369547d10cf08a8e785fc4c1b69
prerequisite-patch-id: ae154d0d43beba4483f29747aecceae853657561
prerequisite-patch-id: 1ac7f3a4007a4ff64813e1a117ee6f16c28695bc
prerequisite-patch-id: ed34d0ebe0b56869508698367a26bd9e913394eb
prerequisite-patch-id: 658bad2b9692a0f959ee73e2d3798a34f16c9f11
prerequisite-patch-id: 4d83a8451a41ee3d597f1e6be1457f695b738b76
prerequisite-patch-id: d3b421dc985d58dbaaef46ec6d16b4a2764424ea
prerequisite-patch-id: ac7aab762dcd10fcc241be07503abae66f5912c8
prerequisite-patch-id: 34877d560c1c74de6e6875bdd719dafebb620732
prerequisite-patch-id: 9864c8f72ae9de7d6caf90096cf015ad0199ea7e
prerequisite-patch-id: fa730102c85dc93ce0c9e7b4128d08dc09306192
prerequisite-patch-id: 8c1a8a32e9be9b282727985a542afe4766c4afd5
prerequisite-patch-id: ac25e540981c4015261293bd5502ab39f0b6d9e6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 245dbf34f0d60634846534ce846baa39ff91f6dc
prerequisite-patch-id: 879c03c00f0023fcddfc8194692cd5706be4b893
prerequisite-patch-id: 6aa6a678f8c0d9ff3ce278d27342742ec352e95d
prerequisite-patch-id: ccb225b386bb12bf442a8ac9096aabc4b2c6058c
prerequisite-patch-id: b6ba55a23631a83543d6abc75a13665c8d17a8a9
prerequisite-patch-id: b93c7d0e70d2bfe18a5fe3c444e2584c4268574a
prerequisite-patch-id: 049b8b92e1d5920dd67712b54d74f58f9db21244
prerequisite-patch-id: 1d014b01b316a06e116a08d7b1395e00673c8d5c
prerequisite-patch-id: 2d066a698eedeb5b5466095056812810d27f69c9
prerequisite-patch-id: f07cf696ae2e60cb6f4cc36828c4e7680a2b1b94
prerequisite-patch-id: c2059064e48ee1c541d43d3420d79ebab1205990
prerequisite-patch-id: 96a7e4869c5c7a6786387d09a77eb30574fdd354
prerequisite-patch-id: 6fc000e0534c9850283e65443e4df0df02c6c1cd
prerequisite-patch-id: f75c57a884b38f8fc61ef3737d6c9b5639497adc
prerequisite-patch-id: a07fd1675545f66f62152ddf1761463c4c2b2e17
prerequisite-patch-id: 5f8983e3a633d4c148a36584620d9473c563946c
prerequisite-patch-id: ad462723fb76d41e1e6f66003af2265b9c2b364a
prerequisite-patch-id: 946f07ca0236544523d4349670207e10e94b39ae
prerequisite-patch-id: 5da1224014c422b3423ff959318f2777b44b9175
prerequisite-patch-id: 958ac7ea7e001daf18aa62a3bacfd3746fd54d13
prerequisite-patch-id: 2d25b818974f17416479c9138b0b27acd6918444
prerequisite-patch-id: 21bf03fe577e3c6d6b712075ad954814d8a531ac
prerequisite-patch-id: 413192b0b6adb07ba90b9104b25a60de8190656d
prerequisite-patch-id: f6deff80e594f31469d40caae9cf809436dbf057
prerequisite-patch-id: 741a67b7b3511d378615126f2020c4c8466a7596
prerequisite-patch-id: 54640c82d0f87a7ffd054edeec4ec41e0e42f33d
prerequisite-patch-id: 6d46cbd6d73441b67c594f4af7bb6b0091fb6063
-- 
2.43.0.472.g3155946c3a-goog


^ permalink raw reply related	[flat|nested] 134+ messages in thread

* [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
@ 2024-01-06  8:36                                       ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:36 UTC (permalink / raw)
  To: jgg; +Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

Add tests for some of the more common STE update operations that we
expect to see, as well as some artificial STE updates to test the edges
of arm_smmu_write_entry. These also serve as a record of which common
operation is expected to be hitless, and how many syncs they require.

arm_smmu_write_entry implements a generic algorithm that updates an
STE/CD to any other abritrary STE/CD configuration. The update requires
a sequence of write+sync operations, with some invariants that must be
held true after each sync. arm_smmu_write_entry lends itself well to
unit-testing since the function's interaction with the STE/CD is already
abstracted by input callbacks that we can hook to introspect into the
sequence of operations. We can use these hooks to guarantee that
invariants are held throughout the entire update operation.

Signed-off-by: Michael Shavit <mshavit@google.com>
---

 drivers/iommu/Kconfig                         |   9 +
 drivers/iommu/arm/arm-smmu-v3/Makefile        |   2 +
 .../iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c  | 329 ++++++++++++++++++
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c   |   6 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h   |   7 +-
 5 files changed, 349 insertions(+), 4 deletions(-)
 create mode 100644 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c

diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
index 7673bb82945b6..e4c4071115c8e 100644
--- a/drivers/iommu/Kconfig
+++ b/drivers/iommu/Kconfig
@@ -405,6 +405,15 @@ config ARM_SMMU_V3_SVA
 	  Say Y here if your system supports SVA extensions such as PCIe PASID
 	  and PRI.
 
+config ARM_SMMU_V3_KUNIT_TEST
+	tristate "KUnit tests for arm-smmu-v3 driver"  if !KUNIT_ALL_TESTS
+	depends on ARM_SMMU_V3 && KUNIT
+	default KUNIT_ALL_TESTS
+	help
+	  Enable this option to unit-test arm-smmu-v3 driver functions.
+
+	  If unsure, say N.
+
 config S390_IOMMU
 	def_bool y if S390 && PCI
 	depends on S390 && PCI
diff --git a/drivers/iommu/arm/arm-smmu-v3/Makefile b/drivers/iommu/arm/arm-smmu-v3/Makefile
index 54feb1ecccad8..014a997753a8a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/Makefile
+++ b/drivers/iommu/arm/arm-smmu-v3/Makefile
@@ -3,3 +3,5 @@ obj-$(CONFIG_ARM_SMMU_V3) += arm_smmu_v3.o
 arm_smmu_v3-objs-y += arm-smmu-v3.o
 arm_smmu_v3-objs-$(CONFIG_ARM_SMMU_V3_SVA) += arm-smmu-v3-sva.o
 arm_smmu_v3-objs := $(arm_smmu_v3-objs-y)
+
+obj-$(CONFIG_ARM_SMMU_V3_KUNIT_TEST) += arm-smmu-v3-test.o
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
new file mode 100644
index 0000000000000..59ffcafb575fb
--- /dev/null
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
@@ -0,0 +1,329 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+
+#include "arm-smmu-v3.h"
+
+struct arm_smmu_test_writer {
+	struct arm_smmu_entry_writer_ops ops;
+	struct kunit *test;
+	const __le64 *init_entry;
+	const __le64 *target_entry;
+	__le64 *entry;
+
+	bool invalid_entry_written;
+	int num_syncs;
+};
+
+static bool arm_smmu_entry_differs_in_used_bits(const __le64 *entry,
+						const __le64 *used_bits,
+						const __le64 *target,
+						unsigned int length)
+{
+	bool differs = false;
+	int i;
+
+	for (i = 0; i < length; i++) {
+		if ((entry[i] & used_bits[i]) != target[i])
+			differs = true;
+	}
+	return differs;
+}
+
+static void
+arm_smmu_test_writer_record_syncs(const struct arm_smmu_entry_writer_ops *ops)
+{
+	struct arm_smmu_test_writer *test_writer =
+		container_of(ops, struct arm_smmu_test_writer, ops);
+	__le64 *entry_used_bits;
+
+	entry_used_bits = kunit_kzalloc(
+		test_writer->test,
+		sizeof(*entry_used_bits) * ops->num_entry_qwords, GFP_KERNEL);
+	KUNIT_ASSERT_NOT_NULL(test_writer->test, entry_used_bits);
+
+	pr_debug("STE value is now set to: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8,
+			     test_writer->entry,
+			     ops->num_entry_qwords * sizeof(*test_writer->entry),
+			     false);
+
+	test_writer->num_syncs += 1;
+	if (!(test_writer->entry[0] & ops->v_bit))
+		test_writer->invalid_entry_written = true;
+	else {
+		/*
+		 * At any stage in a hitless transition, the entry must be
+		 * equivalent to either the initial entry or the target entry
+		 * when only considering the bits used by the current
+		 * configuration.
+		 */
+		ops->get_used(ops,
+			test_writer->entry,
+			entry_used_bits);
+		KUNIT_EXPECT_FALSE(test_writer->test,
+				   arm_smmu_entry_differs_in_used_bits(
+					   test_writer->entry, entry_used_bits,
+					   test_writer->init_entry,
+					   ops->num_entry_qwords) &&
+					   arm_smmu_entry_differs_in_used_bits(
+						   test_writer->entry,
+						   entry_used_bits,
+						   test_writer->target_entry,
+						   ops->num_entry_qwords));
+	}
+}
+
+static void arm_smmu_v3_test_ste_debug_print_used_bits(
+	const struct arm_smmu_entry_writer_ops *ops,
+	const struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_ste used_bits = { 0 };
+
+	arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
+	pr_debug("STE used bits: ");
+	print_hex_dump_debug(
+		"    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
+		ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
+}
+
+static void arm_smmu_v3_test_ste_expect_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected, bool hitless)
+{
+	struct arm_smmu_ste cur_copy;
+	struct arm_smmu_test_writer test_writer = {
+		.ops = {
+			.v_bit = cpu_to_le64(STRTAB_STE_0_V),
+			.num_entry_qwords = ARRAY_SIZE(cur_copy.data),
+			.sync = arm_smmu_test_writer_record_syncs,
+			.get_used = arm_smmu_get_ste_used,
+		},
+		.test = test,
+		.init_entry = cur->data,
+		.target_entry = target->data,
+		.entry = cur_copy.data,
+		.num_syncs = 0,
+		.invalid_entry_written = false,
+
+	};
+	memcpy(&cur_copy, cur, sizeof(cur_copy));
+
+	pr_debug("STE initial value: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, cur_copy.data,
+			     ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+			     false);
+	arm_smmu_v3_test_ste_debug_print_used_bits(&test_writer.ops, cur);
+	pr_debug("STE target value: ");
+	print_hex_dump_debug("    ", DUMP_PREFIX_NONE, 16, 8, target->data,
+			     ARRAY_SIZE(cur_copy.data) * sizeof(*cur_copy.data),
+			     false);
+	arm_smmu_v3_test_ste_debug_print_used_bits(&test_writer.ops, target);
+
+	arm_smmu_write_entry(&test_writer.ops, cur_copy.data, target->data);
+
+	KUNIT_EXPECT_EQ(test, test_writer.invalid_entry_written, !hitless);
+	KUNIT_EXPECT_EQ(test, test_writer.num_syncs, num_syncs_expected);
+	KUNIT_EXPECT_MEMEQ(test, target->data, cur_copy.data,
+			   ARRAY_SIZE(cur_copy.data));
+}
+
+static void arm_smmu_v3_test_ste_expect_non_hitless_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+	arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+					       num_syncs_expected, false);
+}
+
+static void arm_smmu_v3_test_ste_expect_hitless_transition(
+	struct kunit *test, const struct arm_smmu_ste *cur,
+	const struct arm_smmu_ste *target, int num_syncs_expected)
+{
+	arm_smmu_v3_test_ste_expect_transition(test, cur, target,
+					       num_syncs_expected, true);
+}
+
+static const dma_addr_t fake_cdtab_dma_addr = 0xF0F0F0F0F0F0;
+
+static void arm_smmu_test_make_cdtable_ste(struct arm_smmu_ste *ste,
+					   unsigned int s1dss,
+					   const dma_addr_t dma_addr)
+{
+	struct arm_smmu_master master;
+	struct arm_smmu_ctx_desc_cfg cd_table;
+	struct arm_smmu_device smmu;
+
+	cd_table.cdtab_dma = dma_addr;
+	cd_table.s1cdmax = 0xFF;
+	cd_table.s1fmt = STRTAB_STE_0_S1FMT_64K_L2;
+	smmu.features = ARM_SMMU_FEAT_STALLS;
+	master.smmu = &smmu;
+
+	arm_smmu_make_cdtable_ste(ste, &master, &cd_table, true, s1dss);
+}
+
+struct arm_smmu_ste bypass_ste;
+struct arm_smmu_ste abort_ste;
+
+static int arm_smmu_v3_test_suite_init(struct kunit_suite *test)
+{
+	arm_smmu_make_bypass_ste(&bypass_ste);
+	arm_smmu_make_abort_ste(&abort_ste);
+
+	return 0;
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_abort(struct kunit *test)
+{
+	/*
+	 * Bypass STEs has used bits in the first two Qwords, while abort STEs
+	 * only have used bits in the first QWord. Transitioning from bypass to
+	 * abort requires two syncs: the first to set the first qword and make
+	 * the STE into an abort, the second to clean up the second qword.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &abort_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_bypass(struct kunit *test)
+{
+	/*
+	 * Transitioning from abort to bypass also requires two syncs: the first
+	 * to set the second qword data required by the bypass STE, and the
+	 * second to set the first qword and switch to bypass.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &abort_ste, &bypass_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_abort(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &abort_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_abort_to_cdtable(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &abort_ste, &ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_to_bypass(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &bypass_ste,
+		/*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_bypass_to_cdtable(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &ste,
+		/*num_syncs_expected=*/3);
+}
+
+static void arm_smmu_v3_write_ste_test_cdtable_s1dss_change(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+
+	/*
+	 * Flipping s1dss on a CD table STE only involves changes to the second
+	 * qword of an STE and can be done in a single write.
+	 */
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &ste, &s1dss_bypass,
+		/*num_syncs_expected=*/1);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &s1dss_bypass, &ste,
+		/*num_syncs_expected=*/1);
+}
+
+static void
+arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass(struct kunit *test)
+{
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &s1dss_bypass, &bypass_ste,
+		/*num_syncs_expected=*/2);
+}
+
+static void
+arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass(struct kunit *test)
+{
+	struct arm_smmu_ste s1dss_bypass;
+
+	arm_smmu_test_make_cdtable_ste(&s1dss_bypass, STRTAB_STE_1_S1DSS_BYPASS,
+				       fake_cdtab_dma_addr);
+	arm_smmu_v3_test_ste_expect_hitless_transition(
+		test, &bypass_ste, &s1dss_bypass,
+		/*num_syncs_expected=*/2);
+}
+
+static void arm_smmu_v3_write_ste_test_non_hitless(struct kunit *test)
+{
+	struct arm_smmu_ste ste;
+	struct arm_smmu_ste ste_2;
+
+	/*
+	 * Although no flow resembles this in practice, one way to force an STE
+	 * update to be non-hitless is to change its CD table pointer as well as
+	 * s1 dss field in the same update.
+	 */
+	arm_smmu_test_make_cdtable_ste(&ste, STRTAB_STE_1_S1DSS_SSID0,
+				       fake_cdtab_dma_addr);
+	arm_smmu_test_make_cdtable_ste(&ste_2, STRTAB_STE_1_S1DSS_BYPASS,
+				       0x4B4B4b4B4B);
+	arm_smmu_v3_test_ste_expect_non_hitless_transition(
+		test, &ste, &ste_2,
+		/*num_syncs_expected=*/3);
+}
+
+static struct kunit_case arm_smmu_v3_test_cases[] = {
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_abort),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_bypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_abort),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_abort_to_cdtable),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_to_bypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_bypass_to_cdtable),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_cdtable_s1dss_change),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_s1dssbypass_to_stebypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_stebypass_to_s1dssbypass),
+	KUNIT_CASE(arm_smmu_v3_write_ste_test_non_hitless),
+	{},
+};
+
+static struct kunit_suite arm_smmu_v3_test_module = {
+	.name = "arm-smmu-v3-kunit-test",
+	.suite_init = arm_smmu_v3_test_suite_init,
+	.test_cases = arm_smmu_v3_test_cases,
+};
+kunit_test_suites(&arm_smmu_v3_test_module);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 065df42c86b28..e8630a317cc5e 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1511,7 +1511,7 @@ static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
 	}
 }
 
-static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 {
 	memset(target, 0, sizeof(*target));
 	target->data[0] = cpu_to_le64(
@@ -1519,7 +1519,7 @@ static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
 }
 
-static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 {
 	memset(target, 0, sizeof(*target));
 	target->data[0] = cpu_to_le64(
@@ -1529,7 +1529,7 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }
 
-static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
 				      struct arm_smmu_master *master,
 				      struct arm_smmu_ctx_desc_cfg *cd_table,
 				      bool ats_enabled, unsigned int s1dss)
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 54a6af60800d2..eddd686645040 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -766,6 +766,12 @@ void arm_smmu_get_ste_used(const struct arm_smmu_entry_writer_ops *ops,
 			   const __le64 *ent, __le64 *used_bits);
 void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
 			  __le64 *cur, const __le64 *target);
+void arm_smmu_make_abort_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target);
+void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+				      struct arm_smmu_master *master,
+				      struct arm_smmu_ctx_desc_cfg *cd_table,
+				      bool ats_enabled, unsigned int s1dss);
 
 static inline struct arm_smmu_domain *to_smmu_domain(struct iommu_domain *dom)
 {
@@ -798,7 +804,6 @@ void arm_smmu_make_s1_cd(struct arm_smmu_cd *target,
 void arm_smmu_write_cd_entry(struct arm_smmu_master *master, int ssid,
 			     struct arm_smmu_cd *cdptr,
 			     const struct arm_smmu_cd *target);
-
 int arm_smmu_set_pasid(struct arm_smmu_master *master,
 		       struct arm_smmu_domain *smmu_domain, ioasid_t pasid,
 		       struct arm_smmu_cd *cd);

base-commit: 2cc14f52aeb78ce3f29677c2de1f06c0e91471ab
prerequisite-patch-id: 3bc3d332ed043fbe64543bda7c7e734e19ba46aa
prerequisite-patch-id: bb900133a10e40d3136e104b19c430442c4e2647
prerequisite-patch-id: 9ec5907dd0348b00f9341a63490bdafd99a403ca
prerequisite-patch-id: dc50ec47974c35de431b80b83b501c4ca63758a3
prerequisite-patch-id: 371b31533a5abf8e1b8dc8568ffa455d16b611c6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 7743327071a8d8fb04cc43887fe61432f42eb60d
prerequisite-patch-id: c74e8e54bd5391ef40e0a92f25db0822b421dd6a
prerequisite-patch-id: 3ce8237727e2ce08261352c6b492a9bcf73651c4
prerequisite-patch-id: d6342ff93ec8850ce76e45f1e22d143208bfa13c
prerequisite-patch-id: 6d2c59c2fdb9ae9e09fb042148f57b12d5058c9e
prerequisite-patch-id: f86746e1c19fba223fe2e559fc0f3ecf6fc7cc47
prerequisite-patch-id: 2d43b690a831e369547d10cf08a8e785fc4c1b69
prerequisite-patch-id: ae154d0d43beba4483f29747aecceae853657561
prerequisite-patch-id: 1ac7f3a4007a4ff64813e1a117ee6f16c28695bc
prerequisite-patch-id: ed34d0ebe0b56869508698367a26bd9e913394eb
prerequisite-patch-id: 658bad2b9692a0f959ee73e2d3798a34f16c9f11
prerequisite-patch-id: 4d83a8451a41ee3d597f1e6be1457f695b738b76
prerequisite-patch-id: d3b421dc985d58dbaaef46ec6d16b4a2764424ea
prerequisite-patch-id: ac7aab762dcd10fcc241be07503abae66f5912c8
prerequisite-patch-id: 34877d560c1c74de6e6875bdd719dafebb620732
prerequisite-patch-id: 9864c8f72ae9de7d6caf90096cf015ad0199ea7e
prerequisite-patch-id: fa730102c85dc93ce0c9e7b4128d08dc09306192
prerequisite-patch-id: 8c1a8a32e9be9b282727985a542afe4766c4afd5
prerequisite-patch-id: ac25e540981c4015261293bd5502ab39f0b6d9e6
prerequisite-patch-id: 0000000000000000000000000000000000000000
prerequisite-patch-id: 245dbf34f0d60634846534ce846baa39ff91f6dc
prerequisite-patch-id: 879c03c00f0023fcddfc8194692cd5706be4b893
prerequisite-patch-id: 6aa6a678f8c0d9ff3ce278d27342742ec352e95d
prerequisite-patch-id: ccb225b386bb12bf442a8ac9096aabc4b2c6058c
prerequisite-patch-id: b6ba55a23631a83543d6abc75a13665c8d17a8a9
prerequisite-patch-id: b93c7d0e70d2bfe18a5fe3c444e2584c4268574a
prerequisite-patch-id: 049b8b92e1d5920dd67712b54d74f58f9db21244
prerequisite-patch-id: 1d014b01b316a06e116a08d7b1395e00673c8d5c
prerequisite-patch-id: 2d066a698eedeb5b5466095056812810d27f69c9
prerequisite-patch-id: f07cf696ae2e60cb6f4cc36828c4e7680a2b1b94
prerequisite-patch-id: c2059064e48ee1c541d43d3420d79ebab1205990
prerequisite-patch-id: 96a7e4869c5c7a6786387d09a77eb30574fdd354
prerequisite-patch-id: 6fc000e0534c9850283e65443e4df0df02c6c1cd
prerequisite-patch-id: f75c57a884b38f8fc61ef3737d6c9b5639497adc
prerequisite-patch-id: a07fd1675545f66f62152ddf1761463c4c2b2e17
prerequisite-patch-id: 5f8983e3a633d4c148a36584620d9473c563946c
prerequisite-patch-id: ad462723fb76d41e1e6f66003af2265b9c2b364a
prerequisite-patch-id: 946f07ca0236544523d4349670207e10e94b39ae
prerequisite-patch-id: 5da1224014c422b3423ff959318f2777b44b9175
prerequisite-patch-id: 958ac7ea7e001daf18aa62a3bacfd3746fd54d13
prerequisite-patch-id: 2d25b818974f17416479c9138b0b27acd6918444
prerequisite-patch-id: 21bf03fe577e3c6d6b712075ad954814d8a531ac
prerequisite-patch-id: 413192b0b6adb07ba90b9104b25a60de8190656d
prerequisite-patch-id: f6deff80e594f31469d40caae9cf809436dbf057
prerequisite-patch-id: 741a67b7b3511d378615126f2020c4c8466a7596
prerequisite-patch-id: 54640c82d0f87a7ffd054edeec4ec41e0e42f33d
prerequisite-patch-id: 6d46cbd6d73441b67c594f4af7bb6b0091fb6063
-- 
2.43.0.472.g3155946c3a-goog


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply related	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-03 17:50                                   ` Jason Gunthorpe
@ 2024-01-06  8:50                                     ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

> I'm fine with this, if you think it is better please sort out the rest
> of the bits and send me a diff and I'll integrate it
>
> Thanks,
> Jason

Integrated and re-sent the 3 relevant patches; although git-send-email
gave two of them a different subject so they may appear as a different
thread depending on your email client.

Please note that I didn't update the commit description on the two
patches that you initially wrote. Also note that the Kunit test patch
does not yet add tests for any CD updates (outside of generic logic
which is shared with STE update). I'm on vacation for the next week
and haven't had a chance to expand the test coverage.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-06  8:50                                     ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-06  8:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

> I'm fine with this, if you think it is better please sort out the rest
> of the bits and send me a diff and I'll integrate it
>
> Thanks,
> Jason

Integrated and re-sent the 3 relevant patches; although git-send-email
gave two of them a different subject so they may appear as a different
thread depending on your email client.

Please note that I didn't update the commit description on the two
patches that you initially wrote. Also note that the Kunit test patch
does not yet add tests for any CD updates (outside of generic logic
which is shared with STE update). I'm on vacation for the next week
and haven't had a chance to expand the test coverage.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-06  8:36                                     ` Michael Shavit
@ 2024-01-10 13:10                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-10 13:10 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:14PM +0800, Michael Shavit wrote:
> +/*
> + * Update the STE/CD to the target configuration. The transition from the current
> + * entry to the target entry takes place over multiple steps that attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */

I edited this a bit more:


/*
 * Update the STE/CD to the target configuration. The transition from the
 * current entry to the target entry takes place over multiple steps that
 * attempts to make the transition hitless if possible. This function takes care
 * not to create a situation where the HW can perceive a corrupted entry. HW is
 * only required to have a 64 bit atomicity with stores from the CPU, while
 * entries are many 64 bit values big.
 *
 * The difference between the current value and the target value is analyzed to
 * determine which of three updates are required - disruptive, hitless or no
 * change.
 *
 * In the most general disruptive case we can make any update in three steps:
 *  - Disrupting the entry (V=0)
 *  - Fill now unused qwords, execpt qword 0 which contains V
 *  - Make qword 0 have the final value and valid (V=1) with a single 64
 *    bit store
 *
 * However this disrupts the HW while it is happening. There are several
 * interesting cases where a STE/CD can be updated without disturbing the HW
 * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
 * because the used bits don't intersect. We can detect this by calculating how
 * many 64 bit values need update after adjusting the unused bits and skip the
 * V=0 process. This relies on the IGNORED behavior described in the
 * specification.
 */

> +void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
> +			  __le64 *entry, const __le64 *target)
> +{
> +	__le64 unused_update[NUM_ENTRY_QWORDS];
> +	u8 used_qword_diff;
> +	unsigned int critical_qword_index;
> +
> +	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
> +	if (hweight8(used_qword_diff) > 1) {
> +		/*
> +		 * At least two qwords need their used bits to be changed. This
> +		 * requires a breaking update, zero the V bit, write all qwords
> +		 * but 0, then set qword 0
> +		 */
> +		unused_update[0] = entry[0] & (~ops->v_bit);
> +		entry_set(ops, entry, unused_update, 0, 1);
> +		entry_set(ops, entry, target, 1, ops->num_entry_qwords - 1);
> +		entry_set(ops, entry, target, 0, 1);
> +	} else if (hweight8(used_qword_diff) == 1) {
> +		/*
> +		 * Only one qword needs its used bits to be changed. This is a
> +		 * hitless update, update all bits the current STE is ignoring
> +		 * to their new values, then update a single qword to change the
> +		 * STE and finally 0 out any bits that are now unused in the
> +		 * target configuration.
> +		 */
> +		critical_qword_index = ffs(used_qword_diff) - 1;
> +		/*
> +		 * Skip writing unused bits in the critical qword since we'll be
> +		 * writing it in the next step anyways. This can save a sync
> +		 * when the only change is in that qword.
> +		 */
> +		unused_update[critical_qword_index] = entry[critical_qword_index];

Oh that is a neat improvement!

> +		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
> +		entry_set(ops, entry, target, critical_qword_index, 1);
> +		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
> +	} else {
> +		/*
> +		 * If everything is working properly this shouldn't do anything
> +		 * as unused bits should always be 0 and thus can't change.
> +		 */
> +		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
> +				       ops->num_entry_qwords));
> +	}
> +}
> +
> +#undef NUM_ENTRY_QWORDS

It is fine the keep the constant, it is reasonably named.

> +struct arm_smmu_ste_writer {
> +	struct arm_smmu_entry_writer_ops ops;
> +	struct arm_smmu_device *smmu;
> +	u32 sid;
> +};

I think the security focused people will not be totally happy with writable
function pointers..

So I changed it into:

struct arm_smmu_entry_writer_ops;
struct arm_smmu_entry_writer {
	const struct arm_smmu_entry_writer_ops *ops;
	struct arm_smmu_master *master;
};

struct arm_smmu_entry_writer_ops {
	unsigned int num_entry_qwords;
	__le64 v_bit;
	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
			 __le64 *used);
	void (*sync)(struct arm_smmu_entry_writer *writer);
};

(both ste and cd can use the master)

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-10 13:10                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-10 13:10 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:14PM +0800, Michael Shavit wrote:
> +/*
> + * Update the STE/CD to the target configuration. The transition from the current
> + * entry to the target entry takes place over multiple steps that attempts to make
> + * the transition hitless if possible. This function takes care not to create a
> + * situation where the HW can perceive a corrupted entry. HW is only required to
> + * have a 64 bit atomicity with stores from the CPU, while entries are many 64
> + * bit values big.
> + *
> + * The algorithm works by evolving the entry toward the target in a series of
> + * steps. Each step synchronizes with the HW so that the HW can not see an entry
> + * torn across two steps. During each step the HW can observe a torn entry that
> + * has any combination of the step's old/new 64 bit words. The algorithm
> + * objective is for the HW behavior to always be one of current behavior, V=0,
> + * or new behavior.
> + *
> + * In the most general case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused bits, all bits except V
> + *  - Make valid (V=1), single 64 bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification
> + */

I edited this a bit more:


/*
 * Update the STE/CD to the target configuration. The transition from the
 * current entry to the target entry takes place over multiple steps that
 * attempts to make the transition hitless if possible. This function takes care
 * not to create a situation where the HW can perceive a corrupted entry. HW is
 * only required to have a 64 bit atomicity with stores from the CPU, while
 * entries are many 64 bit values big.
 *
 * The difference between the current value and the target value is analyzed to
 * determine which of three updates are required - disruptive, hitless or no
 * change.
 *
 * In the most general disruptive case we can make any update in three steps:
 *  - Disrupting the entry (V=0)
 *  - Fill now unused qwords, execpt qword 0 which contains V
 *  - Make qword 0 have the final value and valid (V=1) with a single 64
 *    bit store
 *
 * However this disrupts the HW while it is happening. There are several
 * interesting cases where a STE/CD can be updated without disturbing the HW
 * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
 * because the used bits don't intersect. We can detect this by calculating how
 * many 64 bit values need update after adjusting the unused bits and skip the
 * V=0 process. This relies on the IGNORED behavior described in the
 * specification.
 */

> +void arm_smmu_write_entry(const struct arm_smmu_entry_writer_ops *ops,
> +			  __le64 *entry, const __le64 *target)
> +{
> +	__le64 unused_update[NUM_ENTRY_QWORDS];
> +	u8 used_qword_diff;
> +	unsigned int critical_qword_index;
> +
> +	used_qword_diff = compute_qword_diff(ops, entry, target, unused_update);
> +	if (hweight8(used_qword_diff) > 1) {
> +		/*
> +		 * At least two qwords need their used bits to be changed. This
> +		 * requires a breaking update, zero the V bit, write all qwords
> +		 * but 0, then set qword 0
> +		 */
> +		unused_update[0] = entry[0] & (~ops->v_bit);
> +		entry_set(ops, entry, unused_update, 0, 1);
> +		entry_set(ops, entry, target, 1, ops->num_entry_qwords - 1);
> +		entry_set(ops, entry, target, 0, 1);
> +	} else if (hweight8(used_qword_diff) == 1) {
> +		/*
> +		 * Only one qword needs its used bits to be changed. This is a
> +		 * hitless update, update all bits the current STE is ignoring
> +		 * to their new values, then update a single qword to change the
> +		 * STE and finally 0 out any bits that are now unused in the
> +		 * target configuration.
> +		 */
> +		critical_qword_index = ffs(used_qword_diff) - 1;
> +		/*
> +		 * Skip writing unused bits in the critical qword since we'll be
> +		 * writing it in the next step anyways. This can save a sync
> +		 * when the only change is in that qword.
> +		 */
> +		unused_update[critical_qword_index] = entry[critical_qword_index];

Oh that is a neat improvement!

> +		entry_set(ops, entry, unused_update, 0, ops->num_entry_qwords);
> +		entry_set(ops, entry, target, critical_qword_index, 1);
> +		entry_set(ops, entry, target, 0, ops->num_entry_qwords);
> +	} else {
> +		/*
> +		 * If everything is working properly this shouldn't do anything
> +		 * as unused bits should always be 0 and thus can't change.
> +		 */
> +		WARN_ON_ONCE(entry_set(ops, entry, target, 0,
> +				       ops->num_entry_qwords));
> +	}
> +}
> +
> +#undef NUM_ENTRY_QWORDS

It is fine the keep the constant, it is reasonably named.

> +struct arm_smmu_ste_writer {
> +	struct arm_smmu_entry_writer_ops ops;
> +	struct arm_smmu_device *smmu;
> +	u32 sid;
> +};

I think the security focused people will not be totally happy with writable
function pointers..

So I changed it into:

struct arm_smmu_entry_writer_ops;
struct arm_smmu_entry_writer {
	const struct arm_smmu_entry_writer_ops *ops;
	struct arm_smmu_master *master;
};

struct arm_smmu_entry_writer_ops {
	unsigned int num_entry_qwords;
	__le64 v_bit;
	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
			 __le64 *used);
	void (*sync)(struct arm_smmu_entry_writer *writer);
};

(both ste and cd can use the master)

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry_step()
  2024-01-06  8:36                                       ` Michael Shavit
@ 2024-01-10 13:34                                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-10 13:34 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:15PM +0800, Michael Shavit wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> CD table entries and STE's have the same essential programming sequence,
> just with different types and sizes.
> 
> Have arm_smmu_write_ctx_desc() generate a target CD and call
> arm_smmu_write_entry_step() to do the programming. Due to the way the
> target CD is generated by modifying the existing CD this alone is not
> enough for the CD callers to be freed of the ordering requirements.
> 
> The following patches will make the rest of the CD flow mirror the STE
> flow with precise CD contents generated in all cases.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Michael Shavit <mshavit@google.com>
> ---
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 90 +++++++++++++++------
>  1 file changed, 67 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index c9559c4075b4b..5a598500b5c6d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -23,6 +23,7 @@
>  #include <linux/of.h>
>  #include <linux/of_address.h>
>  #include <linux/of_platform.h>
> +#include <linux/minmax.h>
>  #include <linux/pci.h>
>  #include <linux/pci-ats.h>
>  #include <linux/platform_device.h>
> @@ -994,7 +995,9 @@ static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
>  	return changed;
>  }
>  
> -#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
> +#define NUM_ENTRY_QWORDS (max(sizeof_field(struct arm_smmu_ste, data), \
> +			     sizeof_field(struct arm_smmu_cd, data)) \
> +			     / sizeof(u64))

So, the reason I wrote it the other way, with the enum, is because
this isn't a constexpr in Linux. max() has some complex implementation
hidden inside.

An obvious consequence of this is you can't do something like:

static unsigned int foo[NUM_ENTRY_QWORDS]; // error: statement expression not allowed at file scope

Now, the question is what does the compiler do with an automatic stack
variable when it is not a constexpr but with optimization can be made
constant. Particularly will someone's checker (sparse perhaps?)
complain that this is a forbidden "variable length array" alloca?

At least latest gcc and clang are able to avoid the variable length
array, but I wonder if this is asking for trouble...

Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry_step()
@ 2024-01-10 13:34                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-10 13:34 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:15PM +0800, Michael Shavit wrote:
> From: Jason Gunthorpe <jgg@nvidia.com>
> 
> CD table entries and STE's have the same essential programming sequence,
> just with different types and sizes.
> 
> Have arm_smmu_write_ctx_desc() generate a target CD and call
> arm_smmu_write_entry_step() to do the programming. Due to the way the
> target CD is generated by modifying the existing CD this alone is not
> enough for the CD callers to be freed of the ordering requirements.
> 
> The following patches will make the rest of the CD flow mirror the STE
> flow with precise CD contents generated in all cases.
> 
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> Signed-off-by: Michael Shavit <mshavit@google.com>
> ---
> 
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 90 +++++++++++++++------
>  1 file changed, 67 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index c9559c4075b4b..5a598500b5c6d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -23,6 +23,7 @@
>  #include <linux/of.h>
>  #include <linux/of_address.h>
>  #include <linux/of_platform.h>
> +#include <linux/minmax.h>
>  #include <linux/pci.h>
>  #include <linux/pci-ats.h>
>  #include <linux/platform_device.h>
> @@ -994,7 +995,9 @@ static bool entry_set(const struct arm_smmu_entry_writer_ops *ops,
>  	return changed;
>  }
>  
> -#define NUM_ENTRY_QWORDS (sizeof_field(struct arm_smmu_ste, data) / sizeof(u64))
> +#define NUM_ENTRY_QWORDS (max(sizeof_field(struct arm_smmu_ste, data), \
> +			     sizeof_field(struct arm_smmu_cd, data)) \
> +			     / sizeof(u64))

So, the reason I wrote it the other way, with the enum, is because
this isn't a constexpr in Linux. max() has some complex implementation
hidden inside.

An obvious consequence of this is you can't do something like:

static unsigned int foo[NUM_ENTRY_QWORDS]; // error: statement expression not allowed at file scope

Now, the question is what does the compiler do with an automatic stack
variable when it is not a constexpr but with optimization can be made
constant. Particularly will someone's checker (sparse perhaps?)
complain that this is a forbidden "variable length array" alloca?

At least latest gcc and clang are able to avoid the variable length
array, but I wonder if this is asking for trouble...

Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
  2024-01-06  8:36                                       ` Michael Shavit
@ 2024-01-12 16:36                                         ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-12 16:36 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:16PM +0800, Michael Shavit wrote:
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> new file mode 100644
> index 0000000000000..59ffcafb575fb
> --- /dev/null
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> @@ -0,0 +1,329 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <kunit/test.h>

I added

* Copyright 2024 Google LLC.

Here, let me know if it should be something else

> +	arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
> +	pr_debug("STE used bits: ");
> +	print_hex_dump_debug(
> +		"    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
> +		ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);

I fixed up alot of these weird sizeof things all over the three patches

sizeof(struct arm_smmu_ste) is the correct way to get the size of the
HW structure, no need to peek into data. This is because we use the
struct as the pointer to an array so the whole struct must be
correctly sized.

ARRAY_SIZE(x.data)*(sizeof(*x.data)) == sizeof(x)

Sadly there is no ARRAY_SIZE_FIELD()

I also made some hacky patches so smmuv3 would compile on x86 and ran
this kunit on x86 - looks fine to me

I'm going to put it in part 3, just because it is new and doesn't have
any RB/TB tags like the rest of part 1.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
@ 2024-01-12 16:36                                         ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-12 16:36 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 06, 2024 at 04:36:16PM +0800, Michael Shavit wrote:
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> new file mode 100644
> index 0000000000000..59ffcafb575fb
> --- /dev/null
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> @@ -0,0 +1,329 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <kunit/test.h>

I added

* Copyright 2024 Google LLC.

Here, let me know if it should be something else

> +	arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
> +	pr_debug("STE used bits: ");
> +	print_hex_dump_debug(
> +		"    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
> +		ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);

I fixed up alot of these weird sizeof things all over the three patches

sizeof(struct arm_smmu_ste) is the correct way to get the size of the
HW structure, no need to peek into data. This is because we use the
struct as the pointer to an array so the whole struct must be
correctly sized.

ARRAY_SIZE(x.data)*(sizeof(*x.data)) == sizeof(x)

Sadly there is no ARRAY_SIZE_FIELD()

I also made some hacky patches so smmuv3 would compile on x86 and ran
this kunit on x86 - looks fine to me

I'm going to put it in part 3, just because it is new and doesn't have
any RB/TB tags like the rest of part 1.

Thanks,
Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-06  8:50                                     ` Michael Shavit
@ 2024-01-12 19:45                                       ` Jason Gunthorpe
  -1 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-12 19:45 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sat, Jan 06, 2024 at 04:50:48PM +0800, Michael Shavit wrote:
> > I'm fine with this, if you think it is better please sort out the rest
> > of the bits and send me a diff and I'll integrate it
> >
> > Thanks,
> > Jason
> 
> Integrated and re-sent the 3 relevant patches; although git-send-email
> gave two of them a different subject so they may appear as a different
> thread depending on your email client.
> 
> Please note that I didn't update the commit description on the two
> patches that you initially wrote. Also note that the Kunit test patch
> does not yet add tests for any CD updates (outside of generic logic
> which is shared with STE update). I'm on vacation for the next week
> and haven't had a chance to expand the test coverage.

Okay, I took care of it all and got the branch rebased onto something
closer to what v6.8-rc1 will look like. I made a bunch of cosmetic
changes and checked the unit test still works. I'll post the three
parts to the list when v6.8-rc1 comes out in a weeks time.

The update is on my github:

https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/

Here is the rewritten commit message:

iommu/arm-smmu-v3: Make STE programming independent of the callers

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The iommu core APIs would ideally expect the driver to do a hitless change
of iommu_domain in a number of cases:

 - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
   for the RESV ranges

 - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
   domain installed. The RID should not be impacted

 - PASID downgrade has IDENTIY on the RID and all PASID's removed.
   The RID should not be impacted

 - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
   impacted

 - NESTING -> NESTING for carrying all the above hitless cases in a VM
   into the hypervisor. To comprehensively emulate the HW in a VM we should
   assume the VM OS is running logic like this and expecting hitless updates
   to be relayed to real HW.

For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
how limited it is, and the driver does have a need for hitless CD updates:

 - SMMUv3 BTM S1 ASID re-label

 - SVA mm release should change the CD to answert not-present to all
   requests without allowing logging (EPD0)

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.
At the end everything that can be hitless will be hitless, including all
of the above.

Introduce arm_smmu_write_entry() which will run through the multi-qword
programming sequence to avoid creating an incoherent 'torn' STE in the HW
caches. It automatically detects which of two algorithms to use:

1) The disruptive V=0 update described in the spec which disrupts the
   entry and does three syncs to make the change:
       - Write V=0 to QWORD 0
       - Write the entire STE except QWORD 0
       - Write QWORD 0

2) A hitless update algorithm that follows the same rational that the driver
   already uses. It is safe to change IGNORED bits that HW doesn't use:
       - Write the target value into all currently unused bits
       - Write a single QWORD, this makes the new STE live atomically
       - Ensure now unused bits are 0

The detection of which path to use and the implementation of the hitless
update rely on a "used bitmask" describing what bits the HW is actually
using based on the V/CFG/etc bits. This flows from the spec language,
typically indicated as IGNORED.

Knowing which bits the HW is using we can update the bits it does not use
and then compute how many QWORDS need to be changed. If only one qword
needs to be updated the hitless algorithm is possible.

Later patches will include CD updates in this mechanism so make the
implementation generic using a struct arm_smmu_entry_writer and struct
arm_smmu_entry_writer_ops to abstract the differences between STE and CD
to be plugged in.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
@ 2024-01-12 19:45                                       ` Jason Gunthorpe
  0 siblings, 0 replies; 134+ messages in thread
From: Jason Gunthorpe @ 2024-01-12 19:45 UTC (permalink / raw)
  To: Michael Shavit
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Nicolin Chen

On Sat, Jan 06, 2024 at 04:50:48PM +0800, Michael Shavit wrote:
> > I'm fine with this, if you think it is better please sort out the rest
> > of the bits and send me a diff and I'll integrate it
> >
> > Thanks,
> > Jason
> 
> Integrated and re-sent the 3 relevant patches; although git-send-email
> gave two of them a different subject so they may appear as a different
> thread depending on your email client.
> 
> Please note that I didn't update the commit description on the two
> patches that you initially wrote. Also note that the Kunit test patch
> does not yet add tests for any CD updates (outside of generic logic
> which is shared with STE update). I'm on vacation for the next week
> and haven't had a chance to expand the test coverage.

Okay, I took care of it all and got the branch rebased onto something
closer to what v6.8-rc1 will look like. I made a bunch of cosmetic
changes and checked the unit test still works. I'll post the three
parts to the list when v6.8-rc1 comes out in a weeks time.

The update is on my github:

https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/

Here is the rewritten commit message:

iommu/arm-smmu-v3: Make STE programming independent of the callers

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The iommu core APIs would ideally expect the driver to do a hitless change
of iommu_domain in a number of cases:

 - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
   for the RESV ranges

 - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
   domain installed. The RID should not be impacted

 - PASID downgrade has IDENTIY on the RID and all PASID's removed.
   The RID should not be impacted

 - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
   impacted

 - NESTING -> NESTING for carrying all the above hitless cases in a VM
   into the hypervisor. To comprehensively emulate the HW in a VM we should
   assume the VM OS is running logic like this and expecting hitless updates
   to be relayed to real HW.

For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
how limited it is, and the driver does have a need for hitless CD updates:

 - SMMUv3 BTM S1 ASID re-label

 - SVA mm release should change the CD to answert not-present to all
   requests without allowing logging (EPD0)

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.
At the end everything that can be hitless will be hitless, including all
of the above.

Introduce arm_smmu_write_entry() which will run through the multi-qword
programming sequence to avoid creating an incoherent 'torn' STE in the HW
caches. It automatically detects which of two algorithms to use:

1) The disruptive V=0 update described in the spec which disrupts the
   entry and does three syncs to make the change:
       - Write V=0 to QWORD 0
       - Write the entire STE except QWORD 0
       - Write QWORD 0

2) A hitless update algorithm that follows the same rational that the driver
   already uses. It is safe to change IGNORED bits that HW doesn't use:
       - Write the target value into all currently unused bits
       - Write a single QWORD, this makes the new STE live atomically
       - Ensure now unused bits are 0

The detection of which path to use and the implementation of the hitless
update rely on a "used bitmask" describing what bits the HW is actually
using based on the V/CFG/etc bits. This flows from the spec language,
typically indicated as IGNORED.

Knowing which bits the HW is using we can update the bits it does not use
and then compute how many QWORDS need to be changed. If only one qword
needs to be updated the hitless algorithm is possible.

Later patches will include CD updates in this mechanism so make the
implementation generic using a struct arm_smmu_entry_writer and struct
arm_smmu_entry_writer_ops to abstract the differences between STE and CD
to be plugged in.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
  2024-01-12 16:36                                         ` Jason Gunthorpe
@ 2024-01-16  9:23                                           ` Michael Shavit
  -1 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-16  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 13, 2024 at 12:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Jan 06, 2024 at 04:36:16PM +0800, Michael Shavit wrote:
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> > new file mode 100644
> > index 0000000000000..59ffcafb575fb
> > --- /dev/null
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> > @@ -0,0 +1,329 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <kunit/test.h>
>
> I added
>
> * Copyright 2024 Google LLC.
>
> Here, let me know if it should be something else

Thanks!

>
> > +     arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
> > +     pr_debug("STE used bits: ");
> > +     print_hex_dump_debug(
> > +             "    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
> > +             ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
>
> I fixed up alot of these weird sizeof things all over the three patches
>
> sizeof(struct arm_smmu_ste) is the correct way to get the size of the
> HW structure, no need to peek into data. This is because we use the
> struct as the pointer to an array so the whole struct must be
> correctly sized.
>
> ARRAY_SIZE(x.data)*(sizeof(*x.data)) == sizeof(x)
>
> Sadly there is no ARRAY_SIZE_FIELD()

Makes sense.


>
> I also made some hacky patches so smmuv3 would compile on x86 and ran
> this kunit on x86 - looks fine to me
>
> I'm going to put it in part 3, just because it is new and doesn't have
> any RB/TB tags like the rest of part 1.
>
> Thanks,
> Jason

^ permalink raw reply	[flat|nested] 134+ messages in thread

* Re: [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry
@ 2024-01-16  9:23                                           ` Michael Shavit
  0 siblings, 0 replies; 134+ messages in thread
From: Michael Shavit @ 2024-01-16  9:23 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, joro, linux-arm-kernel, robin.murphy, will, nicolinc

On Sat, Jan 13, 2024 at 12:36 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Sat, Jan 06, 2024 at 04:36:16PM +0800, Michael Shavit wrote:
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> > new file mode 100644
> > index 0000000000000..59ffcafb575fb
> > --- /dev/null
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-test.c
> > @@ -0,0 +1,329 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <kunit/test.h>
>
> I added
>
> * Copyright 2024 Google LLC.
>
> Here, let me know if it should be something else

Thanks!

>
> > +     arm_smmu_get_ste_used(ops, ste->data, used_bits.data);
> > +     pr_debug("STE used bits: ");
> > +     print_hex_dump_debug(
> > +             "    ", DUMP_PREFIX_NONE, 16, 8, used_bits.data,
> > +             ARRAY_SIZE(used_bits.data) * sizeof(*used_bits.data), false);
>
> I fixed up alot of these weird sizeof things all over the three patches
>
> sizeof(struct arm_smmu_ste) is the correct way to get the size of the
> HW structure, no need to peek into data. This is because we use the
> struct as the pointer to an array so the whole struct must be
> correctly sized.
>
> ARRAY_SIZE(x.data)*(sizeof(*x.data)) == sizeof(x)
>
> Sadly there is no ARRAY_SIZE_FIELD()

Makes sense.


>
> I also made some hacky patches so smmuv3 would compile on x86 and ran
> this kunit on x86 - looks fine to me
>
> I'm going to put it in part 3, just because it is new and doesn't have
> any RB/TB tags like the rest of part 1.
>
> Thanks,
> Jason

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 134+ messages in thread

end of thread, other threads:[~2024-01-16  9:24 UTC | newest]

Thread overview: 134+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-11  0:33 [PATCH 00/19] Update SMMUv3 to the modern iommu API (part 1/2) Jason Gunthorpe
2023-10-11  0:33 ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 01/19] iommu/arm-smmu-v3: Add a type for the STE Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-13 10:37   ` Will Deacon
2023-10-13 10:37     ` Will Deacon
2023-10-13 14:00     ` Jason Gunthorpe
2023-10-13 14:00       ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 02/19] iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 03/19] iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-12  8:10   ` Michael Shavit
2023-10-12  8:10     ` Michael Shavit
2023-10-12 12:16     ` Jason Gunthorpe
2023-10-12 12:16       ` Jason Gunthorpe
2023-10-18 11:05       ` Michael Shavit
2023-10-18 11:05         ` Michael Shavit
2023-10-18 13:04         ` Jason Gunthorpe
2023-10-18 13:04           ` Jason Gunthorpe
2023-10-20  8:23           ` Michael Shavit
2023-10-20  8:23             ` Michael Shavit
2023-10-20 11:39             ` Jason Gunthorpe
2023-10-20 11:39               ` Jason Gunthorpe
2023-10-23  8:36               ` Michael Shavit
2023-10-23  8:36                 ` Michael Shavit
2023-10-23 12:05                 ` Jason Gunthorpe
2023-10-23 12:05                   ` Jason Gunthorpe
2023-12-15 20:26                 ` Michael Shavit
2023-12-15 20:26                   ` Michael Shavit
2023-12-17 13:03                   ` Jason Gunthorpe
2023-12-17 13:03                     ` Jason Gunthorpe
2023-12-18 12:35                     ` Michael Shavit
2023-12-18 12:35                       ` Michael Shavit
2023-12-18 12:42                       ` Michael Shavit
2023-12-18 12:42                         ` Michael Shavit
2023-12-19 13:42                       ` Michael Shavit
2023-12-19 13:42                         ` Michael Shavit
2023-12-25 12:17                         ` Michael Shavit
2023-12-25 12:17                           ` Michael Shavit
2023-12-25 12:58                           ` Michael Shavit
2023-12-25 12:58                             ` Michael Shavit
2023-12-27 15:33                             ` Jason Gunthorpe
2023-12-27 15:33                               ` Jason Gunthorpe
2023-12-27 15:46                         ` Jason Gunthorpe
2023-12-27 15:46                           ` Jason Gunthorpe
2024-01-02  8:08                           ` Michael Shavit
2024-01-02  8:08                             ` Michael Shavit
2024-01-02 14:48                             ` Jason Gunthorpe
2024-01-02 14:48                               ` Jason Gunthorpe
2024-01-03 16:52                               ` Michael Shavit
2024-01-03 16:52                                 ` Michael Shavit
2024-01-03 17:50                                 ` Jason Gunthorpe
2024-01-03 17:50                                   ` Jason Gunthorpe
2024-01-06  8:36                                   ` [PATCH] " Michael Shavit
2024-01-06  8:36                                     ` Michael Shavit
2024-01-06  8:36                                     ` [PATCH] iommu/arm-smmu-v3: Make CD programming use arm_smmu_write_entry_step() Michael Shavit
2024-01-06  8:36                                       ` Michael Shavit
2024-01-10 13:34                                       ` Jason Gunthorpe
2024-01-10 13:34                                         ` Jason Gunthorpe
2024-01-06  8:36                                     ` [PATCH] iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry Michael Shavit
2024-01-06  8:36                                       ` Michael Shavit
2024-01-12 16:36                                       ` Jason Gunthorpe
2024-01-12 16:36                                         ` Jason Gunthorpe
2024-01-16  9:23                                         ` Michael Shavit
2024-01-16  9:23                                           ` Michael Shavit
2024-01-10 13:10                                     ` [PATCH] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
2024-01-10 13:10                                       ` Jason Gunthorpe
2024-01-06  8:50                                   ` [PATCH 04/19] " Michael Shavit
2024-01-06  8:50                                     ` Michael Shavit
2024-01-12 19:45                                     ` Jason Gunthorpe
2024-01-12 19:45                                       ` Jason Gunthorpe
2024-01-03 15:42                           ` Michael Shavit
2024-01-03 15:42                             ` Michael Shavit
2024-01-03 15:49                             ` Jason Gunthorpe
2024-01-03 15:49                               ` Jason Gunthorpe
2024-01-03 16:47                               ` Michael Shavit
2024-01-03 16:47                                 ` Michael Shavit
2024-01-02  8:13                         ` Michael Shavit
2024-01-02  8:13                           ` Michael Shavit
2024-01-02 14:48                           ` Jason Gunthorpe
2024-01-02 14:48                             ` Jason Gunthorpe
2023-10-18 10:54   ` Michael Shavit
2023-10-18 10:54     ` Michael Shavit
2023-10-18 12:24     ` Jason Gunthorpe
2023-10-18 12:24       ` Jason Gunthorpe
2023-10-19 23:03       ` Jason Gunthorpe
2023-10-19 23:03         ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 05/19] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 06/19] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 07/19] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 08/19] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 09/19] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-24  2:44   ` Michael Shavit
2023-10-24  2:44     ` Michael Shavit
2023-10-24  2:48     ` Michael Shavit
2023-10-24  2:48       ` Michael Shavit
2023-10-24 11:50     ` Jason Gunthorpe
2023-10-24 11:50       ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 10/19] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 11/19] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 12/19] iommu/arm-smmu-v3: Put writing the context descriptor in the right order Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-12  9:01   ` Michael Shavit
2023-10-12  9:01     ` Michael Shavit
2023-10-12 12:34     ` Jason Gunthorpe
2023-10-12 12:34       ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 13/19] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 14/19] iommu/arm-smmu-v3: Remove arm_smmu_master->domain Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 15/19] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-18 11:06   ` Michael Shavit
2023-10-18 11:06     ` Michael Shavit
2023-10-18 12:26     ` Jason Gunthorpe
2023-10-18 12:26       ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 16/19] iommu/arm-smmu-v3: Add a global static BLOCKED domain Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 17/19] iommu/arm-smmu-v3: Use the identity/blocked domain during release Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 18/19] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe
2023-10-11  0:33 ` [PATCH 19/19] iommu/arm-smmu-v3: Convert to domain_alloc_paging() Jason Gunthorpe
2023-10-11  0:33   ` Jason Gunthorpe

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.