iommu.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3)
@ 2024-01-25 23:57 Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
                   ` (15 more replies)
  0 siblings, 16 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

The SMMUv3 driver was originally written in 2015 when the iommu driver
facing API looked quite different. The API has evolved, especially lately,
and the driver has fallen behind.

This work aims to bring make the SMMUv3 driver the best IOMMU driver with
the most comprehensive implementation of the API. After all parts it
addresses:

 - Global static BLOCKED and IDENTITY domains with 'never fail' attach
   semantics. BLOCKED is desired for efficient VFIO.

 - Support map before attach for PAGING iommu_domains.

 - attach_dev failure does not change the HW configuration.

 - Fully hitless transitions between IDENTITY -> DMA -> IDENTITY.
   The API has IOMMU_RESV_DIRECT which is expected to be
   continuously translating.

 - Safe transitions between PAGING -> BLOCKED, do not ever temporarily
   do IDENTITY. This is required for iommufd security.

 - Full PASID API support including:
    - S1/SVA domains attached to PASIDs
    - IDENTITY/BLOCKED/S1 attached to RID
    - Change of the RID domain while PASIDs are attached

 - Streamlined SVA support using the core infrastructure

 - Hitless, whenever possible, change between two domains

 - iommufd IOMMU_GET_HW_INFO, IOMMU_HWPT_ALLOC_NEST_PARENT, and
   IOMMU_DOMAIN_NESTED support

Over all these things are going to become more accessible to iommufd, and
exposed to VMs, so it is important for the driver to have a robust
implementation of the API.

The work is split into three parts, with this part largely focusing on the
STE and building up to the BLOCKED & IDENTITY global static domains.

The second part largely focuses on the CD and builds up to having a common
PASID infrastructure that SVA and S1 domains equally use.

The third part has some random cleanups and the iommufd related parts.

Overall this takes the approach of turning the STE/CD programming upside
down where the CD/STE value is computed right at a driver callback
function and then pushed down into programming logic. The programming
logic hides the details of the required CD/STE tear-less update. This
makes the CD/STE functions independent of the arm_smmu_domain which makes
it fairly straightforward to untangle all the different call chains, and
add news ones.

Further, this frees the arm_smmu_domain related logic from keeping track
of what state the STE/CD is currently in so it can carefully sequence the
correct update. There are many new update pairs that are subtly introduced
as the work progresses.

The locking to support BTM via arm_smmu_asid_lock is a bit subtle right
now and patches throughout this work adjust and tighten this so that it is
clearer and doesn't get broken.

Once the lower STE layers no longer need to touch arm_smmu_domain we can
isolate struct arm_smmu_domain to be only used for PAGING domains, audit
all the to_smmu_domain() calls to be only in PAGING domain ops, and
introduce the normal global static BLOCKED/IDENTITY domains using the new
STE infrastructure. Part 2 will ultimately migrate SVA over to use
arm_smmu_domain as well.

All parts are on github:

 https://github.com/jgunthorpe/linux/commits/smmuv3_newapi

v4:
 - Rebase on v6.8-rc1. Patches 1-3 merged
 - Replace patch "Make STE programming independent of the callers" with
   Michael's version
    * Describe the core API desire for hitless updates
    * Replace the iterator with STE/CD specific function pointers.
      This lets the logic be written top down instead of rolled into an
      iterator
    * Optimize away a sync when the critical qword is the only qword
      to update
 - Pass master not smmu to arm_smmu_write_ste() throughout
 - arm_smmu_make_s2_domain_ste() should use data[1] = not |= since
   it is known to be zero
 - Return errno's from domain_alloc() paths
v3: https://lore.kernel.org/r/0-v3-d794f8d934da+411a-smmuv3_newapi_p1_jgg@nvidia.com
 - Use some local variables in arm_smmu_get_step_for_sid() for clarity
 - White space and spelling changes
 - Commit message updates
 - Keep master->domain_head initialized to avoid a list_del corruption
v2: https://lore.kernel.org/r/0-v2-de8b10590bf5+400-smmuv3_newapi_p1_jgg@nvidia.com
 - Rebased on v6.7-rc1
 - Improve the comment for arm_smmu_write_entry_step()
 - Fix the botched memcmp
 - Document the spec justification for the SHCFG exclusion in used
 - Include STRTAB_STE_1_SHCFG for STRTAB_STE_0_CFG_S2_TRANS in used
 - WARN_ON for unknown STEs in used
 - Fix error unwind in arm_smmu_attach_dev()
 - Whitespace, spelling, and checkpatch related items
v1: https://lore.kernel.org/r/0-v1-e289ca9121be+2be-smmuv3_newapi_p1_jgg@nvidia.com

Jason Gunthorpe (16):
  iommu/arm-smmu-v3: Make STE programming independent of the callers
  iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into
    functions
  iommu/arm-smmu-v3: Build the whole STE in
    arm_smmu_make_s2_domain_ste()
  iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  iommu/arm-smmu-v3: Compute the STE only once for each master
  iommu/arm-smmu-v3: Do not change the STE twice during
    arm_smmu_attach_dev()
  iommu/arm-smmu-v3: Put writing the context descriptor in the right
    order
  iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
  iommu/arm-smmu-v3: Remove arm_smmu_master->domain
  iommu/arm-smmu-v3: Add a global static IDENTITY domain
  iommu/arm-smmu-v3: Add a global static BLOCKED domain
  iommu/arm-smmu-v3: Use the identity/blocked domain during release
  iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to
    finalize
  iommu/arm-smmu-v3: Convert to domain_alloc_paging()

 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 747 +++++++++++++-------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |   4 -
 2 files changed, 510 insertions(+), 241 deletions(-)


base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
-- 
2.43.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-26  4:03   ` Michael Shavit
                     ` (2 more replies)
  2024-01-25 23:57 ` [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
                   ` (14 subsequent siblings)
  15 siblings, 3 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

As the comment in arm_smmu_write_strtab_ent() explains, this routine has
been limited to only work correctly in certain scenarios that the caller
must ensure. Generally the caller must put the STE into ABORT or BYPASS
before attempting to program it to something else.

The iommu core APIs would ideally expect the driver to do a hitless change
of iommu_domain in a number of cases:

 - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
   for the RESV ranges

 - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
   domain installed. The RID should not be impacted

 - PASID downgrade has IDENTIY on the RID and all PASID's removed.
   The RID should not be impacted

 - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
   impacted

 - NESTING -> NESTING for carrying all the above hitless cases in a VM
   into the hypervisor. To comprehensively emulate the HW in a VM we should
   assume the VM OS is running logic like this and expecting hitless updates
   to be relayed to real HW.

For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
how limited it is, and the driver does have a need for hitless CD updates:

 - SMMUv3 BTM S1 ASID re-label

 - SVA mm release should change the CD to answert not-present to all
   requests without allowing logging (EPD0)

The next patches/series are going to start removing some of this logic
from the callers, and add more complex state combinations than currently.
At the end everything that can be hitless will be hitless, including all
of the above.

Introduce arm_smmu_write_entry() which will run through the multi-qword
programming sequence to avoid creating an incoherent 'torn' STE in the HW
caches. It automatically detects which of two algorithms to use:

1) The disruptive V=0 update described in the spec which disrupts the
   entry and does three syncs to make the change:
       - Write V=0 to QWORD 0
       - Write the entire STE except QWORD 0
       - Write QWORD 0

2) A hitless update algorithm that follows the same rational that the driver
   already uses. It is safe to change IGNORED bits that HW doesn't use:
       - Write the target value into all currently unused bits
       - Write a single QWORD, this makes the new STE live atomically
       - Ensure now unused bits are 0

The detection of which path to use and the implementation of the hitless
update rely on a "used bitmask" describing what bits the HW is actually
using based on the V/CFG/etc bits. This flows from the spec language,
typically indicated as IGNORED.

Knowing which bits the HW is using we can update the bits it does not use
and then compute how many QWORDS need to be changed. If only one qword
needs to be updated the hitless algorithm is possible.

Later patches will include CD updates in this mechanism so make the
implementation generic using a struct arm_smmu_entry_writer and struct
arm_smmu_entry_writer_ops to abstract the differences between STE and CD
to be plugged in.

At this point it generates the same sequence of updates as the current
code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
extra sync (this seems to be an existing bug).

Going forward this will use a V=0 transition instead of cycling through
ABORT if a hitfull change is required. This seems more appropriate as ABORT
will fail DMAs without any logging, but dropping a DMA due to transient
V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

Signed-off-by: Michael Shavit <mshavit@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 328 ++++++++++++++++----
 1 file changed, 261 insertions(+), 67 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 0ffb1cf17e0b2e..690742e8f173eb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
 	ARM_SMMU_MAX_MSIS,
 };
 
+struct arm_smmu_entry_writer_ops;
+struct arm_smmu_entry_writer {
+	const struct arm_smmu_entry_writer_ops *ops;
+	struct arm_smmu_master *master;
+};
+
+struct arm_smmu_entry_writer_ops {
+	unsigned int num_entry_qwords;
+	__le64 v_bit;
+	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
+			 __le64 *used);
+	void (*sync)(struct arm_smmu_entry_writer *writer);
+};
+
+#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
+
 static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
 	[EVTQ_MSI_INDEX] = {
 		ARM_SMMU_EVTQ_IRQ_CFG0,
@@ -971,6 +987,140 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
 	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
 }
 
+/*
+ * Figure out if we can do a hitless update of entry to become target. Returns a
+ * bit mask where 1 indicates that qword needs to be set disruptively.
+ * unused_update is an intermediate value of entry that has unused bits set to
+ * their new values.
+ */
+static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
+				    const __le64 *entry, const __le64 *target,
+				    __le64 *unused_update)
+{
+	__le64 target_used[NUM_ENTRY_QWORDS] = {};
+	__le64 cur_used[NUM_ENTRY_QWORDS] = {};
+	u8 used_qword_diff = 0;
+	unsigned int i;
+
+	writer->ops->get_used(writer, entry, cur_used);
+	writer->ops->get_used(writer, target, target_used);
+
+	for (i = 0; i != writer->ops->num_entry_qwords; i++) {
+		/*
+		 * Check that masks are up to date, the make functions are not
+		 * allowed to set a bit to 1 if the used function doesn't say it
+		 * is used.
+		 */
+		WARN_ON_ONCE(target[i] & ~target_used[i]);
+
+		/* Bits can change because they are not currently being used */
+		unused_update[i] = (entry[i] & cur_used[i]) |
+				   (target[i] & ~cur_used[i]);
+		/*
+		 * Each bit indicates that a used bit in a qword needs to be
+		 * changed after unused_update is applied.
+		 */
+		if ((unused_update[i] & target_used[i]) != target[i])
+			used_qword_diff |= 1 << i;
+	}
+	return used_qword_diff;
+}
+
+static bool entry_set(struct arm_smmu_entry_writer *writer, __le64 *entry,
+		      const __le64 *target, unsigned int start,
+		      unsigned int len)
+{
+	bool changed = false;
+	unsigned int i;
+
+	for (i = start; len != 0; len--, i++) {
+		if (entry[i] != target[i]) {
+			WRITE_ONCE(entry[i], target[i]);
+			changed = true;
+		}
+	}
+
+	if (changed)
+		writer->ops->sync(writer);
+	return changed;
+}
+
+/*
+ * Update the STE/CD to the target configuration. The transition from the
+ * current entry to the target entry takes place over multiple steps that
+ * attempts to make the transition hitless if possible. This function takes care
+ * not to create a situation where the HW can perceive a corrupted entry. HW is
+ * only required to have a 64 bit atomicity with stores from the CPU, while
+ * entries are many 64 bit values big.
+ *
+ * The difference between the current value and the target value is analyzed to
+ * determine which of three updates are required - disruptive, hitless or no
+ * change.
+ *
+ * In the most general disruptive case we can make any update in three steps:
+ *  - Disrupting the entry (V=0)
+ *  - Fill now unused qwords, execpt qword 0 which contains V
+ *  - Make qword 0 have the final value and valid (V=1) with a single 64
+ *    bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification.
+ */
+static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
+				 __le64 *entry, const __le64 *target)
+{
+	unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
+	__le64 unused_update[NUM_ENTRY_QWORDS];
+	u8 used_qword_diff;
+
+	used_qword_diff =
+		arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
+	if (hweight8(used_qword_diff) > 1) {
+		/*
+		 * At least two qwords need their inuse bits to be changed. This
+		 * requires a breaking update, zero the V bit, write all qwords
+		 * but 0, then set qword 0
+		 */
+		unused_update[0] = entry[0] & (~writer->ops->v_bit);
+		entry_set(writer, entry, unused_update, 0, 1);
+		entry_set(writer, entry, target, 1, num_entry_qwords - 1);
+		entry_set(writer, entry, target, 0, 1);
+	} else if (hweight8(used_qword_diff) == 1) {
+		/*
+		 * Only one qword needs its used bits to be changed. This is a
+		 * hitless update, update all bits the current STE is ignoring
+		 * to their new values, then update a single "critical qword" to
+		 * change the STE and finally 0 out any bits that are now unused
+		 * in the target configuration.
+		 */
+		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
+
+		/*
+		 * Skip writing unused bits in the critical qword since we'll be
+		 * writing it in the next step anyways. This can save a sync
+		 * when the only change is in that qword.
+		 */
+		unused_update[critical_qword_index] =
+			entry[critical_qword_index];
+		entry_set(writer, entry, unused_update, 0, num_entry_qwords);
+		entry_set(writer, entry, target, critical_qword_index, 1);
+		entry_set(writer, entry, target, 0, num_entry_qwords);
+	} else {
+		/*
+		 * No inuse bit changed. Sanity check that all unused bits are 0
+		 * in the entry. The target was already sanity checked by
+		 * compute_qword_diff().
+		 */
+		WARN_ON_ONCE(
+			entry_set(writer, entry, target, 0, num_entry_qwords));
+	}
+}
+
 static void arm_smmu_sync_cd(struct arm_smmu_master *master,
 			     int ssid, bool leaf)
 {
@@ -1238,50 +1388,123 @@ arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc)
 	WRITE_ONCE(*dst, cpu_to_le64(val));
 }
 
-static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
+struct arm_smmu_ste_writer {
+	struct arm_smmu_entry_writer writer;
+	u32 sid;
+};
+
+/*
+ * Based on the value of ent report which bits of the STE the HW will access. It
+ * would be nice if this was complete according to the spec, but minimally it
+ * has to capture the bits this driver uses.
+ */
+static void arm_smmu_get_ste_used(struct arm_smmu_entry_writer *writer,
+				  const __le64 *ent, __le64 *used_bits)
 {
+	used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
+	if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
+		return;
+
+	/*
+	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
+	 * attribute/permission configuration fields for the SHCFG behavior.
+	 */
+	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
+	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
+		    STRTAB_STE_1_S1DSS_BYPASS)
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+
+	used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
+	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
+	case STRTAB_STE_0_CFG_ABORT:
+		break;
+	case STRTAB_STE_0_CFG_BYPASS:
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
+		break;
+	case STRTAB_STE_0_CFG_S1_TRANS:
+		used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
+					    STRTAB_STE_0_S1CTXPTR_MASK |
+					    STRTAB_STE_0_S1CDMAX);
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
+				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
+				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
+		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
+		break;
+	case STRTAB_STE_0_CFG_S2_TRANS:
+		used_bits[1] |=
+			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
+		used_bits[2] |=
+			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
+				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
+				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
+		used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
+		break;
+
+	default:
+		memset(used_bits, 0xFF, sizeof(struct arm_smmu_ste));
+		WARN_ON(true);
+	}
+}
+
+static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer *writer)
+{
+	struct arm_smmu_ste_writer *ste_writer =
+		container_of(writer, struct arm_smmu_ste_writer, writer);
 	struct arm_smmu_cmdq_ent cmd = {
 		.opcode	= CMDQ_OP_CFGI_STE,
 		.cfgi	= {
-			.sid	= sid,
+			.sid	= ste_writer->sid,
 			.leaf	= true,
 		},
 	};
 
-	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
+	arm_smmu_cmdq_issue_cmd_with_sync(writer->master->smmu, &cmd);
+}
+
+static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
+	.sync = arm_smmu_ste_writer_sync_entry,
+	.get_used = arm_smmu_get_ste_used,
+	.v_bit = cpu_to_le64(STRTAB_STE_0_V),
+	.num_entry_qwords = sizeof(struct arm_smmu_ste) / sizeof(u64),
+};
+
+static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
+			       struct arm_smmu_ste *ste,
+			       const struct arm_smmu_ste *target)
+{
+	struct arm_smmu_device *smmu = master->smmu;
+	struct arm_smmu_ste_writer ste_writer = {
+		.writer = {
+			.ops = &arm_smmu_ste_writer_ops,
+			.master = master,
+		},
+		.sid = sid,
+	};
+
+	arm_smmu_write_entry(&ste_writer.writer, ste->data, target->data);
+
+	/* It's likely that we'll want to use the new STE soon */
+	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
+		struct arm_smmu_cmdq_ent
+			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
+					 .prefetch = {
+						 .sid = sid,
+					 } };
+
+		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	}
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	/*
-	 * This is hideously complicated, but we only really care about
-	 * three cases at the moment:
-	 *
-	 * 1. Invalid (all zero) -> bypass/fault (init)
-	 * 2. Bypass/fault -> translation/bypass (attach)
-	 * 3. Translation/bypass -> bypass/fault (detach)
-	 *
-	 * Given that we can't update the STE atomically and the SMMU
-	 * doesn't read the thing in a defined order, that leaves us
-	 * with the following maintenance requirements:
-	 *
-	 * 1. Update Config, return (init time STEs aren't live)
-	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
-	 * 3. Update Config, sync
-	 */
-	u64 val = le64_to_cpu(dst->data[0]);
-	bool ste_live = false;
+	u64 val;
 	struct arm_smmu_device *smmu = master->smmu;
 	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
 	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_cmdq_ent prefetch_cmd = {
-		.opcode		= CMDQ_OP_PREFETCH_CFG,
-		.prefetch	= {
-			.sid	= sid,
-		},
-	};
+	struct arm_smmu_ste target = {};
 
 	if (smmu_domain) {
 		switch (smmu_domain->stage) {
@@ -1296,22 +1519,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		}
 	}
 
-	if (val & STRTAB_STE_0_V) {
-		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
-		case STRTAB_STE_0_CFG_BYPASS:
-			break;
-		case STRTAB_STE_0_CFG_S1_TRANS:
-		case STRTAB_STE_0_CFG_S2_TRANS:
-			ste_live = true;
-			break;
-		case STRTAB_STE_0_CFG_ABORT:
-			BUG_ON(!disable_bypass);
-			break;
-		default:
-			BUG(); /* STE corruption */
-		}
-	}
-
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
@@ -1322,16 +1529,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		else
 			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
-		dst->data[0] = cpu_to_le64(val);
-		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
+		target.data[0] = cpu_to_le64(val);
+		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
 						STRTAB_STE_1_SHCFG_INCOMING));
-		dst->data[2] = 0; /* Nuke the VMID */
-		/*
-		 * The SMMU can perform negative caching, so we must sync
-		 * the STE regardless of whether the old value was live.
-		 */
-		if (smmu)
-			arm_smmu_sync_ste_for_sid(smmu, sid);
+		target.data[2] = 0; /* Nuke the VMID */
+		arm_smmu_write_ste(master, sid, dst, &target);
 		return;
 	}
 
@@ -1339,8 +1541,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
 
-		BUG_ON(ste_live);
-		dst->data[1] = cpu_to_le64(
+		target.data[1] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
 			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
 			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
@@ -1349,7 +1550,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
 		    !master->stall_enabled)
-			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
+			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
 
 		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
 			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
@@ -1358,8 +1559,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	}
 
 	if (s2_cfg) {
-		BUG_ON(ste_live);
-		dst->data[2] = cpu_to_le64(
+		target.data[2] = cpu_to_le64(
 			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
 			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
 #ifdef __BIG_ENDIAN
@@ -1368,23 +1568,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
 			 STRTAB_STE_2_S2R);
 
-		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
 
 		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
 	}
 
 	if (master->ats_enabled)
-		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
+		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
 						 STRTAB_STE_1_EATS_TRANS));
 
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-	/* See comment in arm_smmu_write_ctx_desc() */
-	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
-	arm_smmu_sync_ste_for_sid(smmu, sid);
-
-	/* It's likely that we'll want to use the new STE soon */
-	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
-		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
+	target.data[0] = cpu_to_le64(val);
+	arm_smmu_write_ste(master, sid, dst, &target);
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-31 14:40   ` Mostafa Saleh
  2024-01-25 23:57 ` [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

This allows writing the flow of arm_smmu_write_strtab_ent() around abort
and bypass domains more naturally.

Note that the core code no longer supplies NULL domains, though there is
still a flow in the driver that end up in arm_smmu_write_strtab_ent() with
NULL. A later patch will remove it.

Remove the duplicate calculation of the STE in arm_smmu_init_bypass_stes()
and remove the force parameter. arm_smmu_rmr_install_bypass_ste() can now
simply invoke arm_smmu_make_bypass_ste() directly.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +++++++++++----------
 1 file changed, 47 insertions(+), 42 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 690742e8f173eb..38bcb4ed1fccc1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1496,6 +1496,24 @@ static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
 	}
 }
 
+static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
+}
+
+static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
+{
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS));
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
@@ -1506,37 +1524,31 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
-	if (smmu_domain) {
-		switch (smmu_domain->stage) {
-		case ARM_SMMU_DOMAIN_S1:
-			cd_table = &master->cd_table;
-			break;
-		case ARM_SMMU_DOMAIN_S2:
-			s2_cfg = &smmu_domain->s2_cfg;
-			break;
-		default:
-			break;
-		}
+	if (!smmu_domain) {
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(&target);
+		else
+			arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(master, sid, dst, &target);
+		return;
+	}
+
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
+		cd_table = &master->cd_table;
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		s2_cfg = &smmu_domain->s2_cfg;
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_write_ste(master, sid, dst, &target);
+		return;
 	}
 
 	/* Nuke the existing STE_0 value, as we're going to rewrite it */
 	val = STRTAB_STE_0_V;
 
-	/* Bypass/fault */
-	if (!smmu_domain || !(cd_table || s2_cfg)) {
-		if (!smmu_domain && disable_bypass)
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-		else
-			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
-
-		target.data[0] = cpu_to_le64(val);
-		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
-						STRTAB_STE_1_SHCFG_INCOMING));
-		target.data[2] = 0; /* Nuke the VMID */
-		arm_smmu_write_ste(master, sid, dst, &target);
-		return;
-	}
-
 	if (cd_table) {
 		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
 			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
@@ -1582,21 +1594,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 }
 
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
-				      unsigned int nent, bool force)
+				      unsigned int nent)
 {
 	unsigned int i;
-	u64 val = STRTAB_STE_0_V;
-
-	if (disable_bypass && !force)
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
-	else
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
 
 	for (i = 0; i < nent; ++i) {
-		strtab->data[0] = cpu_to_le64(val);
-		strtab->data[1] = cpu_to_le64(FIELD_PREP(
-			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
-		strtab->data[2] = 0;
+		if (disable_bypass)
+			arm_smmu_make_abort_ste(strtab);
+		else
+			arm_smmu_make_bypass_ste(strtab);
 		strtab++;
 	}
 }
@@ -1624,7 +1630,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
 		return -ENOMEM;
 	}
 
-	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT, false);
+	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
 	arm_smmu_write_strtab_l1_desc(strtab, desc);
 	return 0;
 }
@@ -3243,7 +3249,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
 	cfg->strtab_base_cfg = reg;
 
-	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents, false);
+	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
 	return 0;
 }
 
@@ -3954,7 +3960,6 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
 
 	list_for_each_entry(e, &rmr_list, list) {
-		struct arm_smmu_ste *step;
 		struct iommu_iort_rmr_data *rmr;
 		int ret, i;
 
@@ -3967,8 +3972,8 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
 				continue;
 			}
 
-			step = arm_smmu_get_step_for_sid(smmu, rmr->sids[i]);
-			arm_smmu_init_bypass_stes(step, 1, true);
+			arm_smmu_make_bypass_ste(
+				arm_smmu_get_step_for_sid(smmu, rmr->sids[i]));
 		}
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-29 15:07   ` Shameerali Kolothum Thodi
  2024-01-25 23:57 ` [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Logically arm_smmu_init_strtab_linear() is the function that allocates and
populates the stream table with the initial value of the STEs. After this
function returns the stream table should be fully ready.

arm_smmu_rmr_install_bypass_ste() adjusts the initial stream table to force
any SIDs that the FW says have IOMMU_RESV_DIRECT to use bypass. This
ensures there is no disruption to the identity mapping during boot.

Put arm_smmu_rmr_install_bypass_ste() into arm_smmu_init_strtab_linear(),
it already executes immediately after arm_smmu_init_strtab_linear().

No functional change intended.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 38bcb4ed1fccc1..df8fc7b87a7907 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -102,6 +102,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 	{ 0, NULL},
 };
 
+static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
 	int i = 0;
@@ -3250,6 +3252,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
 	cfg->strtab_base_cfg = reg;
 
 	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
+
+	/* Check for RMRs and install bypass STEs if any */
+	arm_smmu_rmr_install_bypass_ste(smmu);
 	return 0;
 }
 
@@ -4063,9 +4068,6 @@ static int arm_smmu_device_probe(struct platform_device *pdev)
 	/* Record our private device structure */
 	platform_set_drvdata(pdev, smmu);
 
-	/* Check for RMRs and install bypass STEs if any */
-	arm_smmu_rmr_install_bypass_ste(smmu);
-
 	/* Reset the device */
 	ret = arm_smmu_device_reset(smmu, bypass);
 	if (ret)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (2 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-31 14:50   ` Mostafa Saleh
  2024-01-25 23:57 ` [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

This is preparation to move the STE calculation higher up in to the call
chain and remove arm_smmu_write_strtab_ent(). These new functions will be
called directly from attach_dev.

Reviewed-by: Moritz Fischer <mdf@kernel.org>
Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 115 +++++++++++---------
 1 file changed, 62 insertions(+), 53 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index df8fc7b87a7907..910156881423e0 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1516,13 +1516,68 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
 		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
 }
 
+static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
+				      struct arm_smmu_master *master,
+				      struct arm_smmu_ctx_desc_cfg *cd_table)
+{
+	struct arm_smmu_device *smmu = master->smmu;
+
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
+		FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt) |
+		(cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
+		FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax));
+
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
+		FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
+		FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
+		((smmu->features & ARM_SMMU_FEAT_STALLS &&
+		  !master->stall_enabled) ?
+			 STRTAB_STE_1_S1STALLD :
+			 0) |
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0) |
+		FIELD_PREP(STRTAB_STE_1_STRW,
+			   (smmu->features & ARM_SMMU_FEAT_E2H) ?
+				   STRTAB_STE_1_STRW_EL2 :
+				   STRTAB_STE_1_STRW_NSEL1));
+}
+
+static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
+					struct arm_smmu_master *master,
+					struct arm_smmu_domain *smmu_domain)
+{
+	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+
+	memset(target, 0, sizeof(*target));
+	target->data[0] = cpu_to_le64(
+		STRTAB_STE_0_V |
+		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS));
+
+	target->data[1] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_1_EATS,
+			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
+
+	target->data[2] = cpu_to_le64(
+		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		STRTAB_STE_2_S2AA64 |
+#ifdef __BIG_ENDIAN
+		STRTAB_STE_2_S2ENDI |
+#endif
+		STRTAB_STE_2_S2PTW |
+		STRTAB_STE_2_S2R);
+
+	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+}
+
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 				      struct arm_smmu_ste *dst)
 {
-	u64 val;
-	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
-	struct arm_smmu_s2_cfg *s2_cfg = NULL;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 	struct arm_smmu_ste target = {};
 
@@ -1537,61 +1592,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
 
 	switch (smmu_domain->stage) {
 	case ARM_SMMU_DOMAIN_S1:
-		cd_table = &master->cd_table;
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
-		s2_cfg = &smmu_domain->s2_cfg;
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(master, sid, dst, &target);
-		return;
+		break;
 	}
-
-	/* Nuke the existing STE_0 value, as we're going to rewrite it */
-	val = STRTAB_STE_0_V;
-
-	if (cd_table) {
-		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
-			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
-
-		target.data[1] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
-			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
-			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
-			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
-
-		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
-		    !master->stall_enabled)
-			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
-
-		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
-			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
-			FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax) |
-			FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt);
-	}
-
-	if (s2_cfg) {
-		target.data[2] = cpu_to_le64(
-			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
-#ifdef __BIG_ENDIAN
-			 STRTAB_STE_2_S2ENDI |
-#endif
-			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
-			 STRTAB_STE_2_S2R);
-
-		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
-
-		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
-	}
-
-	if (master->ats_enabled)
-		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
-						 STRTAB_STE_1_EATS_TRANS));
-
-	target.data[0] = cpu_to_le64(val);
 	arm_smmu_write_ste(master, sid, dst, &target);
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste()
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (3 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-02-01 11:34   ` Mostafa Saleh
  2024-01-25 23:57 ` [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
                   ` (10 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Half the code was living in arm_smmu_domain_finalise_s2(), just move it
here and take the values directly from the pgtbl_ops instead of storing
copies.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 ++++++++++++---------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 --
 2 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 910156881423e0..9a95d0f1494223 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1552,6 +1552,11 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 					struct arm_smmu_domain *smmu_domain)
 {
 	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
+	const struct io_pgtable_cfg *pgtbl_cfg =
+		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
+	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr =
+		&pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
+	u64 vtcr_val;
 
 	memset(target, 0, sizeof(*target));
 	target->data[0] = cpu_to_le64(
@@ -1562,9 +1567,16 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		FIELD_PREP(STRTAB_STE_1_EATS,
 			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
 
+	vtcr_val = FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
+		   FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	target->data[2] = cpu_to_le64(
 		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
-		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
+		FIELD_PREP(STRTAB_STE_2_VTCR, vtcr_val) |
 		STRTAB_STE_2_S2AA64 |
 #ifdef __BIG_ENDIAN
 		STRTAB_STE_2_S2ENDI |
@@ -1572,7 +1584,8 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 		STRTAB_STE_2_S2PTW |
 		STRTAB_STE_2_S2R);
 
-	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
+	target->data[3] = cpu_to_le64(pgtbl_cfg->arm_lpae_s2_cfg.vttbr &
+				      STRTAB_STE_3_S2TTB_MASK);
 }
 
 static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
@@ -2328,7 +2341,6 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	int vmid;
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
-	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
 	vmid = ida_alloc_range(&smmu->vmid_map, 1, (1 << smmu->vmid_bits) - 1,
@@ -2336,16 +2348,7 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	if (vmid < 0)
 		return vmid;
 
-	vtcr = &pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
 	cfg->vmid	= (u16)vmid;
-	cfg->vttbr	= pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
-	cfg->vtcr	= FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
-			  FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
 	return 0;
 }
 
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 65fb388d51734d..eb669121f1954d 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -609,8 +609,6 @@ struct arm_smmu_ctx_desc_cfg {
 
 struct arm_smmu_s2_cfg {
 	u16				vmid;
-	u64				vttbr;
-	u64				vtcr;
 };
 
 struct arm_smmu_strtab_cfg {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (4 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-02-01 12:15   ` Mostafa Saleh
  2024-01-25 23:57 ` [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

The BTM support wants to be able to change the ASID of any smmu_domain.
When it goes to do this it holds the arm_smmu_asid_lock and iterates over
the target domain's devices list.

During attach of a S1 domain we must ensure that the devices list and
CD are in sync, otherwise we could miss CD updates or a parallel CD update
could push an out of date CD.

This is pretty complicated, and almost works today because
arm_smmu_detach_dev() removes the master from the linked list before
working on the CD entries, preventing parallel update of the CD.

However, it does have an issue where the CD can remain programed while the
domain appears to be unattached. arm_smmu_share_asid() will then not clear
any CD entriess and install its own CD entry with the same ASID
concurrently. This creates a small race window where the IOMMU can see two
ASIDs pointing to different translations.

Solve this by wrapping most of the attach flow in the
arm_smmu_asid_lock. This locks more than strictly needed to prepare for
the next patch which will reorganize the order of the linked list, STE and
CD changes.

Move arm_smmu_detach_dev() till after we have initialized the domain so
the lock can be held for less time.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 22 ++++++++++++---------
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 9a95d0f1494223..539ef380f457fa 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2612,8 +2612,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		return -EBUSY;
 	}
 
-	arm_smmu_detach_dev(master);
-
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
@@ -2628,6 +2626,16 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	if (ret)
 		return ret;
 
+	/*
+	 * Prevent arm_smmu_share_asid() from trying to change the ASID
+	 * of either the old or new domain while we are working on it.
+	 * This allows the STE and the smmu_domain->devices list to
+	 * be inconsistent during this routine.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	arm_smmu_detach_dev(master);
+
 	master->domain = smmu_domain;
 
 	/*
@@ -2653,13 +2661,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			}
 		}
 
-		/*
-		 * Prevent SVA from concurrently modifying the CD or writing to
-		 * the CD entry
-		 */
-		mutex_lock(&arm_smmu_asid_lock);
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		mutex_unlock(&arm_smmu_asid_lock);
 		if (ret) {
 			master->domain = NULL;
 			goto out_list_del;
@@ -2669,13 +2671,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	arm_smmu_install_ste_for_dev(master);
 
 	arm_smmu_enable_ats(master);
-	return 0;
+	goto out_unlock;
 
 out_list_del:
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
+out_unlock:
+	mutex_unlock(&arm_smmu_asid_lock);
 	return ret;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (5 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-02-01 12:18   ` Mostafa Saleh
  2024-01-25 23:57 ` [PATCH v4 08/16] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev() Jason Gunthorpe
                   ` (8 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Currently arm_smmu_install_ste_for_dev() iterates over every SID and
computes from scratch an identical STE. Every SID should have the same STE
contents. Turn this inside out so that the STE is supplied by the caller
and arm_smmu_install_ste_for_dev() simply installs it to every SID.

This is possible now that the STE generation does not inform what sequence
should be used to program it.

This allows splitting the STE calculation up according to the call site,
which following patches will make use of, and removes the confusing NULL
domain special case that only supported arm_smmu_detach_dev().

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 57 ++++++++-------------
 1 file changed, 22 insertions(+), 35 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 539ef380f457fa..cf3e348cb9abe1 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -1588,35 +1588,6 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
 				      STRTAB_STE_3_S2TTB_MASK);
 }
 
-static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
-				      struct arm_smmu_ste *dst)
-{
-	struct arm_smmu_domain *smmu_domain = master->domain;
-	struct arm_smmu_ste target = {};
-
-	if (!smmu_domain) {
-		if (disable_bypass)
-			arm_smmu_make_abort_ste(&target);
-		else
-			arm_smmu_make_bypass_ste(&target);
-		arm_smmu_write_ste(master, sid, dst, &target);
-		return;
-	}
-
-	switch (smmu_domain->stage) {
-	case ARM_SMMU_DOMAIN_S1:
-		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
-		break;
-	case ARM_SMMU_DOMAIN_S2:
-		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
-		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		break;
-	}
-	arm_smmu_write_ste(master, sid, dst, &target);
-}
-
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
 				      unsigned int nent)
 {
@@ -2439,7 +2410,8 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
 	}
 }
 
-static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
+static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
+					 const struct arm_smmu_ste *target)
 {
 	int i, j;
 	struct arm_smmu_device *smmu = master->smmu;
@@ -2456,7 +2428,7 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
 		if (j < i)
 			continue;
 
-		arm_smmu_write_strtab_ent(master, sid, step);
+		arm_smmu_write_ste(master, sid, step, target);
 	}
 }
 
@@ -2563,6 +2535,7 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2576,7 +2549,11 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	arm_smmu_install_ste_for_dev(master);
+	if (disable_bypass)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2591,6 +2568,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 {
 	int ret = 0;
 	unsigned long flags;
+	struct arm_smmu_ste target;
 	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
 	struct arm_smmu_device *smmu;
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
@@ -2652,7 +2630,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	list_add(&master->domain_head, &smmu_domain->devices);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
+	switch (smmu_domain->stage) {
+	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
 			if (ret) {
@@ -2666,9 +2645,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			master->domain = NULL;
 			goto out_list_del;
 		}
-	}
 
-	arm_smmu_install_ste_for_dev(master);
+		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		break;
+	case ARM_SMMU_DOMAIN_S2:
+		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		break;
+	case ARM_SMMU_DOMAIN_BYPASS:
+		arm_smmu_make_bypass_ste(&target);
+		break;
+	}
+	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 08/16] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev()
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (6 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 09/16] iommu/arm-smmu-v3: Put writing the context descriptor in the right order Jason Gunthorpe
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

This was needed because the STE code required the STE to be in
ABORT/BYPASS inorder to program a cdtable or S2 STE. Now that the STE code
can automatically handle all transitions we can remove this step
from the attach_dev flow.

A few small bugs exist because of this:

1) If the core code does BLOCKED -> UNMANAGED with disable_bypass=false
   then there will be a moment where the STE points at BYPASS. Since
   this can be done by VFIO/IOMMUFD it is a small security race.

2) If the core code does IDENTITY -> DMA then any IOMMU_RESV_DIRECT
   regions will temporarily become BLOCKED. We'd like drivers to
   work in a way that allows IOMMU_RESV_DIRECT to be continuously
   functional during these transitions.

Make arm_smmu_release_device() put the STE back to the correct
ABORT/BYPASS setting. Fix a bug where a IOMMU_RESV_DIRECT was ignored on
this path.

As noted before the reordering of the linked list/STE/CD changes is OK
against concurrent arm_smmu_share_asid() because of the
arm_smmu_asid_lock.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index cf3e348cb9abe1..bf5698643afc50 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2535,7 +2535,6 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
 	unsigned long flags;
-	struct arm_smmu_ste target;
 	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	if (!smmu_domain)
@@ -2549,11 +2548,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	if (disable_bypass)
-		arm_smmu_make_abort_ste(&target);
-	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
 	/*
 	 * Clearing the CD entry isn't strictly required to detach the domain
 	 * since the table is uninstalled anyway, but it helps avoid confusion
@@ -2901,9 +2895,18 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
+
+	/* Put the STE back to what arm_smmu_init_strtab() sets */
+	if (disable_bypass && !dev->iommu->require_direct)
+		arm_smmu_make_abort_ste(&target);
+	else
+		arm_smmu_make_bypass_ste(&target);
+	arm_smmu_install_ste_for_dev(master, &target);
+
 	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 09/16] iommu/arm-smmu-v3: Put writing the context descriptor in the right order
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (7 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 08/16] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev() Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 10/16] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats() Jason Gunthorpe
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Get closer to the IOMMU API ideal that changes between domains can be
hitless. The ordering for the CD table entry is not entirely clean from
this perspective.

When switching away from a STE with a CD table programmed in it we should
write the new STE first, then clear any old data in the CD entry.

If we are programming a CD table for the first time to a STE then the CD
entry should be programmed before the STE is loaded.

If we are replacing a CD table entry when the STE already points at the CD
entry then we just need to do the make/break sequence.

Lift this code out of arm_smmu_detach_dev() so it can all be sequenced
properly. The only other caller is arm_smmu_release_device() and it is
going to free the cdtable anyhow, so it doesn't matter what is in it.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 29 ++++++++++++++-------
 1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index bf5698643afc50..09b40ba35a9cee 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2548,14 +2548,6 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 
 	master->domain = NULL;
 	master->ats_enabled = false;
-	/*
-	 * Clearing the CD entry isn't strictly required to detach the domain
-	 * since the table is uninstalled anyway, but it helps avoid confusion
-	 * in the call to arm_smmu_write_ctx_desc on the next attach (which
-	 * expects the entry to be empty).
-	 */
-	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1 && master->cd_table.cdtab)
-		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
 }
 
 static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
@@ -2632,6 +2624,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 				master->domain = NULL;
 				goto out_list_del;
 			}
+		} else {
+			/*
+			 * arm_smmu_write_ctx_desc() relies on the entry being
+			 * invalid to work, clear any existing entry.
+			 */
+			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
+			if (ret) {
+				master->domain = NULL;
+				goto out_list_del;
+			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
@@ -2641,15 +2644,23 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
+		arm_smmu_install_ste_for_dev(master, &target);
 		break;
 	case ARM_SMMU_DOMAIN_S2:
 		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	case ARM_SMMU_DOMAIN_BYPASS:
 		arm_smmu_make_bypass_ste(&target);
+		arm_smmu_install_ste_for_dev(master, &target);
+		if (master->cd_table.cdtab)
+			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
+						      NULL);
 		break;
 	}
-	arm_smmu_install_ste_for_dev(master, &target);
 
 	arm_smmu_enable_ats(master);
 	goto out_unlock;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 10/16] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats()
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (8 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 09/16] iommu/arm-smmu-v3: Put writing the context descriptor in the right order Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 11/16] iommu/arm-smmu-v3: Remove arm_smmu_master->domain Jason Gunthorpe
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

The caller already has the domain, just pass it in. A following patch will
remove master->domain.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 09b40ba35a9cee..d7b0cea140f12b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2447,12 +2447,12 @@ static bool arm_smmu_ats_supported(struct arm_smmu_master *master)
 	return dev_is_pci(dev) && pci_ats_supported(to_pci_dev(dev));
 }
 
-static void arm_smmu_enable_ats(struct arm_smmu_master *master)
+static void arm_smmu_enable_ats(struct arm_smmu_master *master,
+				struct arm_smmu_domain *smmu_domain)
 {
 	size_t stu;
 	struct pci_dev *pdev;
 	struct arm_smmu_device *smmu = master->smmu;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
 	/* Don't enable ATS at the endpoint if it's not enabled in the STE */
 	if (!master->ats_enabled)
@@ -2468,10 +2468,9 @@ static void arm_smmu_enable_ats(struct arm_smmu_master *master)
 		dev_err(master->dev, "Failed to enable ATS (STU %zu)\n", stu);
 }
 
-static void arm_smmu_disable_ats(struct arm_smmu_master *master)
+static void arm_smmu_disable_ats(struct arm_smmu_master *master,
+				 struct arm_smmu_domain *smmu_domain)
 {
-	struct arm_smmu_domain *smmu_domain = master->domain;
-
 	if (!master->ats_enabled)
 		return;
 
@@ -2540,7 +2539,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	if (!smmu_domain)
 		return;
 
-	arm_smmu_disable_ats(master);
+	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_del(&master->domain_head);
@@ -2662,7 +2661,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 		break;
 	}
 
-	arm_smmu_enable_ats(master);
+	arm_smmu_enable_ats(master, smmu_domain);
 	goto out_unlock;
 
 out_list_del:
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 11/16] iommu/arm-smmu-v3: Remove arm_smmu_master->domain
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (9 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 10/16] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats() Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Introducing global statics which are of type struct iommu_domain, not
struct arm_smmu_domain makes it difficult to retain
arm_smmu_master->domain, as it can no longer point to an IDENTITY or
BLOCKED domain.

The only place that uses the value is arm_smmu_detach_dev(). Change things
to work like other drivers and call iommu_get_domain_for_dev() to obtain
the current domain.

The master->domain is subtly protecting the domain_head against being
unused, change the domain_head to be INIT'd when the master is not
attached to a domain instead of garbage/zero.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 26 ++++++++-------------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d7b0cea140f12b..f08cfa9b90b3eb 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2533,19 +2533,20 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
 
 static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 {
+	struct iommu_domain *domain = iommu_get_domain_for_dev(master->dev);
+	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
-	struct arm_smmu_domain *smmu_domain = master->domain;
 
-	if (!smmu_domain)
+	if (!domain)
 		return;
 
+	smmu_domain = to_smmu_domain(domain);
 	arm_smmu_disable_ats(master, smmu_domain);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_del(&master->domain_head);
+	list_del_init(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
-	master->domain = NULL;
 	master->ats_enabled = false;
 }
 
@@ -2599,8 +2600,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	master->domain = smmu_domain;
-
 	/*
 	 * The SMMU does not support enabling ATS with bypass. When the STE is
 	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
@@ -2619,10 +2618,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	case ARM_SMMU_DOMAIN_S1:
 		if (!master->cd_table.cdtab) {
 			ret = arm_smmu_alloc_cd_tables(master);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		} else {
 			/*
 			 * arm_smmu_write_ctx_desc() relies on the entry being
@@ -2630,17 +2627,13 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			 */
 			ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
-			if (ret) {
-				master->domain = NULL;
+			if (ret)
 				goto out_list_del;
-			}
 		}
 
 		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
-		if (ret) {
-			master->domain = NULL;
+		if (ret)
 			goto out_list_del;
-		}
 
 		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
 		arm_smmu_install_ste_for_dev(master, &target);
@@ -2666,7 +2659,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 out_list_del:
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
-	list_del(&master->domain_head);
+	list_del_init(&master->domain_head);
 	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
 
 out_unlock:
@@ -2867,6 +2860,7 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 	master->dev = dev;
 	master->smmu = smmu;
 	INIT_LIST_HEAD(&master->bonds);
+	INIT_LIST_HEAD(&master->domain_head);
 	dev_iommu_priv_set(dev, master);
 
 	ret = arm_smmu_insert_master(smmu, master);
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index eb669121f1954d..6b63ea7dae72da 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -695,7 +695,6 @@ struct arm_smmu_stream {
 struct arm_smmu_master {
 	struct arm_smmu_device		*smmu;
 	struct device			*dev;
-	struct arm_smmu_domain		*domain;
 	struct list_head		domain_head;
 	struct arm_smmu_stream		*streams;
 	/* Locked by the iommu core using the group mutex */
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (10 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 11/16] iommu/arm-smmu-v3: Remove arm_smmu_master->domain Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-29 18:11   ` Shameerali Kolothum Thodi
  2024-01-25 23:57 ` [PATCH v4 13/16] iommu/arm-smmu-v3: Add a global static BLOCKED domain Jason Gunthorpe
                   ` (3 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Move to the new static global for identity domains. Move all the logic out
of arm_smmu_attach_dev into an identity only function.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++------
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
 2 files changed, 58 insertions(+), 25 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f08cfa9b90b3eb..d35bf9655c9b1b 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2226,8 +2226,7 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 		return arm_smmu_sva_domain_alloc();
 
 	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA &&
-	    type != IOMMU_DOMAIN_IDENTITY)
+	    type != IOMMU_DOMAIN_DMA)
 		return NULL;
 
 	/*
@@ -2335,11 +2334,6 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
 	struct arm_smmu_device *smmu = smmu_domain->smmu;
 
-	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
-		smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
-		return 0;
-	}
-
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
 		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
@@ -2537,7 +2531,7 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
 	struct arm_smmu_domain *smmu_domain;
 	unsigned long flags;
 
-	if (!domain)
+	if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
 		return;
 
 	smmu_domain = to_smmu_domain(domain);
@@ -2600,15 +2594,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 
 	arm_smmu_detach_dev(master);
 
-	/*
-	 * The SMMU does not support enabling ATS with bypass. When the STE is
-	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
-	 * Translated transactions are denied as though ATS is disabled for the
-	 * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
-	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
-	 */
-	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
-		master->ats_enabled = arm_smmu_ats_supported(master);
+	master->ats_enabled = arm_smmu_ats_supported(master);
 
 	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
 	list_add(&master->domain_head, &smmu_domain->devices);
@@ -2645,13 +2631,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
 						      NULL);
 		break;
-	case ARM_SMMU_DOMAIN_BYPASS:
-		arm_smmu_make_bypass_ste(&target);
-		arm_smmu_install_ste_for_dev(master, &target);
-		if (master->cd_table.cdtab)
-			arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
-						      NULL);
-		break;
 	}
 
 	arm_smmu_enable_ats(master, smmu_domain);
@@ -2667,6 +2646,60 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	return ret;
 }
 
+static int arm_smmu_attach_dev_ste(struct device *dev,
+				   struct arm_smmu_ste *ste)
+{
+	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+
+	if (arm_smmu_master_sva_enabled(master))
+		return -EBUSY;
+
+	/*
+	 * Do not allow any ASID to be changed while are working on the STE,
+	 * otherwise we could miss invalidations.
+	 */
+	mutex_lock(&arm_smmu_asid_lock);
+
+	/*
+	 * The SMMU does not support enabling ATS with bypass/abort. When the
+	 * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests
+	 * and Translated transactions are denied as though ATS is disabled for
+	 * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
+	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
+	 */
+	arm_smmu_detach_dev(master);
+
+	arm_smmu_install_ste_for_dev(master, ste);
+	mutex_unlock(&arm_smmu_asid_lock);
+
+	/*
+	 * This has to be done after removing the master from the
+	 * arm_smmu_domain->devices to avoid races updating the same context
+	 * descriptor from arm_smmu_share_asid().
+	 */
+	if (master->cd_table.cdtab)
+		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, NULL);
+	return 0;
+}
+
+static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_bypass_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_identity_ops = {
+	.attach_dev = arm_smmu_attach_dev_identity,
+};
+
+static struct iommu_domain arm_smmu_identity_domain = {
+	.type = IOMMU_DOMAIN_IDENTITY,
+	.ops = &arm_smmu_identity_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -3056,6 +3089,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 }
 
 static struct iommu_ops arm_smmu_ops = {
+	.identity_domain	= &arm_smmu_identity_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
index 6b63ea7dae72da..23baf117e7e4b5 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
@@ -712,7 +712,6 @@ struct arm_smmu_master {
 enum arm_smmu_domain_stage {
 	ARM_SMMU_DOMAIN_S1 = 0,
 	ARM_SMMU_DOMAIN_S2,
-	ARM_SMMU_DOMAIN_BYPASS,
 };
 
 struct arm_smmu_domain {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 13/16] iommu/arm-smmu-v3: Add a global static BLOCKED domain
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (11 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 14/16] iommu/arm-smmu-v3: Use the identity/blocked domain during release Jason Gunthorpe
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Using the same design as the IDENTITY domain install an
STRTAB_STE_0_CFG_ABORT STE.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index d35bf9655c9b1b..15e305253ddbb3 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2700,6 +2700,24 @@ static struct iommu_domain arm_smmu_identity_domain = {
 	.ops = &arm_smmu_identity_ops,
 };
 
+static int arm_smmu_attach_dev_blocked(struct iommu_domain *domain,
+					struct device *dev)
+{
+	struct arm_smmu_ste ste;
+
+	arm_smmu_make_abort_ste(&ste);
+	return arm_smmu_attach_dev_ste(dev, &ste);
+}
+
+static const struct iommu_domain_ops arm_smmu_blocked_ops = {
+	.attach_dev = arm_smmu_attach_dev_blocked,
+};
+
+static struct iommu_domain arm_smmu_blocked_domain = {
+	.type = IOMMU_DOMAIN_BLOCKED,
+	.ops = &arm_smmu_blocked_ops,
+};
+
 static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned long iova,
 			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
 			      int prot, gfp_t gfp, size_t *mapped)
@@ -3090,6 +3108,7 @@ static void arm_smmu_remove_dev_pasid(struct device *dev, ioasid_t pasid)
 
 static struct iommu_ops arm_smmu_ops = {
 	.identity_domain	= &arm_smmu_identity_domain,
+	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
 	.probe_device		= arm_smmu_probe_device,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 14/16] iommu/arm-smmu-v3: Use the identity/blocked domain during release
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (12 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 13/16] iommu/arm-smmu-v3: Add a global static BLOCKED domain Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 15/16] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 16/16] iommu/arm-smmu-v3: Convert to domain_alloc_paging() Jason Gunthorpe
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Consolidate some more code by having release call
arm_smmu_attach_dev_identity/blocked() instead of open coding this.

Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 15e305253ddbb3..92a72ec6ee974a 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2950,19 +2950,16 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev)
 static void arm_smmu_release_device(struct device *dev)
 {
 	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
-	struct arm_smmu_ste target;
 
 	if (WARN_ON(arm_smmu_master_sva_enabled(master)))
 		iopf_queue_remove_device(master->smmu->evtq.iopf, dev);
 
 	/* Put the STE back to what arm_smmu_init_strtab() sets */
 	if (disable_bypass && !dev->iommu->require_direct)
-		arm_smmu_make_abort_ste(&target);
+		arm_smmu_attach_dev_blocked(&arm_smmu_blocked_domain, dev);
 	else
-		arm_smmu_make_bypass_ste(&target);
-	arm_smmu_install_ste_for_dev(master, &target);
+		arm_smmu_attach_dev_identity(&arm_smmu_identity_domain, dev);
 
-	arm_smmu_detach_dev(master);
 	arm_smmu_disable_pasid(master);
 	arm_smmu_remove_master(master);
 	if (master->cd_table.cdtab)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 15/16] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (13 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 14/16] iommu/arm-smmu-v3: Use the identity/blocked domain during release Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  2024-01-25 23:57 ` [PATCH v4 16/16] iommu/arm-smmu-v3: Convert to domain_alloc_paging() Jason Gunthorpe
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Instead of putting container_of() casts in the internals, use the proper
type in this call chain. This makes it easier to check that the two global
static domains are not leaking into call chains they should not.

Passing the smmu avoids the only caller from having to set it and unset it
in the error path.

Reviewed-by: Michael Shavit <mshavit@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 34 ++++++++++-----------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 92a72ec6ee974a..f4543bf0c18a49 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -103,6 +103,8 @@ static struct arm_smmu_option_prop arm_smmu_options[] = {
 };
 
 static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu);
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu);
 
 static void parse_driver_options(struct arm_smmu_device *smmu)
 {
@@ -2268,12 +2270,12 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
 	kfree(smmu_domain);
 }
 
-static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s1(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int ret;
 	u32 asid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_ctx_desc *cd = &smmu_domain->cd;
 	typeof(&pgtbl_cfg->arm_lpae_s1_cfg.tcr) tcr = &pgtbl_cfg->arm_lpae_s1_cfg.tcr;
 
@@ -2305,11 +2307,11 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
 	return ret;
 }
 
-static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
+static int arm_smmu_domain_finalise_s2(struct arm_smmu_device *smmu,
+				       struct arm_smmu_domain *smmu_domain,
 				       struct io_pgtable_cfg *pgtbl_cfg)
 {
 	int vmid;
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
 	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
 
 	/* Reserve VMID 0 for stage-2 bypass STEs */
@@ -2322,17 +2324,17 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
 	return 0;
 }
 
-static int arm_smmu_domain_finalise(struct iommu_domain *domain)
+static int arm_smmu_domain_finalise(struct arm_smmu_domain *smmu_domain,
+				    struct arm_smmu_device *smmu)
 {
 	int ret;
 	unsigned long ias, oas;
 	enum io_pgtable_fmt fmt;
 	struct io_pgtable_cfg pgtbl_cfg;
 	struct io_pgtable_ops *pgtbl_ops;
-	int (*finalise_stage_fn)(struct arm_smmu_domain *,
-				 struct io_pgtable_cfg *);
-	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
-	struct arm_smmu_device *smmu = smmu_domain->smmu;
+	int (*finalise_stage_fn)(struct arm_smmu_device *smmu,
+				 struct arm_smmu_domain *smmu_domain,
+				 struct io_pgtable_cfg *pgtbl_cfg);
 
 	/* Restrict the stage to what we can actually support */
 	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
@@ -2371,17 +2373,18 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain)
 	if (!pgtbl_ops)
 		return -ENOMEM;
 
-	domain->pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
-	domain->geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
-	domain->geometry.force_aperture = true;
+	smmu_domain->domain.pgsize_bitmap = pgtbl_cfg.pgsize_bitmap;
+	smmu_domain->domain.geometry.aperture_end = (1UL << pgtbl_cfg.ias) - 1;
+	smmu_domain->domain.geometry.force_aperture = true;
 
-	ret = finalise_stage_fn(smmu_domain, &pgtbl_cfg);
+	ret = finalise_stage_fn(smmu, smmu_domain, &pgtbl_cfg);
 	if (ret < 0) {
 		free_io_pgtable_ops(pgtbl_ops);
 		return ret;
 	}
 
 	smmu_domain->pgtbl_ops = pgtbl_ops;
+	smmu_domain->smmu = smmu;
 	return 0;
 }
 
@@ -2573,10 +2576,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
 	mutex_lock(&smmu_domain->init_mutex);
 
 	if (!smmu_domain->smmu) {
-		smmu_domain->smmu = smmu;
-		ret = arm_smmu_domain_finalise(domain);
-		if (ret)
-			smmu_domain->smmu = NULL;
+		ret = arm_smmu_domain_finalise(smmu_domain, smmu);
 	} else if (smmu_domain->smmu != smmu)
 		ret = -EINVAL;
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v4 16/16] iommu/arm-smmu-v3: Convert to domain_alloc_paging()
  2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
                   ` (14 preceding siblings ...)
  2024-01-25 23:57 ` [PATCH v4 15/16] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize Jason Gunthorpe
@ 2024-01-25 23:57 ` Jason Gunthorpe
  15 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 23:57 UTC (permalink / raw)
  To: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Now that the BLOCKED and IDENTITY behaviors are managed with their own
domains change to the domain_alloc_paging() op.

For now SVA remains using the old interface, eventually it will get its
own op that can pass in the device and mm_struct which will let us have a
sane lifetime for the mmu_notifier.

Call arm_smmu_domain_finalise() early if dev is available.

Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
Tested-by: Nicolin Chen <nicolinc@nvidia.com>
Tested-by: Moritz Fischer <moritzf@google.com>
Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 22 ++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index f4543bf0c18a49..f890abe95e57f7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -2222,14 +2222,15 @@ static bool arm_smmu_capable(struct device *dev, enum iommu_cap cap)
 
 static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 {
-	struct arm_smmu_domain *smmu_domain;
 
 	if (type == IOMMU_DOMAIN_SVA)
 		return arm_smmu_sva_domain_alloc();
+	return ERR_PTR(-EOPNOTSUPP);
+}
 
-	if (type != IOMMU_DOMAIN_UNMANAGED &&
-	    type != IOMMU_DOMAIN_DMA)
-		return NULL;
+static struct iommu_domain *arm_smmu_domain_alloc_paging(struct device *dev)
+{
+	struct arm_smmu_domain *smmu_domain;
 
 	/*
 	 * Allocate the domain and initialise some of its data structures.
@@ -2238,13 +2239,23 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type)
 	 */
 	smmu_domain = kzalloc(sizeof(*smmu_domain), GFP_KERNEL);
 	if (!smmu_domain)
-		return NULL;
+		return ERR_PTR(-ENOMEM);
 
 	mutex_init(&smmu_domain->init_mutex);
 	INIT_LIST_HEAD(&smmu_domain->devices);
 	spin_lock_init(&smmu_domain->devices_lock);
 	INIT_LIST_HEAD(&smmu_domain->mmu_notifiers);
 
+	if (dev) {
+		struct arm_smmu_master *master = dev_iommu_priv_get(dev);
+		int ret;
+
+		ret = arm_smmu_domain_finalise(smmu_domain, master->smmu);
+		if (ret) {
+			kfree(smmu_domain);
+			return ERR_PTR(ret);
+		}
+	}
 	return &smmu_domain->domain;
 }
 
@@ -3108,6 +3119,7 @@ static struct iommu_ops arm_smmu_ops = {
 	.blocked_domain		= &arm_smmu_blocked_domain,
 	.capable		= arm_smmu_capable,
 	.domain_alloc		= arm_smmu_domain_alloc,
+	.domain_alloc_paging    = arm_smmu_domain_alloc_paging,
 	.probe_device		= arm_smmu_probe_device,
 	.release_device		= arm_smmu_release_device,
 	.device_group		= arm_smmu_device_group,
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
@ 2024-01-26  4:03   ` Michael Shavit
  2024-01-29 19:53   ` Moritz Fischer
  2024-01-30 22:42   ` Mostafa Saleh
  2 siblings, 0 replies; 39+ messages in thread
From: Michael Shavit @ 2024-01-26  4:03 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Nicolin Chen, patches,
	Shameer Kolothum

On Fri, Jan 26, 2024 at 7:57 AM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
>
> The iommu core APIs would ideally expect the driver to do a hitless change
> of iommu_domain in a number of cases:
>
>  - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
>    for the RESV ranges
>
>  - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
>    domain installed. The RID should not be impacted
>
>  - PASID downgrade has IDENTIY on the RID and all PASID's removed.
>    The RID should not be impacted
>
>  - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
>    impacted
>
>  - NESTING -> NESTING for carrying all the above hitless cases in a VM
>    into the hypervisor. To comprehensively emulate the HW in a VM we should
>    assume the VM OS is running logic like this and expecting hitless updates
>    to be relayed to real HW.
>
> For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
> how limited it is, and the driver does have a need for hitless CD updates:
>
>  - SMMUv3 BTM S1 ASID re-label
>
>  - SVA mm release should change the CD to answert not-present to all
>    requests without allowing logging (EPD0)
>
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
> At the end everything that can be hitless will be hitless, including all
> of the above.
>
> Introduce arm_smmu_write_entry() which will run through the multi-qword
> programming sequence to avoid creating an incoherent 'torn' STE in the HW
> caches. It automatically detects which of two algorithms to use:
>
> 1) The disruptive V=0 update described in the spec which disrupts the
>    entry and does three syncs to make the change:
>        - Write V=0 to QWORD 0
>        - Write the entire STE except QWORD 0
>        - Write QWORD 0
>
> 2) A hitless update algorithm that follows the same rational that the driver
>    already uses. It is safe to change IGNORED bits that HW doesn't use:
>        - Write the target value into all currently unused bits
>        - Write a single QWORD, this makes the new STE live atomically
>        - Ensure now unused bits are 0
>
> The detection of which path to use and the implementation of the hitless
> update rely on a "used bitmask" describing what bits the HW is actually
> using based on the V/CFG/etc bits. This flows from the spec language,
> typically indicated as IGNORED.
>
> Knowing which bits the HW is using we can update the bits it does not use
> and then compute how many QWORDS need to be changed. If only one qword
> needs to be updated the hitless algorithm is possible.
>
> Later patches will include CD updates in this mechanism so make the
> implementation generic using a struct arm_smmu_entry_writer and struct
> arm_smmu_entry_writer_ops to abstract the differences between STE and CD
> to be plugged in.
>
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).
>
> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
>
> Signed-off-by: Michael Shavit <mshavit@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 328 ++++++++++++++++----
>  1 file changed, 261 insertions(+), 67 deletions(-)
>
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 0ffb1cf17e0b2e..690742e8f173eb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
>         ARM_SMMU_MAX_MSIS,
>  };
>
> +struct arm_smmu_entry_writer_ops;
> +struct arm_smmu_entry_writer {
> +       const struct arm_smmu_entry_writer_ops *ops;
> +       struct arm_smmu_master *master;
> +};
> +
> +struct arm_smmu_entry_writer_ops {
> +       unsigned int num_entry_qwords;
> +       __le64 v_bit;
> +       void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
> +                        __le64 *used);
> +       void (*sync)(struct arm_smmu_entry_writer *writer);
> +};
> +
> +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> +
>  static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
>         [EVTQ_MSI_INDEX] = {
>                 ARM_SMMU_EVTQ_IRQ_CFG0,
> @@ -971,6 +987,140 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
>         arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>
> +/*
> + * Figure out if we can do a hitless update of entry to become target. Returns a
> + * bit mask where 1 indicates that qword needs to be set disruptively.
> + * unused_update is an intermediate value of entry that has unused bits set to
> + * their new values.
> + */
> +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> +                                   const __le64 *entry, const __le64 *target,
> +                                   __le64 *unused_update)
> +{
> +       __le64 target_used[NUM_ENTRY_QWORDS] = {};
> +       __le64 cur_used[NUM_ENTRY_QWORDS] = {};
> +       u8 used_qword_diff = 0;
> +       unsigned int i;
> +
> +       writer->ops->get_used(writer, entry, cur_used);
> +       writer->ops->get_used(writer, target, target_used);
> +
> +       for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> +               /*
> +                * Check that masks are up to date, the make functions are not
> +                * allowed to set a bit to 1 if the used function doesn't say it
> +                * is used.
> +                */
> +               WARN_ON_ONCE(target[i] & ~target_used[i]);
> +
> +               /* Bits can change because they are not currently being used */
> +               unused_update[i] = (entry[i] & cur_used[i]) |
> +                                  (target[i] & ~cur_used[i]);
> +               /*
> +                * Each bit indicates that a used bit in a qword needs to be
> +                * changed after unused_update is applied.
> +                */
> +               if ((unused_update[i] & target_used[i]) != target[i])
> +                       used_qword_diff |= 1 << i;
> +       }
> +       return used_qword_diff;
> +}
> +
> +static bool entry_set(struct arm_smmu_entry_writer *writer, __le64 *entry,
> +                     const __le64 *target, unsigned int start,
> +                     unsigned int len)
> +{
> +       bool changed = false;
> +       unsigned int i;
> +
> +       for (i = start; len != 0; len--, i++) {
> +               if (entry[i] != target[i]) {
> +                       WRITE_ONCE(entry[i], target[i]);
> +                       changed = true;
> +               }
> +       }
> +
> +       if (changed)
> +               writer->ops->sync(writer);
> +       return changed;
> +}
> +
> +/*
> + * Update the STE/CD to the target configuration. The transition from the
> + * current entry to the target entry takes place over multiple steps that
> + * attempts to make the transition hitless if possible. This function takes care
> + * not to create a situation where the HW can perceive a corrupted entry. HW is
> + * only required to have a 64 bit atomicity with stores from the CPU, while
> + * entries are many 64 bit values big.
> + *
> + * The difference between the current value and the target value is analyzed to
> + * determine which of three updates are required - disruptive, hitless or no
> + * change.
> + *
> + * In the most general disruptive case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused qwords, execpt qword 0 which contains V
> + *  - Make qword 0 have the final value and valid (V=1) with a single 64
> + *    bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification.
> + */
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +                                __le64 *entry, const __le64 *target)
> +{
> +       unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> +       __le64 unused_update[NUM_ENTRY_QWORDS];
> +       u8 used_qword_diff;
> +
> +       used_qword_diff =
> +               arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> +       if (hweight8(used_qword_diff) > 1) {
> +               /*
> +                * At least two qwords need their inuse bits to be changed. This
> +                * requires a breaking update, zero the V bit, write all qwords
> +                * but 0, then set qword 0
> +                */
> +               unused_update[0] = entry[0] & (~writer->ops->v_bit);
> +               entry_set(writer, entry, unused_update, 0, 1);
> +               entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> +               entry_set(writer, entry, target, 0, 1);
> +       } else if (hweight8(used_qword_diff) == 1) {
> +               /*
> +                * Only one qword needs its used bits to be changed. This is a
> +                * hitless update, update all bits the current STE is ignoring
> +                * to their new values, then update a single "critical qword" to
> +                * change the STE and finally 0 out any bits that are now unused
> +                * in the target configuration.
> +                */
> +               unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> +
> +               /*
> +                * Skip writing unused bits in the critical qword since we'll be
> +                * writing it in the next step anyways. This can save a sync
> +                * when the only change is in that qword.
> +                */
> +               unused_update[critical_qword_index] =
> +                       entry[critical_qword_index];
> +               entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> +               entry_set(writer, entry, target, critical_qword_index, 1);
> +               entry_set(writer, entry, target, 0, num_entry_qwords);
> +       } else {
> +               /*
> +                * No inuse bit changed. Sanity check that all unused bits are 0
> +                * in the entry. The target was already sanity checked by
> +                * compute_qword_diff().
> +                */
> +               WARN_ON_ONCE(
> +                       entry_set(writer, entry, target, 0, num_entry_qwords));
> +       }
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>                              int ssid, bool leaf)
>  {
> @@ -1238,50 +1388,123 @@ arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc)
>         WRITE_ONCE(*dst, cpu_to_le64(val));
>  }
>
> -static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
> +struct arm_smmu_ste_writer {
> +       struct arm_smmu_entry_writer writer;
> +       u32 sid;
> +};
> +
> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(struct arm_smmu_entry_writer *writer,
> +                                 const __le64 *ent, __le64 *used_bits)
>  {
> +       used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
> +       if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +               return;
> +
> +       /*
> +        * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +        * attribute/permission configuration fields for the SHCFG behavior.
> +        */
> +       if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
> +           FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
> +                   STRTAB_STE_1_S1DSS_BYPASS)
> +               used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +       used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +       switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
> +       case STRTAB_STE_0_CFG_ABORT:
> +               break;
> +       case STRTAB_STE_0_CFG_BYPASS:
> +               used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +               break;
> +       case STRTAB_STE_0_CFG_S1_TRANS:
> +               used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +                                           STRTAB_STE_0_S1CTXPTR_MASK |
> +                                           STRTAB_STE_0_S1CDMAX);
> +               used_bits[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +                                   STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +                                   STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +               used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +               break;
> +       case STRTAB_STE_0_CFG_S2_TRANS:
> +               used_bits[1] |=
> +                       cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +               used_bits[2] |=
> +                       cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +                                   STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +                                   STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +               used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +               break;
> +
> +       default:
> +               memset(used_bits, 0xFF, sizeof(struct arm_smmu_ste));
> +               WARN_ON(true);
> +       }
> +}
> +
> +static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer *writer)
> +{
> +       struct arm_smmu_ste_writer *ste_writer =
> +               container_of(writer, struct arm_smmu_ste_writer, writer);
>         struct arm_smmu_cmdq_ent cmd = {
>                 .opcode = CMDQ_OP_CFGI_STE,
>                 .cfgi   = {
> -                       .sid    = sid,
> +                       .sid    = ste_writer->sid,
>                         .leaf   = true,
>                 },
>         };
>
> -       arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> +       arm_smmu_cmdq_issue_cmd_with_sync(writer->master->smmu, &cmd);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
> +       .sync = arm_smmu_ste_writer_sync_entry,
> +       .get_used = arm_smmu_get_ste_used,
> +       .v_bit = cpu_to_le64(STRTAB_STE_0_V),
> +       .num_entry_qwords = sizeof(struct arm_smmu_ste) / sizeof(u64),
> +};
> +
> +static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
> +                              struct arm_smmu_ste *ste,
> +                              const struct arm_smmu_ste *target)
> +{
> +       struct arm_smmu_device *smmu = master->smmu;
> +       struct arm_smmu_ste_writer ste_writer = {
> +               .writer = {
> +                       .ops = &arm_smmu_ste_writer_ops,
> +                       .master = master,
> +               },
> +               .sid = sid,
> +       };
> +
> +       arm_smmu_write_entry(&ste_writer.writer, ste->data, target->data);
> +
> +       /* It's likely that we'll want to use the new STE soon */
> +       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +               struct arm_smmu_cmdq_ent
> +                       prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +                                        .prefetch = {
> +                                                .sid = sid,
> +                                        } };
> +
> +               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       }
>  }
>
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                                       struct arm_smmu_ste *dst)
>  {
> -       /*
> -        * This is hideously complicated, but we only really care about
> -        * three cases at the moment:
> -        *
> -        * 1. Invalid (all zero) -> bypass/fault (init)
> -        * 2. Bypass/fault -> translation/bypass (attach)
> -        * 3. Translation/bypass -> bypass/fault (detach)
> -        *
> -        * Given that we can't update the STE atomically and the SMMU
> -        * doesn't read the thing in a defined order, that leaves us
> -        * with the following maintenance requirements:
> -        *
> -        * 1. Update Config, return (init time STEs aren't live)
> -        * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -        * 3. Update Config, sync
> -        */
> -       u64 val = le64_to_cpu(dst->data[0]);
> -       bool ste_live = false;
> +       u64 val;
>         struct arm_smmu_device *smmu = master->smmu;
>         struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>         struct arm_smmu_s2_cfg *s2_cfg = NULL;
>         struct arm_smmu_domain *smmu_domain = master->domain;
> -       struct arm_smmu_cmdq_ent prefetch_cmd = {
> -               .opcode         = CMDQ_OP_PREFETCH_CFG,
> -               .prefetch       = {
> -                       .sid    = sid,
> -               },
> -       };
> +       struct arm_smmu_ste target = {};
>
>         if (smmu_domain) {
>                 switch (smmu_domain->stage) {
> @@ -1296,22 +1519,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 }
>         }
>
> -       if (val & STRTAB_STE_0_V) {
> -               switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -               case STRTAB_STE_0_CFG_BYPASS:
> -                       break;
> -               case STRTAB_STE_0_CFG_S1_TRANS:
> -               case STRTAB_STE_0_CFG_S2_TRANS:
> -                       ste_live = true;
> -                       break;
> -               case STRTAB_STE_0_CFG_ABORT:
> -                       BUG_ON(!disable_bypass);
> -                       break;
> -               default:
> -                       BUG(); /* STE corruption */
> -               }
> -       }
> -
>         /* Nuke the existing STE_0 value, as we're going to rewrite it */
>         val = STRTAB_STE_0_V;
>
> @@ -1322,16 +1529,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 else
>                         val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>
> -               dst->data[0] = cpu_to_le64(val);
> -               dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +               target.data[0] = cpu_to_le64(val);
> +               target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>                                                 STRTAB_STE_1_SHCFG_INCOMING));
> -               dst->data[2] = 0; /* Nuke the VMID */
> -               /*
> -                * The SMMU can perform negative caching, so we must sync
> -                * the STE regardless of whether the old value was live.
> -                */
> -               if (smmu)
> -                       arm_smmu_sync_ste_for_sid(smmu, sid);
> +               target.data[2] = 0; /* Nuke the VMID */
> +               arm_smmu_write_ste(master, sid, dst, &target);
>                 return;
>         }
>
> @@ -1339,8 +1541,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                 u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>                         STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>
> -               BUG_ON(ste_live);
> -               dst->data[1] = cpu_to_le64(
> +               target.data[1] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>                          FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>                          FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1349,7 +1550,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>
>                 if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>                     !master->stall_enabled)
> -                       dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +                       target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>
>                 val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>                         FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1358,8 +1559,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>         }
>
>         if (s2_cfg) {
> -               BUG_ON(ste_live);
> -               dst->data[2] = cpu_to_le64(
> +               target.data[2] = cpu_to_le64(
>                          FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>                          FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1368,23 +1568,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>                          STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>                          STRTAB_STE_2_S2R);
>
> -               dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +               target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>
>                 val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>         }
>
>         if (master->ats_enabled)
> -               dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +               target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>                                                  STRTAB_STE_1_EATS_TRANS));
>
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -       /* See comment in arm_smmu_write_ctx_desc() */
> -       WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -       arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -       /* It's likely that we'll want to use the new STE soon */
> -       if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -               arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +       target.data[0] = cpu_to_le64(val);
> +       arm_smmu_write_ste(master, sid, dst, &target);
>  }
>
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> --
> 2.43.0
>
Reviewed-by: Michael Shavit <mshavit@google.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  2024-01-25 23:57 ` [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
@ 2024-01-29 15:07   ` Shameerali Kolothum Thodi
  2024-01-29 15:43     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-01-29 15:07 UTC (permalink / raw)
  To: Jason Gunthorpe, iommu, Joerg Roedel, linux-arm-kernel,
	Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen, patches



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 25, 2024 11:57 PM
> To: iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; linux-arm-
> kernel@lists.infradead.org; Robin Murphy <robin.murphy@arm.com>; Will
> Deacon <will@kernel.org>
> Cc: Moritz Fischer <mdf@kernel.org>; Moritz Fischer <moritzf@google.com>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: [PATCH v4 03/16] iommu/arm-smmu-v3: Move
> arm_smmu_rmr_install_bypass_ste()
> 
> Logically arm_smmu_init_strtab_linear() is the function that allocates and
> populates the stream table with the initial value of the STEs. After this
> function returns the stream table should be fully ready.
> 
> arm_smmu_rmr_install_bypass_ste() adjusts the initial stream table to force
> any SIDs that the FW says have IOMMU_RESV_DIRECT to use bypass. This
> ensures there is no disruption to the identity mapping during boot.
> 
> Put arm_smmu_rmr_install_bypass_ste() into arm_smmu_init_strtab_linear(),
> it already executes immediately after arm_smmu_init_strtab_linear().
> 
> No functional change intended.

I think this actually changes the behavior and will cause regression as we
now install rmr  sids only for linear stream table not for SMMUv3 with 
2-level stream table supported.

Please check.

Thanks,
Shameer

> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 38bcb4ed1fccc1..df8fc7b87a7907 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -102,6 +102,8 @@ static struct arm_smmu_option_prop
> arm_smmu_options[] = {
>  	{ 0, NULL},
>  };
> 
> +static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device
> *smmu);
> +
>  static void parse_driver_options(struct arm_smmu_device *smmu)
>  {
>  	int i = 0;
> @@ -3250,6 +3252,9 @@ static int arm_smmu_init_strtab_linear(struct
> arm_smmu_device *smmu)
>  	cfg->strtab_base_cfg = reg;
> 
>  	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
> +
> +	/* Check for RMRs and install bypass STEs if any */
> +	arm_smmu_rmr_install_bypass_ste(smmu);
>  	return 0;
>  }
> 
> @@ -4063,9 +4068,6 @@ static int arm_smmu_device_probe(struct
> platform_device *pdev)
>  	/* Record our private device structure */
>  	platform_set_drvdata(pdev, smmu);
> 
> -	/* Check for RMRs and install bypass STEs if any */
> -	arm_smmu_rmr_install_bypass_ste(smmu);
> -
>  	/* Reset the device */
>  	ret = arm_smmu_device_reset(smmu, bypass);
>  	if (ret)
> --
> 2.43.0


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste()
  2024-01-29 15:07   ` Shameerali Kolothum Thodi
@ 2024-01-29 15:43     ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-29 15:43 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches

On Mon, Jan 29, 2024 at 03:07:21PM +0000, Shameerali Kolothum Thodi wrote:

> > Logically arm_smmu_init_strtab_linear() is the function that allocates and
> > populates the stream table with the initial value of the STEs. After this
> > function returns the stream table should be fully ready.
> > 
> > arm_smmu_rmr_install_bypass_ste() adjusts the initial stream table to force
> > any SIDs that the FW says have IOMMU_RESV_DIRECT to use bypass. This
> > ensures there is no disruption to the identity mapping during boot.
> > 
> > Put arm_smmu_rmr_install_bypass_ste() into arm_smmu_init_strtab_linear(),
> > it already executes immediately after arm_smmu_init_strtab_linear().
> > 
> > No functional change intended.
> 
> I think this actually changes the behavior and will cause regression as we
> now install rmr  sids only for linear stream table not for SMMUv3 with 
> 2-level stream table supported.

Oh you are right, it should be in arm_smmu_init_strtab()

Thanks!
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2024-01-25 23:57 ` [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
@ 2024-01-29 18:11   ` Shameerali Kolothum Thodi
  2024-01-29 18:37     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-01-29 18:11 UTC (permalink / raw)
  To: Jason Gunthorpe, iommu, Joerg Roedel, linux-arm-kernel,
	Robin Murphy, Will Deacon
  Cc: Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen, patches



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Thursday, January 25, 2024 11:57 PM
> To: iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; linux-arm-
> kernel@lists.infradead.org; Robin Murphy <robin.murphy@arm.com>; Will
> Deacon <will@kernel.org>
> Cc: Moritz Fischer <mdf@kernel.org>; Moritz Fischer <moritzf@google.com>;
> Michael Shavit <mshavit@google.com>; Nicolin Chen <nicolinc@nvidia.com>;
> patches@lists.linux.dev; Shameerali Kolothum Thodi
> <shameerali.kolothum.thodi@huawei.com>
> Subject: [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static
> IDENTITY domain
> 
> Move to the new static global for identity domains. Move all the logic out
> of arm_smmu_attach_dev into an identity only function.
> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 82 +++++++++++++++--
> ----
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 -
>  2 files changed, 58 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index f08cfa9b90b3eb..d35bf9655c9b1b 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2226,8 +2226,7 @@ static struct iommu_domain
> *arm_smmu_domain_alloc(unsigned type)
>  		return arm_smmu_sva_domain_alloc();
> 
>  	if (type != IOMMU_DOMAIN_UNMANAGED &&
> -	    type != IOMMU_DOMAIN_DMA &&
> -	    type != IOMMU_DOMAIN_IDENTITY)
> +	    type != IOMMU_DOMAIN_DMA)
>  		return NULL;
> 
>  	/*
> @@ -2335,11 +2334,6 @@ static int arm_smmu_domain_finalise(struct
> iommu_domain *domain)
>  	struct arm_smmu_domain *smmu_domain =
> to_smmu_domain(domain);
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
> 
> -	if (domain->type == IOMMU_DOMAIN_IDENTITY) {
> -		smmu_domain->stage = ARM_SMMU_DOMAIN_BYPASS;
> -		return 0;
> -	}
> -
>  	/* Restrict the stage to what we can actually support */
>  	if (!(smmu->features & ARM_SMMU_FEAT_TRANS_S1))
>  		smmu_domain->stage = ARM_SMMU_DOMAIN_S2;
> @@ -2537,7 +2531,7 @@ static void arm_smmu_detach_dev(struct
> arm_smmu_master *master)
>  	struct arm_smmu_domain *smmu_domain;
>  	unsigned long flags;
> 
> -	if (!domain)
> +	if (!domain || !(domain->type & __IOMMU_DOMAIN_PAGING))
>  		return;
> 
>  	smmu_domain = to_smmu_domain(domain);
> @@ -2600,15 +2594,7 @@ static int arm_smmu_attach_dev(struct
> iommu_domain *domain, struct device *dev)
> 
>  	arm_smmu_detach_dev(master);
> 
> -	/*
> -	 * The SMMU does not support enabling ATS with bypass. When the
> STE is
> -	 * in bypass (STE.Config[2:0] == 0b100), ATS Translation Requests and
> -	 * Translated transactions are denied as though ATS is disabled for
> the
> -	 * stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> -	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> -	 */
> -	if (smmu_domain->stage != ARM_SMMU_DOMAIN_BYPASS)
> -		master->ats_enabled = arm_smmu_ats_supported(master);
> +	master->ats_enabled = arm_smmu_ats_supported(master);
> 
>  	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>  	list_add(&master->domain_head, &smmu_domain->devices);
> @@ -2645,13 +2631,6 @@ static int arm_smmu_attach_dev(struct
> iommu_domain *domain, struct device *dev)
>  			arm_smmu_write_ctx_desc(master,
> IOMMU_NO_PASID,
>  						      NULL);
>  		break;
> -	case ARM_SMMU_DOMAIN_BYPASS:
> -		arm_smmu_make_bypass_ste(&target);
> -		arm_smmu_install_ste_for_dev(master, &target);
> -		if (master->cd_table.cdtab)
> -			arm_smmu_write_ctx_desc(master,
> IOMMU_NO_PASID,
> -						      NULL);
> -		break;
>  	}
> 
>  	arm_smmu_enable_ats(master, smmu_domain);
> @@ -2667,6 +2646,60 @@ static int arm_smmu_attach_dev(struct
> iommu_domain *domain, struct device *dev)
>  	return ret;
>  }
> 
> +static int arm_smmu_attach_dev_ste(struct device *dev,
> +				   struct arm_smmu_ste *ste)
> +{
> +	struct arm_smmu_master *master = dev_iommu_priv_get(dev);
> +
> +	if (arm_smmu_master_sva_enabled(master))
> +		return -EBUSY;
> +
> +	/*
> +	 * Do not allow any ASID to be changed while are working on the STE,
> +	 * otherwise we could miss invalidations.
> +	 */
> +	mutex_lock(&arm_smmu_asid_lock);
> +
> +	/*
> +	 * The SMMU does not support enabling ATS with bypass/abort.
> When the
> +	 * STE is in bypass (STE.Config[2:0] == 0b100), ATS Translation
> Requests
> +	 * and Translated transactions are denied as though ATS is disabled
> for
> +	 * the stream (STE.EATS == 0b00), causing F_BAD_ATS_TREQ and
> +	 * F_TRANSL_FORBIDDEN events (IHI0070Ea 5.2 Stream Table Entry).
> +	 */
> +	arm_smmu_detach_dev(master);
> +
> +	arm_smmu_install_ste_for_dev(master, ste);
> +	mutex_unlock(&arm_smmu_asid_lock);
> +
> +	/*
> +	 * This has to be done after removing the master from the
> +	 * arm_smmu_domain->devices to avoid races updating the same
> context
> +	 * descriptor from arm_smmu_share_asid().
> +	 */
> +	if (master->cd_table.cdtab)
> +		arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID,
> NULL);
> +	return 0;
> +}
> +
> +static int arm_smmu_attach_dev_identity(struct iommu_domain *domain,
> +					struct device *dev)
> +{
> +	struct arm_smmu_ste ste;
> +
> +	arm_smmu_make_bypass_ste(&ste);
> +	return arm_smmu_attach_dev_ste(dev, &ste);
> +}
> +
> +static const struct iommu_domain_ops arm_smmu_identity_ops = {
> +	.attach_dev = arm_smmu_attach_dev_identity,
> +};
> +
> +static struct iommu_domain arm_smmu_identity_domain = {
> +	.type = IOMMU_DOMAIN_IDENTITY,
> +	.ops = &arm_smmu_identity_ops,
> +};
> +
>  static int arm_smmu_map_pages(struct iommu_domain *domain, unsigned
> long iova,
>  			      phys_addr_t paddr, size_t pgsize, size_t pgcount,
>  			      int prot, gfp_t gfp, size_t *mapped)
> @@ -3056,6 +3089,7 @@ static void arm_smmu_remove_dev_pasid(struct
> device *dev, ioasid_t pasid)
>  }
> 
>  static struct iommu_ops arm_smmu_ops = {
> +	.identity_domain	= &arm_smmu_identity_domain,

This seems to create a problem when we have set the identity domain and
try to enable sva for the device. Since there is no smmu_domain for this case 
and there is no specific domain type checking in iommu_sva_bind_device() path,
it eventually crashes(hangs in my test) in,

iommu_sva_bind_device()
   ...
      arm_smmu_sva_set_dev_pasid()
        __arm_smmu_sva_bind()
           arm_smmu_mmu_notifier_get(smmu_domain, ..)  --> never exit the mmu notifier list loop.

I think we should check for the domain type in iommu_sva_bind_device() or later
before trying to use smmu_domain.  At present(ie, without this series) it returns error
while we are trying to write the CD. But that looks too late as well.

Thanks,
Shameer



^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2024-01-29 18:11   ` Shameerali Kolothum Thodi
@ 2024-01-29 18:37     ` Jason Gunthorpe
  2024-01-30  8:35       ` Shameerali Kolothum Thodi
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-29 18:37 UTC (permalink / raw)
  To: Shameerali Kolothum Thodi
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches

On Mon, Jan 29, 2024 at 06:11:48PM +0000, Shameerali Kolothum Thodi wrote:

> > @@ -3056,6 +3089,7 @@ static void arm_smmu_remove_dev_pasid(struct
> > device *dev, ioasid_t pasid)
> >  }
> > 
> >  static struct iommu_ops arm_smmu_ops = {
> > +	.identity_domain	= &arm_smmu_identity_domain,
> 
> This seems to create a problem when we have set the identity domain and
> try to enable sva for the device. Since there is no smmu_domain for this case 
> and there is no specific domain type checking in iommu_sva_bind_device() path,
> it eventually crashes(hangs in my test) in,

Yeah, that is a longstanding issue in the SVA implementation, it only
works if the RID is set to a S1 paging domain.

I cleaned it up here so that the SVA series was cleaer:

https://lore.kernel.org/linux-iommu/1-v4-e7091cdd9e8d+43b1-smmuv3_newapi_p2_jgg@nvidia.com/

> iommu_sva_bind_device()
>    ...
>       arm_smmu_sva_set_dev_pasid()
>         __arm_smmu_sva_bind()
>            arm_smmu_mmu_notifier_get(smmu_domain, ..)  --> never exit the mmu notifier list loop.
> 
> I think we should check for the domain type in iommu_sva_bind_device() or later
> before trying to use smmu_domain.  At present(ie, without this series) it returns error
> while we are trying to write the CD. But that looks too late as well.

Oh wow, is that how it worked? OK, I figured it was just broken but if
there was some error code that happened indirectly then lets've move
the above patch ahead of this one.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
  2024-01-26  4:03   ` Michael Shavit
@ 2024-01-29 19:53   ` Moritz Fischer
  2024-01-30 22:42   ` Mostafa Saleh
  2 siblings, 0 replies; 39+ messages in thread
From: Moritz Fischer @ 2024-01-29 19:53 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Michael Shavit, Nicolin Chen, patches,
	Shameer Kolothum

On Thu, Jan 25, 2024 at 07:57:11PM -0400, Jason Gunthorpe wrote:
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.

> The iommu core APIs would ideally expect the driver to do a hitless change
> of iommu_domain in a number of cases:

>   - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
>     for the RESV ranges

>   - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
>     domain installed. The RID should not be impacted

>   - PASID downgrade has IDENTIY on the RID and all PASID's removed.
>     The RID should not be impacted

>   - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
>     impacted

>   - NESTING -> NESTING for carrying all the above hitless cases in a VM
>     into the hypervisor. To comprehensively emulate the HW in a VM we  
> should
>     assume the VM OS is running logic like this and expecting hitless  
> updates
>     to be relayed to real HW.

> For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
> how limited it is, and the driver does have a need for hitless CD updates:

>   - SMMUv3 BTM S1 ASID re-label

>   - SVA mm release should change the CD to answert not-present to all
>     requests without allowing logging (EPD0)

> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
> At the end everything that can be hitless will be hitless, including all
> of the above.

> Introduce arm_smmu_write_entry() which will run through the multi-qword
> programming sequence to avoid creating an incoherent 'torn' STE in the HW
> caches. It automatically detects which of two algorithms to use:

> 1) The disruptive V=0 update described in the spec which disrupts the
>     entry and does three syncs to make the change:
>         - Write V=0 to QWORD 0
>         - Write the entire STE except QWORD 0
>         - Write QWORD 0

> 2) A hitless update algorithm that follows the same rational that the  
> driver
>     already uses. It is safe to change IGNORED bits that HW doesn't use:
>         - Write the target value into all currently unused bits
>         - Write a single QWORD, this makes the new STE live atomically
>         - Ensure now unused bits are 0

> The detection of which path to use and the implementation of the hitless
> update rely on a "used bitmask" describing what bits the HW is actually
> using based on the V/CFG/etc bits. This flows from the spec language,
> typically indicated as IGNORED.

> Knowing which bits the HW is using we can update the bits it does not use
> and then compute how many QWORDS need to be changed. If only one qword
> needs to be updated the hitless algorithm is possible.

> Later patches will include CD updates in this mechanism so make the
> implementation generic using a struct arm_smmu_entry_writer and struct
> arm_smmu_entry_writer_ops to abstract the differences between STE and CD
> to be plugged in.

> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).

> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as  
> ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.

> Signed-off-by: Michael Shavit <mshavit@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 328 ++++++++++++++++----
>   1 file changed, 261 insertions(+), 67 deletions(-)

> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c  
> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 0ffb1cf17e0b2e..690742e8f173eb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
>   	ARM_SMMU_MAX_MSIS,
>   };

> +struct arm_smmu_entry_writer_ops;
> +struct arm_smmu_entry_writer {
> +	const struct arm_smmu_entry_writer_ops *ops;
> +	struct arm_smmu_master *master;
> +};
> +
> +struct arm_smmu_entry_writer_ops {
> +	unsigned int num_entry_qwords;
> +	__le64 v_bit;
> +	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64  
> *entry,
> +			 __le64 *used);
> +	void (*sync)(struct arm_smmu_entry_writer *writer);
> +};
> +
> +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> +
>   static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
>   	[EVTQ_MSI_INDEX] = {
>   		ARM_SMMU_EVTQ_IRQ_CFG0,
> @@ -971,6 +987,140 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device  
> *smmu, u16 asid)
>   	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>   }

> +/*
> + * Figure out if we can do a hitless update of entry to become target.  
> Returns a
> + * bit mask where 1 indicates that qword needs to be set disruptively.
> + * unused_update is an intermediate value of entry that has unused bits  
> set to
> + * their new values.
> + */
> +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> +				    const __le64 *entry, const __le64 *target,
> +				    __le64 *unused_update)
> +{
> +	__le64 target_used[NUM_ENTRY_QWORDS] = {};
> +	__le64 cur_used[NUM_ENTRY_QWORDS] = {};
> +	u8 used_qword_diff = 0;
> +	unsigned int i;
> +
> +	writer->ops->get_used(writer, entry, cur_used);
> +	writer->ops->get_used(writer, target, target_used);
> +
> +	for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> +		/*
> +		 * Check that masks are up to date, the make functions are not
> +		 * allowed to set a bit to 1 if the used function doesn't say it
> +		 * is used.
> +		 */
> +		WARN_ON_ONCE(target[i] & ~target_used[i]);
> +
> +		/* Bits can change because they are not currently being used */
> +		unused_update[i] = (entry[i] & cur_used[i]) |
> +				   (target[i] & ~cur_used[i]);
> +		/*
> +		 * Each bit indicates that a used bit in a qword needs to be
> +		 * changed after unused_update is applied.
> +		 */
> +		if ((unused_update[i] & target_used[i]) != target[i])
> +			used_qword_diff |= 1 << i;
> +	}
> +	return used_qword_diff;
> +}
> +
> +static bool entry_set(struct arm_smmu_entry_writer *writer, __le64  
> *entry,
> +		      const __le64 *target, unsigned int start,
> +		      unsigned int len)
> +{
> +	bool changed = false;
> +	unsigned int i;
> +
> +	for (i = start; len != 0; len--, i++) {
> +		if (entry[i] != target[i]) {
> +			WRITE_ONCE(entry[i], target[i]);
> +			changed = true;
> +		}
> +	}
> +
> +	if (changed)
> +		writer->ops->sync(writer);
> +	return changed;
> +}
> +
> +/*
> + * Update the STE/CD to the target configuration. The transition from the
> + * current entry to the target entry takes place over multiple steps that
> + * attempts to make the transition hitless if possible. This function  
> takes care
> + * not to create a situation where the HW can perceive a corrupted  
> entry. HW is
> + * only required to have a 64 bit atomicity with stores from the CPU,  
> while
> + * entries are many 64 bit values big.
> + *
> + * The difference between the current value and the target value is  
> analyzed to
> + * determine which of three updates are required - disruptive, hitless  
> or no
> + * change.
> + *
> + * In the most general disruptive case we can make any update in three  
> steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused qwords, execpt qword 0 which contains V
> + *  - Make qword 0 have the final value and valid (V=1) with a single 64
> + *    bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing  
> the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc)  
> or
> + * because the used bits don't intersect. We can detect this by  
> calculating how
> + * many 64 bit values need update after adjusting the unused bits and  
> skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification.
> + */
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +				 __le64 *entry, const __le64 *target)
> +{
> +	unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> +	__le64 unused_update[NUM_ENTRY_QWORDS];
> +	u8 used_qword_diff;
> +
> +	used_qword_diff =
> +		arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> +	if (hweight8(used_qword_diff) > 1) {
> +		/*
> +		 * At least two qwords need their inuse bits to be changed. This
> +		 * requires a breaking update, zero the V bit, write all qwords
> +		 * but 0, then set qword 0
> +		 */
> +		unused_update[0] = entry[0] & (~writer->ops->v_bit);
> +		entry_set(writer, entry, unused_update, 0, 1);
> +		entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> +		entry_set(writer, entry, target, 0, 1);
> +	} else if (hweight8(used_qword_diff) == 1) {
> +		/*
> +		 * Only one qword needs its used bits to be changed. This is a
> +		 * hitless update, update all bits the current STE is ignoring
> +		 * to their new values, then update a single "critical qword" to
> +		 * change the STE and finally 0 out any bits that are now unused
> +		 * in the target configuration.
> +		 */
> +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> +
> +		/*
> +		 * Skip writing unused bits in the critical qword since we'll be
> +		 * writing it in the next step anyways. This can save a sync
> +		 * when the only change is in that qword.
> +		 */
> +		unused_update[critical_qword_index] =
> +			entry[critical_qword_index];
> +		entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> +		entry_set(writer, entry, target, critical_qword_index, 1);
> +		entry_set(writer, entry, target, 0, num_entry_qwords);
> +	} else {
> +		/*
> +		 * No inuse bit changed. Sanity check that all unused bits are 0
> +		 * in the entry. The target was already sanity checked by
> +		 * compute_qword_diff().
> +		 */
> +		WARN_ON_ONCE(
> +			entry_set(writer, entry, target, 0, num_entry_qwords));
> +	}
> +}
> +
>   static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>   			     int ssid, bool leaf)
>   {
> @@ -1238,50 +1388,123 @@ arm_smmu_write_strtab_l1_desc(__le64 *dst,  
> struct arm_smmu_strtab_l1_desc *desc)
>   	WRITE_ONCE(*dst, cpu_to_le64(val));
>   }

> -static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32  
> sid)
> +struct arm_smmu_ste_writer {
> +	struct arm_smmu_entry_writer writer;
> +	u32 sid;
> +};
> +
> +/*
> + * Based on the value of ent report which bits of the STE the HW will  
> access. It
> + * would be nice if this was complete according to the spec, but  
> minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(struct arm_smmu_entry_writer *writer,
> +				  const __le64 *ent, __le64 *used_bits)
>   {
> +	used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
> +	if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +		return;
> +
> +	/*
> +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +	 * attribute/permission configuration fields for the SHCFG behavior.
> +	 */
> +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
> +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
> +		    STRTAB_STE_1_S1DSS_BYPASS)
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +	used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
> +	case STRTAB_STE_0_CFG_ABORT:
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +		break;
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +					    STRTAB_STE_0_S1CTXPTR_MASK |
> +					    STRTAB_STE_0_S1CDMAX);
> +		used_bits[1] |=
> +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +		break;
> +	case STRTAB_STE_0_CFG_S2_TRANS:
> +		used_bits[1] |=
> +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +		used_bits[2] |=
> +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +		used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +		break;
> +
> +	default:
> +		memset(used_bits, 0xFF, sizeof(struct arm_smmu_ste));
> +		WARN_ON(true);
> +	}
> +}
> +
> +static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer  
> *writer)
> +{
> +	struct arm_smmu_ste_writer *ste_writer =
> +		container_of(writer, struct arm_smmu_ste_writer, writer);
>   	struct arm_smmu_cmdq_ent cmd = {
>   		.opcode	= CMDQ_OP_CFGI_STE,
>   		.cfgi	= {
> -			.sid	= sid,
> +			.sid	= ste_writer->sid,
>   			.leaf	= true,
>   		},
>   	};

> -	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> +	arm_smmu_cmdq_issue_cmd_with_sync(writer->master->smmu, &cmd);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
> +	.sync = arm_smmu_ste_writer_sync_entry,
> +	.get_used = arm_smmu_get_ste_used,
> +	.v_bit = cpu_to_le64(STRTAB_STE_0_V),
> +	.num_entry_qwords = sizeof(struct arm_smmu_ste) / sizeof(u64),
> +};
> +
> +static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
> +			       struct arm_smmu_ste *ste,
> +			       const struct arm_smmu_ste *target)
> +{
> +	struct arm_smmu_device *smmu = master->smmu;
> +	struct arm_smmu_ste_writer ste_writer = {
> +		.writer = {
> +			.ops = &arm_smmu_ste_writer_ops,
> +			.master = master,
> +		},
> +		.sid = sid,
> +	};
> +
> +	arm_smmu_write_entry(&ste_writer.writer, ste->data, target->data);
> +
> +	/* It's likely that we'll want to use the new STE soon */
> +	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +		struct arm_smmu_cmdq_ent
> +			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +					 .prefetch = {
> +						 .sid = sid,
> +					 } };
> +
> +		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	}
>   }

>   static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master,  
> u32 sid,
>   				      struct arm_smmu_ste *dst)
>   {
> -	/*
> -	 * This is hideously complicated, but we only really care about
> -	 * three cases at the moment:
> -	 *
> -	 * 1. Invalid (all zero) -> bypass/fault (init)
> -	 * 2. Bypass/fault -> translation/bypass (attach)
> -	 * 3. Translation/bypass -> bypass/fault (detach)
> -	 *
> -	 * Given that we can't update the STE atomically and the SMMU
> -	 * doesn't read the thing in a defined order, that leaves us
> -	 * with the following maintenance requirements:
> -	 *
> -	 * 1. Update Config, return (init time STEs aren't live)
> -	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -	 * 3. Update Config, sync
> -	 */
> -	u64 val = le64_to_cpu(dst->data[0]);
> -	bool ste_live = false;
> +	u64 val;
>   	struct arm_smmu_device *smmu = master->smmu;
>   	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>   	struct arm_smmu_s2_cfg *s2_cfg = NULL;
>   	struct arm_smmu_domain *smmu_domain = master->domain;
> -	struct arm_smmu_cmdq_ent prefetch_cmd = {
> -		.opcode		= CMDQ_OP_PREFETCH_CFG,
> -		.prefetch	= {
> -			.sid	= sid,
> -		},
> -	};
> +	struct arm_smmu_ste target = {};

>   	if (smmu_domain) {
>   		switch (smmu_domain->stage) {
> @@ -1296,22 +1519,6 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,
>   		}
>   	}

> -	if (val & STRTAB_STE_0_V) {
> -		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -		case STRTAB_STE_0_CFG_BYPASS:
> -			break;
> -		case STRTAB_STE_0_CFG_S1_TRANS:
> -		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> -			break;
> -		case STRTAB_STE_0_CFG_ABORT:
> -			BUG_ON(!disable_bypass);
> -			break;
> -		default:
> -			BUG(); /* STE corruption */
> -		}
> -	}
> -
>   	/* Nuke the existing STE_0 value, as we're going to rewrite it */
>   	val = STRTAB_STE_0_V;

> @@ -1322,16 +1529,11 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,
>   		else
>   			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);

> -		dst->data[0] = cpu_to_le64(val);
> -		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +		target.data[0] = cpu_to_le64(val);
> +		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>   						STRTAB_STE_1_SHCFG_INCOMING));
> -		dst->data[2] = 0; /* Nuke the VMID */
> -		/*
> -		 * The SMMU can perform negative caching, so we must sync
> -		 * the STE regardless of whether the old value was live.
> -		 */
> -		if (smmu)
> -			arm_smmu_sync_ste_for_sid(smmu, sid);
> +		target.data[2] = 0; /* Nuke the VMID */
> +		arm_smmu_write_ste(master, sid, dst, &target);
>   		return;
>   	}

> @@ -1339,8 +1541,7 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,
>   		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>   			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;

> -		BUG_ON(ste_live);
> -		dst->data[1] = cpu_to_le64(
> +		target.data[1] = cpu_to_le64(
>   			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>   			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>   			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1349,7 +1550,7 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,

>   		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>   		    !master->stall_enabled)
> -			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);

>   		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>   			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1358,8 +1559,7 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,
>   	}

>   	if (s2_cfg) {
> -		BUG_ON(ste_live);
> -		dst->data[2] = cpu_to_le64(
> +		target.data[2] = cpu_to_le64(
>   			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>   			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>   #ifdef __BIG_ENDIAN
> @@ -1368,23 +1568,17 @@ static void arm_smmu_write_strtab_ent(struct  
> arm_smmu_master *master, u32 sid,
>   			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>   			 STRTAB_STE_2_S2R);

> -		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);

>   		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>   	}

>   	if (master->ats_enabled)
> -		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>   						 STRTAB_STE_1_EATS_TRANS));

> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -	/* See comment in arm_smmu_write_ctx_desc() */
> -	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -	/* It's likely that we'll want to use the new STE soon */
> -	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	target.data[0] = cpu_to_le64(val);
> +	arm_smmu_write_ste(master, sid, dst, &target);
>   }

>   static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> --
> 2.43.0

Reviewed-by: Moritz Fischer <moritzf@google.com>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* RE: [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain
  2024-01-29 18:37     ` Jason Gunthorpe
@ 2024-01-30  8:35       ` Shameerali Kolothum Thodi
  0 siblings, 0 replies; 39+ messages in thread
From: Shameerali Kolothum Thodi @ 2024-01-30  8:35 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches



> -----Original Message-----
> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Monday, January 29, 2024 6:38 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>
> Cc: iommu@lists.linux.dev; Joerg Roedel <joro@8bytes.org>; linux-arm-
> kernel@lists.infradead.org; Robin Murphy <robin.murphy@arm.com>; Will
> Deacon <will@kernel.org>; Moritz Fischer <mdf@kernel.org>; Moritz Fischer
> <moritzf@google.com>; Michael Shavit <mshavit@google.com>; Nicolin Chen
> <nicolinc@nvidia.com>; patches@lists.linux.dev
> Subject: Re: [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static
> IDENTITY domain
> 
> On Mon, Jan 29, 2024 at 06:11:48PM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> > > @@ -3056,6 +3089,7 @@ static void
> arm_smmu_remove_dev_pasid(struct
> > > device *dev, ioasid_t pasid)
> > >  }
> > >
> > >  static struct iommu_ops arm_smmu_ops = {
> > > +	.identity_domain	= &arm_smmu_identity_domain,
> >
> > This seems to create a problem when we have set the identity domain and
> > try to enable sva for the device. Since there is no smmu_domain for this
> case
> > and there is no specific domain type checking in iommu_sva_bind_device()
> path,
> > it eventually crashes(hangs in my test) in,
> 
> Yeah, that is a longstanding issue in the SVA implementation, it only
> works if the RID is set to a S1 paging domain.
> 
> I cleaned it up here so that the SVA series was cleaer:
> 
> https://lore.kernel.org/linux-iommu/1-v4-e7091cdd9e8d+43b1-
> smmuv3_newapi_p2_jgg@nvidia.com/

Yes, this will do. But I think it is not complete. I will comment on that one.

> 
> > iommu_sva_bind_device()
> >    ...
> >       arm_smmu_sva_set_dev_pasid()
> >         __arm_smmu_sva_bind()
> >            arm_smmu_mmu_notifier_get(smmu_domain, ..)  --> never exit the
> mmu notifier list loop.
> >
> > I think we should check for the domain type in iommu_sva_bind_device()
> or later
> > before trying to use smmu_domain.  At present(ie, without this series) it
> returns error
> > while we are trying to write the CD. But that looks too late as well.
> 
> Oh wow, is that how it worked? OK, I figured it was just broken but if
> there was some error code that happened indirectly then lets've move
> the above patch ahead of this one.

Yes, indirectly indeed :)
https://elixir.bootlin.com/linux/v6.8-rc2/source/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c#L1068

Thanks,
Shameer

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
  2024-01-26  4:03   ` Michael Shavit
  2024-01-29 19:53   ` Moritz Fischer
@ 2024-01-30 22:42   ` Mostafa Saleh
  2024-01-30 23:56     ` Jason Gunthorpe
  2 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-01-30 22:42 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

Beside the comment on v3 about VMID.

On Thu, Jan 25, 2024 at 07:57:11PM -0400, Jason Gunthorpe wrote:
> As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> been limited to only work correctly in certain scenarios that the caller
> must ensure. Generally the caller must put the STE into ABORT or BYPASS
> before attempting to program it to something else.
> 
> The iommu core APIs would ideally expect the driver to do a hitless change
> of iommu_domain in a number of cases:
> 
>  - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
>    for the RESV ranges
> 
>  - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
>    domain installed. The RID should not be impacted
> 
>  - PASID downgrade has IDENTIY on the RID and all PASID's removed.
>    The RID should not be impacted
> 
>  - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
>    impacted
> 
>  - NESTING -> NESTING for carrying all the above hitless cases in a VM
>    into the hypervisor. To comprehensively emulate the HW in a VM we should
>    assume the VM OS is running logic like this and expecting hitless updates
>    to be relayed to real HW.

From my understanding, some of these cases are not implemented (at this point).
However, from what I see, most of these cases are related to switching from/to
identity, which the current driver would have to block in between, is my
understanding correct?

As for NESTING -> NESTING,  how is that achieved? (and why?)
AFAICT, VFIO will do BLOCKING in between any transition, and that domain
should never change while the a device is assigned to a VM.

> For CD updates arm_smmu_write_ctx_desc() has a similar comment explaining
> how limited it is, and the driver does have a need for hitless CD updates:
> 
>  - SMMUv3 BTM S1 ASID re-label
> 
>  - SVA mm release should change the CD to answert not-present to all
>    requests without allowing logging (EPD0)
> 
> The next patches/series are going to start removing some of this logic
> from the callers, and add more complex state combinations than currently.
> At the end everything that can be hitless will be hitless, including all
> of the above.
> 
> Introduce arm_smmu_write_entry() which will run through the multi-qword
> programming sequence to avoid creating an incoherent 'torn' STE in the HW
> caches. It automatically detects which of two algorithms to use:
> 
> 1) The disruptive V=0 update described in the spec which disrupts the
>    entry and does three syncs to make the change:
>        - Write V=0 to QWORD 0
>        - Write the entire STE except QWORD 0
>        - Write QWORD 0
> 
> 2) A hitless update algorithm that follows the same rational that the driver
>    already uses. It is safe to change IGNORED bits that HW doesn't use:
>        - Write the target value into all currently unused bits
>        - Write a single QWORD, this makes the new STE live atomically
>        - Ensure now unused bits are 0
> 
> The detection of which path to use and the implementation of the hitless
> update rely on a "used bitmask" describing what bits the HW is actually
> using based on the V/CFG/etc bits. This flows from the spec language,
> typically indicated as IGNORED.
> 
> Knowing which bits the HW is using we can update the bits it does not use
> and then compute how many QWORDS need to be changed. If only one qword
> needs to be updated the hitless algorithm is possible.
> 
> Later patches will include CD updates in this mechanism so make the
> implementation generic using a struct arm_smmu_entry_writer and struct
> arm_smmu_entry_writer_ops to abstract the differences between STE and CD
> to be plugged in.
> 
> At this point it generates the same sequence of updates as the current
> code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> extra sync (this seems to be an existing bug).
> 
> Going forward this will use a V=0 transition instead of cycling through
> ABORT if a hitfull change is required. This seems more appropriate as ABORT
> will fail DMAs without any logging, but dropping a DMA due to transient
> V=0 is probably signaling a bug, so the C_BAD_STE is valuable.
> 
> Signed-off-by: Michael Shavit <mshavit@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 328 ++++++++++++++++----
>  1 file changed, 261 insertions(+), 67 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 0ffb1cf17e0b2e..690742e8f173eb 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
>  	ARM_SMMU_MAX_MSIS,
>  };
>  
> +struct arm_smmu_entry_writer_ops;
> +struct arm_smmu_entry_writer {
> +	const struct arm_smmu_entry_writer_ops *ops;
> +	struct arm_smmu_master *master;

I see only master->smmu is used, is there a reason why we have this
struct instead?

> +};
> +
> +struct arm_smmu_entry_writer_ops {
> +	unsigned int num_entry_qwords;
> +	__le64 v_bit;
> +	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
> +			 __le64 *used);

*writer is not used in this series, I think it would make more sense if
it's added in the patch that introduce using it.

> +	void (*sync)(struct arm_smmu_entry_writer *writer);
> +};
> +
> +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> +

Isn't that just STRTAB_STE_DWORDS, also it makes more sense to not tie
this to the struct but with the actual hardware description that would
never change (but the struct can change)

>  static phys_addr_t arm_smmu_msi_cfg[ARM_SMMU_MAX_MSIS][3] = {
>  	[EVTQ_MSI_INDEX] = {
>  		ARM_SMMU_EVTQ_IRQ_CFG0,
> @@ -971,6 +987,140 @@ void arm_smmu_tlb_inv_asid(struct arm_smmu_device *smmu, u16 asid)
>  	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
>  }
>  
> +/*
> + * Figure out if we can do a hitless update of entry to become target. Returns a
> + * bit mask where 1 indicates that qword needs to be set disruptively.
> + * unused_update is an intermediate value of entry that has unused bits set to
> + * their new values.
> + */
> +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> +				    const __le64 *entry, const __le64 *target,
> +				    __le64 *unused_update)
> +{
> +	__le64 target_used[NUM_ENTRY_QWORDS] = {};
> +	__le64 cur_used[NUM_ENTRY_QWORDS] = {};
> +	u8 used_qword_diff = 0;
> +	unsigned int i;
> +
> +	writer->ops->get_used(writer, entry, cur_used);
> +	writer->ops->get_used(writer, target, target_used);
> +
> +	for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> +		/*
> +		 * Check that masks are up to date, the make functions are not
> +		 * allowed to set a bit to 1 if the used function doesn't say it
> +		 * is used.
> +		 */
> +		WARN_ON_ONCE(target[i] & ~target_used[i]);
> +

I think this should be a BUG. As we don't know the consequence for such change,
and this should never happen in a non-development kernel.

> +		/* Bits can change because they are not currently being used */
> +		unused_update[i] = (entry[i] & cur_used[i]) |
> +				   (target[i] & ~cur_used[i]);
> +		/*
> +		 * Each bit indicates that a used bit in a qword needs to be
> +		 * changed after unused_update is applied.
> +		 */
> +		if ((unused_update[i] & target_used[i]) != target[i])
> +			used_qword_diff |= 1 << i;
> +	}
> +	return used_qword_diff;
> +}
> +
> +static bool entry_set(struct arm_smmu_entry_writer *writer, __le64 *entry,
> +		      const __le64 *target, unsigned int start,
> +		      unsigned int len)
> +{
> +	bool changed = false;
> +	unsigned int i;
> +
> +	for (i = start; len != 0; len--, i++) {
> +		if (entry[i] != target[i]) {
> +			WRITE_ONCE(entry[i], target[i]);
> +			changed = true;
> +		}
> +	}
> +
> +	if (changed)
> +		writer->ops->sync(writer);
> +	return changed;
> +}
> +
> +/*
> + * Update the STE/CD to the target configuration. The transition from the
> + * current entry to the target entry takes place over multiple steps that
> + * attempts to make the transition hitless if possible. This function takes care
> + * not to create a situation where the HW can perceive a corrupted entry. HW is
> + * only required to have a 64 bit atomicity with stores from the CPU, while
> + * entries are many 64 bit values big.
> + *
> + * The difference between the current value and the target value is analyzed to
> + * determine which of three updates are required - disruptive, hitless or no
> + * change.
> + *
> + * In the most general disruptive case we can make any update in three steps:
> + *  - Disrupting the entry (V=0)
> + *  - Fill now unused qwords, execpt qword 0 which contains V
> + *  - Make qword 0 have the final value and valid (V=1) with a single 64
> + *    bit store
> + *
> + * However this disrupts the HW while it is happening. There are several
> + * interesting cases where a STE/CD can be updated without disturbing the HW
> + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> + * because the used bits don't intersect. We can detect this by calculating how
> + * many 64 bit values need update after adjusting the unused bits and skip the
> + * V=0 process. This relies on the IGNORED behavior described in the
> + * specification.
> + */
> +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> +				 __le64 *entry, const __le64 *target)
> +{
> +	unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> +	__le64 unused_update[NUM_ENTRY_QWORDS];
> +	u8 used_qword_diff;
> +
> +	used_qword_diff =
> +		arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> +	if (hweight8(used_qword_diff) > 1) {
> +		/*
> +		 * At least two qwords need their inuse bits to be changed. This
> +		 * requires a breaking update, zero the V bit, write all qwords
> +		 * but 0, then set qword 0
> +		 */
> +		unused_update[0] = entry[0] & (~writer->ops->v_bit);
> +		entry_set(writer, entry, unused_update, 0, 1);
> +		entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> +		entry_set(writer, entry, target, 0, 1);
> +	} else if (hweight8(used_qword_diff) == 1) {
> +		/*
> +		 * Only one qword needs its used bits to be changed. This is a
> +		 * hitless update, update all bits the current STE is ignoring
> +		 * to their new values, then update a single "critical qword" to
> +		 * change the STE and finally 0 out any bits that are now unused
> +		 * in the target configuration.
> +		 */
> +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> +
> +		/*
> +		 * Skip writing unused bits in the critical qword since we'll be
> +		 * writing it in the next step anyways. This can save a sync
> +		 * when the only change is in that qword.
> +		 */
> +		unused_update[critical_qword_index] =
> +			entry[critical_qword_index];
> +		entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> +		entry_set(writer, entry, target, critical_qword_index, 1);
> +		entry_set(writer, entry, target, 0, num_entry_qwords);

The STE is updated in 3 steps.
1) Update all bits from target (except the changed qword)
2) Update the changed qword
3) Remove the bits that are not used by the target STE.

In most cases we would issue a sync for 1) and 3) although the hardware ignores
the updates, that seems necessary, am I missing something?

> +	} else {
> +		/*
> +		 * No inuse bit changed. Sanity check that all unused bits are 0
> +		 * in the entry. The target was already sanity checked by
> +		 * compute_qword_diff().
> +		 */
> +		WARN_ON_ONCE(
> +			entry_set(writer, entry, target, 0, num_entry_qwords));
> +	}
> +}
> +
>  static void arm_smmu_sync_cd(struct arm_smmu_master *master,
>  			     int ssid, bool leaf)
>  {
> @@ -1238,50 +1388,123 @@ arm_smmu_write_strtab_l1_desc(__le64 *dst, struct arm_smmu_strtab_l1_desc *desc)
>  	WRITE_ONCE(*dst, cpu_to_le64(val));
>  }
>  
> -static void arm_smmu_sync_ste_for_sid(struct arm_smmu_device *smmu, u32 sid)
> +struct arm_smmu_ste_writer {
> +	struct arm_smmu_entry_writer writer;
> +	u32 sid;
> +};
> +
> +/*
> + * Based on the value of ent report which bits of the STE the HW will access. It
> + * would be nice if this was complete according to the spec, but minimally it
> + * has to capture the bits this driver uses.
> + */
> +static void arm_smmu_get_ste_used(struct arm_smmu_entry_writer *writer,
> +				  const __le64 *ent, __le64 *used_bits)
>  {
> +	used_bits[0] = cpu_to_le64(STRTAB_STE_0_V);
> +	if (!(ent[0] & cpu_to_le64(STRTAB_STE_0_V)))
> +		return;
> +
> +	/*
> +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> +	 * attribute/permission configuration fields for the SHCFG behavior.
> +	 */
> +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0])) & 1 &&
> +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent[1])) ==
> +		    STRTAB_STE_1_S1DSS_BYPASS)
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +
> +	used_bits[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent[0]))) {
> +	case STRTAB_STE_0_CFG_ABORT:
> +		break;
> +	case STRTAB_STE_0_CFG_BYPASS:
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> +		break;
> +	case STRTAB_STE_0_CFG_S1_TRANS:
> +		used_bits[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> +					    STRTAB_STE_0_S1CTXPTR_MASK |
> +					    STRTAB_STE_0_S1CDMAX);
> +		used_bits[1] |=
> +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> +		used_bits[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> +		break;
> +	case STRTAB_STE_0_CFG_S2_TRANS:
> +		used_bits[1] |=
> +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> +		used_bits[2] |=
> +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> +		used_bits[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> +		break;
> +
> +	default:
> +		memset(used_bits, 0xFF, sizeof(struct arm_smmu_ste));
> +		WARN_ON(true);
> +	}
> +}
> +
> +static void arm_smmu_ste_writer_sync_entry(struct arm_smmu_entry_writer *writer)
> +{
> +	struct arm_smmu_ste_writer *ste_writer =
> +		container_of(writer, struct arm_smmu_ste_writer, writer);
>  	struct arm_smmu_cmdq_ent cmd = {
>  		.opcode	= CMDQ_OP_CFGI_STE,
>  		.cfgi	= {
> -			.sid	= sid,
> +			.sid	= ste_writer->sid,
>  			.leaf	= true,
>  		},
>  	};
>  
> -	arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
> +	arm_smmu_cmdq_issue_cmd_with_sync(writer->master->smmu, &cmd);
> +}
> +
> +static const struct arm_smmu_entry_writer_ops arm_smmu_ste_writer_ops = {
> +	.sync = arm_smmu_ste_writer_sync_entry,
> +	.get_used = arm_smmu_get_ste_used,
> +	.v_bit = cpu_to_le64(STRTAB_STE_0_V),
> +	.num_entry_qwords = sizeof(struct arm_smmu_ste) / sizeof(u64),

Same, I think that's STRTAB_STE_DWORDS.
> +};
> +
> +static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
> +			       struct arm_smmu_ste *ste,
> +			       const struct arm_smmu_ste *target)
> +{
> +	struct arm_smmu_device *smmu = master->smmu;
> +	struct arm_smmu_ste_writer ste_writer = {
> +		.writer = {
> +			.ops = &arm_smmu_ste_writer_ops,
> +			.master = master,
> +		},
> +		.sid = sid,
> +	};
> +
> +	arm_smmu_write_entry(&ste_writer.writer, ste->data, target->data);
> +
> +	/* It's likely that we'll want to use the new STE soon */
> +	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
> +		struct arm_smmu_cmdq_ent
> +			prefetch_cmd = { .opcode = CMDQ_OP_PREFETCH_CFG,
> +					 .prefetch = {
> +						 .sid = sid,
> +					 } };
> +
> +		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	}
>  }
>  
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  				      struct arm_smmu_ste *dst)
>  {
> -	/*
> -	 * This is hideously complicated, but we only really care about
> -	 * three cases at the moment:
> -	 *
> -	 * 1. Invalid (all zero) -> bypass/fault (init)
> -	 * 2. Bypass/fault -> translation/bypass (attach)
> -	 * 3. Translation/bypass -> bypass/fault (detach)
> -	 *
> -	 * Given that we can't update the STE atomically and the SMMU
> -	 * doesn't read the thing in a defined order, that leaves us
> -	 * with the following maintenance requirements:
> -	 *
> -	 * 1. Update Config, return (init time STEs aren't live)
> -	 * 2. Write everything apart from dword 0, sync, write dword 0, sync
> -	 * 3. Update Config, sync
> -	 */
> -	u64 val = le64_to_cpu(dst->data[0]);
> -	bool ste_live = false;
> +	u64 val;
>  	struct arm_smmu_device *smmu = master->smmu;
>  	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
>  	struct arm_smmu_s2_cfg *s2_cfg = NULL;
>  	struct arm_smmu_domain *smmu_domain = master->domain;
> -	struct arm_smmu_cmdq_ent prefetch_cmd = {
> -		.opcode		= CMDQ_OP_PREFETCH_CFG,
> -		.prefetch	= {
> -			.sid	= sid,
> -		},
> -	};
> +	struct arm_smmu_ste target = {};
>  
>  	if (smmu_domain) {
>  		switch (smmu_domain->stage) {
> @@ -1296,22 +1519,6 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		}
>  	}
>  
> -	if (val & STRTAB_STE_0_V) {
> -		switch (FIELD_GET(STRTAB_STE_0_CFG, val)) {
> -		case STRTAB_STE_0_CFG_BYPASS:
> -			break;
> -		case STRTAB_STE_0_CFG_S1_TRANS:
> -		case STRTAB_STE_0_CFG_S2_TRANS:
> -			ste_live = true;
> -			break;
> -		case STRTAB_STE_0_CFG_ABORT:
> -			BUG_ON(!disable_bypass);
> -			break;
> -		default:
> -			BUG(); /* STE corruption */
> -		}
> -	}
> -
>  	/* Nuke the existing STE_0 value, as we're going to rewrite it */
>  	val = STRTAB_STE_0_V;
>  
> @@ -1322,16 +1529,11 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		else
>  			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>  
> -		dst->data[0] = cpu_to_le64(val);
> -		dst->data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> +		target.data[0] = cpu_to_le64(val);
> +		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
>  						STRTAB_STE_1_SHCFG_INCOMING));
> -		dst->data[2] = 0; /* Nuke the VMID */
> -		/*
> -		 * The SMMU can perform negative caching, so we must sync
> -		 * the STE regardless of whether the old value was live.
> -		 */
> -		if (smmu)
> -			arm_smmu_sync_ste_for_sid(smmu, sid);
> +		target.data[2] = 0; /* Nuke the VMID */
> +		arm_smmu_write_ste(master, sid, dst, &target);
>  		return;
>  	}
>  
> @@ -1339,8 +1541,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>  			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
>  
> -		BUG_ON(ste_live);
> -		dst->data[1] = cpu_to_le64(
> +		target.data[1] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
>  			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
>  			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> @@ -1349,7 +1550,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  
>  		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
>  		    !master->stall_enabled)
> -			dst->data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> +			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
>  
>  		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
>  			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> @@ -1358,8 +1559,7 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  	}
>  
>  	if (s2_cfg) {
> -		BUG_ON(ste_live);
> -		dst->data[2] = cpu_to_le64(
> +		target.data[2] = cpu_to_le64(
>  			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
>  			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
>  #ifdef __BIG_ENDIAN
> @@ -1368,23 +1568,17 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
>  			 STRTAB_STE_2_S2R);
>  
> -		dst->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
>  
>  		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
>  	}
>  
>  	if (master->ats_enabled)
> -		dst->data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> +		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
>  						 STRTAB_STE_1_EATS_TRANS));
>  
> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -	/* See comment in arm_smmu_write_ctx_desc() */
> -	WRITE_ONCE(dst->data[0], cpu_to_le64(val));
> -	arm_smmu_sync_ste_for_sid(smmu, sid);
> -
> -	/* It's likely that we'll want to use the new STE soon */
> -	if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH))
> -		arm_smmu_cmdq_issue_cmd(smmu, &prefetch_cmd);
> +	target.data[0] = cpu_to_le64(val);
> +	arm_smmu_write_ste(master, sid, dst, &target);
>  }
>  
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> -- 
> 2.43.0
>

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-30 22:42   ` Mostafa Saleh
@ 2024-01-30 23:56     ` Jason Gunthorpe
  2024-01-31 14:34       ` Mostafa Saleh
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-30 23:56 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Tue, Jan 30, 2024 at 10:42:13PM +0000, Mostafa Saleh wrote:

> On Thu, Jan 25, 2024 at 07:57:11PM -0400, Jason Gunthorpe wrote:
> > As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> > been limited to only work correctly in certain scenarios that the caller
> > must ensure. Generally the caller must put the STE into ABORT or BYPASS
> > before attempting to program it to something else.
> > 
> > The iommu core APIs would ideally expect the driver to do a hitless change
> > of iommu_domain in a number of cases:
> > 
> >  - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
> >    for the RESV ranges
> > 
> >  - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
> >    domain installed. The RID should not be impacted
> > 
> >  - PASID downgrade has IDENTIY on the RID and all PASID's removed.
> >    The RID should not be impacted
> > 
> >  - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
> >    impacted
> > 
> >  - NESTING -> NESTING for carrying all the above hitless cases in a VM
> >    into the hypervisor. To comprehensively emulate the HW in a VM we should
> >    assume the VM OS is running logic like this and expecting hitless updates
> >    to be relayed to real HW.
> 
> From my understanding, some of these cases are not implemented (at this point).
> However, from what I see, most of these cases are related to switching from/to
> identity, which the current driver would have to block in between, is my
> understanding correct?

Basically

> As for NESTING -> NESTING,  how is that achieved? (and why?)

Through iommufd and it is necessary to reflect hitless transition from
the VM to the real HW. See VFIO_DEVICE_ATTACH_IOMMUFD_PT

> AFAICT, VFIO will do BLOCKING in between any transition, and that domain
> should never change while the a device is assigned to a VM.

It ultimately calls iommufd_device_replace() which avoids that. Old
vfio type1 users will force a blocking, but type1 will never support
nesting so it isn't relevant.

> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index 0ffb1cf17e0b2e..690742e8f173eb 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
> >  	ARM_SMMU_MAX_MSIS,
> >  };
> >  
> > +struct arm_smmu_entry_writer_ops;
> > +struct arm_smmu_entry_writer {
> > +	const struct arm_smmu_entry_writer_ops *ops;
> > +	struct arm_smmu_master *master;
> 
> I see only master->smmu is used, is there a reason why we have this
> struct instead?

The CD patches in part 2 requires the master because the CD entry
memory is shared across multiple CDs so we iterate the SID list inside
the update. The STE is the opposite, each STE has its own memory so we
iterate the SID list outside the update.

> > +struct arm_smmu_entry_writer_ops {
> > +	unsigned int num_entry_qwords;
> > +	__le64 v_bit;
> > +	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
> > +			 __le64 *used);
> 
> *writer is not used in this series, I think it would make more sense if
> it's added in the patch that introduce using it.

Ah, I guess, I think it is used in the test bench.
 
> > +	void (*sync)(struct arm_smmu_entry_writer *writer);
> > +};
> > +
> > +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> > +
> 
> Isn't that just STRTAB_STE_DWORDS, also it makes more sense to not tie
> this to the struct but with the actual hardware description that would
> never change (but the struct can change)

The struct and the HW description are the same. The struct size cannot
change. Broadly in the series STRTAB_STE_DWORDS is being dis-favoured
for sizeof(struct arm_smmu_ste) now that we have the struct.

After part 3 there are only two references left to that constant, so I
will likely change part 3 to remove it.

> > +/*
> > + * Figure out if we can do a hitless update of entry to become target. Returns a
> > + * bit mask where 1 indicates that qword needs to be set disruptively.
> > + * unused_update is an intermediate value of entry that has unused bits set to
> > + * their new values.
> > + */
> > +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> > +				    const __le64 *entry, const __le64 *target,
> > +				    __le64 *unused_update)
> > +{
> > +	__le64 target_used[NUM_ENTRY_QWORDS] = {};
> > +	__le64 cur_used[NUM_ENTRY_QWORDS] = {};
> > +	u8 used_qword_diff = 0;
> > +	unsigned int i;
> > +
> > +	writer->ops->get_used(writer, entry, cur_used);
> > +	writer->ops->get_used(writer, target, target_used);
> > +
> > +	for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> > +		/*
> > +		 * Check that masks are up to date, the make functions are not
> > +		 * allowed to set a bit to 1 if the used function doesn't say it
> > +		 * is used.
> > +		 */
> > +		WARN_ON_ONCE(target[i] & ~target_used[i]);
> > +
> 
> I think this should be a BUG. As we don't know the consequence for such change,
> and this should never happen in a non-development kernel.

Guidance from Linus is to never use BUG, always use WARN_ON and try to
recover. If people are running in a high-sensitivity production
environment they should set the warn on panic feature to ensure any
kernel self-detection of corruption triggers a halt.

> > +/*
> > + * Update the STE/CD to the target configuration. The transition from the
> > + * current entry to the target entry takes place over multiple steps that
> > + * attempts to make the transition hitless if possible. This function takes care
> > + * not to create a situation where the HW can perceive a corrupted entry. HW is
> > + * only required to have a 64 bit atomicity with stores from the CPU, while
> > + * entries are many 64 bit values big.
> > + *
> > + * The difference between the current value and the target value is analyzed to
> > + * determine which of three updates are required - disruptive, hitless or no
> > + * change.
> > + *
> > + * In the most general disruptive case we can make any update in three steps:
> > + *  - Disrupting the entry (V=0)
> > + *  - Fill now unused qwords, execpt qword 0 which contains V
> > + *  - Make qword 0 have the final value and valid (V=1) with a single 64
> > + *    bit store
> > + *
> > + * However this disrupts the HW while it is happening. There are several
> > + * interesting cases where a STE/CD can be updated without disturbing the HW
> > + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> > + * because the used bits don't intersect. We can detect this by calculating how
> > + * many 64 bit values need update after adjusting the unused bits and skip the
> > + * V=0 process. This relies on the IGNORED behavior described in the
> > + * specification.
> > + */
> > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > +				 __le64 *entry, const __le64 *target)
> > +{
> > +	unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> > +	__le64 unused_update[NUM_ENTRY_QWORDS];
> > +	u8 used_qword_diff;
> > +
> > +	used_qword_diff =
> > +		arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> > +	if (hweight8(used_qword_diff) > 1) {
> > +		/*
> > +		 * At least two qwords need their inuse bits to be changed. This
> > +		 * requires a breaking update, zero the V bit, write all qwords
> > +		 * but 0, then set qword 0
> > +		 */
> > +		unused_update[0] = entry[0] & (~writer->ops->v_bit);
> > +		entry_set(writer, entry, unused_update, 0, 1);
> > +		entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> > +		entry_set(writer, entry, target, 0, 1);
> > +	} else if (hweight8(used_qword_diff) == 1) {
> > +		/*
> > +		 * Only one qword needs its used bits to be changed. This is a
> > +		 * hitless update, update all bits the current STE is ignoring
> > +		 * to their new values, then update a single "critical qword" to
> > +		 * change the STE and finally 0 out any bits that are now unused
> > +		 * in the target configuration.
> > +		 */
> > +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> > +
> > +		/*
> > +		 * Skip writing unused bits in the critical qword since we'll be
> > +		 * writing it in the next step anyways. This can save a sync
> > +		 * when the only change is in that qword.
> > +		 */
> > +		unused_update[critical_qword_index] =
> > +			entry[critical_qword_index];
> > +		entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> > +		entry_set(writer, entry, target, critical_qword_index, 1);
> > +		entry_set(writer, entry, target, 0, num_entry_qwords);
> 
> The STE is updated in 3 steps.
> 1) Update all bits from target (except the changed qword)
> 2) Update the changed qword
> 3) Remove the bits that are not used by the target STE.
> 
> In most cases we would issue a sync for 1) and 3) although the hardware ignores
> the updates, that seems necessary, am I missing something?

"seems [un]necessary", right?

All syncs are necessary because the way the SMMU HW is permitted to
cache on a qword by qword basis.

Eg with no sync after step 1 the HW cache could have:

  QW0 Not present
  QW1 Step 0 (Current)

And then instantly after step 2 updates DW0, but before it does the
sync, the HW is permited to read. Then it would have:

  QW0 Step 2
  QW1 Step 0 (Current)

Which is illegal. The HW is allowed to observe a mix of Step[n] and
Step[n+1] only. Never a mix of Step[n-1] and Step[n+1].

The sync provides a barrier that prevents this. HW can never observe
the critical qword of step 2 without also observing only new values of
step 1.

The same argument is for step 3 -> next step 1 on a future update.

Regards,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-30 23:56     ` Jason Gunthorpe
@ 2024-01-31 14:34       ` Mostafa Saleh
  2024-01-31 14:40         ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-01-31 14:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Tue, Jan 30, 2024 at 07:56:11PM -0400, Jason Gunthorpe wrote:
> On Tue, Jan 30, 2024 at 10:42:13PM +0000, Mostafa Saleh wrote:
> 
> > On Thu, Jan 25, 2024 at 07:57:11PM -0400, Jason Gunthorpe wrote:
> > > As the comment in arm_smmu_write_strtab_ent() explains, this routine has
> > > been limited to only work correctly in certain scenarios that the caller
> > > must ensure. Generally the caller must put the STE into ABORT or BYPASS
> > > before attempting to program it to something else.
> > > 
> > > The iommu core APIs would ideally expect the driver to do a hitless change
> > > of iommu_domain in a number of cases:
> > > 
> > >  - RESV_DIRECT support wants IDENTITY -> DMA -> IDENTITY to be hitless
> > >    for the RESV ranges
> > > 
> > >  - PASID upgrade has IDENTIY on the RID with no PASID then a PASID paging
> > >    domain installed. The RID should not be impacted
> > > 
> > >  - PASID downgrade has IDENTIY on the RID and all PASID's removed.
> > >    The RID should not be impacted
> > > 
> > >  - RID does PAGING -> BLOCKING with active PASID, PASID's should not be
> > >    impacted
> > > 
> > >  - NESTING -> NESTING for carrying all the above hitless cases in a VM
> > >    into the hypervisor. To comprehensively emulate the HW in a VM we should
> > >    assume the VM OS is running logic like this and expecting hitless updates
> > >    to be relayed to real HW.
> > 
> > From my understanding, some of these cases are not implemented (at this point).
> > However, from what I see, most of these cases are related to switching from/to
> > identity, which the current driver would have to block in between, is my
> > understanding correct?
> 
> Basically
> 
> > As for NESTING -> NESTING,  how is that achieved? (and why?)
> 
> Through iommufd and it is necessary to reflect hitless transition from
> the VM to the real HW. See VFIO_DEVICE_ATTACH_IOMMUFD_PT
> 
> > AFAICT, VFIO will do BLOCKING in between any transition, and that domain
> > should never change while the a device is assigned to a VM.
> 
> It ultimately calls iommufd_device_replace() which avoids that. Old
> vfio type1 users will force a blocking, but type1 will never support
> nesting so it isn't relevant.
>
Thanks, I will check those.
> > > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > index 0ffb1cf17e0b2e..690742e8f173eb 100644
> > > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > > @@ -48,6 +48,22 @@ enum arm_smmu_msi_index {
> > >  	ARM_SMMU_MAX_MSIS,
> > >  };
> > >  
> > > +struct arm_smmu_entry_writer_ops;
> > > +struct arm_smmu_entry_writer {
> > > +	const struct arm_smmu_entry_writer_ops *ops;
> > > +	struct arm_smmu_master *master;
> > 
> > I see only master->smmu is used, is there a reason why we have this
> > struct instead?
> 
> The CD patches in part 2 requires the master because the CD entry
> memory is shared across multiple CDs so we iterate the SID list inside
> the update. The STE is the opposite, each STE has its own memory so we
> iterate the SID list outside the update.
> 
> > > +struct arm_smmu_entry_writer_ops {
> > > +	unsigned int num_entry_qwords;
> > > +	__le64 v_bit;
> > > +	void (*get_used)(struct arm_smmu_entry_writer *writer, const __le64 *entry,
> > > +			 __le64 *used);
> > 
> > *writer is not used in this series, I think it would make more sense if
> > it's added in the patch that introduce using it.
> 
> Ah, I guess, I think it is used in the test bench.
>  
> > > +	void (*sync)(struct arm_smmu_entry_writer *writer);
> > > +};
> > > +
> > > +#define NUM_ENTRY_QWORDS (sizeof(struct arm_smmu_ste) / sizeof(u64))
> > > +
> > 
> > Isn't that just STRTAB_STE_DWORDS, also it makes more sense to not tie
> > this to the struct but with the actual hardware description that would
> > never change (but the struct can change)
> 
> The struct and the HW description are the same. The struct size cannot
> change. Broadly in the series STRTAB_STE_DWORDS is being dis-favoured
> for sizeof(struct arm_smmu_ste) now that we have the struct.
> 
> After part 3 there are only two references left to that constant, so I
> will likely change part 3 to remove it.

But arm_smmu_ste is defined based on STRTAB_STE_DWORDS. And this macro would
never change as it is tied to the HW. However, in the future we can update
“struct arm_smmu_ste” to hold a refcount for some reason,
then sizeof(struct arm_smmu_ste) is not the size of the STE in the hardware.
IMHO, any reference to the HW STE should be done using the macro.

> > > +/*
> > > + * Figure out if we can do a hitless update of entry to become target. Returns a
> > > + * bit mask where 1 indicates that qword needs to be set disruptively.
> > > + * unused_update is an intermediate value of entry that has unused bits set to
> > > + * their new values.
> > > + */
> > > +static u8 arm_smmu_entry_qword_diff(struct arm_smmu_entry_writer *writer,
> > > +				    const __le64 *entry, const __le64 *target,
> > > +				    __le64 *unused_update)
> > > +{
> > > +	__le64 target_used[NUM_ENTRY_QWORDS] = {};
> > > +	__le64 cur_used[NUM_ENTRY_QWORDS] = {};
> > > +	u8 used_qword_diff = 0;
> > > +	unsigned int i;
> > > +
> > > +	writer->ops->get_used(writer, entry, cur_used);
> > > +	writer->ops->get_used(writer, target, target_used);
> > > +
> > > +	for (i = 0; i != writer->ops->num_entry_qwords; i++) {
> > > +		/*
> > > +		 * Check that masks are up to date, the make functions are not
> > > +		 * allowed to set a bit to 1 if the used function doesn't say it
> > > +		 * is used.
> > > +		 */
> > > +		WARN_ON_ONCE(target[i] & ~target_used[i]);
> > > +
> > 
> > I think this should be a BUG. As we don't know the consequence for such change,
> > and this should never happen in a non-development kernel.
> 
> Guidance from Linus is to never use BUG, always use WARN_ON and try to
> recover. If people are running in a high-sensitivity production
> environment they should set the warn on panic feature to ensure any
> kernel self-detection of corruption triggers a halt.
> 
> > > +/*
> > > + * Update the STE/CD to the target configuration. The transition from the
> > > + * current entry to the target entry takes place over multiple steps that
> > > + * attempts to make the transition hitless if possible. This function takes care
> > > + * not to create a situation where the HW can perceive a corrupted entry. HW is
> > > + * only required to have a 64 bit atomicity with stores from the CPU, while
> > > + * entries are many 64 bit values big.
> > > + *
> > > + * The difference between the current value and the target value is analyzed to
> > > + * determine which of three updates are required - disruptive, hitless or no
> > > + * change.
> > > + *
> > > + * In the most general disruptive case we can make any update in three steps:
> > > + *  - Disrupting the entry (V=0)
> > > + *  - Fill now unused qwords, execpt qword 0 which contains V
> > > + *  - Make qword 0 have the final value and valid (V=1) with a single 64
> > > + *    bit store
> > > + *
> > > + * However this disrupts the HW while it is happening. There are several
> > > + * interesting cases where a STE/CD can be updated without disturbing the HW
> > > + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> > > + * because the used bits don't intersect. We can detect this by calculating how
> > > + * many 64 bit values need update after adjusting the unused bits and skip the
> > > + * V=0 process. This relies on the IGNORED behavior described in the
> > > + * specification.
> > > + */
> > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > +				 __le64 *entry, const __le64 *target)
> > > +{
> > > +	unsigned int num_entry_qwords = writer->ops->num_entry_qwords;
> > > +	__le64 unused_update[NUM_ENTRY_QWORDS];
> > > +	u8 used_qword_diff;
> > > +
> > > +	used_qword_diff =
> > > +		arm_smmu_entry_qword_diff(writer, entry, target, unused_update);
> > > +	if (hweight8(used_qword_diff) > 1) {
> > > +		/*
> > > +		 * At least two qwords need their inuse bits to be changed. This
> > > +		 * requires a breaking update, zero the V bit, write all qwords
> > > +		 * but 0, then set qword 0
> > > +		 */
> > > +		unused_update[0] = entry[0] & (~writer->ops->v_bit);
> > > +		entry_set(writer, entry, unused_update, 0, 1);
> > > +		entry_set(writer, entry, target, 1, num_entry_qwords - 1);
> > > +		entry_set(writer, entry, target, 0, 1);
> > > +	} else if (hweight8(used_qword_diff) == 1) {
> > > +		/*
> > > +		 * Only one qword needs its used bits to be changed. This is a
> > > +		 * hitless update, update all bits the current STE is ignoring
> > > +		 * to their new values, then update a single "critical qword" to
> > > +		 * change the STE and finally 0 out any bits that are now unused
> > > +		 * in the target configuration.
> > > +		 */
> > > +		unsigned int critical_qword_index = ffs(used_qword_diff) - 1;
> > > +
> > > +		/*
> > > +		 * Skip writing unused bits in the critical qword since we'll be
> > > +		 * writing it in the next step anyways. This can save a sync
> > > +		 * when the only change is in that qword.
> > > +		 */
> > > +		unused_update[critical_qword_index] =
> > > +			entry[critical_qword_index];
> > > +		entry_set(writer, entry, unused_update, 0, num_entry_qwords);
> > > +		entry_set(writer, entry, target, critical_qword_index, 1);
> > > +		entry_set(writer, entry, target, 0, num_entry_qwords);
> > 
> > The STE is updated in 3 steps.
> > 1) Update all bits from target (except the changed qword)
> > 2) Update the changed qword
> > 3) Remove the bits that are not used by the target STE.
> > 
> > In most cases we would issue a sync for 1) and 3) although the hardware ignores
> > the updates, that seems necessary, am I missing something?
> 
> "seems [un]necessary", right?
Yes, that's a typo.

> All syncs are necessary because the way the SMMU HW is permitted to
> cache on a qword by qword basis.
> 
> Eg with no sync after step 1 the HW cache could have:
> 
>   QW0 Not present
>   QW1 Step 0 (Current)
> 
> And then instantly after step 2 updates DW0, but before it does the
> sync, the HW is permited to read. Then it would have:
> 
>   QW0 Step 2
>   QW1 Step 0 (Current)
> 
> Which is illegal. The HW is allowed to observe a mix of Step[n] and
> Step[n+1] only. Never a mix of Step[n-1] and Step[n+1].
> 
> The sync provides a barrier that prevents this. HW can never observe
> the critical qword of step 2 without also observing only new values of
> step 1.
> 
> The same argument is for step 3 -> next step 1 on a future update.

I see, thanks for the explanation.

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers
  2024-01-31 14:34       ` Mostafa Saleh
@ 2024-01-31 14:40         ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-31 14:40 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Wed, Jan 31, 2024 at 02:34:23PM +0000, Mostafa Saleh wrote:

> But arm_smmu_ste is defined based on STRTAB_STE_DWORDS. And this macro would
> never change as it is tied to the HW. However, in the future we can update
> “struct arm_smmu_ste” to hold a refcount for some reason,
> then sizeof(struct arm_smmu_ste) is not the size of the STE in the hardware.
> IMHO, any reference to the HW STE should be done using the macro.

We can't do anything like that. The arm_smmu_ste is a HW structure
that overlays the actual memory that the SMMU is DMA'ing from. It's
size and memory layout cannot be changed.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2024-01-25 23:57 ` [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
@ 2024-01-31 14:40   ` Mostafa Saleh
  2024-01-31 14:47     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-01-31 14:40 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

On Thu, Jan 25, 2024 at 07:57:12PM -0400, Jason Gunthorpe wrote:
> This allows writing the flow of arm_smmu_write_strtab_ent() around abort
> and bypass domains more naturally.
> 
> Note that the core code no longer supplies NULL domains, though there is
> still a flow in the driver that end up in arm_smmu_write_strtab_ent() with
> NULL. A later patch will remove it.
> 
> Remove the duplicate calculation of the STE in arm_smmu_init_bypass_stes()
> and remove the force parameter. arm_smmu_rmr_install_bypass_ste() can now
> simply invoke arm_smmu_make_bypass_ste() directly.
> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 89 +++++++++++----------
>  1 file changed, 47 insertions(+), 42 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 690742e8f173eb..38bcb4ed1fccc1 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1496,6 +1496,24 @@ static void arm_smmu_write_ste(struct arm_smmu_master *master, u32 sid,
>  	}
>  }
>  
> +static void arm_smmu_make_abort_ste(struct arm_smmu_ste *target)
> +{
> +	memset(target, 0, sizeof(*target));
> +	target->data[0] = cpu_to_le64(
> +		STRTAB_STE_0_V |
> +		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT));
> +}
> +
> +static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
> +{
> +	memset(target, 0, sizeof(*target));

I see this can be used with the actual STE. Although this is done at init, but
briefly making the STE abort from “arm_smmu_make_bypass_ste”, seems a bit
fragile to me, in case we use this in the future in different scenarios, it
might break the hitless assumption. But no strong opinion though.

> +	target->data[0] = cpu_to_le64(
> +		STRTAB_STE_0_V |
> +		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS));
> +	target->data[1] = cpu_to_le64(
> +		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  				      struct arm_smmu_ste *dst)
>  {
> @@ -1506,37 +1524,31 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  	struct arm_smmu_domain *smmu_domain = master->domain;
>  	struct arm_smmu_ste target = {};
>  
> -	if (smmu_domain) {
> -		switch (smmu_domain->stage) {
> -		case ARM_SMMU_DOMAIN_S1:
> -			cd_table = &master->cd_table;
> -			break;
> -		case ARM_SMMU_DOMAIN_S2:
> -			s2_cfg = &smmu_domain->s2_cfg;
> -			break;
> -		default:
> -			break;
> -		}
> +	if (!smmu_domain) {
> +		if (disable_bypass)
> +			arm_smmu_make_abort_ste(&target);
> +		else
> +			arm_smmu_make_bypass_ste(&target);
> +		arm_smmu_write_ste(master, sid, dst, &target);
> +		return;
> +	}
> +
> +	switch (smmu_domain->stage) {
> +	case ARM_SMMU_DOMAIN_S1:
> +		cd_table = &master->cd_table;
> +		break;
> +	case ARM_SMMU_DOMAIN_S2:
> +		s2_cfg = &smmu_domain->s2_cfg;
> +		break;
> +	case ARM_SMMU_DOMAIN_BYPASS:
> +		arm_smmu_make_bypass_ste(&target);
> +		arm_smmu_write_ste(master, sid, dst, &target);
> +		return;
>  	}
>  
>  	/* Nuke the existing STE_0 value, as we're going to rewrite it */
>  	val = STRTAB_STE_0_V;
>  
> -	/* Bypass/fault */
> -	if (!smmu_domain || !(cd_table || s2_cfg)) {
> -		if (!smmu_domain && disable_bypass)
> -			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
> -		else
> -			val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
> -
> -		target.data[0] = cpu_to_le64(val);
> -		target.data[1] = cpu_to_le64(FIELD_PREP(STRTAB_STE_1_SHCFG,
> -						STRTAB_STE_1_SHCFG_INCOMING));
> -		target.data[2] = 0; /* Nuke the VMID */
> -		arm_smmu_write_ste(master, sid, dst, &target);
> -		return;
> -	}
> -
>  	if (cd_table) {
>  		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
>  			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
> @@ -1582,21 +1594,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  }
>  
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
> -				      unsigned int nent, bool force)
> +				      unsigned int nent)
>  {
>  	unsigned int i;
> -	u64 val = STRTAB_STE_0_V;
> -
> -	if (disable_bypass && !force)
> -		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_ABORT);
> -	else
> -		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_BYPASS);
>  
>  	for (i = 0; i < nent; ++i) {
> -		strtab->data[0] = cpu_to_le64(val);
> -		strtab->data[1] = cpu_to_le64(FIELD_PREP(
> -			STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
> -		strtab->data[2] = 0;
> +		if (disable_bypass)
> +			arm_smmu_make_abort_ste(strtab);
> +		else
> +			arm_smmu_make_bypass_ste(strtab);
>  		strtab++;
>  	}
>  }
> @@ -1624,7 +1630,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>  		return -ENOMEM;
>  	}
>  
> -	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT, false);
> +	arm_smmu_init_bypass_stes(desc->l2ptr, 1 << STRTAB_SPLIT);
>  	arm_smmu_write_strtab_l1_desc(strtab, desc);
>  	return 0;
>  }
> @@ -3243,7 +3249,7 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>  	reg |= FIELD_PREP(STRTAB_BASE_CFG_LOG2SIZE, smmu->sid_bits);
>  	cfg->strtab_base_cfg = reg;
>  
> -	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents, false);
> +	arm_smmu_init_bypass_stes(strtab, cfg->num_l1_ents);
>  	return 0;
>  }
>  
> @@ -3954,7 +3960,6 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
>  	iort_get_rmr_sids(dev_fwnode(smmu->dev), &rmr_list);
>  
>  	list_for_each_entry(e, &rmr_list, list) {
> -		struct arm_smmu_ste *step;
>  		struct iommu_iort_rmr_data *rmr;
>  		int ret, i;
>  
> @@ -3967,8 +3972,8 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
>  				continue;
>  			}
>  
> -			step = arm_smmu_get_step_for_sid(smmu, rmr->sids[i]);
> -			arm_smmu_init_bypass_stes(step, 1, true);
> +			arm_smmu_make_bypass_ste(
> +				arm_smmu_get_step_for_sid(smmu, rmr->sids[i]));
>  		}
>  	}
>  
> -- 
> 2.43.0
>

Reviewed-by: Mostafa Saleh <smostafa@google.com>

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2024-01-31 14:40   ` Mostafa Saleh
@ 2024-01-31 14:47     ` Jason Gunthorpe
  2024-02-01 11:32       ` Mostafa Saleh
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-31 14:47 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Wed, Jan 31, 2024 at 02:40:24PM +0000, Mostafa Saleh wrote:
> > +static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
> > +{
> > +	memset(target, 0, sizeof(*target));
> 
> I see this can be used with the actual STE. Although this is done at init, but
> briefly making the STE abort from “arm_smmu_make_bypass_ste”, seems a bit
> fragile to me, in case we use this in the future in different scenarios, it
> might break the hitless assumption. But no strong opinion though.

At init time, when that case happens, the STE table hasn't been
installed in the HW yet. This is why that specific code path has been
directly manipulating the STE and does not call the normal update path
with the sync'ing.

It is perhaps subtle that is why it is a different flow.

(this is also why I moved the function order as it was a bit obscure
to see that indeed all this stuff was sequenced right and we were not
updating a live STE improperly)

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions
  2024-01-25 23:57 ` [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
@ 2024-01-31 14:50   ` Mostafa Saleh
  2024-01-31 15:05     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-01-31 14:50 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

On Thu, Jan 25, 2024 at 07:57:14PM -0400, Jason Gunthorpe wrote:
> This is preparation to move the STE calculation higher up in to the call
> chain and remove arm_smmu_write_strtab_ent(). These new functions will be
> called directly from attach_dev.
> 
> Reviewed-by: Moritz Fischer <mdf@kernel.org>
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 115 +++++++++++---------
>  1 file changed, 62 insertions(+), 53 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index df8fc7b87a7907..910156881423e0 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1516,13 +1516,68 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
>  		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
>  }
>  
> +static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
> +				      struct arm_smmu_master *master,
> +				      struct arm_smmu_ctx_desc_cfg *cd_table)
master already include cd_table in "master->cd_table", why do we need to
pass it separately?
> +{
> +	struct arm_smmu_device *smmu = master->smmu;
> +
> +	memset(target, 0, sizeof(*target));
> +	target->data[0] = cpu_to_le64(
> +		STRTAB_STE_0_V |
> +		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> +		FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt) |
> +		(cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> +		FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax));
> +
> +	target->data[1] = cpu_to_le64(
> +		FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
> +		FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> +		FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> +		FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
> +		((smmu->features & ARM_SMMU_FEAT_STALLS &&
> +		  !master->stall_enabled) ?
> +			 STRTAB_STE_1_S1STALLD :
> +			 0) |
> +		FIELD_PREP(STRTAB_STE_1_EATS,
> +			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0) |
> +		FIELD_PREP(STRTAB_STE_1_STRW,
> +			   (smmu->features & ARM_SMMU_FEAT_E2H) ?
> +				   STRTAB_STE_1_STRW_EL2 :
> +				   STRTAB_STE_1_STRW_NSEL1));
> +}
> +
> +static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
> +					struct arm_smmu_master *master,
> +					struct arm_smmu_domain *smmu_domain)
Similary, master already has the domain in "master->domain".
> +{
> +	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
> +
> +	memset(target, 0, sizeof(*target));
> +	target->data[0] = cpu_to_le64(
> +		STRTAB_STE_0_V |
> +		FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS));
> +
> +	target->data[1] = cpu_to_le64(
> +		FIELD_PREP(STRTAB_STE_1_EATS,
> +			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
> +
> +	target->data[2] = cpu_to_le64(
> +		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
> +		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
> +		STRTAB_STE_2_S2AA64 |
> +#ifdef __BIG_ENDIAN
> +		STRTAB_STE_2_S2ENDI |
> +#endif
> +		STRTAB_STE_2_S2PTW |
> +		STRTAB_STE_2_S2R);
> +
> +	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +}
> +
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  				      struct arm_smmu_ste *dst)
>  {
> -	u64 val;
> -	struct arm_smmu_device *smmu = master->smmu;
> -	struct arm_smmu_ctx_desc_cfg *cd_table = NULL;
> -	struct arm_smmu_s2_cfg *s2_cfg = NULL;
>  	struct arm_smmu_domain *smmu_domain = master->domain;
>  	struct arm_smmu_ste target = {};
>  
> @@ -1537,61 +1592,15 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
>  
>  	switch (smmu_domain->stage) {
>  	case ARM_SMMU_DOMAIN_S1:
> -		cd_table = &master->cd_table;
> +		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
>  		break;
>  	case ARM_SMMU_DOMAIN_S2:
> -		s2_cfg = &smmu_domain->s2_cfg;
> +		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
>  		break;
>  	case ARM_SMMU_DOMAIN_BYPASS:
>  		arm_smmu_make_bypass_ste(&target);
> -		arm_smmu_write_ste(master, sid, dst, &target);
> -		return;
> +		break;
>  	}
> -
> -	/* Nuke the existing STE_0 value, as we're going to rewrite it */
> -	val = STRTAB_STE_0_V;
> -
> -	if (cd_table) {
> -		u64 strw = smmu->features & ARM_SMMU_FEAT_E2H ?
> -			STRTAB_STE_1_STRW_EL2 : STRTAB_STE_1_STRW_NSEL1;
> -
> -		target.data[1] = cpu_to_le64(
> -			 FIELD_PREP(STRTAB_STE_1_S1DSS, STRTAB_STE_1_S1DSS_SSID0) |
> -			 FIELD_PREP(STRTAB_STE_1_S1CIR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> -			 FIELD_PREP(STRTAB_STE_1_S1COR, STRTAB_STE_1_S1C_CACHE_WBRA) |
> -			 FIELD_PREP(STRTAB_STE_1_S1CSH, ARM_SMMU_SH_ISH) |
> -			 FIELD_PREP(STRTAB_STE_1_STRW, strw));
> -
> -		if (smmu->features & ARM_SMMU_FEAT_STALLS &&
> -		    !master->stall_enabled)
> -			target.data[1] |= cpu_to_le64(STRTAB_STE_1_S1STALLD);
> -
> -		val |= (cd_table->cdtab_dma & STRTAB_STE_0_S1CTXPTR_MASK) |
> -			FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S1_TRANS) |
> -			FIELD_PREP(STRTAB_STE_0_S1CDMAX, cd_table->s1cdmax) |
> -			FIELD_PREP(STRTAB_STE_0_S1FMT, cd_table->s1fmt);
> -	}
> -
> -	if (s2_cfg) {
> -		target.data[2] = cpu_to_le64(
> -			 FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
> -			 FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
> -#ifdef __BIG_ENDIAN
> -			 STRTAB_STE_2_S2ENDI |
> -#endif
> -			 STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2AA64 |
> -			 STRTAB_STE_2_S2R);
> -
> -		target.data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> -
> -		val |= FIELD_PREP(STRTAB_STE_0_CFG, STRTAB_STE_0_CFG_S2_TRANS);
> -	}
> -
> -	if (master->ats_enabled)
> -		target.data[1] |= cpu_to_le64(FIELD_PREP(STRTAB_STE_1_EATS,
> -						 STRTAB_STE_1_EATS_TRANS));
> -
> -	target.data[0] = cpu_to_le64(val);
>  	arm_smmu_write_ste(master, sid, dst, &target);
>  }
>  
> -- 
> 2.43.0

Reviewed-by: Mostafa Saleh <smostafa@google.com>

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions
  2024-01-31 14:50   ` Mostafa Saleh
@ 2024-01-31 15:05     ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-01-31 15:05 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Wed, Jan 31, 2024 at 02:50:42PM +0000, Mostafa Saleh wrote:
> > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > index df8fc7b87a7907..910156881423e0 100644
> > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> > @@ -1516,13 +1516,68 @@ static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
> >  		FIELD_PREP(STRTAB_STE_1_SHCFG, STRTAB_STE_1_SHCFG_INCOMING));
> >  }
> >  
> > +static void arm_smmu_make_cdtable_ste(struct arm_smmu_ste *target,
> > +				      struct arm_smmu_master *master,
> > +				      struct arm_smmu_ctx_desc_cfg *cd_table)
> master already include cd_table in "master->cd_table", why do we need to
> pass it separately?

Indeed, it could be like that, I will change it

> > +static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
> > +					struct arm_smmu_master *master,
> > +					struct arm_smmu_domain *smmu_domain)
> Similary, master already has the domain in "master->domain".

A couple patches further on will delete master->domain.

Thanks,
Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2024-01-31 14:47     ` Jason Gunthorpe
@ 2024-02-01 11:32       ` Mostafa Saleh
  2024-02-01 13:02         ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-02-01 11:32 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Wed, Jan 31, 2024 at 10:47:02AM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 31, 2024 at 02:40:24PM +0000, Mostafa Saleh wrote:
> > > +static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
> > > +{
> > > +	memset(target, 0, sizeof(*target));
> > 
> > I see this can be used with the actual STE. Although this is done at init, but
> > briefly making the STE abort from “arm_smmu_make_bypass_ste”, seems a bit
> > fragile to me, in case we use this in the future in different scenarios, it
> > might break the hitless assumption. But no strong opinion though.
> 
> At init time, when that case happens, the STE table hasn't been
> installed in the HW yet. This is why that specific code path has been
> directly manipulating the STE and does not call the normal update path
> with the sync'ing.
> 
> It is perhaps subtle that is why it is a different flow.
> 
> (this is also why I moved the function order as it was a bit obscure
> to see that indeed all this stuff was sequenced right and we were not
> updating a live STE improperly)

I agree, this is not an issue, but it was just confusing when I first read it
as “arm_smmu_make_bypass_ste” would make the STE transiently unavailable, and
I had to go and check the usage. Maybe we can add a comment to clarify how this
function is expected to be used.

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste()
  2024-01-25 23:57 ` [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
@ 2024-02-01 11:34   ` Mostafa Saleh
  0 siblings, 0 replies; 39+ messages in thread
From: Mostafa Saleh @ 2024-02-01 11:34 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

On Thu, Jan 25, 2024 at 07:57:15PM -0400, Jason Gunthorpe wrote:
> Half the code was living in arm_smmu_domain_finalise_s2(), just move it
> here and take the values directly from the pgtbl_ops instead of storing
> copies.
> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 27 ++++++++++++---------
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  2 --
>  2 files changed, 15 insertions(+), 14 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 910156881423e0..9a95d0f1494223 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1552,6 +1552,11 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  					struct arm_smmu_domain *smmu_domain)
>  {
>  	struct arm_smmu_s2_cfg *s2_cfg = &smmu_domain->s2_cfg;
> +	const struct io_pgtable_cfg *pgtbl_cfg =
> +		&io_pgtable_ops_to_pgtable(smmu_domain->pgtbl_ops)->cfg;
> +	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr =
> +		&pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
> +	u64 vtcr_val;
>  
>  	memset(target, 0, sizeof(*target));
>  	target->data[0] = cpu_to_le64(
> @@ -1562,9 +1567,16 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  		FIELD_PREP(STRTAB_STE_1_EATS,
>  			   master->ats_enabled ? STRTAB_STE_1_EATS_TRANS : 0));
>  
> +	vtcr_val = FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
> +		   FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
>  	target->data[2] = cpu_to_le64(
>  		FIELD_PREP(STRTAB_STE_2_S2VMID, s2_cfg->vmid) |
> -		FIELD_PREP(STRTAB_STE_2_VTCR, s2_cfg->vtcr) |
> +		FIELD_PREP(STRTAB_STE_2_VTCR, vtcr_val) |
>  		STRTAB_STE_2_S2AA64 |
>  #ifdef __BIG_ENDIAN
>  		STRTAB_STE_2_S2ENDI |
> @@ -1572,7 +1584,8 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  		STRTAB_STE_2_S2PTW |
>  		STRTAB_STE_2_S2R);
>  
> -	target->data[3] = cpu_to_le64(s2_cfg->vttbr & STRTAB_STE_3_S2TTB_MASK);
> +	target->data[3] = cpu_to_le64(pgtbl_cfg->arm_lpae_s2_cfg.vttbr &
> +				      STRTAB_STE_3_S2TTB_MASK);
>  }
>  
>  static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> @@ -2328,7 +2341,6 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
>  	int vmid;
>  	struct arm_smmu_device *smmu = smmu_domain->smmu;
>  	struct arm_smmu_s2_cfg *cfg = &smmu_domain->s2_cfg;
> -	typeof(&pgtbl_cfg->arm_lpae_s2_cfg.vtcr) vtcr;
>  
>  	/* Reserve VMID 0 for stage-2 bypass STEs */
>  	vmid = ida_alloc_range(&smmu->vmid_map, 1, (1 << smmu->vmid_bits) - 1,
> @@ -2336,16 +2348,7 @@ static int arm_smmu_domain_finalise_s2(struct arm_smmu_domain *smmu_domain,
>  	if (vmid < 0)
>  		return vmid;
>  
> -	vtcr = &pgtbl_cfg->arm_lpae_s2_cfg.vtcr;
>  	cfg->vmid	= (u16)vmid;
> -	cfg->vttbr	= pgtbl_cfg->arm_lpae_s2_cfg.vttbr;
> -	cfg->vtcr	= FIELD_PREP(STRTAB_STE_2_VTCR_S2T0SZ, vtcr->tsz) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SL0, vtcr->sl) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2IR0, vtcr->irgn) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2OR0, vtcr->orgn) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2SH0, vtcr->sh) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2TG, vtcr->tg) |
> -			  FIELD_PREP(STRTAB_STE_2_VTCR_S2PS, vtcr->ps);
>  	return 0;
>  }
>  
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index 65fb388d51734d..eb669121f1954d 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -609,8 +609,6 @@ struct arm_smmu_ctx_desc_cfg {
>  
>  struct arm_smmu_s2_cfg {
>  	u16				vmid;
> -	u64				vttbr;
> -	u64				vtcr;
>  };
>  
>  struct arm_smmu_strtab_cfg {
> -- 
> 2.43.0
>

Reviewed-by: Mostafa Saleh <smostafa@google.com>

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2024-01-25 23:57 ` [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
@ 2024-02-01 12:15   ` Mostafa Saleh
  2024-02-01 13:24     ` Jason Gunthorpe
  0 siblings, 1 reply; 39+ messages in thread
From: Mostafa Saleh @ 2024-02-01 12:15 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

On Thu, Jan 25, 2024 at 07:57:16PM -0400, Jason Gunthorpe wrote:
> The BTM support wants to be able to change the ASID of any smmu_domain.
> When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> the target domain's devices list.
> 
> During attach of a S1 domain we must ensure that the devices list and
> CD are in sync, otherwise we could miss CD updates or a parallel CD update
> could push an out of date CD.
> 
> This is pretty complicated, and almost works today because
> arm_smmu_detach_dev() removes the master from the linked list before
> working on the CD entries, preventing parallel update of the CD.
> 
> However, it does have an issue where the CD can remain programed while the
> domain appears to be unattached. arm_smmu_share_asid() will then not clear
> any CD entriess and install its own CD entry with the same ASID
> concurrently. This creates a small race window where the IOMMU can see two
> ASIDs pointing to different translations.

I don’t see the race condition.

The current flow is as follows,
For SVA, if the asid was used by domain_x, it will do:

lock(arm_smmu_asid_lock)
Alloc new asid and set cd->asid.
lock(domain_x->devices_lock)
Write new CD with the new asid
unlock(domain_x->devices_lock)
unlock(arm_smmu_asid_lock)

For attach_dev (domain_y), if the device was attached to domain_z
//Detach old domain
lock(domain_z->devices_lock)
Remove master from old domain
unlock(domain_z->devices_lock)
Clear CD
//Attach new domain
lock(arm_smmu_asid_lock)
Allocate ASID
unlock(arm_smmu_asid_lock)

lock(domain_y->devices_lock)
Insert new master.
unlock(domain_y->devices_lock)

lock(arm_smmu_asid_lock)
Write CD
unlock(arm_smmu_asid_lock)


In case
1) domain_x == domain_z(old domain)
Write to the CD is protected by domain_x->devices_lock, so either:
    a) The device will be removed, so SVA code will not touch it, and the
    detach will clear the CD.
    b) The device CD will be updated from the SVA with the new code, but then
    it will be removed from the domain and cleared.

I don’t see any case where we end with a programmed CD.

2) domain_x == domain_y(new domain)

Similarly the device would either see the new CD(new asid) or the old CD
then the new CD.

Can you please clarify the race condition? as it seems I am missing something.

> Solve this by wrapping most of the attach flow in the
> arm_smmu_asid_lock. This locks more than strictly needed to prepare for
> the next patch which will reorganize the order of the linked list, STE and
> CD changes.
> 
> Move arm_smmu_detach_dev() till after we have initialized the domain so
> the lock can be held for less time.
> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 22 ++++++++++++---------
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 9a95d0f1494223..539ef380f457fa 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -2612,8 +2612,6 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  		return -EBUSY;
>  	}
>  
> -	arm_smmu_detach_dev(master);
> -
>  	mutex_lock(&smmu_domain->init_mutex);
>  
>  	if (!smmu_domain->smmu) {
> @@ -2628,6 +2626,16 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  	if (ret)
>  		return ret;
>  
> +	/*
> +	 * Prevent arm_smmu_share_asid() from trying to change the ASID
> +	 * of either the old or new domain while we are working on it.
> +	 * This allows the STE and the smmu_domain->devices list to
> +	 * be inconsistent during this routine.
> +	 */
> +	mutex_lock(&arm_smmu_asid_lock);
> +
> +	arm_smmu_detach_dev(master);
> +
>  	master->domain = smmu_domain;
>  
>  	/*
> @@ -2653,13 +2661,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  			}
>  		}
>  
> -		/*
> -		 * Prevent SVA from concurrently modifying the CD or writing to
> -		 * the CD entry
> -		 */
> -		mutex_lock(&arm_smmu_asid_lock);
>  		ret = arm_smmu_write_ctx_desc(master, IOMMU_NO_PASID, &smmu_domain->cd);
> -		mutex_unlock(&arm_smmu_asid_lock);
>  		if (ret) {
>  			master->domain = NULL;
>  			goto out_list_del;
> @@ -2669,13 +2671,15 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  	arm_smmu_install_ste_for_dev(master);
>  
>  	arm_smmu_enable_ats(master);
> -	return 0;
> +	goto out_unlock;
>  
>  out_list_del:
>  	spin_lock_irqsave(&smmu_domain->devices_lock, flags);
>  	list_del(&master->domain_head);
>  	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>  
> +out_unlock:
> +	mutex_unlock(&arm_smmu_asid_lock);
>  	return ret;
>  }
>  
> -- 
> 2.43.0
> 

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master
  2024-01-25 23:57 ` [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
@ 2024-02-01 12:18   ` Mostafa Saleh
  0 siblings, 0 replies; 39+ messages in thread
From: Mostafa Saleh @ 2024-02-01 12:18 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

Hi Jason,

On Thu, Jan 25, 2024 at 07:57:17PM -0400, Jason Gunthorpe wrote:
> Currently arm_smmu_install_ste_for_dev() iterates over every SID and
> computes from scratch an identical STE. Every SID should have the same STE
> contents. Turn this inside out so that the STE is supplied by the caller
> and arm_smmu_install_ste_for_dev() simply installs it to every SID.
> 
> This is possible now that the STE generation does not inform what sequence
> should be used to program it.
> 
> This allows splitting the STE calculation up according to the call site,
> which following patches will make use of, and removes the confusing NULL
> domain special case that only supported arm_smmu_detach_dev().
> 
> Reviewed-by: Michael Shavit <mshavit@google.com>
> Reviewed-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com>
> Tested-by: Nicolin Chen <nicolinc@nvidia.com>
> Tested-by: Moritz Fischer <moritzf@google.com>
> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
> ---
>  drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 57 ++++++++-------------
>  1 file changed, 22 insertions(+), 35 deletions(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index 539ef380f457fa..cf3e348cb9abe1 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1588,35 +1588,6 @@ static void arm_smmu_make_s2_domain_ste(struct arm_smmu_ste *target,
>  				      STRTAB_STE_3_S2TTB_MASK);
>  }
>  
> -static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
> -				      struct arm_smmu_ste *dst)
> -{
> -	struct arm_smmu_domain *smmu_domain = master->domain;
> -	struct arm_smmu_ste target = {};
> -
> -	if (!smmu_domain) {
> -		if (disable_bypass)
> -			arm_smmu_make_abort_ste(&target);
> -		else
> -			arm_smmu_make_bypass_ste(&target);
> -		arm_smmu_write_ste(master, sid, dst, &target);
> -		return;
> -	}
> -
> -	switch (smmu_domain->stage) {
> -	case ARM_SMMU_DOMAIN_S1:
> -		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> -		break;
> -	case ARM_SMMU_DOMAIN_S2:
> -		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
> -		break;
> -	case ARM_SMMU_DOMAIN_BYPASS:
> -		arm_smmu_make_bypass_ste(&target);
> -		break;
> -	}
> -	arm_smmu_write_ste(master, sid, dst, &target);
> -}
> -
>  static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
>  				      unsigned int nent)
>  {
> @@ -2439,7 +2410,8 @@ arm_smmu_get_step_for_sid(struct arm_smmu_device *smmu, u32 sid)
>  	}
>  }
>  
> -static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
> +static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master,
> +					 const struct arm_smmu_ste *target)
>  {
>  	int i, j;
>  	struct arm_smmu_device *smmu = master->smmu;
> @@ -2456,7 +2428,7 @@ static void arm_smmu_install_ste_for_dev(struct arm_smmu_master *master)
>  		if (j < i)
>  			continue;
>  
> -		arm_smmu_write_strtab_ent(master, sid, step);
> +		arm_smmu_write_ste(master, sid, step, target);
>  	}
>  }
>  
> @@ -2563,6 +2535,7 @@ static void arm_smmu_disable_pasid(struct arm_smmu_master *master)
>  static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>  {
>  	unsigned long flags;
> +	struct arm_smmu_ste target;
>  	struct arm_smmu_domain *smmu_domain = master->domain;
>  
>  	if (!smmu_domain)
> @@ -2576,7 +2549,11 @@ static void arm_smmu_detach_dev(struct arm_smmu_master *master)
>  
>  	master->domain = NULL;
>  	master->ats_enabled = false;
> -	arm_smmu_install_ste_for_dev(master);
> +	if (disable_bypass)
> +		arm_smmu_make_abort_ste(&target);
> +	else
> +		arm_smmu_make_bypass_ste(&target);
> +	arm_smmu_install_ste_for_dev(master, &target);
>  	/*
>  	 * Clearing the CD entry isn't strictly required to detach the domain
>  	 * since the table is uninstalled anyway, but it helps avoid confusion
> @@ -2591,6 +2568,7 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  {
>  	int ret = 0;
>  	unsigned long flags;
> +	struct arm_smmu_ste target;
>  	struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev);
>  	struct arm_smmu_device *smmu;
>  	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
> @@ -2652,7 +2630,8 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  	list_add(&master->domain_head, &smmu_domain->devices);
>  	spin_unlock_irqrestore(&smmu_domain->devices_lock, flags);
>  
> -	if (smmu_domain->stage == ARM_SMMU_DOMAIN_S1) {
> +	switch (smmu_domain->stage) {
> +	case ARM_SMMU_DOMAIN_S1:
>  		if (!master->cd_table.cdtab) {
>  			ret = arm_smmu_alloc_cd_tables(master);
>  			if (ret) {
> @@ -2666,9 +2645,17 @@ static int arm_smmu_attach_dev(struct iommu_domain *domain, struct device *dev)
>  			master->domain = NULL;
>  			goto out_list_del;
>  		}
> -	}
>  
> -	arm_smmu_install_ste_for_dev(master);
> +		arm_smmu_make_cdtable_ste(&target, master, &master->cd_table);
> +		break;
> +	case ARM_SMMU_DOMAIN_S2:
> +		arm_smmu_make_s2_domain_ste(&target, master, smmu_domain);
> +		break;
> +	case ARM_SMMU_DOMAIN_BYPASS:
> +		arm_smmu_make_bypass_ste(&target);
> +		break;
> +	}
> +	arm_smmu_install_ste_for_dev(master, &target);
>  
>  	arm_smmu_enable_ats(master);
>  	goto out_unlock;
> -- 
> 2.43.0
>

Reviewed-by: Mostafa Saleh <smostafa@google.com>

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass
  2024-02-01 11:32       ` Mostafa Saleh
@ 2024-02-01 13:02         ` Jason Gunthorpe
  0 siblings, 0 replies; 39+ messages in thread
From: Jason Gunthorpe @ 2024-02-01 13:02 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Thu, Feb 01, 2024 at 11:32:41AM +0000, Mostafa Saleh wrote:
> On Wed, Jan 31, 2024 at 10:47:02AM -0400, Jason Gunthorpe wrote:
> > On Wed, Jan 31, 2024 at 02:40:24PM +0000, Mostafa Saleh wrote:
> > > > +static void arm_smmu_make_bypass_ste(struct arm_smmu_ste *target)
> > > > +{
> > > > +	memset(target, 0, sizeof(*target));
> > > 
> > > I see this can be used with the actual STE. Although this is done at init, but
> > > briefly making the STE abort from “arm_smmu_make_bypass_ste”, seems a bit
> > > fragile to me, in case we use this in the future in different scenarios, it
> > > might break the hitless assumption. But no strong opinion though.
> > 
> > At init time, when that case happens, the STE table hasn't been
> > installed in the HW yet. This is why that specific code path has been
> > directly manipulating the STE and does not call the normal update path
> > with the sync'ing.
> > 
> > It is perhaps subtle that is why it is a different flow.
> > 
> > (this is also why I moved the function order as it was a bit obscure
> > to see that indeed all this stuff was sequenced right and we were not
> > updating a live STE improperly)
> 
> I agree, this is not an issue, but it was just confusing when I first read it
> as “arm_smmu_make_bypass_ste” would make the STE transiently unavailable, and
> I had to go and check the usage. Maybe we can add a comment to clarify how this
> function is expected to be used.

I added this remark:

@@ -1592,6 +1592,10 @@ static void arm_smmu_write_strtab_ent(struct arm_smmu_master *master, u32 sid,
        arm_smmu_write_ste(master, sid, dst, &target);
 }
 
+/*
+ * This can safely directly manipulate the STE memory without a sync sequence
+ * because the STE table has not been installed in the SMMU yet.
+ */
 static void arm_smmu_init_bypass_stes(struct arm_smmu_ste *strtab,
                                      unsigned int nent)
 {
@@ -3971,6 +3975,10 @@ static void arm_smmu_rmr_install_bypass_ste(struct arm_smmu_device *smmu)
                                continue;
                        }
 
+                       /*
+                        * STE table is not programmed to HW, see
+                        * arm_smmu_init_bypass_stes()
+                        */
                        arm_smmu_make_bypass_ste(
                                arm_smmu_get_step_for_sid(smmu, rmr->sids[i]));
                }

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2024-02-01 12:15   ` Mostafa Saleh
@ 2024-02-01 13:24     ` Jason Gunthorpe
  2024-02-13 13:30       ` Mostafa Saleh
  0 siblings, 1 reply; 39+ messages in thread
From: Jason Gunthorpe @ 2024-02-01 13:24 UTC (permalink / raw)
  To: Mostafa Saleh
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Thu, Feb 01, 2024 at 12:15:53PM +0000, Mostafa Saleh wrote:
> Hi Jason,
> 
> On Thu, Jan 25, 2024 at 07:57:16PM -0400, Jason Gunthorpe wrote:
> > The BTM support wants to be able to change the ASID of any smmu_domain.
> > When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> > the target domain's devices list.
> > 
> > During attach of a S1 domain we must ensure that the devices list and
> > CD are in sync, otherwise we could miss CD updates or a parallel CD update
> > could push an out of date CD.
> > 
> > This is pretty complicated, and almost works today because
> > arm_smmu_detach_dev() removes the master from the linked list before
> > working on the CD entries, preventing parallel update of the CD.
> > 
> > However, it does have an issue where the CD can remain programed while the
> > domain appears to be unattached. arm_smmu_share_asid() will then not clear
> > any CD entriess and install its own CD entry with the same ASID
> > concurrently. This creates a small race window where the IOMMU can see two
> > ASIDs pointing to different translations.
> 
> I don’t see the race condition.
> 
> The current flow is as follows,
> For SVA, if the asid was used by domain_x, it will do:
> 
> lock(arm_smmu_asid_lock)
> Alloc new asid and set cd->asid.
> lock(domain_x->devices_lock)
> Write new CD with the new asid
> unlock(domain_x->devices_lock)
> unlock(arm_smmu_asid_lock)
> 
> For attach_dev (domain_y), if the device was attached to domain_z
> //Detach old domain
> lock(domain_z->devices_lock)
> Remove master from old domain
> unlock(domain_z->devices_lock)

At this moment all locks are dropped and the RID's CD entry continues
to use the ASID.

The racing BTM flow now runs and will do your above:

arm_smmu_mmu_notifier_get()
 arm_smmu_alloc_shared_cd()
  arm_smmu_share_asid():
    arm_smmu_update_ctx_desc_devices() <<- Does nothing due to list_del above
    arm_smmu_tlb_inv_asid() <<-- Woops, we are invalidating an ASID that is still in a CD!
 arm_smmu_write_ctx_desc() <<-- Install a new translation on a PASID's CD

Now the HW can observe two installed CDs using the same ASID but they
point to different translations. This is illegal.

> Clear CD

Now we remove the RID CD, but it is too late, the PASID CD is already
installed.

ASID/VMID lifecycle must be strictly contained to ensure the cache
remains coherent:

1. All programmed STE/CDs using the ASID/VMID must always point to the
   same translation

2. All references to a ASID/VMID must be removed from their STE/CDs
   before the ASID is flushed

3. The ASID/VMID must be flushed before it is assigned to a STE/CD
   with a new translation.

We solve this by requiring that the arm_smmu_asid_lock must be held
such that the smmu_domains->devices list AND the actual content of the
CD tables are always observed to be consistent.

Jason

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev
  2024-02-01 13:24     ` Jason Gunthorpe
@ 2024-02-13 13:30       ` Mostafa Saleh
  0 siblings, 0 replies; 39+ messages in thread
From: Mostafa Saleh @ 2024-02-13 13:30 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: iommu, Joerg Roedel, linux-arm-kernel, Robin Murphy, Will Deacon,
	Moritz Fischer, Moritz Fischer, Michael Shavit, Nicolin Chen,
	patches, Shameer Kolothum

On Thu, Feb 01, 2024 at 09:24:43AM -0400, Jason Gunthorpe wrote:
> On Thu, Feb 01, 2024 at 12:15:53PM +0000, Mostafa Saleh wrote:
> > Hi Jason,
> > 
> > On Thu, Jan 25, 2024 at 07:57:16PM -0400, Jason Gunthorpe wrote:
> > > The BTM support wants to be able to change the ASID of any smmu_domain.
> > > When it goes to do this it holds the arm_smmu_asid_lock and iterates over
> > > the target domain's devices list.
> > > 
> > > During attach of a S1 domain we must ensure that the devices list and
> > > CD are in sync, otherwise we could miss CD updates or a parallel CD update
> > > could push an out of date CD.
> > > 
> > > This is pretty complicated, and almost works today because
> > > arm_smmu_detach_dev() removes the master from the linked list before
> > > working on the CD entries, preventing parallel update of the CD.
> > > 
> > > However, it does have an issue where the CD can remain programed while the
> > > domain appears to be unattached. arm_smmu_share_asid() will then not clear
> > > any CD entriess and install its own CD entry with the same ASID
> > > concurrently. This creates a small race window where the IOMMU can see two
> > > ASIDs pointing to different translations.
> > 
> > I don’t see the race condition.
> > 
> > The current flow is as follows,
> > For SVA, if the asid was used by domain_x, it will do:
> > 
> > lock(arm_smmu_asid_lock)
> > Alloc new asid and set cd->asid.
> > lock(domain_x->devices_lock)
> > Write new CD with the new asid
> > unlock(domain_x->devices_lock)
> > unlock(arm_smmu_asid_lock)
> > 
> > For attach_dev (domain_y), if the device was attached to domain_z
> > //Detach old domain
> > lock(domain_z->devices_lock)
> > Remove master from old domain
> > unlock(domain_z->devices_lock)
> 
> At this moment all locks are dropped and the RID's CD entry continues
> to use the ASID.
> 
> The racing BTM flow now runs and will do your above:
> 
> arm_smmu_mmu_notifier_get()
>  arm_smmu_alloc_shared_cd()
>   arm_smmu_share_asid():
>     arm_smmu_update_ctx_desc_devices() <<- Does nothing due to list_del above
>     arm_smmu_tlb_inv_asid() <<-- Woops, we are invalidating an ASID that is still in a CD!
>  arm_smmu_write_ctx_desc() <<-- Install a new translation on a PASID's CD
> 
> Now the HW can observe two installed CDs using the same ASID but they
> point to different translations. This is illegal.
> 
> > Clear CD
> 
> Now we remove the RID CD, but it is too late, the PASID CD is already
> installed.
> 
> ASID/VMID lifecycle must be strictly contained to ensure the cache
> remains coherent:
> 
> 1. All programmed STE/CDs using the ASID/VMID must always point to the
>    same translation
> 
> 2. All references to a ASID/VMID must be removed from their STE/CDs
>    before the ASID is flushed
> 
> 3. The ASID/VMID must be flushed before it is assigned to a STE/CD
>    with a new translation.
> 
> We solve this by requiring that the arm_smmu_asid_lock must be held
> such that the smmu_domains->devices list AND the actual content of the
> CD tables are always observed to be consistent.
> 
> Jason

I see, thanks a lot for the detailed explanation. 
Maybe this can be added to the change log, so it’s documented somewhere.

Also, I guess this is mainly theoretical, as it requires the detached device to
issue DMA while being detached?

Thanks,
Mostafa

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-02-13 13:30 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2024-01-25 23:57 [PATCH v4 00/16] Update SMMUv3 to the modern iommu API (part 1/3) Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 01/16] iommu/arm-smmu-v3: Make STE programming independent of the callers Jason Gunthorpe
2024-01-26  4:03   ` Michael Shavit
2024-01-29 19:53   ` Moritz Fischer
2024-01-30 22:42   ` Mostafa Saleh
2024-01-30 23:56     ` Jason Gunthorpe
2024-01-31 14:34       ` Mostafa Saleh
2024-01-31 14:40         ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 02/16] iommu/arm-smmu-v3: Consolidate the STE generation for abort/bypass Jason Gunthorpe
2024-01-31 14:40   ` Mostafa Saleh
2024-01-31 14:47     ` Jason Gunthorpe
2024-02-01 11:32       ` Mostafa Saleh
2024-02-01 13:02         ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 03/16] iommu/arm-smmu-v3: Move arm_smmu_rmr_install_bypass_ste() Jason Gunthorpe
2024-01-29 15:07   ` Shameerali Kolothum Thodi
2024-01-29 15:43     ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 04/16] iommu/arm-smmu-v3: Move the STE generation for S1 and S2 domains into functions Jason Gunthorpe
2024-01-31 14:50   ` Mostafa Saleh
2024-01-31 15:05     ` Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 05/16] iommu/arm-smmu-v3: Build the whole STE in arm_smmu_make_s2_domain_ste() Jason Gunthorpe
2024-02-01 11:34   ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 06/16] iommu/arm-smmu-v3: Hold arm_smmu_asid_lock during all of attach_dev Jason Gunthorpe
2024-02-01 12:15   ` Mostafa Saleh
2024-02-01 13:24     ` Jason Gunthorpe
2024-02-13 13:30       ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 07/16] iommu/arm-smmu-v3: Compute the STE only once for each master Jason Gunthorpe
2024-02-01 12:18   ` Mostafa Saleh
2024-01-25 23:57 ` [PATCH v4 08/16] iommu/arm-smmu-v3: Do not change the STE twice during arm_smmu_attach_dev() Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 09/16] iommu/arm-smmu-v3: Put writing the context descriptor in the right order Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 10/16] iommu/arm-smmu-v3: Pass smmu_domain to arm_enable/disable_ats() Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 11/16] iommu/arm-smmu-v3: Remove arm_smmu_master->domain Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 12/16] iommu/arm-smmu-v3: Add a global static IDENTITY domain Jason Gunthorpe
2024-01-29 18:11   ` Shameerali Kolothum Thodi
2024-01-29 18:37     ` Jason Gunthorpe
2024-01-30  8:35       ` Shameerali Kolothum Thodi
2024-01-25 23:57 ` [PATCH v4 13/16] iommu/arm-smmu-v3: Add a global static BLOCKED domain Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 14/16] iommu/arm-smmu-v3: Use the identity/blocked domain during release Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 15/16] iommu/arm-smmu-v3: Pass arm_smmu_domain and arm_smmu_device to finalize Jason Gunthorpe
2024-01-25 23:57 ` [PATCH v4 16/16] iommu/arm-smmu-v3: Convert to domain_alloc_paging() Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).