Cache Invalidation Solution for Nested IOMMU

* Cache Invalidation Solution for Nested IOMMU
@ 2023-04-03  0:33 Nicolin Chen
  2023-04-03  7:26 ` Liu, Yi L
                   ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03  0:33 UTC (permalink / raw)
  To: Robin Murphy, jgg, kevin.tian
  Cc: yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi,
	jean-philippe, iommu

Hi all,

Per discussion in the nested SMMUv3 series[1], we have come up
with a couple of ideas to accelerate the invalidation uAPI.

I have drafted two solutions and would like to collect all of
your inputs, so we shall decide which approach we would choose
to put in the next version.

The first version is simply to individually forward the entire
command. This can save a few CPU cycles from packing/unpacking
invalidation fields of the commands via a data structure, v.s.
the structure in v1[2].

[User Data]
struct iommu_hwpt_invalidate_arm_smmuv3 {
	__u64 cmd[2];
	__u32 error;
	__u32 __reserved;
};

[Driver Handler]
// This draft requires set_rid/unset_rid in the 2nd draft for
// command sanity to report errors.
static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
					   void *user_data)
{
	struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
	struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
	struct arm_smmu_device *smmu = smmu_domain->smmu;
	u64 cmd[CMDQ_ENT_DWORDS];

	if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
		return;

	cmd[0] = inv_info->cmd[0];
	cmd[1] = inv_info->cmd[1];
	switch (cmd[0] & CMDQ_0_OP) {
	case CMDQ_OP_TLBI_NSNH_ALL:
		cmd[0] &= ~CMDQ_0_OP;
		cmd[0] |= CMDQ_OP_TLBI_NH_ALL;
		fallthrough;
	case CMDQ_OP_TLBI_NH_VA:
	case CMDQ_OP_TLBI_NH_VAA:
	case CMDQ_OP_TLBI_NH_ALL:
	case CMDQ_OP_TLBI_NH_ASID:
		cmd[0] &= ~CMDQ_TLBI_0_VMID;
		cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID,
				     smmu_domain->s2->s2_cfg.vmid);
		arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
		break;
	case CMDQ_OP_CFGI_CD:
	case CMDQ_OP_CFGI_CD_ALL:
		arm_smmu_sync_cd(smmu_domain,
				 FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false);
		break;
	default:
		return;
	}
}

The second version is a bit further to support batching, by sharing
a kernel page via mmap().

Firstly, I drafted a pair of new ioctls and iommu_ops callbacks to
link vSID with pSID. This seems to be useful for Intel VT-d also.
https://github.com/nicolinc/iommufd/commit/cbec01245edc70e5de88fbc477d8f0af50b3b0ed

Then I added a new mmap interface to share kernel page(s) from the
Driver, to allow QEMU to write all TLBI commands as a single batch.
Then it can initiate the batch invalidation via another synchronous
hypercall.
https://github.com/nicolinc/iommufd/commit/ee717eb6d46b5285db1aae9172ecdfc70b9cd9ca

The two TMP changes are available in this repo:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04022023
This solution certainly needs some rework before it can be posted
in the next version for review. Yet again, I'd like to see if it
can be a good direction for us to adopt in v2.

The new set_rid/unset_rid ioctls and the mmap interface would be
essential for VCMDQ support that we'd like to achieve at the end
of this journey. So, personally I'd like to see it can be used at
this stage, by the generic SMMUv3 (and potentially VT-d) too.

Thanks
Nicolin

[1] https://lore.kernel.org/linux-iommu/ZBe3kxRXf+VbKy+m@Asurada-Nvidia/
[2] https://lore.kernel.org/linux-iommu/364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvidia.com/

^ permalink raw reply	[flat|nested] 35+ messages in thread